提高 GROUP BY 子句中的排序性能
我在 Postgres 9.4.1 中有兩個表,
events
並event_refs
具有以下模式:
events
桌子CREATE TABLE events ( id serial NOT NULL PRIMARY KEY, event_type text NOT NULL, event_path jsonb, event_data jsonb, created_at timestamp with time zone NOT NULL ); -- Index on type and created time CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at);
event_refs
桌子CREATE TABLE event_refs ( event_id integer NOT NULL, reference_key text NOT NULL, reference_value text NOT NULL, CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value), CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id) REFERENCES events (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION );
兩個表都包含 2M 行。這是我正在執行的查詢
SELECT EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time FROM events INNER JOIN event_refs ON event_refs.event_id = events.id AND event_refs.reference_key = 'project' WHERE events.event_type = 'event1' OR events.event_type = 'event2' AND events.created_at >= '2015-07-01 00:00:00+08:00' AND events.created_at < '2015-12-01 00:00:00+08:00' GROUP BY event_refs.reference_value HAVING COUNT(*) > 1
我知道 where 子句中的運算符優先級。它只應該按日期過濾類型為“event2”的事件。
這是
EXPLAIN ANALYZE
輸出GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1) Group Key: event_refs.reference_value Filter: (count(*) > 1) Rows Removed by Filter: 41315 -> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1) Sort Key: event_refs.reference_value Sort Method: external merge Disk: 3904kB -> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1) Hash Cond: (event_refs.event_id = events.id) -> Seq Scan on event_refs (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1) Filter: (reference_key = 'project'::text) -> Hash (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1) Buckets: 16384 Batches: 2 Memory Usage: 3315kB -> Bitmap Heap Scan on events (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1) Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))) Heap Blocks: exact=14911 -> BitmapOr (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1) -> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1) Index Cond: (event_type = 'event1'::text) -> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1) Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)) Planning time: 0.493 ms Execution time: 4178.517 ms
我知道
event_refs
表掃描上的過濾器沒有過濾任何東西,這是我的測試數據的結果,以後會添加不同的類型。包括
HashJoin
似乎合理的所有內容都提供了我的測試數據,但我想知道是否可以從該子句中提高Sort
速度?GROUP BY
我嘗試在
reference_value
列中添加一個 b 樹索引,但它似乎沒有使用它。如果我沒記錯的話(我很可能是這樣,請告訴我),它正在對 153766 行進行排序。索引不會有利於這個排序過程嗎?
work_mem
這就是使您的排序變得昂貴的原因:
排序方式:外部合併磁碟:3904kB
排序溢出到磁碟,這會降低性能。你需要更多的記憶體。特別是,您需要增加**
work_mem
**. 手冊:
work_mem
(integer
)指定寫入臨時磁碟文件之前內部排序操作和雜湊表要使用的記憶體量。
在這種特殊情況下,將設置提高 4MB 就可以解決問題。通常,由於您在完全部署 60M 行時需要更多,並且如果一般設置
work_mem
太高會適得其反(請閱讀我連結到的手冊!),請考慮在本地將其設置為足夠高以供您使用查詢,例如:BEGIN; SET LOCAL work_mem = '500MB'; -- adapt to your needs SELECT ...; COMMIT;
請注意,即使
SET LOCAL
堅持到交易結束。如果您在同一筆交易中投入更多資金,您可能需要重置:RESET work_mem;
或者將查詢封裝在具有函式本地設置的函式中。與功能範例相關的答案:
索引
我也會嘗試這些索引:
CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);
僅當您從中獲得僅索引掃描時,添加
id
為最後一列才有意義。看:以及關於的部分索引
event_refs
:CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value); WHERE reference_key = 'project';
謂詞
WHERE reference_key = 'project'
在您的測試案例中沒有多大幫助(除了查詢計劃),但它應該對您的完整部署有很大幫助 wherethere will be different types added later
.這也應該允許僅索引掃描。
可能的替代查詢
由於您要選擇大部分內容
events
,因此此替代查詢可能會更快(很大程度上取決於數據分佈):SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time FROM events e JOIN ( SELECT event_id, reference_value, count(*) AS ct FROM event_refs WHERE reference_key = 'project' GROUP BY event_id, reference_value ) r ON r.event_id = e.id WHERE (e.event_type = 'event1' OR e.event_type = 'event2') -- see below ! AND e.created_at >= '2015-07-01 00:00:00+08:00' AND e.created_at < '2015-12-01 00:00:00+08:00' GROUP BY r.reference_value HAVING sum(r.ct) > 1;
我懷疑查詢中存在錯誤,並且您希望
WHERE
像我添加的那樣在子句中使用括號。根據運算符優先級,AND
在之前綁定OR
。只有在每個in有很多行時才有意義。同樣,上述索引會有所幫助。
(event_id, reference_value)``event_refs