Postgresql

提高 GROUP BY 子句中的排序性能

  • June 8, 2020

我在 Postgres 9.4.1 中有兩個表,eventsevent_refs具有以下模式:

events桌子

CREATE TABLE events (
 id serial NOT NULL PRIMARY KEY,
 event_type text NOT NULL,
 event_path jsonb,
 event_data jsonb,
 created_at timestamp with time zone NOT NULL
);

-- Index on type and created time

CREATE INDEX events_event_type_created_at_idx
 ON events (event_type, created_at);

event_refs桌子

CREATE TABLE event_refs (
 event_id integer NOT NULL,
 reference_key text NOT NULL,
 reference_value text NOT NULL,
 CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
 CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
     REFERENCES events (id) MATCH SIMPLE
     ON UPDATE NO ACTION ON DELETE NO ACTION
);

兩個表都包含 2M 行。這是我正在執行的查詢

SELECT
 EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
 events
INNER JOIN
 event_refs
ON
 event_refs.event_id = events.id AND
 event_refs.reference_key = 'project'
WHERE
   events.event_type = 'event1' OR
   events.event_type = 'event2' AND
   events.created_at >= '2015-07-01 00:00:00+08:00' AND
   events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1

我知道 where 子句中的運算符優先級。它只應該按日期過濾類型為“event2”的事件。

這是EXPLAIN ANALYZE輸出

GroupAggregate  (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
  Group Key: event_refs.reference_value
  Filter: (count(*) > 1)
  Rows Removed by Filter: 41315
  ->  Sort  (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
        Sort Key: event_refs.reference_value
        Sort Method: external merge  Disk: 3904kB
        ->  Hash Join  (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
              Hash Cond: (event_refs.event_id = events.id)
              ->  Seq Scan on event_refs  (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
                    Filter: (reference_key = 'project'::text)
              ->  Hash  (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
                    Buckets: 16384  Batches: 2  Memory Usage: 3315kB
                    ->  Bitmap Heap Scan on events  (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
                          Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
                          Heap Blocks: exact=14911
                          ->  BitmapOr  (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
                                ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
                                      Index Cond: (event_type = 'event1'::text)
                                ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
                                      Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
Planning time: 0.493 ms
Execution time: 4178.517 ms

我知道event_refs表掃描上的過濾器沒有過濾任何東西,這是我的測試數據的結果,以後會添加不同的類型。

包括HashJoin似乎合理的所有內容都提供了我的測試數據,但我想知道是否可以從該子句中提高Sort速度?GROUP BY

我嘗試在reference_value列中添加一個 b 樹索引,但它似乎沒有使用它。如果我沒記錯的話(我很可能是這樣,請告訴我),它正在對 153766 行進行排序。索引不會有利於這個排序過程嗎?

work_mem

這就是使您的排序變得昂貴的原因:

排序方式:外部合併磁碟:3904kB

排序溢出到磁碟,這會降低性能。你需要更多的記憶體。特別是,您需要增加**work_mem**. 手冊:

work_mem( integer)

指定寫入臨時磁碟文件之前內部排序操作和雜湊表要使用的記憶體量。

在這種特殊情況下,將設置提高 4MB 就可以解決問題。通常,由於您在完全部署 60M 行時需要更多,並且如果一般設置work_mem太高會適得其反(請閱讀我連結到的手冊!),請考慮在本地將其設置為足夠高以供您使用查詢,例如:

BEGIN;
SET LOCAL work_mem = '500MB';  -- adapt to your needs
SELECT ...;
COMMIT;

請注意,即使SET LOCAL堅持到交易結束。如果您在同一筆交易中投入更多資金,您可能需要重置:

RESET work_mem;

或者將查詢封裝在具有函式本地設置的函式中。與功能範例相關的答案:

索引

我也會嘗試這些索引:

CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);

僅當您從中獲得僅索引掃描時,添加id為最後一列才有意義。看:

以及關於的部分索引event_refs

CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE  reference_key = 'project';

謂詞WHERE reference_key = 'project'在您的測試案例中沒有多大幫助(除了查詢計劃),但它應該對您的完整部署有很大幫助 where there will be different types added later.

這也應該允許僅索引掃描。

可能的替代查詢

由於您要選擇大部分內容events,因此此替代查詢可能會更快(很大程度上取決於數據分佈):

SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM   events e
JOIN  (
  SELECT event_id, reference_value, count(*) AS ct
  FROM   event_refs
  WHERE  reference_key = 'project'                   
  GROUP  BY event_id, reference_value
  ) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
      e.event_type = 'event2')        -- see below !
AND    e.created_at >= '2015-07-01 00:00:00+08:00'
AND    e.created_at <  '2015-12-01 00:00:00+08:00'
GROUP  BY r.reference_value
HAVING sum(r.ct) > 1;

懷疑查詢中存在錯誤,並且您希望WHERE像我添加的那樣在子句中使用括號。根據運算符優先級AND在之前綁定OR

只有在每個in有很多行時才有意義。同樣,上述索引會有所幫助。(event_id, reference_value)``event_refs

引用自:https://dba.stackexchange.com/questions/125094