Postgresql

如何在 PostgreSQL 中增量優化 COUNT(*) ... GROUPBY 查詢?

  • May 16, 2022

我有一個 PostgreSQL 表,其中有數百萬行(時間戳、事件、distinct_id、屬性)

  • 基於event(低基數)和的過濾器timestamp
  • and group by date_trunc('week', timestamp)and distinct_id(高基數)

我已經壓縮了COUNT(*) ... GROUP BY week查詢的性能,特別是這些:

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY
 date_trunc('week', timestamp)
ORDER BY
 date_trunc('week', timestamp) ASC
;

通過創建兩個索引:

CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")

目前執行時間不到 7 秒,我覺得這非常好(至少目前如此)。這是EXPLAIN ANALYZE

GroupAggregate  (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp"))
 ->  Sort  (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
       Sort Key: (date_trunc('week'::text, "timestamp"))
       Sort Method: quicksort  Memory: 8189kB
       ->  Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events  (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
             Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
             Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
             Rows Removed by Filter: 639794
             Heap Fetches: 0
Planning Time: 1.873 ms
Execution Time: 6989.778 ms

但是,如果我將高基數添加distinct_id到 中GROUP BY,查詢會慢得多(45 到 70 秒):

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , distinct_id
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY (
 date_trunc('week', timestamp)
 , distinct_id
)
ORDER BY
 date_trunc('week', timestamp) ASC
;

這是EXPLAIN ANALYZE

Finalize GroupAggregate  (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 ->  Gather Merge  (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
       Workers Planned: 2
       Workers Launched: 2
       ->  Partial GroupAggregate  (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
             Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
             ->  Sort  (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
                   Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
                   Sort Method: quicksort  Memory: 4520kB
                   Worker 0:  Sort Method: quicksort  Memory: 4507kB
                   Worker 1:  Sort Method: quicksort  Memory: 4679kB
                   ->  Parallel Index Scan using idx_timestamp on telemetry_events  (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
                         Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
                         Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
                         Rows Removed by Filter: 213265
Planning Time: 1.942 ms
Execution Time: 72029.136 ms

嘗試優化這些查詢的最佳方法是什麼?

還有一些子問題:我期望將此查詢保持在約 10 秒以下是否合理?對於這類 OLAP 工作負載,我能以多快的速度製作 PostgreSQL?

你為什麼有(event = 'view load' OR event = 'view:loaded')?這些實際上意味著兩個不同的東西,或者你只是有兩個不同拼寫的髒數據具有相同的含義?

您的第一個查詢只是將索引用作表的精簡版本。請注意,Index Cond不包括索引的前導列,對於 btree 索引,這意味著它是掃描整個索引並僅將另一列用作“索引內過濾器”。這可能比掃描表更快,因為索引可能比表小得多。

在您的第二個查詢中,這不起作用,因為索引中沒有一個必要的列,因此無論如何它都需要訪問該表。第二個查詢的一個好的索引應該是("timestamp",event,distinct_id) 這個索引也應該比目前索引更好地適用於第一個查詢。

但更好的是可能會清理您的數據,因此您不需要 OR。或者製作一個在 OR 條件下過濾的過濾索引。

引用自:https://dba.stackexchange.com/questions/312189