Postgresql如何在 PostgreSQL 中增量優化
如何在 PostgreSQL 中增量優化 COUNT(*) ... GROUPBY
查詢?
我有一個 PostgreSQL 表,其中有數百萬行(時間戳、事件、distinct_id、屬性)
- 基於
event
(低基數)和的過濾器timestamp
,- and group by
date_trunc('week', timestamp)
anddistinct_id
(高基數)我已經壓縮了
COUNT(*) ... GROUP BY week
查詢的性能,特別是這些:SELECT date_trunc('week', timestamp) AS "timestamp" , count(*) AS "count" FROM telemetry_events WHERE ( (event = 'view load' OR event = 'view:loaded') AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z' ) GROUP BY date_trunc('week', timestamp) ORDER BY date_trunc('week', timestamp) ASC ;
通過創建兩個索引:
CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp") CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")
目前執行時間不到 7 秒,我覺得這非常好(至少目前如此)。這是
EXPLAIN ANALYZE
:GroupAggregate (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1) Group Key: (date_trunc('week'::text, "timestamp")) -> Sort (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1) Sort Key: (date_trunc('week'::text, "timestamp")) Sort Method: quicksort Memory: 8189kB -> Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1) Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone) Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text)) Rows Removed by Filter: 639794 Heap Fetches: 0 Planning Time: 1.873 ms Execution Time: 6989.778 ms
但是,如果我將高基數添加
distinct_id
到 中GROUP BY
,查詢會慢得多(45 到 70 秒):SELECT date_trunc('week', timestamp) AS "timestamp" , distinct_id , count(*) AS "count" FROM telemetry_events WHERE ( (event = 'view load' OR event = 'view:loaded') AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z' ) GROUP BY ( date_trunc('week', timestamp) , distinct_id ) ORDER BY date_trunc('week', timestamp) ASC ;
這是
EXPLAIN ANALYZE
:Finalize GroupAggregate (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1) Group Key: (date_trunc('week'::text, "timestamp")), distinct_id -> Gather Merge (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial GroupAggregate (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3) Group Key: (date_trunc('week'::text, "timestamp")), distinct_id -> Sort (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3) Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id Sort Method: quicksort Memory: 4520kB Worker 0: Sort Method: quicksort Memory: 4507kB Worker 1: Sort Method: quicksort Memory: 4679kB -> Parallel Index Scan using idx_timestamp on telemetry_events (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3) Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone) Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text)) Rows Removed by Filter: 213265 Planning Time: 1.942 ms Execution Time: 72029.136 ms
嘗試優化這些查詢的最佳方法是什麼?
還有一些子問題:我期望將此查詢保持在約 10 秒以下是否合理?對於這類 OLAP 工作負載,我能以多快的速度製作 PostgreSQL?
你為什麼有
(event = 'view load' OR event = 'view:loaded')
?這些實際上意味著兩個不同的東西,或者你只是有兩個不同拼寫的髒數據具有相同的含義?您的第一個查詢只是將索引用作表的精簡版本。請注意,
Index Cond
不包括索引的前導列,對於 btree 索引,這意味著它是掃描整個索引並僅將另一列用作“索引內過濾器”。這可能比掃描表更快,因為索引可能比表小得多。在您的第二個查詢中,這不起作用,因為索引中沒有一個必要的列,因此無論如何它都需要訪問該表。第二個查詢的一個好的索引應該是
("timestamp",event,distinct_id)
這個索引也應該比目前索引更好地適用於第一個查詢。但更好的是可能會清理您的數據,因此您不需要 OR。或者製作一個在 OR 條件下過濾的過濾索引。