如何在 PostgreSQL 中增量優化 `COUNT(*) ... GROUPBY` 查詢？

May 16, 2022

我有一個 PostgreSQL 表，其中有數百萬行（時間戳、事件、distinct_id、屬性）

基於event（低基數）和的過濾器timestamp，
and group by date_trunc('week', timestamp)and distinct_id（高基數）

我已經壓縮了COUNT(*) ... GROUP BY week查詢的性能，特別是這些：

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp &gt;= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY
 date_trunc('week', timestamp)
ORDER BY
 date_trunc('week', timestamp) ASC
;

通過創建兩個索引：

CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")

目前執行時間不到 7 秒，我覺得這非常好（至少目前如此）。這是EXPLAIN ANALYZE：

GroupAggregate  (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp"))
 -&gt;  Sort  (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
       Sort Key: (date_trunc('week'::text, "timestamp"))
       Sort Method: quicksort  Memory: 8189kB
       -&gt;  Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events  (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
             Index Cond: ("timestamp" &gt;= '2022-01-02 00:00:00+00'::timestamp with time zone)
             Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
             Rows Removed by Filter: 639794
             Heap Fetches: 0
Planning Time: 1.873 ms
Execution Time: 6989.778 ms

但是，如果我將高基數添加distinct_id到中GROUP BY，查詢會慢得多（45 到 70 秒）：

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , distinct_id
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp &gt;= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY (
 date_trunc('week', timestamp)
 , distinct_id
)
ORDER BY
 date_trunc('week', timestamp) ASC
;

這是EXPLAIN ANALYZE：

Finalize GroupAggregate  (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 -&gt;  Gather Merge  (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
       Workers Planned: 2
       Workers Launched: 2
       -&gt;  Partial GroupAggregate  (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
             Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
             -&gt;  Sort  (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
                   Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
                   Sort Method: quicksort  Memory: 4520kB
                   Worker 0:  Sort Method: quicksort  Memory: 4507kB
                   Worker 1:  Sort Method: quicksort  Memory: 4679kB
                   -&gt;  Parallel Index Scan using idx_timestamp on telemetry_events  (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
                         Index Cond: ("timestamp" &gt;= '2022-01-02 00:00:00+00'::timestamp with time zone)
                         Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
                         Rows Removed by Filter: 213265
Planning Time: 1.942 ms
Execution Time: 72029.136 ms

嘗試優化這些查詢的最佳方法是什麼？

還有一些子問題：我期望將此查詢保持在約 10 秒以下是否合理？對於這類 OLAP 工作負載，我能以多快的速度製作 PostgreSQL？

你為什麼有(event = 'view load' OR event = 'view:loaded')？這些實際上意味著兩個不同的東西，或者你只是有兩個不同拼寫的髒數據具有相同的含義？
您的第一個查詢只是將索引用作表的精簡版本。請注意，Index Cond不包括索引的前導列，對於 btree 索引，這意味著它是掃描整個索引並僅將另一列用作“索引內過濾器”。這可能比掃描表更快，因為索引可能比表小得多。
在您的第二個查詢中，這不起作用，因為索引中沒有一個必要的列，因此無論如何它都需要訪問該表。第二個查詢的一個好的索引應該是("timestamp",event,distinct_id) 這個索引也應該比目前索引更好地適用於第一個查詢。
但更好的是可能會清理您的數據，因此您不需要 OR。或者製作一個在 OR 條件下過濾的過濾索引。

引用自：https://dba.stackexchange.com/questions/312189

如何在 PostgreSQL 中增量優化 `COUNT(*) ... GROUPBY` 查詢？

相關問答

在 postgres 中索引時間戳列的推薦方法？

我有一個 GIST 索引和一個正常索引，一個用於過濾，一個用於排序，但 Postgres 不會同時使用兩者

使用 GIN 索引位串

PostgresSQL 多索引調整

使用索引查詢域名表以獲取匹配的頂級域

如何優化這個 IN 查詢？

如何在 PostgreSQL 中增量優化 COUNT(*) ... GROUPBY 查詢？

相關問答

在 postgres 中索引時間戳列的推薦方法？

我有一個 GIST 索引和一個正常索引，一個用於過濾，一個用於排序，但 Postgres 不會同時使用兩者

使用 GIN 索引位串

PostgresSQL 多索引調整

使用索引查詢域名表以獲取匹配的頂級域

如何優化這個 IN 查詢？

如何在 PostgreSQL 中增量優化 `COUNT(*) ... GROUPBY` 查詢？