Postgresql
使用 where 子句計算近似值
我有以下問題,我們有一張公寓設施表,看起來像這樣:
我想執行以下查詢
SELECT crawled_name, count(f.id) AS count FROM facility f WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '') GROUP BY f.crawled_name, apartment_id having count(apartment_id) > 10000 LIMIT 10;
在表達到 1 億個條目後,此查詢變得緩慢。我已經嘗試在 facility_characteristic 上創建索引,但由於 where 子句匹配超過 5-10% 的條目,postgres 執行順序掃描。
我找到了關於使用近似值來加快 Postgresql 計數的 答案
select DISTINCT(crawled_name), a.estimate_ct from facility f INNER JOIN (SELECT v."crawled_name" as name, (c.reltuples * freq)::int AS estimate_ct FROM pg_stats s CROSS JOIN LATERAL unnest(s.most_common_vals::text::text[] -- use your actual data type , s.most_common_freqs) WITH ORDINALITY v ("crawled_name", freq, ord) CROSS JOIN ( SELECT reltuples FROM pg_class WHERE oid = regclass 'facility' ) c WHERE schemaname = 'public' AND tablename = 'facility' AND attname = 'crawled_name' -- case sensitive ORDER BY v.ord LIMIT 100) as a on f.crawled_name = a.name where (f.facility_characteristic IS NULL OR f.facility_characteristic = '') order by a.estimate_ct desc;
此查詢速度更快但不夠快。有人可以幫助我提高速度嗎?
結果
EXPLAIN (ANALYZE, BUFFERS) SELECT crawled_name, count(f.id) AS count FROM facility f WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '') GROUP BY f.crawled_name, apartment_id having count(apartment_id) > 10000 LIMIT 10;
是
Limit (cost=11632509.62..11632511.00 rows=10 width=28) (actual time=1648959.720..1648959.720 rows=0 loops=1) Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859 -> GroupAggregate (cost=11632509.62..12224422.47 rows=4304821 width=28) (actual time=1648959.718..1648959.718 rows=0 loops=1) Group Key: crawled_name, apartment_id Filter: (count(apartment_id) > 10000) Rows Removed by Filter: 27660633 Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859 -> Sort (cost=11632509.62..11740130.14 rows=43048207 width=28) (actual time=1341268.470..1633080.886 rows=39997609 loops=1) Sort Key: crawled_name, apartment_id Sort Method: external merge Disk: 1679168kB Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859 -> Seq Scan on facility f (cost=0.00..3084227.90 rows=43048207 width=28) (actual time=0.026..106099.542 rows=39997609 loops=1) Filter: ((facility_characteristic IS NULL) OR (facility_characteristic = ''::text)) Rows Removed by Filter: 63180551 Buffers: shared hit=82712 read=1625330 dirtied=191588 written=64279 Planning time: 0.787 ms Execution time: 1649266.193 ms
在 ziggy 回答後解釋分析輸出:
Limit (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.144..36064.183 rows=72 loops=1) -> Sort (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.142..36064.154 rows=72 loops=1) Sort Key: (count(*)) Sort Method: quicksort Memory: 30kB -> HashAggregate (cost=886570.29..886570.41 rows=9 width=12) (actual time=36053.920..36064.093 rows=72 loops=1) Group Key: crawled_name Filter: (count(*) > 100000) Rows Removed by Filter: 57147 -> Bitmap Heap Scan on facility (cost=12260.48..882687.80 rows=517666 width=12) (actual time=2209.363..22588.928 rows=40050167 loops=1) Recheck Cond: ((crawled_name IS NOT NULL) AND (NULLIF(facility_characteristic, ''::text) IS NULL)) Rows Removed by Index Recheck: 57509124 Heap Blocks: exact=33902 lossy=950736 -> Bitmap Index Scan on facility_crawled_name_idx (cost=0.00..12131.06 rows=517666 width=0) (actual time=2201.781..2201.781 rows=40193222 loops=1) Index Cond: (crawled_name IS NOT NULL) Planning time: 0.103 ms Execution time: 36065.583 ms
你有多少獨特的價值
crawled_name
?apartment_id
實現超過 20 億的價值(在未來 5-10 年內)有多可行?請注意,
COUNT(id)
它與COUNT(*)
(因為id
不可為空)相同,並且COUNT(apartment_id) > 10000
有效地COUNT(*) > 10000
與WHERE apartment_id IS NOT NULL
. 您所做的不僅使理解查詢變得複雜,而且還可能導致 PostgreSQL 選擇次優策略。通過使用COUNT(*)
instead ofCOUNT(id)
,PostgreSQL 可以在不包含 的索引上計算整個事物,並且根本id
不需要讀取表,因此它應該處理得更快。假設你目前的
apartment_id
價值只有幾百萬,而且在不久的將來你不太可能達到 20 億。您可以通過創建索引和調整查詢來加快速度,如下所示:CREATE INDEX ON facility (crawled_name, (apartment_id)::int) WHERE NULLIF(facility_characteristic, '') IS NULL AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL; SELECT crawled_name, COUNT(*) FROM facility WHERE NULLIF(facility_characteristic, '') IS NULL AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL GROUP BY crawled_name, apartment_id::int HAVING COUNT(*) > 10000 LIMIT 10;
crawled_name
如果您有一個範圍在幾十個(或幾百個)範圍內的固定值列表,您可以通過將列的類型轉換為列舉來顯著加快速度。您必須先刪除使用此列的索引:DO $do$ DECLARE _l text[]; BEGIN _l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name > '' ORDER BY 1); EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ || array_to_string(_l, $$','$$) || $$');$$; ALTER TABLE facility ALTER COLUMN crawled_name TYPE crawled_name_enum USING (crawled_name::crawled_name_enum); END; $do$;