使用 where 子句計算近似值

June 2, 2016

我有以下問題，我們有一張公寓設施表，看起來像這樣：

我想執行以下查詢

SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
 having count(apartment_id) &gt; 10000
LIMIT 10;

在表達到 1 億個條目後，此查詢變得緩慢。我已經嘗試在 facility_characteristic 上創建索引，但由於 where 子句匹配超過 5-10% 的條目，postgres 執行順序掃描。

我找到了關於使用近似值來加快 Postgresql 計數的答案

select DISTINCT(crawled_name), a.estimate_ct from facility f
INNER JOIN (SELECT v."crawled_name" as name, (c.reltuples * freq)::int AS estimate_ct
          FROM   pg_stats s
            CROSS  JOIN LATERAL
                          unnest(s.most_common_vals::text::text[]  -- use your actual data type
                          , s.most_common_freqs) WITH ORDINALITY v ("crawled_name", freq, ord)
            CROSS  JOIN (
                          SELECT reltuples FROM pg_class
                          WHERE oid = regclass 'facility'
                        ) c
          WHERE  schemaname = 'public'
                 AND    tablename  = 'facility'
                 AND    attname    = 'crawled_name'  -- case sensitive
          ORDER  BY v.ord
          LIMIT  100) as a on f.crawled_name = a.name
where
 (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
order by a.estimate_ct desc;

此查詢速度更快但不夠快。有人可以幫助我提高速度嗎？

結果

EXPLAIN (ANALYZE, BUFFERS) SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
 having count(apartment_id) &gt; 10000
LIMIT 10;

是

Limit  (cost=11632509.62..11632511.00 rows=10 width=28) (actual time=1648959.720..1648959.720 rows=0 loops=1)
 Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
 -&gt;  GroupAggregate  (cost=11632509.62..12224422.47 rows=4304821 width=28) (actual time=1648959.718..1648959.718 rows=0 loops=1)
       Group Key: crawled_name, apartment_id
       Filter: (count(apartment_id) &gt; 10000)
       Rows Removed by Filter: 27660633
       Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
       -&gt;  Sort  (cost=11632509.62..11740130.14 rows=43048207 width=28) (actual time=1341268.470..1633080.886 rows=39997609 loops=1)
             Sort Key: crawled_name, apartment_id
             Sort Method: external merge  Disk: 1679168kB
             Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
             -&gt;  Seq Scan on facility f  (cost=0.00..3084227.90 rows=43048207 width=28) (actual time=0.026..106099.542 rows=39997609 loops=1)
                   Filter: ((facility_characteristic IS NULL) OR (facility_characteristic = ''::text))
                   Rows Removed by Filter: 63180551
                   Buffers: shared hit=82712 read=1625330 dirtied=191588 written=64279
Planning time: 0.787 ms
Execution time: 1649266.193 ms

在 ziggy 回答後解釋分析輸出：

   Limit  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.144..36064.183 rows=72 loops=1)
 -&gt;  Sort  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.142..36064.154 rows=72 loops=1)
       Sort Key: (count(*))
       Sort Method: quicksort  Memory: 30kB
       -&gt;  HashAggregate  (cost=886570.29..886570.41 rows=9 width=12) (actual time=36053.920..36064.093 rows=72 loops=1)
             Group Key: crawled_name
             Filter: (count(*) &gt; 100000)
             Rows Removed by Filter: 57147
             -&gt;  Bitmap Heap Scan on facility  (cost=12260.48..882687.80 rows=517666 width=12) (actual time=2209.363..22588.928 rows=40050167 loops=1)
                   Recheck Cond: ((crawled_name IS NOT NULL) AND (NULLIF(facility_characteristic, ''::text) IS NULL))
                   Rows Removed by Index Recheck: 57509124
                   Heap Blocks: exact=33902 lossy=950736
                   -&gt;  Bitmap Index Scan on facility_crawled_name_idx  (cost=0.00..12131.06 rows=517666 width=0) (actual time=2201.781..2201.781 rows=40193222 loops=1)
                         Index Cond: (crawled_name IS NOT NULL)
Planning time: 0.103 ms
Execution time: 36065.583 ms

你有多少獨特的價值crawled_name？apartment_id實現超過 20 億的價值（在未來 5-10 年內）有多可行？
請注意，COUNT(id)它與COUNT(*)（因為id不可為空）相同，並且COUNT(apartment_id) > 10000有效地COUNT(*) > 10000與WHERE apartment_id IS NOT NULL. 您所做的不僅使理解查詢變得複雜，而且還可能導致 PostgreSQL 選擇次優策略。通過使用COUNT(*)instead of COUNT(id)，PostgreSQL 可以在不包含的索引上計算整個事物，並且根本id不需要讀取表，因此它應該處理得更快。
假設你目前的apartment_id價值只有幾百萬，而且在不久的將來你不太可能達到 20 億。您可以通過創建索引和調整查詢來加快速度，如下所示：
CREATE INDEX ON facility (crawled_name, (apartment_id)::int)
 WHERE NULLIF(facility_characteristic, '') IS NULL
       AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL;

SELECT crawled_name, COUNT(*)
FROM facility
WHERE NULLIF(facility_characteristic, '') IS NULL
     AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL
GROUP BY crawled_name, apartment_id::int
HAVING COUNT(*) &gt; 10000
LIMIT 10;
crawled_name如果您有一個範圍在幾十個（或幾百個）範圍內的固定值列表，您可以通過將列的類型轉換為列舉來顯著加快速度。您必須先刪除使用此列的索引：
DO $do$ DECLARE _l text[]; BEGIN
 _l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name &gt; '' ORDER BY 1);
 EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ ||
           array_to_string(_l, $$','$$) || $$');$$;

 ALTER TABLE facility ALTER COLUMN crawled_name
   TYPE crawled_name_enum USING (crawled_name::crawled_name_enum);
END; $do$;

引用自：https://dba.stackexchange.com/questions/140112

使用 where 子句計算近似值

相關問答

我們可以為 JSONB 數據類型的鍵/值創建索引嗎？

提高 PostgreSQL 表中大掃描計算的性能

Postgres 正在執行順序掃描而不是索引掃描

提高大型 PostgresSQL 表中 COUNT/GROUP-BY 的性能？

當沒有結果並且指定了 LIMIT 時，SELECT 非常慢

優化“WHERE x BETWEEN a AND b GROUP BY y”查詢