Postgresql

使用 where 子句計算近似值

  • June 2, 2016

我有以下問題,我們有一張公寓設施表,看起來像這樣:

在此處輸入圖像描述

我想執行以下查詢

SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
 having count(apartment_id) > 10000
LIMIT 10;

在表達到 1 億個條目後,此查詢變得緩慢。我已經嘗試在 facility_characteristic 上創建索引,但由於 where 子句匹配超過 5-10% 的條目,postgres 執行順序掃描。

我找到了關於使用近似值來加快 Postgresql 計數的 答案

select DISTINCT(crawled_name), a.estimate_ct from facility f
INNER JOIN (SELECT v."crawled_name" as name, (c.reltuples * freq)::int AS estimate_ct
          FROM   pg_stats s
            CROSS  JOIN LATERAL
                          unnest(s.most_common_vals::text::text[]  -- use your actual data type
                          , s.most_common_freqs) WITH ORDINALITY v ("crawled_name", freq, ord)
            CROSS  JOIN (
                          SELECT reltuples FROM pg_class
                          WHERE oid = regclass 'facility'
                        ) c
          WHERE  schemaname = 'public'
                 AND    tablename  = 'facility'
                 AND    attname    = 'crawled_name'  -- case sensitive
          ORDER  BY v.ord
          LIMIT  100) as a on f.crawled_name = a.name
where
 (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
order by a.estimate_ct desc;

此查詢速度更快但不夠快。有人可以幫助我提高速度嗎?

結果

EXPLAIN (ANALYZE, BUFFERS) SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
 having count(apartment_id) > 10000
LIMIT 10;

Limit  (cost=11632509.62..11632511.00 rows=10 width=28) (actual time=1648959.720..1648959.720 rows=0 loops=1)
 Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
 ->  GroupAggregate  (cost=11632509.62..12224422.47 rows=4304821 width=28) (actual time=1648959.718..1648959.718 rows=0 loops=1)
       Group Key: crawled_name, apartment_id
       Filter: (count(apartment_id) > 10000)
       Rows Removed by Filter: 27660633
       Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
       ->  Sort  (cost=11632509.62..11740130.14 rows=43048207 width=28) (actual time=1341268.470..1633080.886 rows=39997609 loops=1)
             Sort Key: crawled_name, apartment_id
             Sort Method: external merge  Disk: 1679168kB
             Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
             ->  Seq Scan on facility f  (cost=0.00..3084227.90 rows=43048207 width=28) (actual time=0.026..106099.542 rows=39997609 loops=1)
                   Filter: ((facility_characteristic IS NULL) OR (facility_characteristic = ''::text))
                   Rows Removed by Filter: 63180551
                   Buffers: shared hit=82712 read=1625330 dirtied=191588 written=64279
Planning time: 0.787 ms
Execution time: 1649266.193 ms

在 ziggy 回答後解釋分析輸出:

   Limit  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.144..36064.183 rows=72 loops=1)
 ->  Sort  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.142..36064.154 rows=72 loops=1)
       Sort Key: (count(*))
       Sort Method: quicksort  Memory: 30kB
       ->  HashAggregate  (cost=886570.29..886570.41 rows=9 width=12) (actual time=36053.920..36064.093 rows=72 loops=1)
             Group Key: crawled_name
             Filter: (count(*) > 100000)
             Rows Removed by Filter: 57147
             ->  Bitmap Heap Scan on facility  (cost=12260.48..882687.80 rows=517666 width=12) (actual time=2209.363..22588.928 rows=40050167 loops=1)
                   Recheck Cond: ((crawled_name IS NOT NULL) AND (NULLIF(facility_characteristic, ''::text) IS NULL))
                   Rows Removed by Index Recheck: 57509124
                   Heap Blocks: exact=33902 lossy=950736
                   ->  Bitmap Index Scan on facility_crawled_name_idx  (cost=0.00..12131.06 rows=517666 width=0) (actual time=2201.781..2201.781 rows=40193222 loops=1)
                         Index Cond: (crawled_name IS NOT NULL)
Planning time: 0.103 ms
Execution time: 36065.583 ms

你有多少獨特的價值crawled_nameapartment_id實現超過 20 億的價值(在未來 5-10 年內)有多可行?

請注意,COUNT(id)它與COUNT(*)(因為id不可為空)相同,並且COUNT(apartment_id) > 10000有效地COUNT(*) > 10000WHERE apartment_id IS NOT NULL. 您所做的不僅使理解查詢變得複雜,而且還可能導致 PostgreSQL 選擇次優策略。通過使用COUNT(*)instead of COUNT(id),PostgreSQL 可以在不包含 的索引上計算整個事物,並且根本id不需要讀取表,因此它應該處理得更快。

假設你目前的apartment_id價值只有幾百萬,而且在不久的將來你不太可能達到 20 億。您可以通過創建索引和調整查詢來加快速度,如下所示:

CREATE INDEX ON facility (crawled_name, (apartment_id)::int)
 WHERE NULLIF(facility_characteristic, '') IS NULL
       AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL;

SELECT crawled_name, COUNT(*)
FROM facility
WHERE NULLIF(facility_characteristic, '') IS NULL
     AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL
GROUP BY crawled_name, apartment_id::int
HAVING COUNT(*) > 10000
LIMIT 10;

crawled_name如果您有一個範圍在幾十個(或幾百個)範圍內的固定值列表,您可以通過將列的類型轉換為列舉來顯著加快速度。您必須先刪除使用此列的索引:

DO $do$ DECLARE _l text[]; BEGIN
 _l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name > '' ORDER BY 1);
 EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ ||
           array_to_string(_l, $$','$$) || $$');$$;

 ALTER TABLE facility ALTER COLUMN crawled_name
   TYPE crawled_name_enum USING (crawled_name::crawled_name_enum);
END; $do$;

引用自:https://dba.stackexchange.com/questions/140112