Postgresql

PostgreSQL 順序掃描而不是索引掃描為什麼?

  • December 4, 2021

大家好,我的 PostgreSQL 數據庫查詢有問題,想知道是否有人可以提供幫助。在某些情況下,我的查詢似乎忽略了我創建的用於連接兩個表datadata_area. 發生這種情況時,它會使用順序掃描並導致查詢速度慢得多。

順序掃描(約 5 分鐘)

Unique  (cost=15368261.82..15369053.96 rows=200 width=1942) (actual time=301266.832..301346.936 rows=153812 loops=1)
  CTE data
    ->  Bitmap Heap Scan on data  (cost=6086.77..610089.54 rows=321976 width=297) (actual time=26.286..197.625 rows=335130 loops=1)
          Recheck Cond: (datasetid = 1)
          Filter: ((readingdatetime >= '1920-01-01 00:00:00'::timestamp without time zone) AND (readingdatetime <= '2013-03-11 00:00:00'::timestamp without time zone) AND (depth >= 0::double precision) AND (depth <= 99999::double precision))
          ->  Bitmap Index Scan on data_datasetid_index  (cost=0.00..6006.27 rows=324789 width=0) (actual time=25.462..25.462 rows=335130 loops=1)
                Index Cond: (datasetid = 1)
  ->  Sort  (cost=15368261.82..15368657.89 rows=158427 width=1942) (actual time=301266.829..301287.110 rows=155194 loops=1)
        Sort Key: data.id
        Sort Method: quicksort  Memory: 81999kB
        ->  Hash Left Join  (cost=15174943.29..15354578.91 rows=158427 width=1942) (actual time=300068.588..301052.832 rows=155194 loops=1)
              Hash Cond: (data_area.area_id = area.id)
              ->  Hash Join  (cost=15174792.93..15351854.12 rows=158427 width=684) (actual time=300066.288..300971.644 rows=155194 loops=1)
                    Hash Cond: (data.id = data_area.data_id)
                    ->  CTE Scan on data  (cost=0.00..6439.52 rows=321976 width=676) (actual time=26.290..313.842 rows=335130 loops=1)
                    ->  Hash  (cost=14857017.62..14857017.62 rows=25422025 width=8) (actual time=300028.260..300028.260 rows=26709939 loops=1)
                          Buckets: 4194304  Batches: 1  Memory Usage: 1043357kB
                          ->  Seq Scan on data_area  (cost=0.00..14857017.62 rows=25422025 width=8) (actual time=182921.056..291687.996 rows=26709939 loops=1)
                                Filter: (area_id = ANY ('{28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11}'::integer[]))
              ->  Hash  (cost=108.49..108.49 rows=3349 width=1258) (actual time=2.256..2.256 rows=3349 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 584kB
                    ->  Seq Scan on area  (cost=0.00..108.49 rows=3349 width=1258) (actual time=0.007..0.666 rows=3349 loops=1)
Total runtime: 301493.379 ms

索引掃描(約 3 秒)在 explain.depesz.com 上

Unique  (cost=17352256.47..17353067.50 rows=200 width=1942) (actual time=3603.303..3681.619 rows=153812 loops=1)
  CTE data
    ->  Bitmap Heap Scan on data  (cost=6284.60..619979.56 rows=332340 width=297) (actual time=26.201..262.314 rows=335130 loops=1)
          Recheck Cond: (datasetid = 1)
          Filter: ((readingdatetime >= '1920-01-01 00:00:00'::timestamp without time zone) AND (readingdatetime <= '2013-03-11 00:00:00'::timestamp without time zone) AND (depth >= 0::double precision) AND (depth <= 99999::double precision))
          ->  Bitmap Index Scan on data_datasetid_index  (cost=0.00..6201.51 rows=335354 width=0) (actual time=25.381..25.381 rows=335130 loops=1)
                Index Cond: (datasetid = 1)
  ->  Sort  (cost=17352256.47..17352661.98 rows=162206 width=1942) (actual time=3603.302..3623.113 rows=155194 loops=1)
        Sort Key: data.id
        Sort Method: quicksort  Memory: 81999kB
        ->  Hash Left Join  (cost=1296.08..17338219.59 rows=162206 width=1942) (actual time=29.980..3375.921 rows=155194 loops=1)
              Hash Cond: (data_area.area_id = area.id)
              ->  Nested Loop  (cost=0.00..17334287.66 rows=162206 width=684) (actual time=26.903..3268.674 rows=155194 loops=1)
                    ->  CTE Scan on data  (cost=0.00..6646.80 rows=332340 width=676) (actual time=26.205..421.858 rows=335130 loops=1)
                    ->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) (actual time=0.006..0.008 rows=0 loops=335130)
                          Index Cond: (data_id = data.id)
                          Filter: (area_id = ANY ('{28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11}'::integer[]))
              ->  Hash  (cost=1254.22..1254.22 rows=3349 width=1258) (actual time=3.057..3.057 rows=3349 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 584kB
                    ->  Index Scan using area_primary_key on area  (cost=0.00..1254.22 rows=3349 width=1258) (actual time=0.012..1.429 rows=3349 loops=1)
Total runtime: 3706.630 ms

表結構

這是表的表結構data_area。如果需要,我可以提供其他表格。

CREATE TABLE data_area
(
 data_id integer NOT NULL,
 area_id integer NOT NULL,
 CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
 CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
     REFERENCES area (id) MATCH SIMPLE
     ON UPDATE NO ACTION ON DELETE NO ACTION,
 CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
     REFERENCES data (id) MATCH SIMPLE
     ON UPDATE CASCADE ON DELETE CASCADE
);

詢問

WITH data AS (
   SELECT * 
   FROM data 
   WHERE 
       datasetid IN (1) 
       AND (readingdatetime BETWEEN '1920-01-01' AND '2013-03-11') 
       AND depth BETWEEN 0 AND 99999
)
SELECT * 
FROM ( 
   SELECT DISTINCT ON (data.id) data.id, * 
   FROM 
       data, 
       data_area 
       LEFT JOIN area ON area_id = area.id 
   WHERE 
       data_id = data.id 
       AND area_id IN (28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11) 
) as s;

返回153812行。是否set enable_seqscan= false;禁用順序掃描並獲取索引結果。

我已經嘗試ANALYSE在數據庫上做一個並增加在查詢中使用的列上收集的統計資訊,但似乎沒有任何幫助。

任何人都可以傳播和闡明這一點或建議我應該嘗試的其他任何事情嗎?

注意這一行:

->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) 
   (actual time=0.006..0.008 rows=0 loops=335130)

如果你計算總成本,考慮循環,它是52.13 * 335130 = 17470326.9. 對於替代方案,這大於 14857017.62 seq_scan。這就是它不使用索引的原因。

所以優化器高估了索引掃描的成本。我猜您的數據是按索引排序的(由於聚集索引或載入方式)和/或您有足夠的高速記憶體和/或不錯的快速磁碟。因此幾乎沒有隨機 I/O 發生。

您還應該檢查correlationin pg_stats,優化器在計算索引成本時使用它來評估集群,最後嘗試更改random_page_costandcpu_index_tuple_cost以匹配您的系統。

您的 CTE 實際上除了“外包”一些條件之外什麼都不做WHERE,其中大多數看起來相當於WHERE TRUE. 由於 CTE 通常位於優化柵欄後面(意味著它自己優化),因此它們可以對某些查詢有很大幫助。然而,在這種情況下,我預計會產生完全相反的效果。

我會嘗試將查詢重寫為盡可能簡單:

SELECT d.id, * 
FROM 
   data d 
   JOIN data_area da ON da.data_id = d.id
   LEFT JOIN area a ON da.area_id = a.id 
WHERE 
   d.datasetid IN (1) 
   AND da.area_id IN (28,29,30,31,32,33,25,26,27,18,19,20,21,12,13,14,15,16,17,34,35,1,2,3,4,5,6,22,23,24,7,8,9,10,11) 
   AND (readingdatetime BETWEEN '1920-01-01' AND '2013-03-11') -- this and the next condition don't do anything, I think
   AND depth BETWEEN 0 AND 99999
;

然後檢查索引是否被使用。您仍然很有可能不需要所有輸出列(至少聯結表的兩列是多餘的)。

請報告並告訴我們您使用的是哪個 PostgreSQL 版本。

引用自:https://dba.stackexchange.com/questions/36374