Postgresql

在 PostgreSQL 中,如何使點陣圖索引掃描並行化?

  • June 6, 2021

我想提高 SQL 語句的性能。

我在第 13 版。這是範常式式碼和我感興趣的查詢。

drop table ords;
CREATE TABLE ords (
ORD_ID INT NOT NULL,
CUST_ID VARCHAR(10) NOT NULL,
ORD_DATE DATE NOT NULL,
ETC_CONTENT VARCHAR(100));

ALTER TABLE ords ADD CONSTRAINT ORDS_PK PRIMARY KEY(ORD_ID);

CREATE INDEX ORDS_X01 ON ORDS (CUST_ID);

INSERT INTO ORDS
SELECT i
     ,lpad(mod(i,1000)::text,10,'cust')
     ,date '2021-06-07'+mod(i,624)
     ,rpad('x',100,'x')
 FROM generate_series(1,1000000) a(i);

drop table delivery;

CREATE TABLE delivery (
ORD_ID INT NOT NULL,
VEHICLE_ID VARCHAR(10) NOT NULL,
START_DATE DATE NOT NULL,
END_DATE   DATE NOT NULL,
ETC_REMARKS VARCHAR(100));

INSERT INTO DELIVERY
SELECT i
    , MOD(i,1000)
    , date '2021-01-01' + mod(i,1000)
    , date '2021-01-05' + mod(i,1000)
    , rpad('x',100,'x')
 FROM generate_series(1,1000000) a(i);

ALTER TABLE DELIVERY ADD CONSTRAINT DELIVERY_PK primary key (ORD_ID);
CREATE INDEX DELIVERY_X01 ON DELIVERY(END_DATE, START_DATE);
CREATE INDEX DELIVERY_X02 ON DELIVERY(VEHICLE_ID);
select pg_relation_size('ords'), pg_relation_size('delivery');
analyze ords;
analyze delivery;
EXPLAIN(ANALYZE, BUFFERS, COSTS OFF)
SELECT A.*, B.*
 FROM ORDS A LEFT JOIN DELIVERY B
   ON (A.ORD_ID = B.ORD_ID
       AND (B.START_DATE <= DATE '2021-07-12' AND B.END_DATE >= DATE '2021-07-10'
            OR (B.VEHICLE_ID > '990')
            )
       )
WHERE A.ORD_DATE BETWEEN DATE '2021-06-01' AND DATE '2021-07-10'
;

下面是執行計劃。

Gather (actual time=86.645..101.685 rows=54501 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=13615 read=23995, temp read=1196 written=1272
  ->  Parallel Hash Left Join (actual time=83.360..87.135 rows=18167 loops=3)
        Hash Cond: (a.ord_id = b.ord_id)
        Buffers: shared hit=13614 read=23995, temp read=1196 written=1272
        ->  Parallel Seq Scan on ords a (actual time=0.047..34.335 rows=18167 loops=3)
              Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date))
              Rows Removed by Filter: 315166
              Buffers: shared hit=4968 read=14263
        ->  Parallel Hash (actual time=42.999..42.999 rows=5333 loops=3)
              Buckets: 32768  Batches: 8  Memory Usage: 608kB
              Buffers: shared hit=8450 read=9732, temp written=280
              ->  Parallel Seq Scan on delivery b (actual time=0.069..40.615 rows=5333 loops=3)
                    Filter: (((start_date <= '2021-07-12'::date) AND (end_date >= '2021-07-10'::date)) OR ((vehicle_id)::text > '990'::text))
                    Rows Removed by Filter: 328000
                    Buffers: shared hit=8450 read=9732
Planning:
  Buffers: shared hit=20
Planning Time: 0.357 ms
Execution Time: 103.282 ms

我曾預計在從 DELIVERY 表中獲取行時,會出現兩個使用 delivery_x01 和 delivery_x02 的點陣圖索引掃描,然後是 BitmapOr 操作。與我想的不同,planner 選擇使用並行性進行表掃描。

為了將我預期的執行計劃與 PostgreSQL 選擇的計劃進行比較,我將參數 max_parallel_workers_per_gather 設置為 0 並重新執行 SQL 語句。

設置 max_parallel_workers_per_gather = 0;

–我重新執行了查詢,這是生成的執行計劃。

Hash Right Join (actual time=100.080..119.375 rows=54501 loops=1)
  Hash Cond: (b.ord_id = a.ord_id)
  Buffers: shared hit=3304 read=18847, temp read=903 written=903
  ->  Bitmap Heap Scan on delivery b (actual time=1.374..4.277 rows=16000 loops=1)
        Recheck Cond: (((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)) OR ((ve
hicle_id)::text > '990'::text))
        Heap Blocks: exact=2182
        Buffers: shared hit=2919
        ->  BitmapOr (actual time=1.108..1.109 rows=0 loops=1)
              Buffers: shared hit=737
              ->  Bitmap Index Scan on delivery_x01 (actual time=0.809..0.810 rows=7000 loops=1)
                    Index Cond: ((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)
)
                    Buffers: shared hit=726
              ->  Bitmap Index Scan on delivery_x02 (actual time=0.298..0.298 rows=9000 loops=1)
                    Index Cond: ((vehicle_id)::text > '990'::text)
                    Buffers: shared hit=11
  ->  Hash (actual time=98.373..98.374 rows=54501 loops=1)
        Buckets: 32768  Batches: 4  Memory Usage: 2331kB
        Buffers: shared hit=384 read=18847, temp written=697
        ->  Seq Scan on ords a (actual time=0.122..85.072 rows=54501 loops=1)
              Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date))
              Rows Removed by Filter: 945499
              Buffers: shared hit=384 read=18847
Planning:
  Buffers: shared hit=12
Planning Time: 0.232 ms
Execution Time: 120.843 ms

通過使用 Bitmap Index Scan 和 BitmapOr 操作,我可以減少塊 I/O 的數量,但執行時間從 103 ms 增加到 120 ms。似乎並行性是執行時間差距的主要因素。所以我推斷如果在點陣圖索引掃描操作中啟動並行性,查詢會變得更快。

最後,我的問題是:如何使點陣圖索引掃描操作並行化?


以下是我想要得到的執行計劃。

Gather
 Workers Planned: 2
 Workers Launched: 2
 Parallel Hash Right Join
 ->  Hash Cond: (b.ord_id = a.ord_id)
   -> Parallel Bitmap Heap Scan on delivery
      -> BitmapOr
         -> Parallel Bitmap Index Scan on delivery_x01
         -> Parallel Bitmap INdex Scan on delivery_x02
   -> Parallel Hash
      -> Parallel Seq Scan on ords

據我所知,沒有並行點陣圖索引掃描之類的東西。所以為了讓它使用其中的一個,你需要做的第一件事就是實現它。但是您需要某種聚合器節點位於其之上。我還不清楚如何以正確和高效的方式實現它。

現在,如果您確實實現了這一點,它可能不會有太大的不同。我認為根本問題在於您將範圍表示為單獨的起點和終點,而不是實際範圍。PostgreSQL 支持範圍類型作為一等實體,使用它們可以解決很多問題。

如果我執行,SET enable_seqscan=0強制它並重新執行你的SELECT,我得到一個並行點陣圖掃描。

                                                                    QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
Gather (actual time=105.482..119.050 rows=54501 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=3002 read=26091 written=6, temp read=1194 written=1264
  ->  Parallel Hash Left Join (actual time=99.531..102.736 rows=18167 loops=3)
        Hash Cond: (a.ord_id = b.ord_id)
        Buffers: shared hit=3002 read=26091 written=6, temp read=1194 written=1264
        ->  Parallel Index Scan using ords_pk on ords a (actual time=0.097..65.549 rows=18167 loops=3)
              Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date))
              Rows Removed by Filter: 315166
              Buffers: shared hit=2911 read=21661 written=6
        ->  Parallel Hash (actual time=29.389..29.391 rows=5333 loops=3)
              Buckets: 32768  Batches: 8  Memory Usage: 576kB
              Buffers: shared hit=7 read=4430, temp written=280
              ->  Parallel Bitmap Heap Scan on delivery b (actual time=23.930..27.802 rows=5333 loops=3)
                    Recheck Cond: (((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)) OR ((vehicle_id)::text > '990'::text))
                    Heap Blocks: exact=973
                    Buffers: shared hit=7 read=4430
                    ->  BitmapOr (actual time=28.203..28.205 rows=0 loops=1)
                          Buffers: shared hit=3 read=2252
                          ->  Bitmap Index Scan on delivery_x01 (actual time=26.972..26.972 rows=7000 loops=1)
                                Index Cond: ((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date))
                                Buffers: shared read=2227
                          ->  Bitmap Index Scan on delivery_x02 (actual time=1.229..1.229 rows=9000 loops=1)
                                Index Cond: ((vehicle_id)::text > '990'::text)
                                Buffers: shared hit=3 read=25
Planning Time: 0.748 ms
Execution Time: 120.634 ms
(28 rows)

那裡的大問題是,使用 seqscan 仍然更快,雖然不是很多

                                                                 QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
Gather (actual time=96.129..109.602 rows=54501 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=14282 read=23295, temp read=1199 written=1248
  ->  Parallel Hash Left Join (actual time=91.822..95.166 rows=18167 loops=3)
        Hash Cond: (a.ord_id = b.ord_id)
        Buffers: shared hit=14282 read=23295, temp read=1199 written=1248
        ->  Parallel Seq Scan on ords a (actual time=0.042..31.783 rows=18167 loops=3)
              Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date))
              Rows Removed by Filter: 315166
              Buffers: shared hit=14126 read=5105
        ->  Parallel Hash (actual time=56.069..56.070 rows=5333 loops=3)
              Buckets: 32768  Batches: 8  Memory Usage: 608kB
              Buffers: shared hit=6 read=18176, temp written=256
              ->  Parallel Seq Scan on delivery b (actual time=0.098..54.431 rows=5333 loops=3)
                    Filter: (((start_date <= '2021-07-12'::date) AND (end_date >= '2021-07-10'::date)) OR ((vehicle_id)::text > '990'::text))
                    Rows Removed by Filter: 328000
                    Buffers: shared hit=6 read=18176
Planning Time: 0.764 ms
Execution Time: 111.118 ms
(20 rows)

所以真正的問題是你為什麼想要這個?而給定一個並行點陣圖掃描速度更快的場景,PostgreSQL會不會自然而然地選擇它呢?

引用自:https://dba.stackexchange.com/questions/293830