在 PostgreSQL 中,如何使點陣圖索引掃描並行化?
我想提高 SQL 語句的性能。
我在第 13 版。這是範常式式碼和我感興趣的查詢。
drop table ords; CREATE TABLE ords ( ORD_ID INT NOT NULL, CUST_ID VARCHAR(10) NOT NULL, ORD_DATE DATE NOT NULL, ETC_CONTENT VARCHAR(100)); ALTER TABLE ords ADD CONSTRAINT ORDS_PK PRIMARY KEY(ORD_ID); CREATE INDEX ORDS_X01 ON ORDS (CUST_ID); INSERT INTO ORDS SELECT i ,lpad(mod(i,1000)::text,10,'cust') ,date '2021-06-07'+mod(i,624) ,rpad('x',100,'x') FROM generate_series(1,1000000) a(i); drop table delivery; CREATE TABLE delivery ( ORD_ID INT NOT NULL, VEHICLE_ID VARCHAR(10) NOT NULL, START_DATE DATE NOT NULL, END_DATE DATE NOT NULL, ETC_REMARKS VARCHAR(100)); INSERT INTO DELIVERY SELECT i , MOD(i,1000) , date '2021-01-01' + mod(i,1000) , date '2021-01-05' + mod(i,1000) , rpad('x',100,'x') FROM generate_series(1,1000000) a(i); ALTER TABLE DELIVERY ADD CONSTRAINT DELIVERY_PK primary key (ORD_ID); CREATE INDEX DELIVERY_X01 ON DELIVERY(END_DATE, START_DATE); CREATE INDEX DELIVERY_X02 ON DELIVERY(VEHICLE_ID); select pg_relation_size('ords'), pg_relation_size('delivery'); analyze ords; analyze delivery; EXPLAIN(ANALYZE, BUFFERS, COSTS OFF) SELECT A.*, B.* FROM ORDS A LEFT JOIN DELIVERY B ON (A.ORD_ID = B.ORD_ID AND (B.START_DATE <= DATE '2021-07-12' AND B.END_DATE >= DATE '2021-07-10' OR (B.VEHICLE_ID > '990') ) ) WHERE A.ORD_DATE BETWEEN DATE '2021-06-01' AND DATE '2021-07-10' ;
下面是執行計劃。
Gather (actual time=86.645..101.685 rows=54501 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=13615 read=23995, temp read=1196 written=1272 -> Parallel Hash Left Join (actual time=83.360..87.135 rows=18167 loops=3) Hash Cond: (a.ord_id = b.ord_id) Buffers: shared hit=13614 read=23995, temp read=1196 written=1272 -> Parallel Seq Scan on ords a (actual time=0.047..34.335 rows=18167 loops=3) Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date)) Rows Removed by Filter: 315166 Buffers: shared hit=4968 read=14263 -> Parallel Hash (actual time=42.999..42.999 rows=5333 loops=3) Buckets: 32768 Batches: 8 Memory Usage: 608kB Buffers: shared hit=8450 read=9732, temp written=280 -> Parallel Seq Scan on delivery b (actual time=0.069..40.615 rows=5333 loops=3) Filter: (((start_date <= '2021-07-12'::date) AND (end_date >= '2021-07-10'::date)) OR ((vehicle_id)::text > '990'::text)) Rows Removed by Filter: 328000 Buffers: shared hit=8450 read=9732 Planning: Buffers: shared hit=20 Planning Time: 0.357 ms Execution Time: 103.282 ms
我曾預計在從 DELIVERY 表中獲取行時,會出現兩個使用 delivery_x01 和 delivery_x02 的點陣圖索引掃描,然後是 BitmapOr 操作。與我想的不同,planner 選擇使用並行性進行表掃描。
為了將我預期的執行計劃與 PostgreSQL 選擇的計劃進行比較,我將參數 max_parallel_workers_per_gather 設置為 0 並重新執行 SQL 語句。
設置 max_parallel_workers_per_gather = 0;
–我重新執行了查詢,這是生成的執行計劃。
Hash Right Join (actual time=100.080..119.375 rows=54501 loops=1) Hash Cond: (b.ord_id = a.ord_id) Buffers: shared hit=3304 read=18847, temp read=903 written=903 -> Bitmap Heap Scan on delivery b (actual time=1.374..4.277 rows=16000 loops=1) Recheck Cond: (((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)) OR ((ve hicle_id)::text > '990'::text)) Heap Blocks: exact=2182 Buffers: shared hit=2919 -> BitmapOr (actual time=1.108..1.109 rows=0 loops=1) Buffers: shared hit=737 -> Bitmap Index Scan on delivery_x01 (actual time=0.809..0.810 rows=7000 loops=1) Index Cond: ((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date) ) Buffers: shared hit=726 -> Bitmap Index Scan on delivery_x02 (actual time=0.298..0.298 rows=9000 loops=1) Index Cond: ((vehicle_id)::text > '990'::text) Buffers: shared hit=11 -> Hash (actual time=98.373..98.374 rows=54501 loops=1) Buckets: 32768 Batches: 4 Memory Usage: 2331kB Buffers: shared hit=384 read=18847, temp written=697 -> Seq Scan on ords a (actual time=0.122..85.072 rows=54501 loops=1) Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date)) Rows Removed by Filter: 945499 Buffers: shared hit=384 read=18847 Planning: Buffers: shared hit=12 Planning Time: 0.232 ms Execution Time: 120.843 ms
通過使用 Bitmap Index Scan 和 BitmapOr 操作,我可以減少塊 I/O 的數量,但執行時間從 103 ms 增加到 120 ms。似乎並行性是執行時間差距的主要因素。所以我推斷如果在點陣圖索引掃描操作中啟動並行性,查詢會變得更快。
最後,我的問題是:如何使點陣圖索引掃描操作並行化?
以下是我想要得到的執行計劃。
Gather Workers Planned: 2 Workers Launched: 2 Parallel Hash Right Join -> Hash Cond: (b.ord_id = a.ord_id) -> Parallel Bitmap Heap Scan on delivery -> BitmapOr -> Parallel Bitmap Index Scan on delivery_x01 -> Parallel Bitmap INdex Scan on delivery_x02 -> Parallel Hash -> Parallel Seq Scan on ords
據我所知,沒有並行點陣圖索引掃描之類的東西。所以為了讓它使用其中的一個,你需要做的第一件事就是實現它。但是您需要某種聚合器節點位於其之上。我還不清楚如何以正確和高效的方式實現它。
現在,如果您確實實現了這一點,它可能不會有太大的不同。我認為根本問題在於您將範圍表示為單獨的起點和終點,而不是實際範圍。PostgreSQL 支持範圍類型作為一等實體,使用它們可以解決很多問題。
如果我執行,
SET enable_seqscan=0
強制它並重新執行你的SELECT
,我得到一個並行點陣圖掃描。QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Gather (actual time=105.482..119.050 rows=54501 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=3002 read=26091 written=6, temp read=1194 written=1264 -> Parallel Hash Left Join (actual time=99.531..102.736 rows=18167 loops=3) Hash Cond: (a.ord_id = b.ord_id) Buffers: shared hit=3002 read=26091 written=6, temp read=1194 written=1264 -> Parallel Index Scan using ords_pk on ords a (actual time=0.097..65.549 rows=18167 loops=3) Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date)) Rows Removed by Filter: 315166 Buffers: shared hit=2911 read=21661 written=6 -> Parallel Hash (actual time=29.389..29.391 rows=5333 loops=3) Buckets: 32768 Batches: 8 Memory Usage: 576kB Buffers: shared hit=7 read=4430, temp written=280 -> Parallel Bitmap Heap Scan on delivery b (actual time=23.930..27.802 rows=5333 loops=3) Recheck Cond: (((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)) OR ((vehicle_id)::text > '990'::text)) Heap Blocks: exact=973 Buffers: shared hit=7 read=4430 -> BitmapOr (actual time=28.203..28.205 rows=0 loops=1) Buffers: shared hit=3 read=2252 -> Bitmap Index Scan on delivery_x01 (actual time=26.972..26.972 rows=7000 loops=1) Index Cond: ((end_date >= '2021-07-10'::date) AND (start_date <= '2021-07-12'::date)) Buffers: shared read=2227 -> Bitmap Index Scan on delivery_x02 (actual time=1.229..1.229 rows=9000 loops=1) Index Cond: ((vehicle_id)::text > '990'::text) Buffers: shared hit=3 read=25 Planning Time: 0.748 ms Execution Time: 120.634 ms (28 rows)
那裡的大問題是,使用 seqscan 仍然更快,雖然不是很多
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Gather (actual time=96.129..109.602 rows=54501 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=14282 read=23295, temp read=1199 written=1248 -> Parallel Hash Left Join (actual time=91.822..95.166 rows=18167 loops=3) Hash Cond: (a.ord_id = b.ord_id) Buffers: shared hit=14282 read=23295, temp read=1199 written=1248 -> Parallel Seq Scan on ords a (actual time=0.042..31.783 rows=18167 loops=3) Filter: ((ord_date >= '2021-06-01'::date) AND (ord_date <= '2021-07-10'::date)) Rows Removed by Filter: 315166 Buffers: shared hit=14126 read=5105 -> Parallel Hash (actual time=56.069..56.070 rows=5333 loops=3) Buckets: 32768 Batches: 8 Memory Usage: 608kB Buffers: shared hit=6 read=18176, temp written=256 -> Parallel Seq Scan on delivery b (actual time=0.098..54.431 rows=5333 loops=3) Filter: (((start_date <= '2021-07-12'::date) AND (end_date >= '2021-07-10'::date)) OR ((vehicle_id)::text > '990'::text)) Rows Removed by Filter: 328000 Buffers: shared hit=6 read=18176 Planning Time: 0.764 ms Execution Time: 111.118 ms (20 rows)
所以真正的問題是你為什麼想要這個?而給定一個並行點陣圖掃描速度更快的場景,PostgreSQL會不會自然而然地選擇它呢?