PostgreSQL 在使用 NOT-IN 時不會生成僅索引計劃,但在將 IN 與多列覆蓋索引一起使用時會生成
我有兩張桌子:
A(id, x, cid)
和B(cid)
。我需要從 A 中獲取一些記錄,並排除 B 中具有相同 cid 的記錄。我有 btree 索引
A (cid, id)
。查詢 1(未輸入):
select id from a where cid not in (select distinct cid from b);
計劃:
Seq Scan on a (cost=20418.20..340328.83 rows=5128145 width=8) (actual time=367.337..3820.175 rows=169046 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 10087249 SubPlan 1 -> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=351.301..357.395 rows=41493 loops=1) Group Key: b.cid -> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.530..148.063 rows=1132787 loops=1) Planning time: 0.254 ms Execution time: 3827.324 ms
它不會生成
index-only-scan
我認為應該合理的 a,因為 indexbtree on A(cid, id)
是一個覆蓋索引。但是,如果我使用 IN 運算符,它可以生成
index-only-scan
,如下所示:查詢 2 (IN):
select id from a where cid in (select distinct cid from b);
計劃:
Nested Loop (cost=20284.27..420607.83 rows=10256290 width=8) (actual time=290.225..2349.182 rows=10087249 loops=1) -> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=290.162..304.054 rows=41493 loops=1) Group Key: b.cid -> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.042..95.151 rows=1132787 loops=1) -> Index Only Scan using idx_a_cid_id on a (cost=0.43..27.61 rows=961 width=12) (actual time=0.005..0.028 rows=243 loops=41493) Index Cond: (cid = b.cid) Heap Fetches: 0 Planning time: 0.197 ms Execution time: 2672.898 ms
如果您可能會考慮因為 和 的基數
A
差異B
太大,那是真的。A
包含10256295
行,而B
包含41493
不同cid
的 s。但是,如果我用相同的邏輯手動將其重寫
Query-1
為以下內容,但僅使用,postgres 仍然可以生成,如下所示:Query-3``IN``index-only-scan
查詢 3 (IN-sub(NOT-IN)):
select id from a where cid in (select cid from a where cid not in (select distinct cid from b));
計劃:
Nested Loop (cost=325220.51..722952.01 rows=10256290 width=8) (actual time=3741.854..5607.756 rows=169046 loops=1) -> HashAggregate (cost=325220.07..325326.82 rows=10675 width=4) (actual time=3741.133..3763.512 rows=51758 loops=1) Group Key: a_1.cid -> Index Only Scan using idx_a_cid on a a_1 (cost=20418.64..312399.71 rows=5128145 width=4) (actual time=377.384..3691.304 rows=169046 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 10087249 Heap Fetches: 0 SubPlan 1 -> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=359.015..365.864 rows=41493 loops=1) Group Key: b.cid -> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.455..144.879 rows=1132787 loops=1) -> Index Only Scan using idx_a_cid_id on a (cost=0.43..27.64 rows=961 width=12) (actual time=0.033..0.035 rows=3 loops=51758) Index Cond: (cid = a_1.cid) Heap Fetches: 0 Planning time: 2.758 ms Execution time: 5617.930 ms
所以我現在很困惑,是否是因為
NOT-IN
運營商本身太難/太貴而無法使用index-only-scan
?還是僅僅因為 PostgreSQL 的查詢優化器不夠聰明,無法生成一個?順便說一句,我的實驗是在 PostgreSQL-9.6 上進行的。
謝謝!
跟進(2019 年 3 月 28 日):
使用 jjanes 的方法,它有效!
SET cpu_index_tuple_cost TO 0;
SET random_page_cost TO 1;
再次解釋 Query-1:
explain analyze select id from a where cid not in (select distinct cid from b);
計劃:
Index Only Scan using idx_a_cid_id on a (cost=20418.64..188115.26 rows=5128145 width=8) (actual time=308.552..1865.567 rows=169046 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 10087249 Heap Fetches: 0 SubPlan 1 -> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=289.169..297.363 rows=41493 loops=1) Group Key: b.cid -> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.020..94.442 rows=1132787 loops=1) Planning time: 0.112 ms Execution time: 1871.197 ms
謝謝你們!
它選擇它認為會更快的計劃。
如果您
SET cpu_index_tuple_cost TO 0
和SET random_page_cost to 1
您可能會使用僅索引掃描找到它。並且還發現它並不可靠,或者至少不會好很多。