PostgreSQL 在使用 NOT-IN 時不會生成僅索引計劃，但在將 IN 與多列覆蓋索引一起使用時會生成

March 29, 2019

我有兩張桌子：A(id, x, cid)和B(cid)。我需要從 A 中獲取一些記錄，並排除 B 中具有相同 cid 的記錄。
我有 btree 索引A (cid, id)。
查詢 1（未輸入）：select id from a where cid not in (select distinct cid from b);
計劃：
Seq Scan on a  (cost=20418.20..340328.83 rows=5128145 width=8) (actual time=367.337..3820.175 rows=169046 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 10087249
  SubPlan 1
    -&gt;  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=351.301..357.395 rows=41493 loops=1)
          Group Key: b.cid
          -&gt;  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.530..148.063 rows=1132787 loops=1)
Planning time: 0.254 ms
Execution time: 3827.324 ms
它不會生成index-only-scan我認為應該合理的 a，因為 indexbtree on A(cid, id)是一個覆蓋索引。
但是，如果我使用 IN 運算符，它可以生成index-only-scan，如下所示：
查詢 2 (IN)：select id from a where cid in (select distinct cid from b);
計劃：
Nested Loop  (cost=20284.27..420607.83 rows=10256290 width=8) (actual time=290.225..2349.182 rows=10087249 loops=1)
  -&gt;  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=290.162..304.054 rows=41493 loops=1)
        Group Key: b.cid
        -&gt;  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.042..95.151 rows=1132787 loops=1)
  -&gt;  Index Only Scan using idx_a_cid_id on a  (cost=0.43..27.61 rows=961 width=12) (actual time=0.005..0.028 rows=243 loops=41493)
        Index Cond: (cid = b.cid)
        Heap Fetches: 0
Planning time: 0.197 ms
Execution time: 2672.898 ms
如果您可能會考慮因為和的基數A差異B太大，那是真的。A包含10256295行，而B包含41493不同cid的 s。
但是，如果我用相同的邏輯手動將其重寫Query-1為以下內容，但僅使用，postgres 仍然可以生成，如下所示：Query-3``IN``index-only-scan
查詢 3 (IN-sub(NOT-IN))：select id from a where cid in (select cid from a where cid not in (select distinct cid from b));
計劃：
Nested Loop  (cost=325220.51..722952.01 rows=10256290 width=8) (actual time=3741.854..5607.756 rows=169046 loops=1)
  -&gt;  HashAggregate  (cost=325220.07..325326.82 rows=10675 width=4) (actual time=3741.133..3763.512 rows=51758 loops=1)
        Group Key: a_1.cid
        -&gt;  Index Only Scan using idx_a_cid on a a_1  (cost=20418.64..312399.71 rows=5128145 width=4) (actual time=377.384..3691.304 rows=169046 loops=1)
              Filter: (NOT (hashed SubPlan 1))
              Rows Removed by Filter: 10087249
              Heap Fetches: 0
              SubPlan 1
                -&gt;  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=359.015..365.864 rows=41493 loops=1)
                      Group Key: b.cid
                      -&gt;  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.455..144.879 rows=1132787 loops=1)
  -&gt;  Index Only Scan using idx_a_cid_id on a  (cost=0.43..27.64 rows=961 width=12) (actual time=0.033..0.035 rows=3 loops=51758)
        Index Cond: (cid = a_1.cid)
        Heap Fetches: 0
Planning time: 2.758 ms
Execution time: 5617.930 ms
所以我現在很困惑，是否是因為NOT-IN運營商本身太難/太貴而無法使用index-only-scan？還是僅僅因為 PostgreSQL 的查詢優化器不夠聰明，無法生成一個？
順便說一句，我的實驗是在 PostgreSQL-9.6 上進行的。
謝謝！
跟進（2019 年 3 月 28 日）：
使用 jjanes 的方法，它有效！
SET cpu_index_tuple_cost TO 0;
SET random_page_cost TO 1;
再次解釋 Query-1：
explain analyze select id from a where cid not in (select distinct cid from b);
計劃：
Index Only Scan using idx_a_cid_id on a  (cost=20418.64..188115.26 rows=5128145 width=8) (actual time=308.552..1865.567 rows=169046 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 10087249
  Heap Fetches: 0
  SubPlan 1
    -&gt;  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=289.169..297.363 rows=41493 loops=1)
          Group Key: b.cid
          -&gt;  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.020..94.442 rows=1132787 loops=1)
Planning time: 0.112 ms
Execution time: 1871.197 ms
謝謝你們！

它選擇它認為會更快的計劃。
如果您SET cpu_index_tuple_cost TO 0和SET random_page_cost to 1您可能會使用僅索引掃描找到它。
並且還發現它並不可靠，或者至少不會好很多。

引用自：https://dba.stackexchange.com/questions/233329

PostgreSQL 在使用 NOT-IN 時不會生成僅索引計劃，但在將 IN 與多列覆蓋索引一起使用時會生成

相關問答

Postgresql 9.6 大型管理工作的最佳設置（min_wal 和 max_wal）

當涉及 OR 時，Postgres 選擇過濾器而不是索引 cond

在 Postgres 9.6 中，為什麼 GIN 索引不用於帶有 text/int 數組的 JSONB 列？

使用計數/排序的 postgres 查詢性能緩慢

PostgreSQL中的漸進式子表掃描？

Postgresql 無法使用我的覆蓋索引並退回到更慢的點陣圖掃描