Postgresql

PostgreSQL 在使用 NOT-IN 時不會生成僅索引計劃,但在將 IN 與多列覆蓋索引一起使用時會生成

  • March 29, 2019

我有兩張桌子:A(id, x, cid)B(cid)。我需要從 A 中獲取一些記錄,並排除 B 中具有相同 cid 的記錄。

我有 btree 索引A (cid, id)

查詢 1(未輸入)select id from a where cid not in (select distinct cid from b);

計劃

Seq Scan on a  (cost=20418.20..340328.83 rows=5128145 width=8) (actual time=367.337..3820.175 rows=169046 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 10087249
  SubPlan 1
    ->  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=351.301..357.395 rows=41493 loops=1)
          Group Key: b.cid
          ->  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.530..148.063 rows=1132787 loops=1)
Planning time: 0.254 ms
Execution time: 3827.324 ms

它不會生成index-only-scan我認為應該合理的 a,因為 indexbtree on A(cid, id)是一個覆蓋索引。

但是,如果我使用 IN 運算符,它可以生成index-only-scan,如下所示:

查詢 2 (IN)select id from a where cid in (select distinct cid from b);

計劃

Nested Loop  (cost=20284.27..420607.83 rows=10256290 width=8) (actual time=290.225..2349.182 rows=10087249 loops=1)
  ->  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=290.162..304.054 rows=41493 loops=1)
        Group Key: b.cid
        ->  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.042..95.151 rows=1132787 loops=1)
  ->  Index Only Scan using idx_a_cid_id on a  (cost=0.43..27.61 rows=961 width=12) (actual time=0.005..0.028 rows=243 loops=41493)
        Index Cond: (cid = b.cid)
        Heap Fetches: 0
Planning time: 0.197 ms
Execution time: 2672.898 ms

如果您可能會考慮因為 和 的基數A差異B太大,那是真的。A包含10256295行,而B包含41493不同cid的 s。

但是,如果我用相同的邏輯手動將其重寫Query-1為以下內容,但僅使用,postgres 仍然可以生成,如下所示:Query-3``IN``index-only-scan

查詢 3 (IN-sub(NOT-IN))select id from a where cid in (select cid from a where cid not in (select distinct cid from b));

計劃

Nested Loop  (cost=325220.51..722952.01 rows=10256290 width=8) (actual time=3741.854..5607.756 rows=169046 loops=1)
  ->  HashAggregate  (cost=325220.07..325326.82 rows=10675 width=4) (actual time=3741.133..3763.512 rows=51758 loops=1)
        Group Key: a_1.cid
        ->  Index Only Scan using idx_a_cid on a a_1  (cost=20418.64..312399.71 rows=5128145 width=4) (actual time=377.384..3691.304 rows=169046 loops=1)
              Filter: (NOT (hashed SubPlan 1))
              Rows Removed by Filter: 10087249
              Heap Fetches: 0
              SubPlan 1
                ->  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=359.015..365.864 rows=41493 loops=1)
                      Group Key: b.cid
                      ->  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.455..144.879 rows=1132787 loops=1)
  ->  Index Only Scan using idx_a_cid_id on a  (cost=0.43..27.64 rows=961 width=12) (actual time=0.033..0.035 rows=3 loops=51758)
        Index Cond: (cid = a_1.cid)
        Heap Fetches: 0
Planning time: 2.758 ms
Execution time: 5617.930 ms

所以我現在很困惑,是否是因為NOT-IN運營商本身太難/太貴而無法使用index-only-scan?還是僅僅因為 PostgreSQL 的查詢優化器不夠聰明,無法生成一個?

順便說一句,我的實驗是在 PostgreSQL-9.6 上進行的。

謝謝!

跟進(2019 年 3 月 28 日):

使用 jjanes 的方法,它有效

SET cpu_index_tuple_cost TO 0;

SET random_page_cost TO 1;

再次解釋 Query-1

explain analyze select id from a where cid not in (select distinct cid from b);

計劃

Index Only Scan using idx_a_cid_id on a  (cost=20418.64..188115.26 rows=5128145 width=8) (actual time=308.552..1865.567 rows=169046 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 10087249
  Heap Fetches: 0
  SubPlan 1
    ->  HashAggregate  (cost=20283.84..20391.33 rows=10749 width=4) (actual time=289.169..297.363 rows=41493 loops=1)
          Group Key: b.cid
          ->  Seq Scan on b  (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.020..94.442 rows=1132787 loops=1)
Planning time: 0.112 ms
Execution time: 1871.197 ms

謝謝你們!

它選擇它認為會更快的計劃。

如果您SET cpu_index_tuple_cost TO 0SET random_page_cost to 1您可能會使用僅索引掃描找到它。

並且還發現它並不可靠,或者至少不會好很多。

引用自:https://dba.stackexchange.com/questions/233329