如何優化索引列上的 IN 查詢

July 26, 2019

我有一個超過 50M 記錄的表。其中一個欄位是COLOR_CODE. 我在列上設置了一個索引，COLOR_CODE如下所示：

"mytable_colorcode_idx" btree (color_code)

我注意到當我執行下面的查詢時，執行時間更長

SELECT count(total_amount) FROM mytable 
WHERE color_code in ('red','green') and sale_date = '1970'

OR但是，使用子句的執行時間更快：

SELECT count(total_amount) FROM mytable 
WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'

查詢計劃IN

explain analyze SELECT count(total_amount) FROM mytable 
WHERE color_code in ('red','green') and sale_date = '1970'
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2074238.07..2074238.08 rows=1 width=8) (actual time=63520.150..63520.150 rows=1 loops=1)
  -&gt;  Bitmap Heap Scan on mytable  (cost=53504.73..2069923.27 rows=1725919 width=6) (actual time=3509.920..63080.519 rows=1727037 loops=1)
        Recheck Cond: ((color_code)::text = ANY ('{red,green}'::text[]))
        Rows Removed by Index Recheck: 5067635
        Filter: (sale_date = 1970)
        Heap Blocks: exact=38679 lossy=496680
        -&gt;  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..53073.26 rows=1725919 width=0) (actual time=3501.777..3501.777 rows=1727037 loops=1)
              Index Cond: ((color_code)::text = ANY ('{red,green}'::text[]))
Planning time: 0.165 ms
Execution time: 63524.100 ms
(10 rows)

查詢計劃OR

explain analyze SELECT count(total_amount) FROM mytable 
   WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'

   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2081265.36..2081265.37 rows=1 width=8) (actual time=18895.998..18895.998 rows=1 loops=1)
  -&gt;  Bitmap Heap Scan on mytable  (cost=56223.06..2076956.39 rows=1723588 width=6) (actual time=161.335..18468.146 rows=1727037 loops=1)
        Recheck Cond: (((color_code)::text = 'red'::text) OR ((color_code)::text = 'green'::text))
        Rows Removed by Index Recheck: 5067635
        Filter: (((color_code)::text = 'red'::text) OR (((color_code)::text = 'green'::text) AND (sale_date = 1970)))
        Heap Blocks: exact=38679 lossy=496680
        -&gt;  BitmapOr  (cost=56223.06..56223.06 rows=1725919 width=0) (actual time=153.683..153.684 rows=0 loops=1)
              -&gt;  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..663.35 rows=20655 width=0) (actual time=3.935..3.935 rows=26768 loops=1)
                    Index Cond: ((color_code)::text = 'red'::text)
              -&gt;  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..54697.91 rows=1705264 width=0) (actual time=149.745..149.746 rows=1700269 loops=1)
                    Index Cond: ((color_code)::text = 'green'::text)
Planning time: 0.162 ms
Execution time: 18896.785 ms
(13 rows)

更新

如果我添加一個索引（color_code、total_count 和 sale_date），我注意到根本沒有使用任何索引。而是進行部分掃描。

"mytable_color_total_count_sale_Date_idx" btree (color_code, total_count, sale_date)  



                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate  (cost=2099755.26..2099755.27 rows=1 width=8) (actual time=97066.585..97066.586 rows=1 loops=1)
  -&gt;  Gather  (cost=2099755.04..2099755.25 rows=2 width=8) (actual time=97063.512..97069.838 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -&gt;  Partial Aggregate  (cost=2098755.04..2098755.05 rows=1 width=8) (actual time=97061.531..97061.532 rows=1 loops=3)
              -&gt;  Parallel Seq Scan on mytable  (cost=0.00..2096119.69 rows=1054140 width=6) (actual time=27782.491..96730.232 rows=841604 loops=3)
                    Filter: ((sale_date = 1970) AND ((color_code)::text = ANY ('{red,green}'::text[])))
                    Rows Removed by Filter: 4196103
Planning time: 0.161 ms
Execution time: 97069.896 ms
(10 rows)

問題

除了將其轉換為子句之外，還有其他方法可以通過IN子句查詢進行優化OR嗎？

您無法比較以下性能：
WHERE color_code in ('red','green') and sale_date = '1970'
和：
WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'
因為它們在邏輯上不等效（將返回不同的結果）。一個簡單的例子：
with T (color_code, sale_date) as ( 
    values ('red', '1970'), ('green','1969')
) 
select * from T 
where color_code in ('green', 'red') 
  and sale_date = '1970';

color_code | sale_date 
------------+-----------
red        | 1970
(1 row)
然而：
with T (color_code, sale_date) as ( 
   values ('red', '1970'), ('green','1969')
) 
select * from T 
where color_code = 'green' or color_code = 'red' 
 and sale_date = '1970';

color_code | sale_date 
------------+-----------
red        | 1970
green      | 1969
(2 rows)
簡而言之AND，優先級高於OR因此您的優化表達式A OR B AND C被評估為A OR (B AND C). 您的原始表達式被評估為(A OR B) AND C。
為了使比較有意義，您需要將查詢更改為：
select * from T 
where (color_code = 'green' or color_code = 'red') 
 and sale_date = '1970';
我的猜測是，您不會在性能方面與您的原始表達有太大差異。
也就是說，我建議使用如下索引：
CREATE INDEX ... ON ... (sale_date, color_code)

我認為您看到的時間差異只是一種記憶體效果，具體取決於您首先執行的查詢。這可能不是由您指定查詢的方式引起的真正差異（儘管正如 Lennart 所描述的，您的查詢並不真正等效，因為您在 OR 部分周圍缺少括號 - 儘管您的所有行似乎sale_date = '1970'無論如何都滿足，所以這種差異是一般來說很重要，但在確切的例子中沒有區別）
您可以執行一些操作來加快此查詢的兩個規範。
首先，看看這一行：
    Heap Blocks: exact=38679 lossy=496680
這意味著您的 work_mem 不足以容納整個點陣圖。所有這些有損塊都需要重新檢查其中的每一行，這需要時間。增加 work_mem 可以防止這種情況發生，並且應該加快查詢速度。理想情況下，有損塊將降至零（此時“有損”標籤將不再顯示。
其次，擁有索引on mytable (color_code, sale_date, total_count)可以允許僅進行索引掃描，因為所有需要的數據都在索引中，並且根本不必訪問表（假設表保持良好的真空狀態）。
這些是互斥的：如果您進行僅索引掃描而不是點陣圖掃描，那麼 work_mem 不再重要。

引用自：https://dba.stackexchange.com/questions/243790

如何優化索引列上的 IN 查詢

相關問答

Postgres 不使用 group by 的表達式索引

慢查詢/索引創建（PostgreSQL 9.2）

帶有 WHERE 條件和 GROUP BY 的 SQL 查詢的索引

FROM 子句中的相關函式是否針對每一行執行？

使用大 IN 優化 Postgres 查詢

大表中的慢速索引掃描