具有大 IN 的 Postgres 查詢，並且在臨時表上加入似乎不起作用

July 7, 2022

編輯：問題正文中的查詢計劃來自EXPLAIN，但正如@jjanes 建議的那樣，EXPLAIN (ANALYZE, BUFFERS)可能更有用。由於輸出非常大，我將它們上傳到這裡：https ://gist.github.com/vr2262/ab3cfb69ac758b5161e27d9cb77ad05f

我有一個查詢，它通過使用…的索引bigint列從表中選擇記錄。達到一定數量（7835，碰巧），查詢很快（順序 ID 大約 150 毫秒，隨機 ID 大約 1 秒），但是再添加一個會導致不同的查詢計劃，並且查詢大約需要 150 秒。我四處尋找其他答案，https://dba.stackexchange.com/a/91254（和其他地方）建議的解決方案是將值插入索引臨時表並加入它。但是，這實際上使它變慢了一些。WHERE``IN

這是原始查詢：

SELECT
 my_table.id AS my_table_id,
 my_table.joined_table_2_id AS my_table_joined_table_2_id,
 my_table.big_where_id AS my_table_big_where_id,
 ST_AsGeoJSON(my_table.geog) AS unrelated_geog,
 joined_table_1.id AS joined_table_1_id,
FROM
 my_table
LEFT OUTER JOIN
 joined_table_a AS joined_table_1 ON my_table.id = joined_table_1.my_table_id
LEFT OUTER JOIN
 joined_table_b AS joined_table_2 ON joined_table_2.id = my_table.joined_table_2_id
WHERE
 my_table.joined_table_2_id = 1
 AND my_table.big_where_id IN (1, 2, 3, ..., 7835);

…以及相關的快速查詢計劃：

Gather  (cost=36576.06..15864926.71 rows=44743 width=139)
 Workers Planned: 2
 -&gt;  Hash Left Join  (cost=35576.06..15859452.41 rows=18643 width=139)
       Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
       -&gt;  Nested Loop Left Join  (cost=35574.99..15854534.26 rows=18643 width=246)
             -&gt;  Parallel Bitmap Heap Scan on my_table  (cost=35574.42..89742.05 rows=2845 width=201)
                   Recheck Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
                   -&gt;  Bitmap Index Scan on my_table_joined_table_2_id_big_where_id_key  (cost=0.00..35572.71 rows=6829 width=0)
                         Index Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
             -&gt;  Index Scan using ix_joined_table_a_my_table_id on joined_table_a joined_table_1  (cost=0.57..5512.89 rows=2834 width=53)
                   Index Cond: (my_table_id = my_table.id)
       -&gt;  Hash  (cost=1.05..1.05 rows=1 width=14)
             -&gt;  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.05 rows=1 width=14)
                   Filter: (id = 1)

多了一個值big_where_id，查詢計劃變為：

Hash Left Join  (cost=50982.39..15870462.06 rows=44750 width=139)
 Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
 -&gt;  Hash Right Join  (cost=50981.33..15858658.19 rows=44750 width=246)
       Hash Cond: (joined_table_1.my_table_id = my_table.id)
       -&gt;  Seq Scan on joined_table_a joined_table_1  (cost=0.00..14184914.72 rows=618195072 width=53)
       -&gt;  Hash  (cost=50895.95..50895.95 rows=6830 width=201)
             -&gt;  Index Scan using my_table_joined_table_2_id_big_where_id_key on my_table  (cost=0.57..50895.95 rows=6830 width=201)
                   Index Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
 -&gt;  Hash  (cost=1.05..1.05 rows=1 width=14)
       -&gt;  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.05 rows=1 width=14)
             Filter: (id = 1)

我嘗試使用這樣的臨時表：

CREATE TEMPORARY TABLE temp_table (id INTEGER PRIMARY KEY);
INSERT INTO temp_table (id) SELECT generate_series(1, 7836);
SELECT
 my_table.id AS my_table_id,
 my_table.joined_table_2_id AS my_table_joined_table_2_id,
 my_table.big_where_id AS my_table_big_where_id,
 ST_AsGeoJSON(my_table.geog) AS unrelated_geog,
 joined_table_1.id AS joined_table_1_id,
FROM
 my_table
LEFT OUTER JOIN
 joined_table_a AS joined_table_1 ON my_table.id = joined_table_1.my_table_id
LEFT OUTER JOIN
 joined_table_b AS joined_table_2 ON joined_table_2.id = my_table.joined_table_2_id
JOIN
 temp_table ON my_table.big_where_id = temp_table.id
WHERE
 my_table.joined_table_2_id = 1;

…但如前所述，它比以前慢了一點。這是查詢計劃（EXPLAIN在上使用SELECT）：

Hash Left Join  (cost=126858.69..28741416.19 rows=138238 width=139)
 Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
 -&gt;  Hash Right Join  (cost=126857.60..28706108.24 rows=138238 width=246)
       Hash Cond: (joined_table_1.my_table_id = my_table.id)
       -&gt;  Seq Scan on joined_table_a joined_table_1  (cost=0.00..14184914.72 rows=618195072 width=53)
       -&gt;  Hash  (cost=125995.86..125995.86 rows=21099 width=201)
             -&gt;  Nested Loop  (cost=0.57..125995.86 rows=21099 width=201)
                   -&gt;  Seq Scan on temp_table  (cost=0.00..159.75 rows=11475 width=4)
                   -&gt;  Index Scan using ix_my_table_big_where_id on my_table  (cost=0.57..10.95 rows=2 width=201)
                         Index Cond: (big_where_id = temp_table.id)
 -&gt;  Hash  (cost=1.04..1.04 rows=4 width=14)
       -&gt;  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.04 rows=4 width=14)

也許臨時表上的常客JOIN是不對的？不過，我也沒有運氣嘗試其他聯接。

這看起來像是一個不公平測量的案例。您可能一遍又一遍地執行相同的查詢，只是每次都將另一個元素添加到 IN 列表中。但這意味著“快速”計劃所需的幾乎所有數據都被頻繁使用並且已經被記憶體。如果您在每次執行時更改針對joined_table_2_id 測試的參數（而不是一直使用'1’），或者為每次執行以不同的方式為IN 列表選擇大約7000 個隨機值，而不是僅使用系列1..7NNN，那麼速度會很快計劃還快嗎？
如果即使使用隨機參數它仍然比替代方案快得多，則表明 random_page_cost 的設置比 seq_page_cost 給定的儲存系統高太多。那些（4 和 1）的預設設置通常適用於硬碟驅動器，而不適用於 SSD。

引用自：https://dba.stackexchange.com/questions/314136

具有大 IN 的 Postgres 查詢，並且在臨時表上加入似乎不起作用

相關問答

優化查詢以在分區表中跨多天獲取數據

FROM 子句中的相關函式是否針對每一行執行？

為什麼優化器不在我的表上使用聚群索引？

優化對 690,000 行表的昂貴的 GROUP BY / ORDER BY 查詢

PostgreSQL IO 成本邏輯

Postgres Planner 偶爾不使用 GIN 索引