Postgresql

具有大 IN 的 Postgres 查詢,並且在臨時表上加入似乎不起作用

  • July 7, 2022

編輯:問題正文中的查詢計劃來自EXPLAIN,但正如@jjanes 建議的那樣,EXPLAIN (ANALYZE, BUFFERS)可能更有用。由於輸出非常大,我將它們上傳到這裡:https ://gist.github.com/vr2262/ab3cfb69ac758b5161e27d9cb77ad05f

我有一個查詢,它通過使用…的索引bigint列從表中選擇記錄。達到一定數量(7835,碰巧),查詢很快(順序 ID 大約 150 毫秒,隨機 ID 大約 1 秒),但是再添加一個會導致不同的查詢計劃,並且查詢大約需要 150 秒。我四處尋找其他答案,https://dba.stackexchange.com/a/91254(和其他地方)建議的解決方案是將值插入索引臨時表並加入它。但是,這實際上使它變慢了一些。WHERE``IN

這是原始查詢:

SELECT
 my_table.id AS my_table_id,
 my_table.joined_table_2_id AS my_table_joined_table_2_id,
 my_table.big_where_id AS my_table_big_where_id,
 ST_AsGeoJSON(my_table.geog) AS unrelated_geog,
 joined_table_1.id AS joined_table_1_id,
FROM
 my_table
LEFT OUTER JOIN
 joined_table_a AS joined_table_1 ON my_table.id = joined_table_1.my_table_id
LEFT OUTER JOIN
 joined_table_b AS joined_table_2 ON joined_table_2.id = my_table.joined_table_2_id
WHERE
 my_table.joined_table_2_id = 1
 AND my_table.big_where_id IN (1, 2, 3, ..., 7835);

…以及相關的快速查詢計劃:

Gather  (cost=36576.06..15864926.71 rows=44743 width=139)
 Workers Planned: 2
 ->  Hash Left Join  (cost=35576.06..15859452.41 rows=18643 width=139)
       Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
       ->  Nested Loop Left Join  (cost=35574.99..15854534.26 rows=18643 width=246)
             ->  Parallel Bitmap Heap Scan on my_table  (cost=35574.42..89742.05 rows=2845 width=201)
                   Recheck Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
                   ->  Bitmap Index Scan on my_table_joined_table_2_id_big_where_id_key  (cost=0.00..35572.71 rows=6829 width=0)
                         Index Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
             ->  Index Scan using ix_joined_table_a_my_table_id on joined_table_a joined_table_1  (cost=0.57..5512.89 rows=2834 width=53)
                   Index Cond: (my_table_id = my_table.id)
       ->  Hash  (cost=1.05..1.05 rows=1 width=14)
             ->  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.05 rows=1 width=14)
                   Filter: (id = 1)

多了一個 值big_where_id,查詢計劃變為:

Hash Left Join  (cost=50982.39..15870462.06 rows=44750 width=139)
 Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
 ->  Hash Right Join  (cost=50981.33..15858658.19 rows=44750 width=246)
       Hash Cond: (joined_table_1.my_table_id = my_table.id)
       ->  Seq Scan on joined_table_a joined_table_1  (cost=0.00..14184914.72 rows=618195072 width=53)
       ->  Hash  (cost=50895.95..50895.95 rows=6830 width=201)
             ->  Index Scan using my_table_joined_table_2_id_big_where_id_key on my_table  (cost=0.57..50895.95 rows=6830 width=201)
                   Index Cond: ((joined_table_2_id = 1) AND (big_where_id = ANY ('{1,2,3,...}'::bigint[])))
 ->  Hash  (cost=1.05..1.05 rows=1 width=14)
       ->  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.05 rows=1 width=14)
             Filter: (id = 1)

我嘗試使用這樣的臨時表:

CREATE TEMPORARY TABLE temp_table (id INTEGER PRIMARY KEY);
INSERT INTO temp_table (id) SELECT generate_series(1, 7836);
SELECT
 my_table.id AS my_table_id,
 my_table.joined_table_2_id AS my_table_joined_table_2_id,
 my_table.big_where_id AS my_table_big_where_id,
 ST_AsGeoJSON(my_table.geog) AS unrelated_geog,
 joined_table_1.id AS joined_table_1_id,
FROM
 my_table
LEFT OUTER JOIN
 joined_table_a AS joined_table_1 ON my_table.id = joined_table_1.my_table_id
LEFT OUTER JOIN
 joined_table_b AS joined_table_2 ON joined_table_2.id = my_table.joined_table_2_id
JOIN
 temp_table ON my_table.big_where_id = temp_table.id
WHERE
 my_table.joined_table_2_id = 1;

…但如前所述,它比以前慢了一點。這是查詢計劃(EXPLAIN在 上使用SELECT):

Hash Left Join  (cost=126858.69..28741416.19 rows=138238 width=139)
 Hash Cond: (my_table.joined_table_2_id = joined_table_2.id)
 ->  Hash Right Join  (cost=126857.60..28706108.24 rows=138238 width=246)
       Hash Cond: (joined_table_1.my_table_id = my_table.id)
       ->  Seq Scan on joined_table_a joined_table_1  (cost=0.00..14184914.72 rows=618195072 width=53)
       ->  Hash  (cost=125995.86..125995.86 rows=21099 width=201)
             ->  Nested Loop  (cost=0.57..125995.86 rows=21099 width=201)
                   ->  Seq Scan on temp_table  (cost=0.00..159.75 rows=11475 width=4)
                   ->  Index Scan using ix_my_table_big_where_id on my_table  (cost=0.57..10.95 rows=2 width=201)
                         Index Cond: (big_where_id = temp_table.id)
 ->  Hash  (cost=1.04..1.04 rows=4 width=14)
       ->  Seq Scan on joined_table_b joined_table_2  (cost=0.00..1.04 rows=4 width=14)

也許臨時表上的常客JOIN是不對的?不過,我也沒有運氣嘗試其他聯接。

這看起來像是一個不公平測量的案例。您可能一遍又一遍地執行相同的查詢,只是每次都將另一個元素添加到 IN 列表中。但這意味著“快速”計劃所需的幾乎所有數據都被頻繁使用並且已經被記憶體。如果您在每次執行時更改針對joined_table_2_id 測試的參數(而不是一直使用'1’),或者為每次執行以不同的方式為IN 列表選擇大約7000 個隨機值,而不是僅使用系列1..7NNN,那麼速度會很快計劃還快嗎?

如果即使使用隨機參數它仍然比替代方案快得多,則表明 random_page_cost 的設置比 seq_page_cost 給定的儲存系統高太多。那些(4 和 1)的預設設置通常適用於硬碟驅動器,而不適用於 SSD。

引用自:https://dba.stackexchange.com/questions/314136