帶有大表的 UPDATE FROM 很慢並且使用 Seq Scans
我有一個大表(最終可能有 10 億行,但目前約為 2600 萬行),我想一次性為給定分組在最高 PK 上設置一個標誌。
我選擇創建一個臨時表來儲存應該設置的 PK,
current=true
其餘的都應該設置current=false
。我製作了一個臨時表而不是物化視圖,但我認為它不會產生真正的區別。為每個發現最大 ID 的過程並不太痛苦:
CREATE TABLE assertion ( pk integer NOT NULL, a bigint NOT NULL, b bigint NOT NULL, c bigint NOT NULL, d integer NOT NULL, current boolean DEFAULT false NOT NULL ); CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true); CREATE INDEX assertion_current_idx1 ON assertion USING btree (current); CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true); SELECT COUNT(pk) FROM assertion; -- 26916858 -- Time: 2912.403 ms (00:02.912) CREATE TEMPORARY TABLE assertion_current AS (SELECT MAX(pk) as pk, a, b, c, d FROM assertion GROUP BY a, b, c, d); -- Time: 72218.755 ms (01:12.219) ANALYZE assertion_current; CREATE INDEX ON assertion_current(pk); -- Time: 22107.698 ms (00:22.108) SELECT COUNT(pk) FROM assertion_current; -- 26455092 -- Time: 15650.078 ms (00:15.650)
根據 的計數
assertion_current
,我們需要為 98% 的行設置“目前”標誌為真。棘手的是如何
assertion
根據目前值在合理的時間內更新表格。有一個a, b, c, d, current
必須維護的唯一約束,因此對current
列的更新需要是原子的,以避免破壞約束。我有幾個選擇:
選項1
僅更新那些
current
更改的值。這具有根據索引欄位更新所需的最少行數的好處:BEGIN; UPDATE assertion SET current = false WHERE assertion.current = true AND PK NOT IN (SELECT pk FROM assertion_current); UPDATE assertion SET current = true WHERE assertion.current = false AND PK IN (SELECT pk FROM assertion_current); COMMIT;
但是這兩個查詢都涉及序列掃描
assertion_current
(我認為)必須乘以大量行。Update on assertion (cost=0.12..431141.55 rows=0 width=0) -> Index Scan using assertion_current_idx on assertion (cost=0.12..431141.55 rows=1 width=7) Index Cond: (current = true) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..787318.40 rows=29982560 width=4) -> Seq Scan on assertion_current (cost=0.00..520285.60 rows=29982560 width=4)
和
Update on assertion (cost=595242.56..596693.92 rows=0 width=0) -> Nested Loop (cost=595242.56..596693.92 rows=17974196 width=13) -> HashAggregate (cost=595242.00..595244.00 rows=200 width=10) Group Key: assertion_current.pk -> Seq Scan on assertion_current (cost=0.00..520285.60 rows=29982560 width=10) -> Index Scan using assertion_pkey on assertion (cost=0.56..8.58 rows=1 width=10) Index Cond: (pk = assertion_current.pk) Filter: (NOT current)
這意味著這些查詢之一(許多目前為真或許多目前為假)總是需要很長時間。
選項 2
單次通過,但必須不必要地觸摸每一行。
UPDATE assertion SET current = (CASE WHEN assertion.pk IN (select PK from assertion_current) THEN TRUE ELSE FALSE END)
但這會導致再次對 assertion_current 進行序列掃描
Update on assertion (cost=0.00..15498697380303.70 rows=0 width=0) -> Seq Scan on assertion (cost=0.00..15498697380303.70 rows=35948392 width=7) SubPlan 1 -> Materialize (cost=0.00..787318.40 rows=29982560 width=4) -> Seq Scan on assertion_current (cost=0.00..520285.60 rows=29982560 width=4)
選項 3
與選項 1 類似,但
WHERE
在更新中使用:BEGIN; UPDATE assertion SET current = false WHERE current = true; UPDATE assertion SET current = true FROM assertion_current WHERE assertion.pk = assertion_current.pk; COMMIT;
但第二個查詢涉及兩次 seq 掃描:
Update on assertion (cost=1654256.82..2721576.65 rows=0 width=0) -> Hash Join (cost=1654256.82..2721576.65 rows=29982560 width=13) Hash Cond: (assertion_current.pk = assertion.pk) -> Seq Scan on assertion_current (cost=0.00..520285.60 rows=29982560 width=10) -> Hash (cost=1029371.92..1029371.92 rows=35948392 width=10) -> Seq Scan on assertion (cost=0.00..1029371.92 rows=35948392 width=10)
選項 4
謝謝@jjanes,這花了> 6個小時,所以我取消了它。
UPDATE assertion SET current = not current WHERE current <> (CASE WHEN assertion.pk IN (select PK from assertion_current) THEN TRUE ELSE FALSE END)
生產
Update on assertion (cost=0.00..11832617068493.14 rows=0 width=0) -> Seq Scan on assertion (cost=0.00..11832617068493.14 rows=27307890 width=7) Filter: (current <> CASE WHEN (SubPlan 1) THEN true ELSE false END) SubPlan 1 -> Materialize (cost=0.00..787318.40 rows=29982560 width=4) -> Seq Scan on assertion_current (cost=0.00..520285.60 rows=29982560 width=4)
選項 5
謝謝@a_horse_with_no_name。這在我的機器上需要 24 分鐘。
UPDATE assertion tg SET current = EXISTS (SELECT pk FROM assertion_current cr WHERE cr.pk = tg.pk);
給
Update on assertion tg (cost=0.00..233024784.94 rows=0 width=0) -> Seq Scan on assertion tg (cost=0.00..233024784.94 rows=27445116 width=7) SubPlan 1 -> Index Only Scan using assertion_current_pk_idx on assertion_current cr (cost=0.44..8.46 rows=1 width=0) Index Cond: (pk = tg.pk)
有沒有更好的方法來及時實現這一目標?
…我們需要為 98% 的行設置“目前”標誌為真
NOT current
罕見的情況也會如此。到目前為止,您似乎一直在嘗試從頭到尾做事。您目前的索引對給定的數據分佈沒有幫助:
CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true); CREATE INDEX assertion_current_idx1 ON assertion USING btree (current); CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);
我們需要保留
UNIQUE
索引來強制執行您的要求,它也很有用:CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion (a, b, c, d) WHERE current;
但簡化
(current = true)
為current
. 執行冗餘表達式沒有意義,只需使用該boolean
值。
assertion_current_idx
絕對沒有意義,永遠不會被使用。但它仍然必須保持最新。算了吧。
assertion_current_idx1
幾乎一樣毫無意義。至少對於查找current = false
. 但是使用這個部分索引要便宜得多——它也支持我在下面建議的第二個查詢:CREATE INDEX assertion_not_current_idx ON assertion (a, b, c, d, pk) WHERE NOT current;
在下面的“初始查詢”之後創建此索引。
請注意,我完全跳過臨時表以支持 CTE。更少的成本。
初始查詢
您在後來的評論中透露,最初***“所有行都已設置
current = false
” 。我們可以使用更簡單、更快的查詢*來初始化。只需更新每組中沒有其他具有更大 PK 的行。沒有獨特的違規行為,沒有“非目前”的更新:UPDATE assertion a SET current = true WHERE NOT EXISTS ( SELECT FROM assertion x WHERE (x.a, x.b, x.c, x.d) = (a.a, a.b, a.c, a.d) AND x.pk > a.pk );
一般查詢
假設每組不能有目前行*。*
WITH new_current AS ( -- only these groups require updates SELECT * FROM ( -- get row with greatest non-current pk per group SELECT a, b, c, d, MAX(pk) AS new_pk , (SELECT a2.pk -- get current row of the same group FROM assertion a2 WHERE (a2.a, a2.b, a2.c, a2.d) = (a1.a, a1.b, a1.c, a1.d) AND a2.current ) AS old_pk FROM assertion a1 WHERE NOT current GROUP BY a, b, c, d ) a1 WHERE (old_pk < new_pk -- only if old pk is lower (!) OR old_pk IS NULL) -- or does not exist ) , up1(dummy) AS ( -- update current to false FIRST UPDATE assertion a SET current = false FROM new_current n WHERE (a.a, a.b, a.c, a.d, a.pk) = (n.a, n.b, n.c, n.d, n.old_pk) -- only matches existing old pk RETURNING true ) UPDATE assertion a -- then update the new current row SET current = true FROM new_current n LEFT JOIN (SELECT FROM up1 LIMIT 1) AS force_update_order ON true -- !!! WHERE (a.a, a.b, a.c, a.d, a.pk) = (n.a, n.b, n.c, n.d, n.new_pk);
到目前為止,您嘗試過的其他任何東西都應該相形見絀。
子查詢通過簡單的聚合
a1
獲取每個組中非目前最大的行。這是最佳選擇,因為每組的候選行 ( )*很少。*否則,使用模擬索引跳過掃描進一步優化此步驟:pk``max()``NOT current
棘手的部分是保持
UNIQUE
約束愉快。它不允許在任何給定時間每組有兩個目前行。同一查詢的 CTE 之間沒有執行順序 - 只要一個不引用另一個。我放入了這樣一個虛擬引用來強制更新順序。當還沒有目前行時,CTE
up1
不返回任何行。所以我們使用LEFT JOIN
. 並且LIMIT 1
永遠不要重複行。