帶有大表的 UPDATE FROM 很慢並且使用 Seq Scans

June 16, 2022

我有一個大表（最終可能有 10 億行，但目前約為 2600 萬行），我想一次性為給定分組在最高 PK 上設置一個標誌。

我選擇創建一個臨時表來儲存應該設置的 PK，current=true其餘的都應該設置current=false。我製作了一個臨時表而不是物化視圖，但我認為它不會產生真正的區別。

為每個發現最大 ID 的過程並不太痛苦：

CREATE TABLE assertion (
   pk integer NOT NULL,
   a bigint NOT NULL,
   b bigint NOT NULL,
   c bigint NOT NULL,
   d integer NOT NULL,
   current boolean DEFAULT false NOT NULL
);

CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);

SELECT COUNT(pk) FROM assertion;

-- 26916858
-- Time: 2912.403 ms (00:02.912)

CREATE TEMPORARY TABLE assertion_current AS
   (SELECT MAX(pk) as pk, a, b, c, d
     FROM assertion
     GROUP BY a, b, c, d);

-- Time: 72218.755 ms (01:12.219)

ANALYZE assertion_current;

CREATE INDEX ON assertion_current(pk);

-- Time: 22107.698 ms (00:22.108)

SELECT COUNT(pk) FROM assertion_current;

-- 26455092
-- Time: 15650.078 ms (00:15.650)

根據的計數assertion_current，我們需要為 98% 的行設置“目前”標誌為真。

棘手的是如何assertion根據目前值在合理的時間內更新表格。有一個a, b, c, d, current必須維護的唯一約束，因此對current列的更新需要是原子的，以避免破壞約束。

我有幾個選擇：

選項1

僅更新那些current更改的值。這具有根據索引欄位更新所需的最少行數的好處：


BEGIN;
UPDATE assertion
  SET current = false
  WHERE assertion.current = true AND PK NOT IN (SELECT pk FROM assertion_current);
UPDATE assertion
  SET current = true
  WHERE assertion.current = false AND PK IN (SELECT pk FROM assertion_current);
COMMIT;

但是這兩個查詢都涉及序列掃描assertion_current（我認為）必須乘以大量行。

Update on assertion  (cost=0.12..431141.55 rows=0 width=0)
  -&gt;  Index Scan using assertion_current_idx on assertion  (cost=0.12..431141.55 rows=1 width=7)
        Index Cond: (current = true)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          -&gt;  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                -&gt;  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

和

Update on assertion  (cost=595242.56..596693.92 rows=0 width=0)
  -&gt;  Nested Loop  (cost=595242.56..596693.92 rows=17974196 width=13)
        -&gt;  HashAggregate  (cost=595242.00..595244.00 rows=200 width=10)
              Group Key: assertion_current.pk
              -&gt;  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
        -&gt;  Index Scan using assertion_pkey on assertion  (cost=0.56..8.58 rows=1 width=10)
              Index Cond: (pk = assertion_current.pk)
              Filter: (NOT current)

這意味著這些查詢之一（許多目前為真或許多目前為假）總是需要很長時間。

選項 2

單次通過，但必須不必要地觸摸每一行。

UPDATE assertion
  SET current =
    (CASE WHEN assertion.pk IN (select PK from assertion_current)
    THEN TRUE ELSE FALSE END)

但這會導致再次對 assertion_current 進行序列掃描

Update on assertion  (cost=0.00..15498697380303.70 rows=0 width=0)
  -&gt;  Seq Scan on assertion  (cost=0.00..15498697380303.70 rows=35948392 width=7)
        SubPlan 1
          -&gt;  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                -&gt;  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

選項 3

與選項 1 類似，但WHERE在更新中使用：

BEGIN;
UPDATE assertion SET current = false WHERE current = true;
UPDATE assertion SET current = true FROM assertion_current
 WHERE assertion.pk = assertion_current.pk;
COMMIT;

但第二個查詢涉及兩次 seq 掃描：

Update on assertion  (cost=1654256.82..2721576.65 rows=0 width=0)
  -&gt;  Hash Join  (cost=1654256.82..2721576.65 rows=29982560 width=13)
        Hash Cond: (assertion_current.pk = assertion.pk)
        -&gt;  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
        -&gt;  Hash  (cost=1029371.92..1029371.92 rows=35948392 width=10)
              -&gt;  Seq Scan on assertion  (cost=0.00..1029371.92 rows=35948392 width=10)

選項 4

謝謝@jjanes，這花了> 6個小時，所以我取消了它。

UPDATE assertion
  SET current = not current
  WHERE current &lt;&gt;
    (CASE WHEN assertion.pk IN (select PK from assertion_current)
    THEN TRUE ELSE FALSE END)

生產

Update on assertion  (cost=0.00..11832617068493.14 rows=0 width=0)
  -&gt;  Seq Scan on assertion  (cost=0.00..11832617068493.14 rows=27307890 width=7)
        Filter: (current &lt;&gt; CASE WHEN (SubPlan 1) THEN true ELSE false END)
        SubPlan 1
          -&gt;  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                -&gt;  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

選項 5

謝謝@a_horse_with_no_name。這在我的機器上需要 24 分鐘。

UPDATE assertion tg SET current = EXISTS (SELECT pk FROM assertion_current cr WHERE cr.pk = tg.pk);

給

Update on assertion tg  (cost=0.00..233024784.94 rows=0 width=0)
  -&gt;  Seq Scan on assertion tg  (cost=0.00..233024784.94 rows=27445116 width=7)
        SubPlan 1
          -&gt;  Index Only Scan using assertion_current_pk_idx on assertion_current cr  (cost=0.44..8.46 rows=1 width=0)
                Index Cond: (pk = tg.pk)

有沒有更好的方法來及時實現這一目標？

…我們需要為 98% 的行設置“目前”標誌為真
NOT current罕見的情況也會如此。到目前為止，您似乎一直在嘗試從頭到尾做事。
您目前的索引對給定的數據分佈沒有幫助：
CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);  
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);  
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);
我們需要保留UNIQUE索引來強制執行您的要求，它也很有用：
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion (a, b, c, d) WHERE current;
但簡化(current = true)為current. 執行冗餘表達式沒有意義，只需使用該boolean值。
assertion_current_idx絕對沒有意義，永遠不會被使用。但它仍然必須保持最新。算了吧。
assertion_current_idx1幾乎一樣毫無意義。至少對於查找current = false. 但是使用這個部分索引要便宜得多——它也支持我在下面建議的第二個查詢：
CREATE INDEX assertion_not_current_idx ON assertion (a, b, c, d, pk) WHERE NOT current;
在下面的“初始查詢”之後創建此索引。
請注意，我完全跳過臨時表以支持 CTE。更少的成本。
初始查詢
您在後來的評論中透露，最初***“所有行都已設置current = false” 。我們可以使用更簡單、更快的查詢*來初始化。只需更新每組中沒有其他具有更大 PK 的行。沒有獨特的違規行為，沒有“非目前”的更新：
UPDATE assertion a
SET    current = true
WHERE  NOT EXISTS (
  SELECT FROM assertion x
  WHERE (x.a, x.b, x.c, x.d)
      = (a.a, a.b, a.c, a.d)
  AND   x.pk &gt; a.pk
  );
一般查詢
假設每組不能有目前行*。*
WITH new_current AS (  -- only these groups require updates
  SELECT *
  FROM  (
     -- get row with greatest non-current pk per group
     SELECT a, b, c, d, MAX(pk) AS new_pk
         , (SELECT a2.pk  -- get current row of the same group
            FROM   assertion a2
            WHERE (a2.a, a2.b, a2.c, a2.d)
                = (a1.a, a1.b, a1.c, a1.d)
            AND    a2.current
           ) AS old_pk
     FROM   assertion a1
     WHERE  NOT current
     GROUP  BY a, b, c, d
     ) a1
  WHERE (old_pk &lt; new_pk   -- only if old pk is lower (!)
      OR old_pk IS NULL)   -- or does not exist
  )
, up1(dummy) AS (  -- update current to false FIRST
  UPDATE assertion a
  SET    current = false
  FROM   new_current n
  WHERE  (a.a, a.b, a.c, a.d, a.pk)
       = (n.a, n.b, n.c, n.d, n.old_pk)  -- only matches existing old pk
  RETURNING true
  )
UPDATE assertion a  -- then update the new current row
SET    current = true
FROM   new_current n
LEFT   JOIN (SELECT FROM up1 LIMIT 1) AS force_update_order ON true  -- !!!
WHERE  (a.a, a.b, a.c, a.d, a.pk)
    = (n.a, n.b, n.c, n.d, n.new_pk);
到目前為止，您嘗試過的其他任何東西都應該相形見絀。
子查詢通過簡單的聚合a1獲取每個組中非目前最大的行。這是最佳選擇，因為每組的候選行 ( )*很少。*否則，使用模擬索引跳過掃描進一步優化此步驟：pk``max()``NOT current
優化 GROUP BY 查詢以檢索每個使用者的最新行
從大表中獲取給定鍵的不同值的最佳性能
棘手的部分是保持UNIQUE約束愉快。它不允許在任何給定時間每組有兩個目前行。同一查詢的 CTE 之間沒有執行順序 - 只要一個不引用另一個。我放入了這樣一個虛擬引用來強制更新順序。
當還沒有目前行時，CTEup1不返回任何行。所以我們使用LEFT JOIN. 並且LIMIT 1永遠不要重複行。

引用自：https://dba.stackexchange.com/questions/313403

帶有大表的 UPDATE FROM 很慢並且使用 Seq Scans

選項1

選項 2

選項 3

選項 4

選項 5

初始查詢

一般查詢

相關問答

如果 PostgreSQL DB 或伺服器異常關閉，臟數據會發生什麼？

哪個索引與 pg_trgm 擴展配合得更好？

更新速度慢

更新具有相同值的行實際上會更新該行嗎？

臨時表上的索引使用情況

當 PostgreSQL 中的計數器達到零時，如何有效地刪除記錄？