在具有相同名稱的列上更新 id 表

September 28, 2015

我可能是一個新手問題，但我不知道如何解決這個問題。我有一張像這樣的桌子：
name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286496    1286375
D      1286503    1286341
B      1286506    1286343
我想更新這個表，使它像這樣：
name | id       | value
A      1286487    1286333
B      1286489    1286403
C      1286495    1286455
C      1286495    1286375
D      1286503    1286341
B      1286489    1286343
因此，名稱為 B 和 C 的行與具有此名稱的第一行具有相同的 id。任何人都可以幫助我嗎？

為了滿足要求，我們必須首先找到我們需要的 ID - 這可以使用min()函式找到，不足為奇。然後我們進行更新——最快的寫法可能如下：

UPDATE minupdate m 
  SET id = (SELECT min(id) 
              FROM minupdate
             WHERE name = m.name);

這樣做的背面是，如果您有更多的行數，它可能會*非常慢。*我使用以下語句在表中填充了大約 50,000 個隨機行：

INSERT INTO minupdate (name, id)
SELECT translate(substr(i::text, 1, 1), '1234567890', 'ABCDEFGHIJ'), 
      random() * 100000
FROM generate_series(1, 50000) t(i);

在我的測試盒上，沒有索引，這花了很長時間 - 我沒有等待它產生一個實際的執行計劃（使用EXPLAIN (ANAYLZE, BUFFERS)，但取消並執行它EXPLAIN以查看有什麼問題：

Update on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
  -&gt;  Seq Scan on minupdate m  (cost=0.00..136044281.50 rows=87720 width=12)
        SubPlan 1
          -&gt;  Aggregate  (cost=1550.87..1550.88 rows=1 width=4)
                -&gt;  Seq Scan on minupdate  (cost=0.00..1526.50 rows=9747 width=4)
                      Filter: (name = m.name)

所以我在 (name, id) 中添加了一個索引以獲得更快的結果：

Update on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=1771.008..1771.008 rows=0 loops=1)
  Buffers: shared hit=394845 read=653 dirtied=654
  -&gt;  Seq Scan on minupdate m  (cost=0.00..417486.00 rows=50000 width=15) (actual time=0.062..1317.366 rows=50000 loops=1)
        Buffers: shared hit=172641 read=192
        SubPlan 2
          -&gt;  Result  (cost=8.31..8.32 rows=1 width=0) (actual time=0.018..0.020 rows=1 loops=50000)
                Buffers: shared hit=171655 read=192
                InitPlan 1 (returns $1)
                  -&gt;  Limit  (cost=0.29..8.31 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=50000)
                        Buffers: shared hit=171655 read=192
                        -&gt;  Index Only Scan using minupdate_name_id_idx on minupdate  (cost=0.29..8.31 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=50000)
                              Index Cond: ((name = m.name) AND (id IS NOT NULL))
                              Heap Fetches: 50000
                              Buffers: shared hit=171655 read=192

現在問題是這樣的：計劃的內部部分被執行了 50,000 次——要更新的行數。在無索引版本中，它可能會掃描整個表這可能會 - 難怪它會花費很長時間。

但還有更多。那 50,000 個循環很醜 - 讓我們找到更好的東西：

EXPLAIN (ANALYZE, BUFFERS)
UPDATE minupdate m 
  SET id = mins.id 
 FROM (SELECT name, min(id) AS id 
         FROM minupdate 
        GROUP BY name
      ) mins
WHERE m.name = mins.name;

這給出了一個更精簡的計劃：

Update on minupdate m  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=995.845..995.845 rows=0 loops=1)
  Buffers: shared hit=241809 read=394 dirtied=843
  -&gt;  Hash Join  (cost=1861.76..4525.59 rows=91698 width=42) (actual time=218.866..542.437 rows=50000 loops=1)
        Hash Cond: (m.name = mins.name)
        Buffers: shared hit=972 dirtied=427
        -&gt;  Seq Scan on minupdate m  (cost=0.00..1402.98 rows=91698 width=8) (actual time=0.027..106.868 rows=50000 loops=1)
              Buffers: shared hit=486 dirtied=1
        -&gt;  Hash  (cost=1861.65..1861.65 rows=9 width=36) (actual time=218.816..218.816 rows=9 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 1kB
              Buffers: shared hit=486 dirtied=426
              -&gt;  Subquery Scan on mins  (cost=1861.47..1861.65 rows=9 width=36) (actual time=218.738..218.790 rows=9 loops=1)
                    Buffers: shared hit=486 dirtied=426
                    -&gt;  HashAggregate  (cost=1861.47..1861.56 rows=9 width=6) (actual time=218.722..218.740 rows=9 loops=1)
                          Group Key: minupdate.name
                          Buffers: shared hit=486 dirtied=426
                          -&gt;  Seq Scan on minupdate  (cost=0.00..1402.98 rows=91698 width=6) (actual time=0.007..111.580 rows=50000 loops=1)
                                Buffers: shared hit=486 dirtied=426
Planning time: 0.164 ms
Execution time: 995.908 ms

現在那裡有兩個順序掃描節點（有點令人驚訝），但都只執行一次（loops=1）。

我在這裡所做的是刪除相關子查詢，將其替換為臨時視圖（之後的查詢FROM），將其加入要更新的表中。這種方法通常比簡單的方法提供更好的計劃。在大桌子上，性能差異可能非常顯著。

引用自：https://dba.stackexchange.com/questions/116324

在具有相同名稱的列上更新 id 表

相關問答

關係 {table_name} 的列 {table_name} 不存在 SQL 狀態：42703

來自子查詢的 UPDATE 更新不應更新的行

將鍵和值添加到鍵尚不存在的 jsonb 數組的元素中

Postgres中UPDATE FROM VALUES和多個UPDATE語句的區別

帶有大表的 UPDATE FROM 很慢並且使用 Seq Scans

是否可以在不先設置結構的情況下將表複製到 PostgreSQL 中？