Postgres UPDATE 使用來自另一個表的數據 - 僅索引掃描用於相關子查詢但不連接

February 4, 2020

語境

我正在調整UPDATE從另一個（大）表中選擇的批量。我的目的是提供一個覆蓋索引來支持源表的僅索引掃描。我意識到必須清理源表以更新其可見性圖。

到目前為止，我的調查表明，優化器選擇僅在UPDATE使用相關子查詢時對源表進行索引掃描，但在使用連接時似乎使用標準索引掃描（UPDATE...FROM）。我問這個問題是為了理解為什麼。

我在這裡提供一個簡化的例子來說明差異。

我使用的是 Postgres 9.6.8，但 10.11 和 11.6 的計劃非常相似。我已經使用官方圖像複製了在 Docker 中安裝 vanilla 9.6 Postgres 的計劃，以及此處的 db<>fiddle 。

設置

CREATE TABLE lookup (
   surrogate_key   BIGINT PRIMARY KEY,
   natural_key     TEXT NOT NULL UNIQUE,
   data            TEXT NOT NULL);

INSERT INTO lookup
SELECT id, 'nk'||id, random()::text
FROM generate_series(1,400000) id;

CREATE UNIQUE INDEX lookup_ix ON lookup(natural_key, surrogate_key);

VACUUM ANALYSE lookup;

CREATE TABLE target (
   target_id               BIGINT PRIMARY KEY,
   lookup_natural_key      TEXT NOT NULL,
   lookup_surrogate_key    BIGINT,
   data                    TEXT NOT NULL
);

INSERT INTO target (target_id, lookup_natural_key, data)
SELECT id+1000, 'nk'||id, random()::text
FROM generate_series(1,1000) id;

ANALYSE target;

使用連接更新

EXPLAIN (ANALYSE, VERBOSE, BUFFERS)
UPDATE target
SET lookup_surrogate_key = surrogate_key
FROM lookup
WHERE lookup_natural_key = natural_key;

標準索引掃描- 因此從表lookup_ix中讀取堆塊：lookup

Update on public.target  (cost=0.42..7109.00 rows=1000 width=54) (actual time=76.688..76.688 rows=0 loops=1)
 Buffers: shared hit=8514 read=550 dirtied=16
 -&gt;  Nested Loop  (cost=0.42..7109.00 rows=1000 width=54) (actual time=0.050..62.493 rows=1000 loops=1)
       Output: target.target_id, target.lookup_natural_key, lookup.surrogate_key, target.data, target.ctid, lookup.ctid
       Buffers: shared hit=3479 read=535
       -&gt;  Seq Scan on public.target  (cost=0.00..19.00 rows=1000 width=40) (actual time=0.013..7.691 rows=1000 loops=1)
             Output: target.target_id, target.lookup_natural_key, target.data, target.ctid
             Buffers: shared hit=9
       -&gt;  Index Scan using lookup_ix on public.lookup  (cost=0.42..7.08 rows=1 width=22) (actual time=0.020..0.027 rows=1 loops=1000)
             Output: lookup.surrogate_key, lookup.ctid, lookup.natural_key
             Index Cond: (lookup.natural_key = target.lookup_natural_key)
             Buffers: shared hit=3470 read=535
Planning time: 0.431 ms
Execution time: 76.826 ms

使用相關子查詢更新

EXPLAIN (ANALYSE, VERBOSE, BUFFERS)
UPDATE target
SET lookup_surrogate_key = (
   SELECT surrogate_key
   FROM lookup
   WHERE lookup_natural_key = natural_key);

lookup_ix僅按預期掃描索引：

Update on public.target  (cost=0.00..4459.00 rows=1000 width=47) (actual time=52.947..52.947 rows=0 loops=1)
 Buffers: shared hit=8050 read=15 dirtied=16
 -&gt;  Seq Scan on public.target  (cost=0.00..4459.00 rows=1000 width=47) (actual time=0.052..40.306 rows=1000 loops=1)
       Output: target.target_id, target.lookup_natural_key, (SubPlan 1), target.data, target.ctid
       Buffers: shared hit=3015
       SubPlan 1
         -&gt;  Index Only Scan using lookup_ix on public.lookup  (cost=0.42..4.44 rows=1 width=8) (actual time=0.013..0.019 rows=1 loops=1000)
               Output: lookup.surrogate_key
               Index Cond: (lookup.natural_key = target.lookup_natural_key)
               Heap Fetches: 0
               Buffers: shared hit=3006
Planning time: 0.130 ms
Execution time: 52.987 ms

db<>在這裡擺弄

我知道查詢在邏輯上並不相同（當lookup給定的沒有/多行時行為不同natural_key），但我對lookup_ix.

誰能解釋一下為什麼加入版本不能使用僅索引掃描？

只有當索引掃描返回的所有列都儲存在索引中時，才能使用僅索引掃描。
您的索引lookup_ix僅包含列natural_key和surrogate_key，但從執行計劃中您可以看到：
-&gt;  Index Scan using lookup_ix on public.lookup
     Output: lookup.surrogate_key, lookup.ctid, lookup.natural_key
所以索引掃描獲取行ctid的物理地址。這是因為我們要執行UPDATE; 請參閱以下評論src\include/nodes/plannodes.h：
* When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we have to uniquely
* identify all the source rows, not only those from the target relations, so
* that we can perform EvalPlanQual rechecking at need.  For plain tables we
* can just fetch the TID, much as for a target relation; this case is
* represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
* FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
* pretty inefficient, since most of the time we'll never need the data; but
* fortunately the overhead is usually not performance-critical in practice.
* By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
* a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
* (Again, this probably doesn't make sense if a physical remote fetch is
* needed, but for FDWs that map to local storage it might be credible.)
（EvalPlanQualPostgreSQL 是一個行話術語/函式名稱，用於獲取將被鎖定或修改的行的最新版本。）
要確定是這種情況，請嘗試EXPLAIN在沒有更新的情況下執行相同的查詢，您將看到計劃是相同的，只是ctid沒有獲取 s，並且它使用僅索引掃描。
現在有一件事情讓我感到驚訝：
每個（葉）索引條目確實包含ctid它引用的行，所以我看不出為什麼不能通過僅索引掃描來獲取它的原因。我在郵件列表上問過，但似乎改變它可能不是微不足道的。

引用自：https://dba.stackexchange.com/questions/258696

Postgres UPDATE 使用來自另一個表的數據 - 僅索引掃描用於相關子查詢但不連接

相關問答

帶有大表的 UPDATE FROM 很慢並且使用 Seq Scans

PostgreSQL 查詢性能問題

Postgres 11：升級後未使用正確的索引

更新速度慢

Postgres 為特定使用者執行 SEQ 掃描

如何以原子方式替換錶數據的子集