Postgresql

Postgres UPDATE 使用來自另一個表的數據 - 僅索引掃描用於相關子查詢但不連接

  • February 4, 2020

語境

我正在調整UPDATE從另一個(大)表中選擇的批量。我的目的是提供一個覆蓋索引來支持源表的僅索引掃描。我意識到必須清理源表以更新其可見性圖。

到目前為止,我的調查表明,優化器選擇僅在UPDATE使用相關子查詢時對源表進行索引掃描,但在使用連接時似乎使用標準索引掃描(UPDATE...FROM)。我問這個問題是為了理解為什麼。

我在這裡提供一個簡化的例子來說明差異。

我使用的是 Postgres 9.6.8,但 10.11 和 11.6 的計劃非常相似。我已經使用官方圖像複製了在 Docker 中安裝 vanilla 9.6 Postgres 的計劃,以及此處的 db<>fiddle 。

設置

CREATE TABLE lookup (
   surrogate_key   BIGINT PRIMARY KEY,
   natural_key     TEXT NOT NULL UNIQUE,
   data            TEXT NOT NULL);

INSERT INTO lookup
SELECT id, 'nk'||id, random()::text
FROM generate_series(1,400000) id;

CREATE UNIQUE INDEX lookup_ix ON lookup(natural_key, surrogate_key);

VACUUM ANALYSE lookup;

CREATE TABLE target (
   target_id               BIGINT PRIMARY KEY,
   lookup_natural_key      TEXT NOT NULL,
   lookup_surrogate_key    BIGINT,
   data                    TEXT NOT NULL
);

INSERT INTO target (target_id, lookup_natural_key, data)
SELECT id+1000, 'nk'||id, random()::text
FROM generate_series(1,1000) id;

ANALYSE target;

使用連接更新

EXPLAIN (ANALYSE, VERBOSE, BUFFERS)
UPDATE target
SET lookup_surrogate_key = surrogate_key
FROM lookup
WHERE lookup_natural_key = natural_key;

標準索引掃描- 因此從表lookup_ix中讀取堆塊:lookup

Update on public.target  (cost=0.42..7109.00 rows=1000 width=54) (actual time=76.688..76.688 rows=0 loops=1)
 Buffers: shared hit=8514 read=550 dirtied=16
 -&gt;  Nested Loop  (cost=0.42..7109.00 rows=1000 width=54) (actual time=0.050..62.493 rows=1000 loops=1)
       Output: target.target_id, target.lookup_natural_key, lookup.surrogate_key, target.data, target.ctid, lookup.ctid
       Buffers: shared hit=3479 read=535
       -&gt;  Seq Scan on public.target  (cost=0.00..19.00 rows=1000 width=40) (actual time=0.013..7.691 rows=1000 loops=1)
             Output: target.target_id, target.lookup_natural_key, target.data, target.ctid
             Buffers: shared hit=9
       -&gt;  Index Scan using lookup_ix on public.lookup  (cost=0.42..7.08 rows=1 width=22) (actual time=0.020..0.027 rows=1 loops=1000)
             Output: lookup.surrogate_key, lookup.ctid, lookup.natural_key
             Index Cond: (lookup.natural_key = target.lookup_natural_key)
             Buffers: shared hit=3470 read=535
Planning time: 0.431 ms
Execution time: 76.826 ms

使用相關子查詢更新

EXPLAIN (ANALYSE, VERBOSE, BUFFERS)
UPDATE target
SET lookup_surrogate_key = (
   SELECT surrogate_key
   FROM lookup
   WHERE lookup_natural_key = natural_key);

lookup_ix僅按預期掃描索引:

Update on public.target  (cost=0.00..4459.00 rows=1000 width=47) (actual time=52.947..52.947 rows=0 loops=1)
 Buffers: shared hit=8050 read=15 dirtied=16
 -&gt;  Seq Scan on public.target  (cost=0.00..4459.00 rows=1000 width=47) (actual time=0.052..40.306 rows=1000 loops=1)
       Output: target.target_id, target.lookup_natural_key, (SubPlan 1), target.data, target.ctid
       Buffers: shared hit=3015
       SubPlan 1
         -&gt;  Index Only Scan using lookup_ix on public.lookup  (cost=0.42..4.44 rows=1 width=8) (actual time=0.013..0.019 rows=1 loops=1000)
               Output: lookup.surrogate_key
               Index Cond: (lookup.natural_key = target.lookup_natural_key)
               Heap Fetches: 0
               Buffers: shared hit=3006
Planning time: 0.130 ms
Execution time: 52.987 ms

db<>在這裡擺弄

我知道查詢在邏輯上並不相同(當lookup給定的沒有/多行時行為不同natural_key),但我對lookup_ix.

誰能解釋一下為什麼加入版本不能使用僅索引掃描?

只有當索引掃描返回的所有列都儲存在索引中時,才能使用僅索引掃描。

您的索引lookup_ix僅包含列natural_keysurrogate_key,但從執行計劃中您可以看到:

-&gt;  Index Scan using lookup_ix on public.lookup
     Output: lookup.surrogate_key, lookup.ctid, lookup.natural_key

所以索引掃描獲取行ctid的物理地址。這是因為我們要執行UPDATE; 請參閱以下評論src\include/nodes/plannodes.h

* When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we have to uniquely
* identify all the source rows, not only those from the target relations, so
* that we can perform EvalPlanQual rechecking at need.  For plain tables we
* can just fetch the TID, much as for a target relation; this case is
* represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
* FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
* pretty inefficient, since most of the time we'll never need the data; but
* fortunately the overhead is usually not performance-critical in practice.
* By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
* a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
* (Again, this probably doesn't make sense if a physical remote fetch is
* needed, but for FDWs that map to local storage it might be credible.)

EvalPlanQualPostgreSQL 是一個行話術語/函式名稱,用於獲取將被鎖定或修改的行的最新版本。)

要確定是這種情況,請嘗試EXPLAIN在沒有更新的情況下執行相同的查詢,您將看到計劃是相同的,只是ctid沒有獲取 s,並且它使用僅索引掃描。


現在有一件事情讓我感到驚訝:

每個(葉)索引條目確實包含ctid它引用的行,所以我看不出為什麼不能通過僅索引掃描來獲取它的原因。我在郵件列表上問過,但似乎改變它可能不是微不足道的。

引用自:https://dba.stackexchange.com/questions/258696