Postgresql

加入具有不正確統計資訊的 ID 列時的次優計劃

  • December 29, 2020

我們在 2 個表之間進行連接查詢時遇到問題(僅顯示查詢中的相關列):

CREATE TABLE IF NOT EXISTS payments (
   id TEXT PRIMARY KEY,
   consumer_id TEXT,
   created_at TIMESTAMP WITHOUT TIME ZONE,
   status TEXT,
   tier TEXT,
   number_of_installments INTEGER,
   authorization_result TEXT,
   authorization_details JSONB,
   last_updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL
);

assessmentreports表使用聲明性分區進行分區,所有分區上具有以下結構和索引:

CREATE TABLE IF NOT EXISTS assessmentreports (
   id TEXT PRIMARY KEY,
   payment_id TEXT,
   kind TEXT,
   created_at TIMESTAMP WITHOUT TIME ZONE,
   result TEXT,
   metadata JSONB,
   last_updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL
) PARTITION BY RANGE (created_at);

CREATE INDEX assessmentreports_kind_result ON public.assessmentreports USING btree (kind, result)
CREATE INDEX assessmentreports_payment_id ON public.assessmentreports USING btree (payment_id)
CREATE UNIQUE INDEX assessmentreports_pkey ON public.assessmentreports USING btree (created_at, id)
CREATE INDEX assessmentreports_last_updated_at_idx ON public.assessmentreports USING brin (last_updated_at)

可以有多個條目,assessmentreports每個條目payment_id具有不同的kind. 不幸的是,kind對於給定的payment_id.

這是一個有代表性的例子:

SELECT
 payments.id,
 status,
 consumer_id,
 tier,
 authorization_result,
 authorization_details,
 number_of_installments,
 max(result) FILTER (WHERE (kind = 'credit_check')) AS credit_result,
 max(result) FILTER (WHERE (kind = 'new_credit_check')) AS new_check_result,
 max(result) FILTER (WHERE (kind = 'address_assessment')) AS reject_inference_result,
 max(metadata ->> 'credit_record_id') FILTER (WHERE (kind = 'new_credit_check')) AS new_credit_record_id,
 max(metadata ->> 'credit_record_id') FILTER (WHERE (kind = 'credit_check')) AS credit_record_id
FROM payments
LEFT JOIN assessmentreports ON assessmentreports.payment_id = payments.id
 AND kind IN ('credit_check', 'new_credit_check', 'address_assessment')
 AND assessmentreports.created_at < now() -- To remove future partitions from plan
WHERE payments.last_updated >= now() - '2 hours'::INTERVAL
GROUP BY 1

如果我們將時間視窗保持為 2 小時,則計劃器會顯示索引掃描,payment_id = id並且速度非常快(2 秒)。請注意,我已經刪除了分區的節點,因為它們看起來都一樣。

GroupAggregate  (cost=6231334.74..6231477.18 rows=2187 width=305)
 Group Key: payments.id
 ->  Sort  (cost=6231334.74..6231343.35 rows=3445 width=467)
       Sort Key: payments.id
       ->  Nested Loop Left Join  (cost=0.99..6231132.34 rows=3445 width=467)
             ->  Index Scan using payments_last_updated on payments (cost=0.56..1801.97 rows=2187 width=145)
                   Index Cond: (last_updated >= '2020-12-25 13:57:26.927564'::timestamp without time zone)
             ->  Append  (cost=0.43..2846.99 rows=135 width=344)
                   ->  Index Scan using assessmentreports_2016_payment_id on assessmentreports_2016  (cost=0.43..14.47 rows=1 width=181)
                         Index Cond: (payment_id = payments.id)
                         Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))
                   ->  Index Scan using assessmentreports_201701_payment_id on assessmentreports_201701  (cost=0.42..13.39 rows=1 width=161)
                         Index Cond: (payment_id = payments.id)
                         Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))
                   ->  Index Scan using assessmentreports_201702_payment_id on assessmentreports_201702  (cost=0.43..17.74 rows=1 width=192)
                         Index Cond: (payment_id = payments.id)
                         Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))

即使在 2 小時內增加一點時間,計劃也會完全改變,性能會差幾個數量級(+40 分鐘),即使過濾器payments後的行數last_updated_at沒有增加那麼多。我再次刪除了分區的節點,因為它們看起來都一樣:

GroupAggregate  (cost=35839796.76..35841181.68 rows=21288 width=304)
 Group Key: payments.id
 ->  Sort  (cost=35839796.76..35839880.48 rows=33487 width=463)
       Sort Key: payments.id
       ->  Hash Join  (cost=25622.81..35830297.49 rows=33487 width=463)
             Hash Cond: (assessmentreports_2016.payment_id = payments.id)
             ->  Append  (cost=7457.32..35712704.68 rows=37877067 width=341)
                   ->  Bitmap Heap Scan on assessmentreports_2016  (cost=7457.32..181237.37 rows=228085 width=181)
                         Recheck Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                         Filter: (created_at <= '2020-12-25 17:02:52.008321'::timestamp without time zone)
                         ->  Bitmap Index Scan on assessmentreports_2016_kind_result  (cost=0.00..7400.30 rows=228085 width=0)
                               Index Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                   ->  Bitmap Heap Scan on assessmentreports_201701  (cost=4018.67..32291.56 rows=130762 width=161)
                         Recheck Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                         Filter: (created_at <= '2020-12-25 17:02:52.008321'::timestamp without time zone)
                         ->  Bitmap Index Scan on assessmentreports_201701_kind_result  (cost=0.00..3985.98 rows=130762 width=0)
                               Index Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
             ->  Hash  (cost=17899.39..17899.39 rows=21288 width=144)
                   ->  Index Scan using payments_last_updated on payments  (cost=0.56..17899.39 rows=21288 width=144)
                         Index Cond: (last_updated >= '2020-12-25 07:02:52.008321'::timestamp without time zone)

我對第一個計劃的理解是它在過濾assessmentreportspayments.id = payment_id過濾paymentsby last_updated_at。這對我來說天真的很有意義,因為與單獨過濾實物相比,這會挑選出更少的付款。

到目前為止我們已經嘗試過的

  • RANDOM_PAGE_COST 設置為 4,但應該是 1.1 或 1,因為我們在 SSD 上。我們嘗試為事務設置它,這似乎將索引的使用擴展到10 hours. 與“點陣圖”計劃 10 分鐘後未完成相比,10 小時視窗在 4 秒內完成。
  • 為 (payment_id 和 kind) 創建一個自定義依賴項統計資訊,但這似乎並沒有真正做任何事情。

有任何想法嗎?我們是否在正確的軌道上認為該index scan計劃更適合這些數據量並且它只是 PostgreSQL 被混淆了?在查看了該 payment_id 列的統計資訊後,我沒有看到任何可以幫助計劃者了解過濾多少payment_id會減少行數的資訊,但它似乎很好地理解了過濾種類的作用。

近似統計

我必須故意含糊其辭,以免在這里分享太多資訊。

  1. 評估報告(所有分區)中的總行數遠遠超過十億,大小為 202Gb。
  2. 種類的數量隨著時間的推移而增加,現在有20多種。
  3. 每次付款的評估報告數量取決於付款,但絕大多數都存在 20 多個。
  4. 付款的發布大小為 33Gb,並且未分區。

硬體

我們在具有 16 個 vCPU、48GB RAM 和 SSD 的 RDS 上執行 PostgreSQL 12(無 Redshift)。

指數

查詢計劃表明您有一個索引payments_last_updated。這就是我們所需要的payments

至於assessmentreports

可以有多個條目,assessmentreports每個條目payment_id具有不同的種類。

所以可能(應該)有這個 UNIQUE 索引:

CREATE UNIQUE INDEX assessmentreports_payment_id_kind_uni ON assessmentreports (payment_id, kind);

這應該有助於獲得索引掃描。

詢問

我會嘗試將聚合向下移動到LATERAL子查詢中。這應該有利於索引掃描 - 除了上面的新(?)定制索引。還保存了外部聚合。

SELECT p.* --  or your list of payments columns
    , a.*
FROM   payments p
LEFT   JOIN LATERAL (
  SELECT max(result)                          FILTER (WHERE kind = 'credit_check')       AS credit_result
       , max(result)                          FILTER (WHERE kind = 'new_credit_check')   AS new_check_result
       , max(result)                          FILTER (WHERE kind = 'address_assessment') AS reject_inference_result
       , max(metadata ->> 'credit_record_id') FILTER (WHERE kind = 'new_credit_check')   AS new_credit_record_id
       , max(metadata ->> 'credit_record_id') FILTER (WHERE kind = 'credit_check')       AS credit_record_id
  FROM   assessmentreports
  WHERE  payment_id = p.id
  AND    kind IN ('credit_check', 'new_credit_check', 'address_assessment')
  AND    created_at < now()
  ) a ON true
WHERE  p.last_updated_at >= now() - interval '2 hours';

設置

超過 500,000,000 行assessmentreports。但是你用textPK和FK操作?這通常比integer(或者bigint如果您必須)儲存和處理更昂貴。

您可能會從有利地重新排序表列中獲得更多收益。看:

metadata JSONB對於具有這種基數的表尤其可疑。如果可能,請使用專用列。更多(空間)效率。列中的 NULL 儲存非常便宜。看:

48 GB 的 RAM 可能不足以記憶體大部分數據。(如果您浪費更少的儲存空間,則更多。)因此**random_page_cost**至少應該是 1.1,即使使用 SSD 儲存也是如此。

這個索引看起來很奇怪:

CREATE UNIQUE INDEX assessmentreports_pkey ON assessmentreports USING btree (created_at, id);

一方面,不要將唯一索引稱為“pkey”。只有 PK 索引應該這樣命名。為什麼你一(created_at, id)開始會有一個唯一的索引?

所有通常的性能建議都適用。特別是:effective_cache_size。設置高。像:

effective_cache_size = 36GB

手冊:

值越高,就越有可能使用索引掃描

最後,我在您的查詢計劃中看不到**並行性。**可能對您的口徑表有所幫助。您是否正確配置它?手冊:

每當 PostgreSQL 需要將來自多個源的行組合成一個結果集時,它都會使用一個AppendMergeAppend計劃節點。這通常發生在實施UNION ALL或掃描分區表時。這樣的節點可以在並行計劃中使用,就像在任何其他計劃中一樣。但是,在並行計劃中,計劃者可以改為使用Parallel Append節點。

引用自:https://dba.stackexchange.com/questions/282131