Postgresql

添加 WHERE 子句會降低查詢速度

  • June 17, 2021

我們執行一個數據庫

x86_64-pc-linux-gnu 上的 PostgreSQL 12.6,由 Debian clang 版本 10.0.1 編譯,64 位

我們正在嘗試獲取最新條目具有status“失敗”的結果集的頂部行。

主表是product_tracking

                                         Table "public.product_tracking"
   Column     |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
---------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
product_id    | character varying(512)   |           | not null |         | extended |              |
delivery_name | character varying(512)   |           | not null |         | extended |              |
feed_gid      | uuid                     |           | not null |         | plain    |              |
status        | status                   |           | not null |         | plain    |              |
updated_at    | timestamp with time zone |           |          | now()   | plain    |              |
errors        | text                     |           |          |         | extended |              |
created_at    | timestamp with time zone |           |          |         | plain    |              |
Indexes:
   "product_tracking_pkey" PRIMARY KEY, btree (product_id, delivery_name, feed_gid)
   "product_tracking_created_at_idx" btree (created_at)
   "product_tracking_created_at_product_idx" btree (created_at, product_id, delivery_name, feed_gid)
   "product_tracking_delivery_name_idx" btree (delivery_name)
   "product_tracking_feed_gid_idx" btree (feed_gid)
   "product_tracking_product_id_idx" btree (product_id)

我們最近添加了索引product_tracking_created_at_product_idx,希望這會加快新查詢的排序。

原始查詢和計劃:

explain analyze
WITH required_feeds AS (
   SELECT gid, name feed_name
   FROM feeds
)
SELECT product_id, feed_gid, errors
FROM product_tracking pt1
        INNER JOIN required_feeds f ON (pt1.feed_gid = f.gid)
WHERE updated_at = (SELECT MAX(updated_at)
                   FROM product_tracking pt2
                   WHERE pt1.product_id = pt2.product_id
                     AND pt1.feed_gid = pt2.feed_gid)
 AND status = 'Failed'
ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc
LIMIT 100
Limit  (cost=0.56..23433928.19 rows=100 width=138) (actual time=0.245..4.107 rows=100 loops=1)
 ->  Nested Loop  (cost=0.56..111311156.82 rows=475 width=138) (actual time=0.244..4.092 rows=100 loops=1)
       Join Filter: (pt1.feed_gid = feeds.gid)
       Rows Removed by Join Filter: 4888
       ->  Index Scan Backward using product_tracking_created_at_product_idx on product_tracking pt1  (cost=0.56..111309426.64 rows=475 width=138) (actual time=0.210..3.190 rows=100 loops=1)
             Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 1)))
             Rows Removed by Filter: 1515
             SubPlan 1
               ->  Aggregate  (cost=8.65..8.66 rows=1 width=8) (actual time=0.017..0.017 rows=1 loops=100)
                     ->  Index Scan using product_tracking_pkey on product_tracking pt2  (cost=0.56..8.65 rows=1 width=8) (actual time=0.013..0.016 rows=1 loops=100)
                           Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid))
       ->  Materialize  (cost=0.00..6.64 rows=243 width=16) (actual time=0.000..0.004 rows=50 loops=100)
             ->  Seq Scan on feeds  (cost=0.00..5.43 rows=243 width=16) (actual time=0.009..0.054 rows=234 loops=1)
Planning Time: 1.855 ms
Execution Time: 4.238 ms

使用過濾器,required_feeds我們得到一個看起來像這樣的執行計劃。我正在使用feed1feed2作為範例,實際上這些提要會發生變化。

explain analyze
WITH required_feeds AS (
   SELECT gid, name feed_name
   FROM feeds
   WHERE name in ('feed1', 'feed2')
)
SELECT product_id, feed_gid, errors
FROM product_tracking pt1
        INNER JOIN required_feeds f ON (pt1.feed_gid = f.gid)
WHERE updated_at = (SELECT MAX(updated_at)
                   FROM product_tracking pt2
                   WHERE pt1.product_id = pt2.product_id
                     AND pt1.feed_gid = pt2.feed_gid)
 AND status = 'Failed'
ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc
LIMIT 100
Limit  (cost=1485702.73..1485702.74 rows=4 width=138) (actual time=2265.019..2265.047 rows=100 loops=1)
 ->  Sort  (cost=1485702.73..1485702.74 rows=4 width=138) (actual time=2265.018..2265.034 rows=100 loops=1)
       Sort Key: pt1.created_at DESC, pt1.product_id DESC, pt1.delivery_name DESC, pt1.feed_gid DESC
       Sort Method: top-N heapsort  Memory: 72kB
       ->  Nested Loop  (cost=2183.11..1485702.69 rows=4 width=138) (actual time=269.623..2260.551 rows=13814 loops=1)
             ->  Seq Scan on feeds  (cost=0.00..6.04 rows=2 width=16) (actual time=0.013..0.069 rows=2 loops=1)
                   Filter: (name = ANY {pkinteractive,iok}'::text[]))
                   Rows Removed by Filter: 243
             ->  Bitmap Heap Scan on product_tracking pt1  (cost=2183.11..742848.30 rows=3 width=138) (actual time=142.994..1127.950 rows=6907 loops=2)
                   Recheck Cond: (feed_gid = feeds.gid)
                   Rows Removed by Index Recheck: 2814781
                   Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 1)))
                   Rows Removed by Filter: 1751784
                   Heap Blocks: exact=99595 lossy=98823
                   ->  Bitmap Index Scan on product_tracking_feed_gid_idx  (cost=0.00..2183.11 rows=71806 width=0) (actual time=124.666..124.666 rows=1799676 loops=2)
                         Index Cond: (feed_gid = feeds.gid)
                   SubPlan 1
                     ->  Aggregate  (cost=8.65..8.66 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=16149)
                           ->  Index Scan using product_tracking_pkey on product_tracking pt2  (cost=0.56..8.65 rows=1 width=8) (actual time=0.012..0.012 rows=2 loops=16149)
                                 Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid))
Planning Time: 1.840 ms
Execution Time: 2265.242 ms

可能有用的統計數據:

SELECT count(*) AS row_count,
      avg(length(product_id)) AS avg_prod_len,
      avg(length(delivery_name)) AS avg_delivery_len,
      count(*) FILTER (WHERE status = 'Failed') AS ct_failed,
      count(DISTINCT (product_id)) AS distinct_product_id,
      count(DISTINCT (delivery_name)) AS distinct_delivery_name,
      count(DISTINCT (feed_gid)) AS distinct_feed_gid,
      count(DISTINCT (status)) AS distinct_status,
      count(DISTINCT (updated_at)) AS distinct_updated_at,
      count(DISTINCT (errors)) AS distinct_errors,
      count(DISTINCT (created_at)) AS distinct_created_at
FROM product_tracking;

row_count = 11601030
avg_prod_len = 12.48
avg_delivery_len = 17.298
ct_failed = 74881
distinct_product_id = 8638613
distinct_delivery_name = 7315794
distinct_feed_gid = 245
distinct_status = 3
distinct_updated_at = 9096954
distinct_errors = 16664
distinct_created_at = 8772269

feeds只有 245 行,所有值目前都是唯一的。

是什麼導致速度變慢,您有什麼建議?

補充評論:

feed_gid遺憾的是必須是 UUID,這是來自我們不擁有的下游系統的映射。status可以是 ‘Failed’、‘Pending’ 或 ‘Succeeded’,所以 aboolean不太可能。全天每分鐘都有幾筆針對此表的事務,因此在短期內重新排序或調整表設計是不可能的。


謝謝,歐文。由於您的輸入,我設法將查詢更改為:

SELECT * FROM product_tracking pt1
JOIN feeds f ON (f.gid = feed_gid)
WHERE updated_at = (SELECT MAX(updated_at)
                   FROM product_tracking pt2
                   WHERE pt1.feed_gid IN (
                       SELECT gid
                       from feeds
                       where name IN ('feed1')
                   )
                     AND pt1.product_id = pt2.product_id
                     AND pt1.feed_gid = pt2.feed_gid
)
 AND status = 'Failed'
ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc
LIMIT 100

Limit  (cost=0.56..23431662.26 rows=100 width=176) (actual time=1599.557..4797.615 rows=4 loops=1)
 Buffers: shared hit=12319324
 ->  Nested Loop  (cost=0.56..112003343.48 rows=478 width=176) (actual time=1599.556..4797.610 rows=4 loops=1)
       Join Filter: (pt1.feed_gid = f.gid)
       Rows Removed by Join Filter: 432
       Buffers: shared hit=12319324
       ->  Index Scan Backward using product_tracking_created_at_product_idx on product_tracking pt1  (cost=0.56..112001602.36 rows=478 width=150) (actual time=1599.483..4797.469 rows=4 loops=1)
             Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 2)))
             Rows Removed by Filter: 12781499
             Buffers: shared hit=12319323
             SubPlan 2
               ->  Aggregate  (cost=8.65..8.66 rows=1 width=8) (actual time=0.000..0.001 rows=1 loops=96451)
                     Buffers: shared hit=227
                     ->  Result  (cost=0.56..8.65 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=96451)
                           One-Time Filter: (hashed SubPlan 1)
                           Buffers: shared hit=227
                           ->  Index Scan using product_tracking_pkey on product_tracking pt2  (cost=0.56..8.65 rows=1 width=8) (actual time=0.027..0.030 rows=3 loops=31)
                                 Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid))
                                 Buffers: shared hit=224
                           SubPlan 1
                             ->  Seq Scan on feeds  (cost=0.00..6.04 rows=1 width=16) (actual time=0.018..0.035 rows=1 loops=1)
                                   Filter: (name = 'absolute'::text)
                                   Rows Removed by Filter: 244
                                   Buffers: shared hit=3
       ->  Materialize  (cost=0.00..6.64 rows=243 width=26) (actual time=0.007..0.020 rows=109 loops=4)
             Buffers: shared hit=1
             ->  Seq Scan on feeds f  (cost=0.00..5.43 rows=243 width=26) (actual time=0.021..0.031 rows=109 loops=1)
                   Buffers: shared hit=1
Planning Time: 0.583 ms
Execution Time: 4797.695 ms

如果索引不需要向下掃描太遠,這會導致查詢超快product_tracking_created_at_product_idx。對於feed最近失敗的一個,查詢大約是半秒。對於feed一年多前的查詢,查詢最多可能需要 5 秒(如此共享查詢)。

切換查詢計劃可能有很多原因。一個非常糟糕的計劃通常表明不准確的列統計資訊和/或成本常數。它從對索引掃描的錯誤估計開始product_tracking_feed_gid_idx

(cost=0.00..2183.11 **rows=71806** width=0) (actual time=124.666..124.666 **rows=1799676** loops=2)

產生比 Postgres 預期更多的行*。*它並不止於此。關於ANALYZE統計和成本設置的相關答案數不勝數。這是一個啟動器:

最重要的是,一個表上的過濾器組合,連接到另一個表,過濾更多(獲得一列的最大值),按其他列排序,然後LIMIT是眾所周知的難以計劃。根據數據分佈的細節,非常不同的查詢計劃可能更可取。如果成本設置和列統計資訊不准確(尤其是數據分佈不均勻),這可能會迅速惡化。有關的:

小心記憶體查詢計劃,用於像這樣的棘手查詢,例如準備好的語句或嵌套在 PL/pgSQL 函式中的查詢。通常情況下,Postgres 不會因此而墮落,但如果前幾個呼叫在相同的查詢計劃下工作正常,它可能會。(動態 SQLEXECUTE可以是 PL/pgSQL 的一種解決方法。)

更快的查詢

在看到您添加的值頻率後,我希望這個等效查詢執行得更好:

SELECT product_id, feed_gid, errors
FROM   product_tracking  pt1
WHERE  status = 'Failed'  -- very selective (< 1% qualify)
AND    feed_gid IN (SELECT gid FROM feeds WHERE name IN ('feed1', 'feed2'))  -- also selective
AND    NOT EXISTS (  -- no later update
  SELECT FROM product_tracking pt2
  WHERE  pt2.product_id = pt1.product_id
  AND    pt2.feed_gid   = pt1.feed_gid
  AND    pt2.updated_at > pt1.updated_at
  )
ORDER  BY created_at DESC, product_id DESC, delivery_name DESC, feed_gid DESC
LIMIT  100;

我刪除了無意義的 CTE。

構造NOT EXISTS為允許具有相同條件的多行updated_at符合條件,就像您的原始行一樣。

至關重要的是,添加這個(體積小!)多列部分索引以使其快速

CREATE INDEX product_tracking_failed_idx ON product_tracking (created_at DESC, product_id DESC, delivery_name DESC, feed_gid DESC)
WHERE status = 'Failed';

出於此查詢的目的,它替換了您現有的索引product_tracking_created_at_product_idx(也針對外部ORDER BY子句量身定制),但為此目的它小更快。

也許這個更簡單的索引也可以完成這項工作。(測試一下!):

CREATE INDEX product_tracking_failed_idx ON product_tracking (created_at DESC)
WHERE status = 'Failed';

這個索引支持NOT EXISTS

CREATE INDEX product_tracking_feed_gid_product_id_updated_at_idx
ON product_tracking (feed_gid, product_id, updated_at DESC NULLS LAST);

我看到了優化表的列類型、列順序和索引的額外潛力。

如果updated_at並且created_at從不為 NULL,則應聲明它們NOT NULL

至於:

status可以是 Failed、Pending 或 Succeeded,所以布爾值不太可能

也可能:

succeeded boolean
  true - 'Succeeded'
  false - 'Failed'
  NULL - 'Pending'

1 字節而不是 7 - 10 字節,查詢更快更簡單。

但這超出了這個問題的範圍——顯然,這不是一個選擇。

引用自:https://dba.stackexchange.com/questions/294297