添加 WHERE 子句會降低查詢速度
我們執行一個數據庫
x86_64-pc-linux-gnu 上的 PostgreSQL 12.6,由 Debian clang 版本 10.0.1 編譯,64 位
我們正在嘗試獲取最新條目具有
status
“失敗”的結果集的頂部行。主表是
product_tracking
:Table "public.product_tracking" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ---------------+--------------------------+-----------+----------+---------+----------+--------------+------------- product_id | character varying(512) | | not null | | extended | | delivery_name | character varying(512) | | not null | | extended | | feed_gid | uuid | | not null | | plain | | status | status | | not null | | plain | | updated_at | timestamp with time zone | | | now() | plain | | errors | text | | | | extended | | created_at | timestamp with time zone | | | | plain | | Indexes: "product_tracking_pkey" PRIMARY KEY, btree (product_id, delivery_name, feed_gid) "product_tracking_created_at_idx" btree (created_at) "product_tracking_created_at_product_idx" btree (created_at, product_id, delivery_name, feed_gid) "product_tracking_delivery_name_idx" btree (delivery_name) "product_tracking_feed_gid_idx" btree (feed_gid) "product_tracking_product_id_idx" btree (product_id)
我們最近添加了索引
product_tracking_created_at_product_idx
,希望這會加快新查詢的排序。原始查詢和計劃:
explain analyze WITH required_feeds AS ( SELECT gid, name feed_name FROM feeds ) SELECT product_id, feed_gid, errors FROM product_tracking pt1 INNER JOIN required_feeds f ON (pt1.feed_gid = f.gid) WHERE updated_at = (SELECT MAX(updated_at) FROM product_tracking pt2 WHERE pt1.product_id = pt2.product_id AND pt1.feed_gid = pt2.feed_gid) AND status = 'Failed' ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc LIMIT 100
Limit (cost=0.56..23433928.19 rows=100 width=138) (actual time=0.245..4.107 rows=100 loops=1) -> Nested Loop (cost=0.56..111311156.82 rows=475 width=138) (actual time=0.244..4.092 rows=100 loops=1) Join Filter: (pt1.feed_gid = feeds.gid) Rows Removed by Join Filter: 4888 -> Index Scan Backward using product_tracking_created_at_product_idx on product_tracking pt1 (cost=0.56..111309426.64 rows=475 width=138) (actual time=0.210..3.190 rows=100 loops=1) Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 1))) Rows Removed by Filter: 1515 SubPlan 1 -> Aggregate (cost=8.65..8.66 rows=1 width=8) (actual time=0.017..0.017 rows=1 loops=100) -> Index Scan using product_tracking_pkey on product_tracking pt2 (cost=0.56..8.65 rows=1 width=8) (actual time=0.013..0.016 rows=1 loops=100) Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid)) -> Materialize (cost=0.00..6.64 rows=243 width=16) (actual time=0.000..0.004 rows=50 loops=100) -> Seq Scan on feeds (cost=0.00..5.43 rows=243 width=16) (actual time=0.009..0.054 rows=234 loops=1) Planning Time: 1.855 ms Execution Time: 4.238 ms
使用過濾器,
required_feeds
我們得到一個看起來像這樣的執行計劃。我正在使用feed1
並feed2
作為範例,實際上這些提要會發生變化。explain analyze WITH required_feeds AS ( SELECT gid, name feed_name FROM feeds WHERE name in ('feed1', 'feed2') ) SELECT product_id, feed_gid, errors FROM product_tracking pt1 INNER JOIN required_feeds f ON (pt1.feed_gid = f.gid) WHERE updated_at = (SELECT MAX(updated_at) FROM product_tracking pt2 WHERE pt1.product_id = pt2.product_id AND pt1.feed_gid = pt2.feed_gid) AND status = 'Failed' ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc LIMIT 100
Limit (cost=1485702.73..1485702.74 rows=4 width=138) (actual time=2265.019..2265.047 rows=100 loops=1) -> Sort (cost=1485702.73..1485702.74 rows=4 width=138) (actual time=2265.018..2265.034 rows=100 loops=1) Sort Key: pt1.created_at DESC, pt1.product_id DESC, pt1.delivery_name DESC, pt1.feed_gid DESC Sort Method: top-N heapsort Memory: 72kB -> Nested Loop (cost=2183.11..1485702.69 rows=4 width=138) (actual time=269.623..2260.551 rows=13814 loops=1) -> Seq Scan on feeds (cost=0.00..6.04 rows=2 width=16) (actual time=0.013..0.069 rows=2 loops=1) Filter: (name = ANY {pkinteractive,iok}'::text[])) Rows Removed by Filter: 243 -> Bitmap Heap Scan on product_tracking pt1 (cost=2183.11..742848.30 rows=3 width=138) (actual time=142.994..1127.950 rows=6907 loops=2) Recheck Cond: (feed_gid = feeds.gid) Rows Removed by Index Recheck: 2814781 Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 1))) Rows Removed by Filter: 1751784 Heap Blocks: exact=99595 lossy=98823 -> Bitmap Index Scan on product_tracking_feed_gid_idx (cost=0.00..2183.11 rows=71806 width=0) (actual time=124.666..124.666 rows=1799676 loops=2) Index Cond: (feed_gid = feeds.gid) SubPlan 1 -> Aggregate (cost=8.65..8.66 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=16149) -> Index Scan using product_tracking_pkey on product_tracking pt2 (cost=0.56..8.65 rows=1 width=8) (actual time=0.012..0.012 rows=2 loops=16149) Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid)) Planning Time: 1.840 ms Execution Time: 2265.242 ms
可能有用的統計數據:
SELECT count(*) AS row_count, avg(length(product_id)) AS avg_prod_len, avg(length(delivery_name)) AS avg_delivery_len, count(*) FILTER (WHERE status = 'Failed') AS ct_failed, count(DISTINCT (product_id)) AS distinct_product_id, count(DISTINCT (delivery_name)) AS distinct_delivery_name, count(DISTINCT (feed_gid)) AS distinct_feed_gid, count(DISTINCT (status)) AS distinct_status, count(DISTINCT (updated_at)) AS distinct_updated_at, count(DISTINCT (errors)) AS distinct_errors, count(DISTINCT (created_at)) AS distinct_created_at FROM product_tracking; row_count = 11601030 avg_prod_len = 12.48 avg_delivery_len = 17.298 ct_failed = 74881 distinct_product_id = 8638613 distinct_delivery_name = 7315794 distinct_feed_gid = 245 distinct_status = 3 distinct_updated_at = 9096954 distinct_errors = 16664 distinct_created_at = 8772269
表
feeds
只有 245 行,所有值目前都是唯一的。是什麼導致速度變慢,您有什麼建議?
補充評論:
feed_gid
遺憾的是必須是 UUID,這是來自我們不擁有的下游系統的映射。status
可以是 ‘Failed’、‘Pending’ 或 ‘Succeeded’,所以 aboolean
不太可能。全天每分鐘都有幾筆針對此表的事務,因此在短期內重新排序或調整表設計是不可能的。謝謝,歐文。由於您的輸入,我設法將查詢更改為:
SELECT * FROM product_tracking pt1 JOIN feeds f ON (f.gid = feed_gid) WHERE updated_at = (SELECT MAX(updated_at) FROM product_tracking pt2 WHERE pt1.feed_gid IN ( SELECT gid from feeds where name IN ('feed1') ) AND pt1.product_id = pt2.product_id AND pt1.feed_gid = pt2.feed_gid ) AND status = 'Failed' ORDER BY created_at desc, product_id desc, delivery_name desc, feed_gid desc LIMIT 100 Limit (cost=0.56..23431662.26 rows=100 width=176) (actual time=1599.557..4797.615 rows=4 loops=1) Buffers: shared hit=12319324 -> Nested Loop (cost=0.56..112003343.48 rows=478 width=176) (actual time=1599.556..4797.610 rows=4 loops=1) Join Filter: (pt1.feed_gid = f.gid) Rows Removed by Join Filter: 432 Buffers: shared hit=12319324 -> Index Scan Backward using product_tracking_created_at_product_idx on product_tracking pt1 (cost=0.56..112001602.36 rows=478 width=150) (actual time=1599.483..4797.469 rows=4 loops=1) Filter: ((status = 'Failed'::status) AND (updated_at = (SubPlan 2))) Rows Removed by Filter: 12781499 Buffers: shared hit=12319323 SubPlan 2 -> Aggregate (cost=8.65..8.66 rows=1 width=8) (actual time=0.000..0.001 rows=1 loops=96451) Buffers: shared hit=227 -> Result (cost=0.56..8.65 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=96451) One-Time Filter: (hashed SubPlan 1) Buffers: shared hit=227 -> Index Scan using product_tracking_pkey on product_tracking pt2 (cost=0.56..8.65 rows=1 width=8) (actual time=0.027..0.030 rows=3 loops=31) Index Cond: (((product_id)::text = (pt1.product_id)::text) AND (feed_gid = pt1.feed_gid)) Buffers: shared hit=224 SubPlan 1 -> Seq Scan on feeds (cost=0.00..6.04 rows=1 width=16) (actual time=0.018..0.035 rows=1 loops=1) Filter: (name = 'absolute'::text) Rows Removed by Filter: 244 Buffers: shared hit=3 -> Materialize (cost=0.00..6.64 rows=243 width=26) (actual time=0.007..0.020 rows=109 loops=4) Buffers: shared hit=1 -> Seq Scan on feeds f (cost=0.00..5.43 rows=243 width=26) (actual time=0.021..0.031 rows=109 loops=1) Buffers: shared hit=1 Planning Time: 0.583 ms Execution Time: 4797.695 ms
如果索引不需要向下掃描太遠,這會導致查詢超快
product_tracking_created_at_product_idx
。對於feed
最近失敗的一個,查詢大約是半秒。對於feed
一年多前的查詢,查詢最多可能需要 5 秒(如此共享查詢)。
切換查詢計劃可能有很多原因。一個非常糟糕的計劃通常表明不准確的列統計資訊和/或成本常數。它從對索引掃描的錯誤估計開始
product_tracking_feed_gid_idx
:(cost=0.00..2183.11 **rows=71806** width=0) (actual time=124.666..124.666 **rows=1799676** loops=2)
產生比 Postgres 預期更多的行*。*它並不止於此。關於
ANALYZE
統計和成本設置的相關答案數不勝數。這是一個啟動器:最重要的是,一個表上的過濾器組合,連接到另一個表,過濾更多(獲得一列的最大值),按其他列排序,然後
LIMIT
是眾所周知的難以計劃。根據數據分佈的細節,非常不同的查詢計劃可能更可取。如果成本設置和列統計資訊不准確(尤其是數據分佈不均勻),這可能會迅速惡化。有關的:小心記憶體查詢計劃,用於像這樣的棘手查詢,例如準備好的語句或嵌套在 PL/pgSQL 函式中的查詢。通常情況下,Postgres 不會因此而墮落,但如果前幾個呼叫在相同的查詢計劃下工作正常,它可能會。(動態 SQL
EXECUTE
可以是 PL/pgSQL 的一種解決方法。)更快的查詢
在看到您添加的值頻率後,我希望這個等效查詢執行得更好:
SELECT product_id, feed_gid, errors FROM product_tracking pt1 WHERE status = 'Failed' -- very selective (< 1% qualify) AND feed_gid IN (SELECT gid FROM feeds WHERE name IN ('feed1', 'feed2')) -- also selective AND NOT EXISTS ( -- no later update SELECT FROM product_tracking pt2 WHERE pt2.product_id = pt1.product_id AND pt2.feed_gid = pt1.feed_gid AND pt2.updated_at > pt1.updated_at ) ORDER BY created_at DESC, product_id DESC, delivery_name DESC, feed_gid DESC LIMIT 100;
我刪除了無意義的 CTE。
構造
NOT EXISTS
為允許具有相同條件的多行updated_at
符合條件,就像您的原始行一樣。至關重要的是,添加這個(體積小!)多列部分索引以使其快速:
CREATE INDEX product_tracking_failed_idx ON product_tracking (created_at DESC, product_id DESC, delivery_name DESC, feed_gid DESC) WHERE status = 'Failed';
出於此查詢的目的,它替換了您現有的索引
product_tracking_created_at_product_idx
(也針對外部ORDER BY
子句量身定制),但為此目的它更小更快。也許這個更簡單的索引也可以完成這項工作。(測試一下!):
CREATE INDEX product_tracking_failed_idx ON product_tracking (created_at DESC) WHERE status = 'Failed';
這個索引支持
NOT EXISTS
:CREATE INDEX product_tracking_feed_gid_product_id_updated_at_idx ON product_tracking (feed_gid, product_id, updated_at DESC NULLS LAST);
我看到了優化表的列類型、列順序和索引的額外潛力。
如果
updated_at
並且created_at
從不為 NULL,則應聲明它們NOT NULL
。至於:
status
可以是 Failed、Pending 或 Succeeded,所以布爾值不太可能也可能:
succeeded boolean true - 'Succeeded' false - 'Failed' NULL - 'Pending'
1 字節而不是 7 - 10 字節,查詢更快更簡單。
但這超出了這個問題的範圍——顯然,這不是一個選擇。