Postgresql
如何優化5000萬行表的自連接視圖
我有以下帶有約 50M 行的簡化表。
table sample ( id uuid not null primary key, measured_date timestamp with time zone not null, segment_id uuid not null, activity_id uuid not null, value integer not null ); Indexes: "sample_pkey" PRIMARY KEY, btree (id) "sample_idx" btree (segment_id, measured_date) "sample_uniq" UNIQUE CONSTRAINT, btree (segment_id, activity_id, measured_date) "sample_activity_idx" btree (activity_id)
我想為每個測量的差距(兩個不同日期之間)獲得我的計算值。
我的查詢如下:
SELECT ROW_NUMBER () OVER () AS id, t1.segment_id AS segment_id, t1.activity_id AS activity_id, t1.measured_date AS from_date, t2.measured_date AS to_date, t2.value AS cumulative_progress, (t2.value - t1.value) AS marginal_progress, FROM sample AS t1 JOIN sample AS t2 ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date WHERE t1.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND t1.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;
結果很快
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |QUERY PLAN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |Sort (cost=17.23..17.23 rows=1 width=154) (actual time=1.011..1.028 rows=171 loops=1) | | Sort Key: t1.measured_date, t2.measured_date | | Sort Method: quicksort Memory: 70kB | | -> WindowAgg (cost=1.13..17.22 rows=1 width=154) (actual time=0.063..0.914 rows=171 loops=1) | | -> Nested Loop (cost=1.13..17.18 rows=1 width=124) (actual time=0.056..0.698 rows=171 loops=1) | | Join Filter: (t1.measured_date < t2.measured_date) | | Rows Removed by Join Filter: 190 | | -> Index Scan using sample_uniq on sample t1 (cost=0.56..8.58 rows=1 width=70) (actual time=0.021..0.043 rows=19 loops=1) | | Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid)) | | -> Index Scan using sample_uniq on sample t2 (cost=0.56..8.58 rows=1 width=86) (actual time=0.005..0.030 rows=19 loops=19) | | Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid)) | |Planning Time: 0.321 ms | |Execution Time: 1.097 ms | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
但是,當我創建視圖以使用此查詢時,結果通常很差。
創建視圖:
CREATE OR REPLACE VIEW sample_marginal AS SELECT ROW_NUMBER () OVER () AS id, t1.segment_id AS segment_id, t1.activity_id AS activity_id, t1.measured_date AS from_date, t2.measured_date AS to_date, t2.value AS cumulative_progress, (t2.value - t1.value) AS marginal_progress, FROM sample AS t1 JOIN sample AS t2 ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date;
查詢視圖:
SELECT * FROM sample_marginal WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc; +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |QUERY PLAN | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |Subquery Scan on sample_marginal (cost=14106816.44..23333603.42 rows=2 width=154) | | Filter: ((sample_marginal.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (sample_marginal.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid)) | | -> WindowAgg (cost=14106816.44..22564147.32 rows=51297073 width=154) | | -> Gather (cost=14106816.44..20768749.77 rows=51297073 width=124) | | Workers Planned: 2 | | -> Merge Join (cost=14105816.44..15638042.47 rows=21373780 width=124) | | Merge Cond: ((t2.activity_id = t1.activity_id) AND (t2.segment_id = t1.segment_id)) | | Join Filter: (t1.updated_by_date < t2.updated_by_date) | | -> Sort (cost=4751689.83..4797948.93 rows=18503642 width=86) | | Sort Key: t2.activity_id, t2.segment_id | | -> Parallel Seq Scan on sample t2 (cost=0.00..1632749.42 rows=18503642 width=86) | | -> Materialize (cost=9354126.62..9576170.32 rows=44408740 width=70) | | -> Sort (cost=9354126.62..9465148.47 rows=44408740 width=70) | | Sort Key: t1.activity_id, t1.segment_id | | -> Seq Scan on sample t1 (cost=0.00..1891800.40 rows=44408740 width=70) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
實際上我從來沒有等到這個查詢完成,因為它花了很長時間,我想在我可以使用它之前修復它。
但是,執行 count(*) 會產生 1.8B 行(當然執行速度很慢)。
關於如何改善我的觀點的任何想法?
SQL 語句的各個部分按特定順序(至少在邏輯上)進行處理。在手頭的情況下,
WHERE
條件是在視窗函式之前評估的row_number()
。您的視圖定義和視圖上的查詢通過有效地引入子查詢來反轉該順序:SELECT * FROM (SELECT ROW_NUMBER () OVER () AS id, t1.segment_id AS segment_id, t1.activity_id AS activity_id, t1.measured_date AS from_date, t2.measured_date AS to_date, t2.value AS cumulative_progress, (t2.value - t1.value) AS marginal_progress, FROM sample AS t1 JOIN sample AS t2 ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date ) AS sample_marginal WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;
優化器對此無能為力。
如果您從視圖定義中刪除視窗函式呼叫(無論如何這似乎毫無意義),性能應該會提高。