Postgresql

如何優化5000萬行表的自連接視圖

  • April 28, 2022

我有以下帶有約 50M 行的簡化表。

table sample (
   id uuid not null primary key,
   measured_date timestamp with time zone not null,
   segment_id uuid not null,
   activity_id uuid not null,
   value integer not null
);
Indexes:
   "sample_pkey" PRIMARY KEY, btree (id)
   "sample_idx" btree (segment_id, measured_date)
   "sample_uniq" UNIQUE CONSTRAINT, btree (segment_id, activity_id, measured_date)
   "sample_activity_idx" btree (activity_id)

我想為每個測量的差距(兩個不同日期之間)獲得我的計算值。

我的查詢如下:

SELECT ROW_NUMBER () OVER () AS id,
       t1.segment_id AS segment_id,
       t1.activity_id AS activity_id,
       t1.measured_date AS from_date,
       t2.measured_date AS to_date,
       t2.value AS cumulative_progress,
       (t2.value - t1.value) AS marginal_progress,
   FROM sample AS t1 JOIN sample AS t2
   ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date
   WHERE t1.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND t1.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;

結果很快

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sort  (cost=17.23..17.23 rows=1 width=154) (actual time=1.011..1.028 rows=171 loops=1)                                                                                  |
|  Sort Key: t1.measured_date, t2.measured_date                                                                                                                          |
|  Sort Method: quicksort  Memory: 70kB                                                                                                                                  |
|  ->  WindowAgg  (cost=1.13..17.22 rows=1 width=154) (actual time=0.063..0.914 rows=171 loops=1)                                                                        |
|        ->  Nested Loop  (cost=1.13..17.18 rows=1 width=124) (actual time=0.056..0.698 rows=171 loops=1)                                                                |
|              Join Filter: (t1.measured_date < t2.measured_date)                                                                                                        |
|              Rows Removed by Join Filter: 190                                                                                                                          |
|              ->  Index Scan using sample_uniq on sample t1  (cost=0.56..8.58 rows=1 width=70) (actual time=0.021..0.043 rows=19 loops=1)                               |
|                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
|              ->  Index Scan using sample_uniq on sample t2  (cost=0.56..8.58 rows=1 width=86) (actual time=0.005..0.030 rows=19 loops=19)                              |
|                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
|Planning Time: 0.321 ms                                                                                                                                                 |
|Execution Time: 1.097 ms                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

但是,當我創建視圖以使用此查詢時,結果通常很差。

創建視圖:

CREATE OR REPLACE VIEW sample_marginal AS
       SELECT ROW_NUMBER () OVER () AS id,
       t1.segment_id AS segment_id,
       t1.activity_id AS activity_id,
       t1.measured_date AS from_date,
       t2.measured_date AS to_date,
       t2.value AS cumulative_progress,
       (t2.value - t1.value) AS marginal_progress,
   FROM sample AS t1 JOIN sample AS t2
   ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date;

查詢視圖:

SELECT * FROM sample_marginal WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Subquery Scan on sample_marginal  (cost=14106816.44..23333603.42 rows=2 width=154)                                                                                        |
|  Filter: ((sample_marginal.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (sample_marginal.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))  |
|  ->  WindowAgg  (cost=14106816.44..22564147.32 rows=51297073 width=154)                                                                                                  |
|        ->  Gather  (cost=14106816.44..20768749.77 rows=51297073 width=124)                                                                                               |
|              Workers Planned: 2                                                                                                                                          |
|              ->  Merge Join  (cost=14105816.44..15638042.47 rows=21373780 width=124)                                                                                     |
|                    Merge Cond: ((t2.activity_id = t1.activity_id) AND (t2.segment_id = t1.segment_id))                                                                   |
|                    Join Filter: (t1.updated_by_date < t2.updated_by_date)                                                                                                |
|                    ->  Sort  (cost=4751689.83..4797948.93 rows=18503642 width=86)                                                                                        |
|                          Sort Key: t2.activity_id, t2.segment_id                                                                                                         |
|                          ->  Parallel Seq Scan on sample t2  (cost=0.00..1632749.42 rows=18503642 width=86)                                                              |
|                    ->  Materialize  (cost=9354126.62..9576170.32 rows=44408740 width=70)                                                                                 |
|                          ->  Sort  (cost=9354126.62..9465148.47 rows=44408740 width=70)                                                                                  |
|                                Sort Key: t1.activity_id, t1.segment_id                                                                                                   |
|                                ->  Seq Scan on sample t1  (cost=0.00..1891800.40 rows=44408740 width=70)                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

實際上我從來沒有等到這個查詢完成,因為它花了很長時間,我想在我可以使用它之前修復它。

但是,執行 count(*) 會產生 1.8B 行(當然執行速度很慢)。

關於如何改善我的觀點的任何想法?

SQL 語句的各個部分按特定順序(至少在邏輯上)進行處理。在手頭的情況下,WHERE條件是在視窗函式之前評估的row_number()。您的視圖定義和視圖上的查詢通過有效地引入子查詢來反轉該順序:

SELECT *
FROM (SELECT ROW_NUMBER () OVER () AS id,
            t1.segment_id AS segment_id,
            t1.activity_id AS activity_id,
            t1.measured_date AS from_date,
            t2.measured_date AS to_date,
            t2.value AS cumulative_progress,
            (t2.value - t1.value) AS marginal_progress,
     FROM sample AS t1
        JOIN sample AS t2
           ON t1.activity_id = t2.activity_id
              AND t1.segment_id = t2.segment_id
              and t1.measured_date < t2.measured_date
    ) AS sample_marginal
WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'
 AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'
ORDER BY from_date asc, to_date asc;

優化器對此無能為力。

如果您從視圖定義中刪除視窗函式呼叫(無論如何這似乎毫無意義),性能應該會提高。

引用自:https://dba.stackexchange.com/questions/311414