Postgresql

在 PostGIS 中為每個要素創建 7 天滾動平均值

  • May 22, 2020

我有一個時間序列數據集,如下所示:

uid | geom     | date       | count |
-------------------------------------
1   | FeatureA | 2016-02-01 | 1     | 
2   | FeatureA | 2016-02-02 | 2     | 
3   | FeatureA | 2016-02-03 | 3     | 
4   | FeatureA | 2016-02-04 | 4     | 
5   | FeatureA | 2016-02-05 | 5     | 
6   | FeatureA | 2016-02-06 | 9     | 
7   | FeatureA | 2016-02-07 | 11    | 
8   | FeatureA | 2016-02-08 | 15    | 
9   | FeatureA | 2016-02-09 | 17    | 
10  | FeatureA | 2016-02-10 | 20    | 
11  | FeatureB | 2016-02-01 | 2     | 
12  | FeatureB | 2016-02-02 | 2     | 
13  | FeatureB | 2016-02-03 | 8     | 
14  | FeatureB | 2016-02-04 | 4     | 
15  | FeatureB | 2016-02-05 | 5     | 
16  | FeatureB | 2016-02-06 | 15    | 
17  | FeatureB | 2016-02-07 | 11    | 
18  | FeatureB | 2016-02-08 | 15    | 
19  | FeatureB | 2016-02-09 | 19    | 
20  | FeatureB | 2016-02-10 | 25    | 

我想計算數據集中每個特徵(約 2000 個特​​徵)的 7 天滾動平均值。要計算 Postgres 視窗中的 7 天滾動平均值,可以按照此處所述使用。

以下程式碼幾乎可以工作:

SELECT geom,date,count,  
      AVG(count)
           OVER(PARTITION BY geom ORDER BY geom, date ROWS BETWEEN CURRENT ROW AND 7 Following) AS rolling_avg_count
FROM features;

這給出了以下輸出:

geom    | date       | count | rolling_avg_count
--------------------------------------------------
FeatureA | 2016-02-01 | 1     | 6.25
FeatureA | 2016-02-02 | 2     | 8.25
FeatureA | 2016-02-03 | 3     | 10.5
FeatureA | 2016-02-04 | 4     | 11.57
FeatureA | 2016-02-05 | 5     | 12.83
FeatureA | 2016-02-06 | 9     | 14.4
FeatureA | 2016-02-07 | 11    | 15.75
FeatureA | 2016-02-08 | 15    | 17.33
FeatureA | 2016-02-09 | 17    | 18.5
FeatureA | 2016-02-10 | 20    | 20
FeatureB | 2016-02-01 | 2     | 7.75
FeatureB | 2016-02-02 | 2     | 9.875
FeatureB | 2016-02-03 | 8     | 12.75
FeatureB | 2016-02-04 | 4     | 13.43
FeatureB | 2016-02-05 | 5     | 15
FeatureB | 2016-02-06 | 15    | 17
FeatureB | 2016-02-07 | 11    | 17.5
FeatureB | 2016-02-08 | 15    | 19.67
FeatureB | 2016-02-10 | 25    | 25

但是,輸出會繼續計算平均值,直到分區結束。例如,uid10 的滾動平均值為 20(僅使用一條記錄計算)。我希望當後面的行少於 7 時停止計算。

理想情況下,輸出看起來像這樣:

geom    | date       | count | rolling_avg_count
--------------------------------------------------
FeatureA | 2016-02-01 | 1     | 6.25
FeatureA | 2016-02-02 | 2     | 8.25
FeatureA | 2016-02-03 | 3     | 10.5
FeatureA | 2016-02-04 | 4     | 11.57
FeatureA | 2016-02-05 | 5     | 
FeatureA | 2016-02-06 | 9     | 
FeatureA | 2016-02-07 | 11    | 
FeatureA | 2016-02-08 | 15    | 
FeatureA | 2016-02-09 | 17    | 
FeatureA | 2016-02-10 | 20    | 
FeatureB | 2016-02-01 | 2     | 7.75
FeatureB | 2016-02-02 | 2     | 9.875
FeatureB | 2016-02-03 | 8     | 12.75
FeatureB | 2016-02-04 | 4     | 13.43
FeatureB | 2016-02-05 | 5     | 
FeatureB | 2016-02-06 | 15    | 
FeatureB | 2016-02-07 | 11    | 
FeatureB | 2016-02-08 | 15    | 
FeatureB | 2016-02-10 | 25    | 

這會產生所需的結果 - 在調整 1 倍誤差後②:

SELECT geom, date, count
    , CASE WHEN rn < 7 THEN NULL  -- ②
           ELSE round(rolling_avg_count, 2) END AS rolling_avg_count
FROM  (
  SELECT geom, date, count
       , AVG(count) OVER (PARTITION BY geom
                          ORDER BY date -- ①
                          ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING -- ②
                         ) AS rolling_avg_count
       , row_number() OVER (PARTITION BY geom ORDER BY date DESC) AS rn
  FROM   features
  ) sub;

db<>在這裡擺弄

① 有了PARTITION BY geom,就不需要geomORDER BY

② 你有過一個錯誤:

我想計算 7 天的滾動平均值

但是您顯示了8天滾動平均值 (1 + 7) 的計算。

我希望當後面的行少於 7 時停止計算。

但是您只顯示了以下6行的結果(顯示值11.5713.43)。

我將其設為實際的 7 天平均值,當後面的行少於 6 時停止。

性能優化

由於第二個視窗函式中的降序排序,以上內容為查詢計劃添加了額外的排序。

這個替代查詢通過額外的來避免這種情況count()

SELECT geom, date, count
    , CASE WHEN rn &lt; 6 THEN NULL -- ③
           ELSE round(rolling_avg_count, 2) END AS rolling_avg_count
FROM  (
  SELECT geom, date, count
       , avg(count)   OVER (PARTITION BY geom
                            ORDER BY date
                            ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING
                           ) AS rolling_avg_count
       , count(*)     OVER (PARTITION BY geom)
       - row_number() OVER (PARTITION BY geom ORDER BY date) AS rn
  FROM   features
  ) sub;

db<>在這裡擺弄

更詳細一點,添加一個額外的視窗函式。但我希望它表現得更好。

③ 調整為從 0 開始的行號。(比添加便宜+ 1。)

引用自:https://dba.stackexchange.com/questions/267610