Postgresql
在 PostGIS 中為每個要素創建 7 天滾動平均值
我有一個時間序列數據集,如下所示:
uid | geom | date | count | ------------------------------------- 1 | FeatureA | 2016-02-01 | 1 | 2 | FeatureA | 2016-02-02 | 2 | 3 | FeatureA | 2016-02-03 | 3 | 4 | FeatureA | 2016-02-04 | 4 | 5 | FeatureA | 2016-02-05 | 5 | 6 | FeatureA | 2016-02-06 | 9 | 7 | FeatureA | 2016-02-07 | 11 | 8 | FeatureA | 2016-02-08 | 15 | 9 | FeatureA | 2016-02-09 | 17 | 10 | FeatureA | 2016-02-10 | 20 | 11 | FeatureB | 2016-02-01 | 2 | 12 | FeatureB | 2016-02-02 | 2 | 13 | FeatureB | 2016-02-03 | 8 | 14 | FeatureB | 2016-02-04 | 4 | 15 | FeatureB | 2016-02-05 | 5 | 16 | FeatureB | 2016-02-06 | 15 | 17 | FeatureB | 2016-02-07 | 11 | 18 | FeatureB | 2016-02-08 | 15 | 19 | FeatureB | 2016-02-09 | 19 | 20 | FeatureB | 2016-02-10 | 25 |
我想計算數據集中每個特徵(約 2000 個特徵)的 7 天滾動平均值。要計算 Postgres 視窗中的 7 天滾動平均值,可以按照此處所述使用。
以下程式碼幾乎可以工作:
SELECT geom,date,count, AVG(count) OVER(PARTITION BY geom ORDER BY geom, date ROWS BETWEEN CURRENT ROW AND 7 Following) AS rolling_avg_count FROM features;
這給出了以下輸出:
geom | date | count | rolling_avg_count -------------------------------------------------- FeatureA | 2016-02-01 | 1 | 6.25 FeatureA | 2016-02-02 | 2 | 8.25 FeatureA | 2016-02-03 | 3 | 10.5 FeatureA | 2016-02-04 | 4 | 11.57 FeatureA | 2016-02-05 | 5 | 12.83 FeatureA | 2016-02-06 | 9 | 14.4 FeatureA | 2016-02-07 | 11 | 15.75 FeatureA | 2016-02-08 | 15 | 17.33 FeatureA | 2016-02-09 | 17 | 18.5 FeatureA | 2016-02-10 | 20 | 20 FeatureB | 2016-02-01 | 2 | 7.75 FeatureB | 2016-02-02 | 2 | 9.875 FeatureB | 2016-02-03 | 8 | 12.75 FeatureB | 2016-02-04 | 4 | 13.43 FeatureB | 2016-02-05 | 5 | 15 FeatureB | 2016-02-06 | 15 | 17 FeatureB | 2016-02-07 | 11 | 17.5 FeatureB | 2016-02-08 | 15 | 19.67 FeatureB | 2016-02-10 | 25 | 25
但是,輸出會繼續計算平均值,直到分區結束。例如,
uid
10 的滾動平均值為 20(僅使用一條記錄計算)。我希望當後面的行少於 7 時停止計算。理想情況下,輸出看起來像這樣:
geom | date | count | rolling_avg_count -------------------------------------------------- FeatureA | 2016-02-01 | 1 | 6.25 FeatureA | 2016-02-02 | 2 | 8.25 FeatureA | 2016-02-03 | 3 | 10.5 FeatureA | 2016-02-04 | 4 | 11.57 FeatureA | 2016-02-05 | 5 | FeatureA | 2016-02-06 | 9 | FeatureA | 2016-02-07 | 11 | FeatureA | 2016-02-08 | 15 | FeatureA | 2016-02-09 | 17 | FeatureA | 2016-02-10 | 20 | FeatureB | 2016-02-01 | 2 | 7.75 FeatureB | 2016-02-02 | 2 | 9.875 FeatureB | 2016-02-03 | 8 | 12.75 FeatureB | 2016-02-04 | 4 | 13.43 FeatureB | 2016-02-05 | 5 | FeatureB | 2016-02-06 | 15 | FeatureB | 2016-02-07 | 11 | FeatureB | 2016-02-08 | 15 | FeatureB | 2016-02-10 | 25 |
這會產生所需的結果 - 在調整 1 倍誤差後②:
SELECT geom, date, count , CASE WHEN rn < 7 THEN NULL -- ② ELSE round(rolling_avg_count, 2) END AS rolling_avg_count FROM ( SELECT geom, date, count , AVG(count) OVER (PARTITION BY geom ORDER BY date -- ① ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING -- ② ) AS rolling_avg_count , row_number() OVER (PARTITION BY geom ORDER BY date DESC) AS rn FROM features ) sub;
db<>在這裡擺弄
① 有了
PARTITION BY geom
,就不需要geom
了ORDER BY
。② 你有過一個錯誤:
我想計算 7 天的滾動平均值
但是您顯示了8天滾動平均值 (1 + 7) 的計算。
我希望當後面的行少於 7 時停止計算。
但是您只顯示了以下6行的結果(顯示值11.57和13.43)。
我將其設為實際的 7 天平均值,當後面的行少於 6 時停止。
性能優化
由於第二個視窗函式中的降序排序,以上內容為查詢計劃添加了額外的排序。
這個替代查詢通過額外的來避免這種情況
count()
:SELECT geom, date, count , CASE WHEN rn < 6 THEN NULL -- ③ ELSE round(rolling_avg_count, 2) END AS rolling_avg_count FROM ( SELECT geom, date, count , avg(count) OVER (PARTITION BY geom ORDER BY date ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING ) AS rolling_avg_count , count(*) OVER (PARTITION BY geom) - row_number() OVER (PARTITION BY geom ORDER BY date) AS rn FROM features ) sub;
db<>在這裡擺弄
更詳細一點,添加一個額外的視窗函式。但我希望它表現得更好。
③ 調整為從 0 開始的行號。(比添加便宜
+ 1
。)