計算某個欄位的 40 天移動平均線
我有一個表格,用於儲存有關呼叫中心使用者呼叫的資訊。該表具有 call_id、呼叫日期、呼叫的實際日期和時間、呼叫類型以及與呼叫關聯的分數。
我的要求是計算通話日得分的 40 天移動平均值。40 天應從通話日期的前一天開始。如果過去 40 天內沒有通話,則應包括計算移動平均線的通話日期行。
以下是樣本數據:
select * from test_aes;
輸出:
call_id | call_dt_key | call_type_id | call_dt_tm | aes_raw 1 | 2016-01-01 | CT1 | 2016-01-01 00:00:10-08 | 10 2 | 2016-01-01 | CT1 | 2016-01-01 00:00:20-08 | 20 3 | 2016-01-01 | CT1 | 2016-01-01 00:00:30-08 | 10 4 | 2016-01-01 | CT1 | 2016-01-01 00:00:40-08 | 20 5 | 2016-01-01 | CT1 | 2016-01-01 00:00:50-08 | 10 6 | 2016-01-01 | CT1 | 2016-01-01 00:01:00-08 | 20 7 | 2016-01-01 | CT1 | 2016-01-01 00:02:00-08 | 10 8 | 2016-01-01 | CT1 | 2016-01-01 00:03:00-08 | 20 9 | 2016-01-01 | CT1 | 2016-01-01 00:04:00-08 | 10 10 | 2016-01-01 | CT1 | 2016-01-01 00:05:00-08 | 20 11 | 2016-01-05 | CT1 | 2016-01-05 00:00:10-08 | 10 12 | 2016-01-05 | CT1 | 2016-01-05 00:00:20-08 | 10 13 | 2016-01-05 | CT1 | 2016-01-05 00:00:30-08 | 20 14 | 2016-01-05 | CT1 | 2016-01-05 00:00:40-08 | 20 15 | 2016-01-05 | CT1 | 2016-01-05 00:00:50-08 | 20 16 | 2016-01-10 | CT1 | 2016-01-10 00:00:10-08 | 10 17 | 2016-01-10 | CT1 | 2016-01-10 00:00:20-08 | 20 18 | 2016-01-15 | CT1 | 2016-01-15 00:00:10-08 | 10 19 | 2016-01-15 | CT1 | 2016-01-15 00:00:20-08 | 20 20 | 2016-01-15 | CT1 | 2016-01-15 00:00:30-08 | 20 21 | 2016-01-16 | CT1 | 2016-01-16 00:00:10-08 | 20 22 | 2016-01-16 | CT1 | 2016-01-16 00:00:20-08 | 10 23 | 2016-01-16 | CT1 | 2016-01-16 00:00:30-08 | 20 24 | 2016-01-20 | CT1 | 2016-01-20 00:00:10-08 | 20 25 | 2016-01-20 | CT1 | 2016-01-20 00:00:20-08 | 10 26 | 2016-01-21 | CT1 | 2016-01-21 00:00:10-08 | 10 27 | 2016-01-21 | CT1 | 2016-01-21 00:00:20-08 | 20 28 | 2016-01-31 | CT1 | 2016-01-31 00:00:10-08 | 10 29 | 2016-01-31 | CT1 | 2016-01-31 00:00:20-08 | 20 30 | 2016-02-01 | CT1 | 2016-02-01 00:00:10-08 | 10 31 | 2016-02-01 | CT1 | 2016-02-01 00:00:20-08 | 20 32 | 2016-02-10 | CT1 | 2016-02-10 00:00:10-08 | 10 33 | 2016-02-10 | CT1 | 2016-02-10 00:00:20-08 | 20 34 | 2016-02-15 | CT1 | 2016-02-15 00:00:15-08 | 10 35 | 2016-02-15 | CT1 | 2016-02-15 00:00:20-08 | 20 36 | 2016-02-26 | CT1 | 2016-02-26 00:00:15-08 | 10 37 | 2016-02-26 | CT1 | 2016-02-26 00:00:20-08 | 20 38 | 2016-03-04 | CT1 | 2016-03-04 00:00:15-08 | 10 39 | 2016-03-04 | CT1 | 2016-03-04 00:00:20-08 | 20 40 | 2016-03-18 | CT1 | 2016-03-18 00:00:15-07 | 10 41 | 2016-03-18 | CT1 | 2016-03-18 00:00:20-07 | 20
因此輸出應該是:
call_dt_key | average_40 2016-01-01 | 15.0000 (include rows for 2016-01-01) 2016-01-05 | 15.0000 (don't include rows for 2016-01-05) 2016-01-10 | 15.3333 (don't include rows for 2016-01-10) 2016-01-15 | 15.2941 (don't include rows for 2016-01-15) 2016-01-16 | 15.5000 (don't include rows for 2016-01-16) 2016-01-20 | 15.6522 (don't include rows for 2016-01-20) 2016-01-21 | 15.6000 (don't include rows for 2016-01-21) 2016-01-31 | 15.5556 (don't include rows for 2016-01-31) 2016-02-01 | 15.5172 (don't include rows for 2016-02-01) 2016-02-10 | 15.4839 (start date 2015-12-31 end date 2016-02-09) 2016-02-15 | 15.6522 (start date 2016-01-05 end date 2016-02-14) 2016-02-26 | 15.3333 (start date 2016-01-16 end date 2016-02-25) 2016-03-04 | 15.0000 (start date 2016-01-23 end date 2016-03-03) 2016-03-18 | 15.0000 (start date 2016-02-06 end date 2016-03-17)
以下連結中的架構和測試數據:SQL Fiddle
我不能
ROWS
在AVG
視窗定義中使用,因為test_aes
某一天有數千行。
從問題中並不清楚該
call_type_id
列的作用是什麼。在你澄清之前,我會忽略它。沒有視窗函式
這是一個根本不使用視窗函式的簡單變體。
確保 上有一個索引
(call_dt_key, aes_raw)
。
CTE_Dates
返回表中所有日期的列表併計算每天的平均值。這average_current_day
是第一天需要的。伺服器將以任何方式掃描整個索引,因此計算這樣的平均值很便宜。然後,對於每個不同的日子,我使用自聯接來計算前 40 天的平均值。這將返回第一天,在主查詢
NULL
中替換為。average_current_day
您不必在這裡使用 CTE,它只是使查詢更易於閱讀。
WITH CTE_Dates AS ( SELECT call_dt_key ,call_dt_key - INTERVAL '41 day' AS dt_from ,call_dt_key - INTERVAL '1 day' AS dt_to ,AVG(test_aes.aes_raw) AS average_current_day FROM test_aes GROUP BY call_dt_key ) SELECT CTE_Dates.call_dt_key ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40 FROM CTE_Dates LEFT JOIN LATERAL ( SELECT AVG(test_aes.aes_raw) AS average_40 FROM test_aes WHERE test_aes.call_dt_key >= CTE_Dates.dt_from AND test_aes.call_dt_key <= CTE_Dates.dt_to ) AS prev40 ON true ORDER BY call_dt_key;
結果
| call_dt_key | average_40 | |----------------------------|--------------------| | January, 01 2016 00:00:00 | 15 | | January, 05 2016 00:00:00 | 15 | | January, 10 2016 00:00:00 | 15.333333333333334 | | January, 15 2016 00:00:00 | 15.294117647058824 | | January, 16 2016 00:00:00 | 15.5 | | January, 20 2016 00:00:00 | 15.652173913043478 | | January, 21 2016 00:00:00 | 15.6 | | January, 31 2016 00:00:00 | 15.555555555555555 | | February, 01 2016 00:00:00 | 15.517241379310345 | | February, 10 2016 00:00:00 | 15.483870967741936 | | February, 15 2016 00:00:00 | 15.652173913043478 | | February, 26 2016 00:00:00 | 15.333333333333334 | | March, 04 2016 00:00:00 | 15 | | March, 18 2016 00:00:00 | 15 |
這是SQL 小提琴。
使用推薦的索引,這個解決方案應該不會太糟糕。
有一個類似的問題,但對於 SQL Server(使用視窗函式的日期範圍滾動總和)。Postgres 似乎支持
RANGE
指定大小的視窗,而 SQL Server 目前不支持。因此,Postgres 的解決方案可能會更簡單一些。關鍵部分是:
AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)
要使用這些視窗函式計算移動平均值,您可能必須首先填補日期中的空白,以便表格每天至少有一行(這些虛擬行中的
NULL
值aes_raw
)。…
正如Erwin Brandstetter在他的回答中正確指出的那樣,目前(從 Postgres 9.5 開始)Postgres 中的
RANGE
子句仍然具有類似於 SQL Server 的限制。文件說:值PRECEDING 和值FOLLOWING 情況目前僅允許在 ROWS 模式下使用。
因此,
RANGE
即使您使用 Postgres 9.5,上述方法也不適合您。使用視窗函式
您可以使用上述 SQL Server 問題中概述的方法。例如,將您的數據分組為每日總和,添加缺失天數的行,計算移動
SUM
和COUNT
使用OVER
withROWS
,然後計算移動平均值。這些方面的東西:
WITH CTE_Dates AS ( SELECT call_dt_key ,SUM(test_aes.aes_raw) AS sum_daily ,COUNT(*) AS cnt_daily ,AVG(test_aes.aes_raw) AS avg_daily ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date FROM test_aes GROUP BY call_dt_key ) ,CTE_AllDates AS ( SELECT CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt ,avg_daily ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40 FROM CTE_Dates INNER JOIN LATERAL generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') AS all_dates(dt) ON true ) SELECT final_dt ,COALESCE(avg_40, avg_daily) AS final_avg FROM CTE_AllDates WHERE final_dt IS NOT NULL ORDER BY final_dt;
結果與第一個變體中的相同。請參閱SQL 小提琴。
同樣,這可以用沒有 CTE 的內聯子查詢來編寫。
值得在真實數據上檢查不同變體的性能。