Postgresql

計算某個欄位的 40 天移動平均線

  • June 10, 2016

我有一個表格,用於儲存有關呼叫中心使用者呼叫的資訊。該表具有 call_id、呼叫日期、呼叫的實際日期和時間、呼叫類型以及與呼叫關聯的分數。

我的要求是計算通話日得分的 40 天移動平均值。40 天應從通話日期的前一天開始。如果過去 40 天內沒有通話,則應包括計算移動平均線的通話日期行。

以下是樣本數據:

select * from test_aes;

輸出:

call_id | call_dt_key | call_type_id |       call_dt_tm       | aes_raw
  1    | 2016-01-01  | CT1          | 2016-01-01 00:00:10-08 |      10
  2    | 2016-01-01  | CT1          | 2016-01-01 00:00:20-08 |      20
  3    | 2016-01-01  | CT1          | 2016-01-01 00:00:30-08 |      10
  4    | 2016-01-01  | CT1          | 2016-01-01 00:00:40-08 |      20
  5    | 2016-01-01  | CT1          | 2016-01-01 00:00:50-08 |      10
  6    | 2016-01-01  | CT1          | 2016-01-01 00:01:00-08 |      20
  7    | 2016-01-01  | CT1          | 2016-01-01 00:02:00-08 |      10
  8    | 2016-01-01  | CT1          | 2016-01-01 00:03:00-08 |      20
  9    | 2016-01-01  | CT1          | 2016-01-01 00:04:00-08 |      10
 10    | 2016-01-01  | CT1          | 2016-01-01 00:05:00-08 |      20
 11    | 2016-01-05  | CT1          | 2016-01-05 00:00:10-08 |      10
 12    | 2016-01-05  | CT1          | 2016-01-05 00:00:20-08 |      10
 13    | 2016-01-05  | CT1          | 2016-01-05 00:00:30-08 |      20
 14    | 2016-01-05  | CT1          | 2016-01-05 00:00:40-08 |      20
 15    | 2016-01-05  | CT1          | 2016-01-05 00:00:50-08 |      20
 16    | 2016-01-10  | CT1          | 2016-01-10 00:00:10-08 |      10
 17    | 2016-01-10  | CT1          | 2016-01-10 00:00:20-08 |      20
 18    | 2016-01-15  | CT1          | 2016-01-15 00:00:10-08 |      10
 19    | 2016-01-15  | CT1          | 2016-01-15 00:00:20-08 |      20
 20    | 2016-01-15  | CT1          | 2016-01-15 00:00:30-08 |      20
 21    | 2016-01-16  | CT1          | 2016-01-16 00:00:10-08 |      20
 22    | 2016-01-16  | CT1          | 2016-01-16 00:00:20-08 |      10
 23    | 2016-01-16  | CT1          | 2016-01-16 00:00:30-08 |      20
 24    | 2016-01-20  | CT1          | 2016-01-20 00:00:10-08 |      20
 25    | 2016-01-20  | CT1          | 2016-01-20 00:00:20-08 |      10
 26    | 2016-01-21  | CT1          | 2016-01-21 00:00:10-08 |      10
 27    | 2016-01-21  | CT1          | 2016-01-21 00:00:20-08 |      20
 28    | 2016-01-31  | CT1          | 2016-01-31 00:00:10-08 |      10
 29    | 2016-01-31  | CT1          | 2016-01-31 00:00:20-08 |      20
 30    | 2016-02-01  | CT1          | 2016-02-01 00:00:10-08 |      10
 31    | 2016-02-01  | CT1          | 2016-02-01 00:00:20-08 |      20
 32    | 2016-02-10  | CT1          | 2016-02-10 00:00:10-08 |      10
 33    | 2016-02-10  | CT1          | 2016-02-10 00:00:20-08 |      20
 34    | 2016-02-15  | CT1          | 2016-02-15 00:00:15-08 |      10
 35    | 2016-02-15  | CT1          | 2016-02-15 00:00:20-08 |      20
 36    | 2016-02-26  | CT1          | 2016-02-26 00:00:15-08 |      10
 37    | 2016-02-26  | CT1          | 2016-02-26 00:00:20-08 |      20
 38    | 2016-03-04  | CT1          | 2016-03-04 00:00:15-08 |      10
 39    | 2016-03-04  | CT1          | 2016-03-04 00:00:20-08 |      20
 40    | 2016-03-18  | CT1          | 2016-03-18 00:00:15-07 |      10
 41    | 2016-03-18  | CT1          | 2016-03-18 00:00:20-07 |      20

因此輸出應該是:

call_dt_key | average_40 
2016-01-01  | 15.0000 (include rows for 2016-01-01)  
2016-01-05  | 15.0000 (don't include rows for 2016-01-05)  
2016-01-10  | 15.3333 (don't include rows for 2016-01-10)  
2016-01-15  | 15.2941 (don't include rows for 2016-01-15)  
2016-01-16  | 15.5000 (don't include rows for 2016-01-16)  
2016-01-20  | 15.6522 (don't include rows for 2016-01-20)  
2016-01-21  | 15.6000 (don't include rows for 2016-01-21)  
2016-01-31  | 15.5556 (don't include rows for 2016-01-31)  
2016-02-01  | 15.5172 (don't include rows for 2016-02-01)  
2016-02-10  | 15.4839 (start date 2015-12-31 end date 2016-02-09)
2016-02-15  | 15.6522 (start date 2016-01-05 end date 2016-02-14)   
2016-02-26  | 15.3333 (start date 2016-01-16 end date 2016-02-25)  
2016-03-04  | 15.0000 (start date 2016-01-23 end date 2016-03-03)  
2016-03-18  | 15.0000 (start date 2016-02-06 end date 2016-03-17)  

以下連結中的架構和測試數據:SQL Fiddle

我不能ROWSAVG視窗定義中使用,因為test_aes某一天有數千行。

從問題中並不清楚該call_type_id列的作用是什麼。在你澄清之前,我會忽略它。

沒有視窗函式

這是一個根本不使用視窗函式的簡單變體。

確保 上有一個索引(call_dt_key, aes_raw)

CTE_Dates返回表中所有日期的列表併計算每天的平均值。這average_current_day是第一天需要的。伺服器將以任何方式掃描整個索引,因此計算這樣的平均值很便宜。

然後,對於每個不同的日子,我使用自聯接來計算前 40 天的平均值。這將返回第一天,在主查詢NULL中替換為。average_current_day

您不必在這裡使用 CTE,它只是使查詢更易於閱讀。

WITH
CTE_Dates
AS
(
   SELECT
       call_dt_key
       ,call_dt_key - INTERVAL '41 day' AS dt_from
       ,call_dt_key - INTERVAL '1 day' AS dt_to
       ,AVG(test_aes.aes_raw) AS average_current_day
   FROM test_aes
   GROUP BY call_dt_key
)
SELECT
   CTE_Dates.call_dt_key
   ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
   CTE_Dates
   LEFT JOIN LATERAL
   (
       SELECT AVG(test_aes.aes_raw) AS average_40
       FROM test_aes
       WHERE
               test_aes.call_dt_key >= CTE_Dates.dt_from
           AND test_aes.call_dt_key <= CTE_Dates.dt_to
   ) AS prev40 ON true
ORDER BY call_dt_key;

結果

|                call_dt_key |         average_40 |
|----------------------------|--------------------|
|  January, 01 2016 00:00:00 |                 15 |
|  January, 05 2016 00:00:00 |                 15 |
|  January, 10 2016 00:00:00 | 15.333333333333334 |
|  January, 15 2016 00:00:00 | 15.294117647058824 |
|  January, 16 2016 00:00:00 |               15.5 |
|  January, 20 2016 00:00:00 | 15.652173913043478 |
|  January, 21 2016 00:00:00 |               15.6 |
|  January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
|    March, 04 2016 00:00:00 |                 15 |
|    March, 18 2016 00:00:00 |                 15 |

這是SQL 小提琴

使用推薦的索引,這個解決方案應該不會太糟糕。


有一個類似的問題,但對於 SQL Server(使用視窗函式的日期範圍滾動總和)。Postgres 似乎支持RANGE指定大小的視窗,而 SQL Server 目前不支持。因此,Postgres 的解決方案可能會更簡單一些。

關鍵部分是:

AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)

要使用這些視窗函式計算移動平均值,您可能必須首先填補日期中的空白,以便表格每天至少有一行(這些虛擬行中的NULLaes_raw)。

正如Erwin Brandstetter在他的回答中正確指出的那樣,目前(從 Postgres 9.5 開始)Postgres 中的RANGE子句仍然具有類似於 SQL Server 的限制。文件說

PRECEDING 和FOLLOWING 情況目前僅允許在 ROWS 模式下使用。

因此,RANGE即使您使用 Postgres 9.5,上述方法也不適合您。


使用視窗函式

您可以使用上述 SQL Server 問題中概述的方法。例如,將您的數據分組為每日總和,添加缺失天數的行,計算移動SUMCOUNT使用OVERwith ROWS,然後計算移動平均值。

這些方面的東西:

WITH
CTE_Dates
AS
(
   SELECT
       call_dt_key
       ,SUM(test_aes.aes_raw) AS sum_daily
       ,COUNT(*) AS cnt_daily
       ,AVG(test_aes.aes_raw) AS avg_daily
       ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
   FROM test_aes
   GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
   SELECT
       CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
       ,avg_daily
       ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) 
           OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
       /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) 
           OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
   FROM
       CTE_Dates
       INNER JOIN LATERAL
           generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') 
           AS all_dates(dt) ON true
)
SELECT
   final_dt
   ,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;

結果與第一個變體中的相同。請參閱SQL 小提琴

同樣,這可以用沒有 CTE 的內聯子查詢來編寫。

值得在真實數據上檢查不同變體的性能。

引用自:https://dba.stackexchange.com/questions/140358