使用視窗函式的日期範圍滾動總和

December 30, 2016

我需要計算一個日期範圍內的滾動總和。為了說明，使用AdventureWorks 範例數據庫，以下假設語法將完全滿足我的需要：
SELECT
   TH.ProductID,
   TH.TransactionDate,
   TH.ActualCost,
   RollingSum45 = SUM(TH.ActualCost) OVER (
       PARTITION BY TH.ProductID
       ORDER BY TH.TransactionDate
       RANGE BETWEEN 
           INTERVAL 45 DAY PRECEDING
           AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
   TH.ProductID,
   TH.TransactionDate,
   TH.ReferenceOrderID;
遺憾的是，RANGE窗框範圍目前不允許 SQL Server 中的間隔。
我知道我可以使用子查詢和正常（非視窗）聚合來編寫解決方案：
SELECT 
   TH.ProductID,
   TH.TransactionDate,
   TH.ActualCost,
   RollingSum45 =
   (
       SELECT SUM(TH2.ActualCost)
       FROM Production.TransactionHistory AS TH2
       WHERE
           TH2.ProductID = TH.ProductID
           AND TH2.TransactionDate &lt;= TH.TransactionDate
           AND TH2.TransactionDate &gt;= DATEADD(DAY, -45, TH.TransactionDate)
   )
FROM Production.TransactionHistory AS TH
ORDER BY
   TH.ProductID,
   TH.TransactionDate,
   TH.ReferenceOrderID;
給定以下索引：
CREATE UNIQUE INDEX i
ON Production.TransactionHistory
   (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
   (ActualCost);
執行計劃是：
雖然不是非常低效，但似乎應該可以僅使用 SQL Server 2012、2014 或 2016（到目前為止）支持的視窗聚合和分析函式來表達此查詢。
為清楚起見，我正在尋找一種對數據執行單次傳遞的解決方案。
在 T-SQL 中，這很可能意味著子句將OVER完成工作，而執行計劃將包含 Window Spools 和 Window Aggregates。所有使用該OVER子句的語言元素都是公平的遊戲。SQLCLR 解決方案是可以接受的，只要它保證產生正確的結果。
對於 T-SQL 解決方案，執行計劃中的雜湊、排序和視窗假離線/聚合越少越好。隨意添加索引，但不允許使用單獨的結構（例如，沒有預先計算的表與觸發器保持同步）。允許參考表（數字表、日期表等）
理想情況下，解決方案將以與上述子查詢版本相同的順序產生完全相同的結果，但任何可以說是正確的也是可以接受的。性能始終是一個考慮因素，因此解決方案至少應該是相當有效的。
**專用聊天室：**我創建了一個公共聊天室，用於與此問題及其答案相關的討論。任何擁有至少 20 個聲望點的使用者都可以直接參與。如果您的代表少於 20 並且想參加，請在下面的評論中聯繫我。
關於“使用視窗函式的日期範圍滾動求和”的討論

好問題，保羅！我使用了幾種不同的方法，一種在 T-SQL 中，一種在 CLR 中。
T-SQL 快速總結
T-SQL 方法可以概括為以下步驟：
取產品/日期的叉積
合併觀察到的銷售數據
將該數據聚合到產品/日期級別
根據此匯總數據計算過去 45 天的滾動總和（其中包含填寫的任何“缺失”天數）
將這些結果過濾到僅具有一個或多個銷售的產品/日期配對
使用SET STATISTICS IO ON，這種方法會報告Table 'TransactionHistory'. Scan count 1, logical reads 484，它確認了桌子上的“單次通過”。供參考，原始循環搜尋查詢報告Table 'TransactionHistory'. Scan count 113444, logical reads 438366。
據報導SET STATISTICS TIME ON，CPU時間是514ms。2231ms這與原始查詢相比是有利的。
CLR 快速總結
CLR 總結可以概括為以下步驟：
將數據讀入記憶體，按產品和日期排序
在處理每筆交易時，將總成本添加到執行總成本中。每當一個交易與前一個交易是不同的產品時，將執行總計重置為 0。
維護指向與目前事務具有相同（產品、日期）的第一個事務的指針。每當遇到與該（產品，日期）的最後一筆交易時，計算該交易的滾動總和並將其應用於具有相同（產品，日期）的所有交易
將所有結果返回給使用者！
使用SET STATISTICS IO ON，此方法報告未發生邏輯 I/O！哇，完美的解決方案！（實際上，似乎SET STATISTICS IO並沒有報告 CLR 內部發生的 I/O。但是從程式碼中可以很容易地看出，對錶進行了一次掃描，並按照 Paul 建議的索引按順序檢索數據。
據報導SET STATISTICS TIME ON，CPU時間是現在187ms。所以這是對 T-SQL 方法的相當大的改進。不幸的是，這兩種方法的總經過時間非常相似，大約為半秒。但是，基於 CLR 的方法確實必須向控制台輸出 113K 行（而按產品/日期分組的 T-SQL 方法只有 52K），所以這就是我關注 CPU 時間的原因。
這種方法的另一大優點是它產生與原始循環/搜尋方法完全相同的結果，即使在同一天多次銷售產品的情況下，每筆交易也包括一行。（在 AdventureWorks 上，我專門比較了逐行的結果，並確認它們與 Paul 的原始查詢相符。）
這種方法的一個缺點，至少在目前的形式中，是它讀取記憶體中的所有數據。然而，已經設計的算法在任何給定時間只嚴格需要記憶體中的目前視窗幀，並且可以更新以適用於超出記憶體的數據集。Paul 在他的回答中通過生成該算法的實現來說明這一點，該算法僅將滑動視窗儲存在記憶體中。這是以授予 CLR 程序集更高的權限為代價的，但絕對值得將此解決方案擴展到任意大的數據集。
T-SQL - 一次掃描，按日期分組
最初設定
USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d &lt; '1/1/2021')
BEGIN
   INSERT INTO dbo.calendar (d) VALUES (@d)
   SELECT @d =  DATEADD(DAY, 1, @d)
END
GO
查詢
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
       @maxAnalysisDate DATE = '2008-09-03'  -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
   SELECT ProductID, TransactionDate, NumOrders, ActualCost,
       SUM(ActualCost) OVER (
               PARTITION BY ProductId ORDER BY TransactionDate 
               ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
           ) AS RollingSum45
   FROM (
       -- The full cross-product of products and dates, combined with actual cost information for that product/date
       SELECT p.ProductID, c.d AS TransactionDate,
           COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
       FROM Production.Product p
       JOIN dbo.calendar c
           ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
       LEFT OUTER JOIN Production.TransactionHistory TH
           ON TH.ProductId = p.productId
           AND TH.TransactionDate = c.d
       GROUP BY P.ProductID, c.d
   ) aggsByDay
) rollingSums
WHERE NumOrders &gt; 0
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)
執行計劃
從執行計劃中，我們看到 Paul 提出的原始索引足以讓我們執行單次有序掃描Production.TransactionHistory，使用合併連接將交易歷史與每個可能的產品/日期組合結合起來。
假設
這種方法中有一些重要的假設。我想這將由保羅來決定它們是否可以接受:)
我正在使用Production.Product桌子。該表在上免費提供，AdventureWorks2012並且該關係由來自的外鍵強制執行Production.TransactionHistory，因此我將其解釋為公平遊戲。
這種方法依賴於事務沒有時間組件的事實AdventureWorks2012；如果他們這樣做了，那麼如果不首先通過交易歷史記錄，就無法生成完整的產品/日期組合。
我正在生成一個行集，每個產品/日期對只包含一行。我認為這是“可以說是正確的”，並且在許多情況下是返回更理想的結果。對於每個產品/日期，我添加了一個NumOrders列來指示發生了多少銷售。如果產品在同一日期多次銷售（例如319/ 2007-09-05 00:00:00.000），請參閱以下螢幕截圖，以比較原始查詢與建議查詢的結果
CLR - 一次掃描，完整的未分組結果集
主要功能體
這裡沒什麼可看的；函式主體聲明輸入（必須匹配相應的 SQL 函式），建立 SQL 連接，並打開 SQLReader。
// SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory
[SqlFunction(DataAccess = DataAccessKind.Read,
   FillRowMethodName = "RollingSum_Fill",
   TableDefinition = "ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT," +
                     "ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT")]
public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) {
   using (var connection = new SqlConnection("context connection=true;")) {
       connection.Open();
       List&lt;TrxnRollingSum&gt; trxns;
       using (var cmd = connection.CreateCommand()) {
           //Read the transaction history (note: the order is important!)
           cmd.CommandText = @"SELECT ProductId, TransactionDate, ReferenceOrderID,
                                   CAST(ActualCost AS FLOAT) AS ActualCost 
                               FROM Production.TransactionHistory 
                               ORDER BY ProductId, TransactionDate";
           using (var reader = cmd.ExecuteReader()) {
               trxns = ComputeRollingSums(reader, rollingPeriodDays.Value);
           }
       }

       return trxns;
   }
}
核心邏輯
我已經分離出主要邏輯，以便更容易關注：
// Given a SqlReader with transaction history data, computes / returns the rolling sums
private static List&lt;TrxnRollingSum&gt; ComputeRollingSums(SqlDataReader reader,
                                                       int rollingPeriodDays) {
   var startIndexOfRollingPeriod = 0;
   var rollingSumIndex = 0;
   var trxns = new List&lt;TrxnRollingSum&gt;();

   // Prior to the loop, initialize "next" to be the first transaction
   var nextTrxn = GetNextTrxn(reader, null);
   while (nextTrxn != null)
   {
       var currTrxn = nextTrxn;
       nextTrxn = GetNextTrxn(reader, currTrxn);
       trxns.Add(currTrxn);

       // If the next transaction is not the same product/date as the current
       // transaction, we can finalize the rolling sum for the current transaction
       // and all previous transactions for the same product/date
       var finalizeRollingSum = nextTrxn == null || (nextTrxn != null &&
                               (currTrxn.ProductId != nextTrxn.ProductId ||
                               currTrxn.TransactionDate != nextTrxn.TransactionDate));
       if (finalizeRollingSum)
       {
           // Advance the pointer to the first transaction (for the same product)
           // that occurs within the rolling period
           while (startIndexOfRollingPeriod &lt; trxns.Count
               && trxns[startIndexOfRollingPeriod].TransactionDate &lt;
                   currTrxn.TransactionDate.AddDays(-1 * rollingPeriodDays))
           {
               startIndexOfRollingPeriod++;
           }

           // Compute the rolling sum as the cumulative sum (for this product),
           // minus the cumulative sum for prior to the beginning of the rolling window
           var sumPriorToWindow = trxns[startIndexOfRollingPeriod].PrevSum;
           var rollingSum = currTrxn.ActualCost + currTrxn.PrevSum - sumPriorToWindow;
           // Fill in the rolling sum for all transactions sharing this product/date
           while (rollingSumIndex &lt; trxns.Count)
           {
               trxns[rollingSumIndex++].RollingSum = rollingSum;
           }
       }

       // If this is the last transaction for this product, reset the rolling period
       if (nextTrxn != null && currTrxn.ProductId != nextTrxn.ProductId)
       {
           startIndexOfRollingPeriod = trxns.Count;
       }
   }

   return trxns;
}
幫手
以下邏輯可以內聯編寫，但將它們拆分為自己的方法時更容易閱讀。
private static TrxnRollingSum GetNextTrxn(SqlDataReader r, TrxnRollingSum currTrxn) {
   TrxnRollingSum nextTrxn = null;
   if (r.Read()) {
       nextTrxn = new TrxnRollingSum {
           ProductId = r.GetInt32(0),
           TransactionDate = r.GetDateTime(1),
           ReferenceOrderId = r.GetInt32(2),
           ActualCost = r.GetDouble(3),
           PrevSum = 0 };
       if (currTrxn != null) {
           nextTrxn.PrevSum = (nextTrxn.ProductId == currTrxn.ProductId)
                   ? currTrxn.PrevSum + currTrxn.ActualCost : 0;
       }
   }
   return nextTrxn;
}

// Represents the output to be returned
// Note that the ReferenceOrderId/PrevSum fields are for debugging only
private class TrxnRollingSum {
   public int ProductId { get; set; }
   public DateTime TransactionDate { get; set; }
   public int ReferenceOrderId { get; set; }
   public double ActualCost { get; set; }
   public double PrevSum { get; set; }
   public double RollingSum { get; set; }
}

// The function that generates the result data for each row
// (Such a function is mandatory for SQL CLR table-valued functions)
public static void RollingSum_Fill(object trxnWithRollingSumObj,
                                   out int productId,
                                   out DateTime transactionDate, 
                                   out int referenceOrderId, out double actualCost,
                                   out double prevCumulativeSum,
                                   out double rollingSum) {
   var trxn = (TrxnRollingSum)trxnWithRollingSumObj;
   productId = trxn.ProductId;
   transactionDate = trxn.TransactionDate;
   referenceOrderId = trxn.ReferenceOrderId;
   actualCost = trxn.ActualCost;
   prevCumulativeSum = trxn.PrevSum;
   rollingSum = trxn.RollingSum;
}
在 SQL 中將所有內容捆綁在一起
到目前為止，一切都在 C# 中，所以讓我們看看實際涉及的 SQL。（或者，您可以使用此部署腳本直接從我的程序集的位創建程序集，而不是自己編譯。）
USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */
GO

-- Enable CLR
EXEC sp_configure 'clr enabled', 1;
GO
RECONFIGURE;
GO

-- Create the assembly based on the dll generated by compiling the CLR project
-- I've also included the "assembly bits" version that can be run without compiling
CREATE ASSEMBLY ClrPlayground
-- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll'
WITH PERMISSION_SET = safe;
GO

--Create a function from the assembly
CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT,
               ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT)
-- The function yields rows in order, so let SQL Server know to avoid an extra sort
ORDER (ProductID, TransactionDate, ReferenceOrderID)
AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
GO

-- Now we can actually use the TVF!
SELECT * 
FROM dbo.RollingSumTvf(45) 
ORDER BY ProductId, TransactionDate, ReferenceOrderId
GO
注意事項
CLR 方法為優化算法提供了更大的靈活性，並且可能由 C# 專家進一步調整。然而，CLR 策略也有缺點。要記住幾件事：
這種 CLR 方法將數據集的副本保存在記憶體中。可以使用流式方法，但我遇到了最初的困難，發現存在一個突出的 Connect 問題，抱怨 SQL 2008+ 中的更改使使用這種方法變得更加困難。TRUSTWORTHY這仍然是可能的（正如 Paul 所展示的），但需要通過將數據庫設置為並授予EXTERNAL_ACCESSCLR 程序集來獲得更高級別的權限。因此存在一些麻煩和潛在的安全隱患，但回報是一種流式方法，它可以更好地擴展到比 AdventureWorks 上的數據集更大的數據集。
某些 DBA 可能不太容易訪問 CLR，這使得這樣的功能更像是一個黑盒子，不那麼透明，不那麼容易修改，不那麼容易部署，也許也不那麼容易調試。與 T-SQL 方法相比，這是一個很大的缺點。
獎勵：T-SQL #2 - 我實際使用的實用方法
在嘗試創造性地思考這個問題一段時間後，我想我也會發布一個相當簡單實用的方法，如果它出現在我的日常工作中，我可能會選擇解決這個問題。它確實利用了 SQL 2012+ 視窗功能，但不是以問題所希望的開創性方式：
-- Compute all running costs into a #temp table; Note that this query could simply read
-- from Production.TransactionHistory, but a CROSS APPLY by product allows the window 
-- function to be computed independently per product, supporting a parallel query plan
SELECT t.*
INTO #runningCosts
FROM Production.Product p
CROSS APPLY (
   SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost,
       -- Running sum of the cost for this product, including all ties on TransactionDate
       SUM(t.ActualCost) OVER (
           ORDER BY t.TransactionDate 
           RANGE UNBOUNDED PRECEDING) AS RunningCost
   FROM Production.TransactionHistory t
   WHERE t.ProductId = p.ProductId
) t
GO

-- Key the table in our output order
ALTER TABLE #runningCosts
ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId)
GO

SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost,
   -- Cumulative running cost - running cost prior to the sliding window
   r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
FROM #runningCosts r
OUTER APPLY (
   -- For each transaction, find the running cost just before the sliding window begins
   SELECT TOP 1 b.RunningCost
   FROM #runningCosts b
   WHERE b.ProductId = r.ProductId
       AND b.TransactionDate &lt; DATEADD(DAY, -45, r.TransactionDate)
   ORDER BY b.TransactionDate DESC
) w
ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId
GO
這實際上產生了一個相當簡單的整體查詢計劃，即使同時查看兩個相關的查詢計劃：

我喜歡這種方法的幾個原因：
它產生問題陳述中請求的完整結果集（與大多數其他 T-SQL 解決方案相反，後者返回結果的分組版本）。
易於解釋、理解和調試；一年後我不會回來，想知道我怎麼能在不破壞正確性或性能的情況下做一個小改動
它在900ms提供的數據集上執行，而不是2700ms原始循環搜尋的
如果數據更密集（每天有更多事務），則計算複雜度不會隨著滑動視窗中的事務數二次增長（就像原始查詢一樣）；我認為這解決了 Paul 想要避免多次掃描的部分擔憂
由於新的 tempdb 延遲寫入功能，它在 SQL 2012+ 的最近更新中基本上沒有 tempdb I/O
對於非常大的數據集，如果記憶體壓力成為一個問題，將每個產品的工作分成單獨的批次是微不足道的
幾個潛在的警告：
雖然它在技術上只掃描一次 Production.TransactionHistory，但它並不是真正的“一次掃描”方法，因為 #temp 表大小相似，並且還需要在該表上執行額外的邏輯 I/O。但是，我認為這與我們可以手動控制的工作台並沒有太大區別，因為我們已經定義了它的精確結構
根據您的環境，可以將 tempdb 的使用視為正面（例如，它位於一組單獨的 SSD 驅動器上）或負面（伺服器上的高並發性，已經有很多 tempdb 爭用）