使用 SQL CLR 標量函式模擬 HASHBYTES 的可擴展方式是什麼?
作為 ETL 流程的一部分,我們將暫存的行與報告數據庫進行比較,以確定自上次載入數據以來是否有任何列實際發生了變化。
比較基於表的唯一鍵和所有其他列的某種散列。我們目前使用
HASHBYTES
該SHA2_256
算法,發現如果許多並發工作執行緒都在呼叫它,則它無法在大型伺服器上擴展HASHBYTES
。在 96 核伺服器上進行測試時,以每秒雜湊值衡量的吞吐量不會增加超過 16 個並發執行緒。我通過將並發
MAXDOP 8
查詢的數量從 1 更改為 12 來進行測試。測試MAXDOP 1
顯示了相同的可伸縮性瓶頸。作為一種解決方法,我想嘗試 SQL CLR 解決方案。這是我試圖說明要求的嘗試:
- 該函式必須能夠參與並行查詢
- 函式必須是確定性的
- 該函式必須輸入一個
NVARCHAR
或VARBINARY
字元串(所有相關列連接在一起)- 字元串的典型輸入大小為 100 - 20000 個字元。20000 不是最大值
- 雜湊衝突的機會應該大致等於或優於 MD5 算法。
CHECKSUM
對我們不起作用,因為有太多的碰撞。- 該函式必須在大型伺服器上很好地擴展(每個執行緒的吞吐量不應隨著執行緒數量的增加而顯著降低)
對於 Application Reasons™,假設我無法保存報表的雜湊值。這是一個不支持觸發器或計算列的 CCI(還有其他我不想討論的問題)。
HASHBYTES
使用 SQL CLR 函式進行模擬的可擴展方式是什麼?我的目標可以表示為在大型伺服器上每秒獲得盡可能多的雜湊值,因此性能也很重要。我對CLR很糟糕,所以我不知道如何做到這一點。如果它激勵任何人回答,我計劃盡快為這個問題添加賞金。下面是一個範例查詢,它非常粗略地說明了案例:DROP TABLE IF EXISTS #CHANGED_IDS; SELECT stg.ID INTO #CHANGED_IDS FROM ( SELECT ID, CAST( HASHBYTES ('SHA2_256', CAST(FK1 AS NVARCHAR(19)) + CAST(FK2 AS NVARCHAR(19)) + CAST(FK3 AS NVARCHAR(19)) + CAST(FK4 AS NVARCHAR(19)) + CAST(FK5 AS NVARCHAR(19)) + CAST(FK6 AS NVARCHAR(19)) + CAST(FK7 AS NVARCHAR(19)) + CAST(FK8 AS NVARCHAR(19)) + CAST(FK9 AS NVARCHAR(19)) + CAST(FK10 AS NVARCHAR(19)) + CAST(FK11 AS NVARCHAR(19)) + CAST(FK12 AS NVARCHAR(19)) + CAST(FK13 AS NVARCHAR(19)) + CAST(FK14 AS NVARCHAR(19)) + CAST(FK15 AS NVARCHAR(19)) + CAST(STR1 AS NVARCHAR(500)) + CAST(STR2 AS NVARCHAR(500)) + CAST(STR3 AS NVARCHAR(500)) + CAST(STR4 AS NVARCHAR(500)) + CAST(STR5 AS NVARCHAR(500)) + CAST(COMP1 AS NVARCHAR(1)) + CAST(COMP2 AS NVARCHAR(1)) + CAST(COMP3 AS NVARCHAR(1)) + CAST(COMP4 AS NVARCHAR(1)) + CAST(COMP5 AS NVARCHAR(1))) AS BINARY(32)) HASH1 FROM HB_TBL WITH (TABLOCK) ) stg INNER JOIN ( SELECT ID, CAST(HASHBYTES ('SHA2_256', CAST(FK1 AS NVARCHAR(19)) + CAST(FK2 AS NVARCHAR(19)) + CAST(FK3 AS NVARCHAR(19)) + CAST(FK4 AS NVARCHAR(19)) + CAST(FK5 AS NVARCHAR(19)) + CAST(FK6 AS NVARCHAR(19)) + CAST(FK7 AS NVARCHAR(19)) + CAST(FK8 AS NVARCHAR(19)) + CAST(FK9 AS NVARCHAR(19)) + CAST(FK10 AS NVARCHAR(19)) + CAST(FK11 AS NVARCHAR(19)) + CAST(FK12 AS NVARCHAR(19)) + CAST(FK13 AS NVARCHAR(19)) + CAST(FK14 AS NVARCHAR(19)) + CAST(FK15 AS NVARCHAR(19)) + CAST(STR1 AS NVARCHAR(500)) + CAST(STR2 AS NVARCHAR(500)) + CAST(STR3 AS NVARCHAR(500)) + CAST(STR4 AS NVARCHAR(500)) + CAST(STR5 AS NVARCHAR(500)) + CAST(COMP1 AS NVARCHAR(1)) + CAST(COMP2 AS NVARCHAR(1)) + CAST(COMP3 AS NVARCHAR(1)) + CAST(COMP4 AS NVARCHAR(1)) + CAST(COMP5 AS NVARCHAR(1)) ) AS BINARY(32)) HASH1 FROM HB_TBL_2 WITH (TABLOCK) ) rpt ON rpt.ID = stg.ID WHERE rpt.HASH1 <> stg.HASH1 OPTION (MAXDOP 8);
為了簡化一些事情,我可能會使用以下類似的東西進行基準測試。我將
HASHBYTES
在星期一發布結果:CREATE TABLE dbo.HASH_ME ( ID BIGINT NOT NULL, FK1 BIGINT NOT NULL, FK2 BIGINT NOT NULL, FK3 BIGINT NOT NULL, FK4 BIGINT NOT NULL, FK5 BIGINT NOT NULL, FK6 BIGINT NOT NULL, FK7 BIGINT NOT NULL, FK8 BIGINT NOT NULL, FK9 BIGINT NOT NULL, FK10 BIGINT NOT NULL, FK11 BIGINT NOT NULL, FK12 BIGINT NOT NULL, FK13 BIGINT NOT NULL, FK14 BIGINT NOT NULL, FK15 BIGINT NOT NULL, STR1 NVARCHAR(500) NOT NULL, STR2 NVARCHAR(500) NOT NULL, STR3 NVARCHAR(500) NOT NULL, STR4 NVARCHAR(500) NOT NULL, STR5 NVARCHAR(2000) NOT NULL, COMP1 TINYINT NOT NULL, COMP2 TINYINT NOT NULL, COMP3 TINYINT NOT NULL, COMP4 TINYINT NOT NULL, COMP5 TINYINT NOT NULL ); INSERT INTO dbo.HASH_ME WITH (TABLOCK) SELECT RN, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, REPLICATE(CHAR(65 + RN % 10 ), 30) ,REPLICATE(CHAR(65 + RN % 10 ), 30) ,REPLICATE(CHAR(65 + RN % 10 ), 30) ,REPLICATE(CHAR(65 + RN % 10 ), 30) ,REPLICATE(CHAR(65 + RN % 10 ), 1000), 0,1,0,1,0 FROM ( SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN FROM master..spt_values t1 CROSS JOIN master..spt_values t2 ) q OPTION (MAXDOP 1); SELECT MAX(HASHBYTES('SHA2_256', CAST(N'' AS NVARCHAR(MAX)) + N'|' + CAST(FK1 AS NVARCHAR(19)) + N'|' + CAST(FK2 AS NVARCHAR(19)) + N'|' + CAST(FK3 AS NVARCHAR(19)) + N'|' + CAST(FK4 AS NVARCHAR(19)) + N'|' + CAST(FK5 AS NVARCHAR(19)) + N'|' + CAST(FK6 AS NVARCHAR(19)) + N'|' + CAST(FK7 AS NVARCHAR(19)) + N'|' + CAST(FK8 AS NVARCHAR(19)) + N'|' + CAST(FK9 AS NVARCHAR(19)) + N'|' + CAST(FK10 AS NVARCHAR(19)) + N'|' + CAST(FK11 AS NVARCHAR(19)) + N'|' + CAST(FK12 AS NVARCHAR(19)) + N'|' + CAST(FK13 AS NVARCHAR(19)) + N'|' + CAST(FK14 AS NVARCHAR(19)) + N'|' + CAST(FK15 AS NVARCHAR(19)) + N'|' + CAST(STR1 AS NVARCHAR(500)) + N'|' + CAST(STR2 AS NVARCHAR(500)) + N'|' + CAST(STR3 AS NVARCHAR(500)) + N'|' + CAST(STR4 AS NVARCHAR(500)) + N'|' + CAST(STR5 AS NVARCHAR(2000)) + N'|' + CAST(COMP1 AS NVARCHAR(1)) + N'|' + CAST(COMP2 AS NVARCHAR(1)) + N'|' + CAST(COMP3 AS NVARCHAR(1)) + N'|' + CAST(COMP4 AS NVARCHAR(1)) + N'|' + CAST(COMP5 AS NVARCHAR(1)) ) ) FROM dbo.HASH_ME OPTION (MAXDOP 1);
由於您只是在尋找更改,因此您不需要加密雜湊函式。
您可以從 Brandon Dahler 的開源Data.HashFunction 庫中選擇一種速度更快的非加密雜湊,該庫在許可和 OSI 批准的MIT 許可下獲得許可。
SpookyHash
是一個受歡迎的選擇。範例實現
原始碼
using Microsoft.SqlServer.Server; using System.Data.HashFunction.SpookyHash; using System.Data.SqlTypes; public partial class UserDefinedFunctions { [SqlFunction ( DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true ) ] public static byte[] SpookyHash ( [SqlFacet (MaxSize = 8000)] SqlBinary Input ) { ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create(); return sh.ComputeHash(Input.Value).Hash; } [SqlFunction ( DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, SystemDataAccess = SystemDataAccessKind.None ) ] public static byte[] SpookyHashLOB ( [SqlFacet (MaxSize = -1)] SqlBinary Input ) { ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create(); return sh.ComputeHash(Input.Value).Hash; } }
該源提供了兩種功能,一種用於 8000 字節或更少的輸入,另一種是 LOB 版本。非 LOB 版本應該更快。
您可能可以將 LOB 二進製文件包裝
COMPRESS
到 8000 字節限制以下,如果這對性能來說是值得的的話。或者,您可以將 LOB 分解為 8000 字節以下的段,或者僅保留HASHBYTES
用於 LOB 的情況(因為更長的輸入可以更好地擴展)。預建程式碼
您顯然可以自己獲取包並編譯所有內容,但我建構了下面的程序集以使快速測試更容易:
https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL 函式
CREATE FUNCTION dbo.SpookyHash ( @Input varbinary(8000) ) RETURNS binary(16) WITH RETURNS NULL ON NULL INPUT, EXECUTE AS OWNER AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash; GO CREATE FUNCTION dbo.SpookyHashLOB ( @Input varbinary(max) ) RETURNS binary(16) WITH RETURNS NULL ON NULL INPUT, EXECUTE AS OWNER AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB; GO
用法
給出問題中的範例數據的範例用法:
SELECT HT1.ID FROM dbo.HB_TBL AS HT1 JOIN dbo.HB_TBL_2 AS HT2 ON HT2.ID = HT1.ID AND dbo.SpookyHash ( CONVERT(binary(8), HT2.FK1) + 0x7C + CONVERT(binary(8), HT2.FK2) + 0x7C + CONVERT(binary(8), HT2.FK3) + 0x7C + CONVERT(binary(8), HT2.FK4) + 0x7C + CONVERT(binary(8), HT2.FK5) + 0x7C + CONVERT(binary(8), HT2.FK6) + 0x7C + CONVERT(binary(8), HT2.FK7) + 0x7C + CONVERT(binary(8), HT2.FK8) + 0x7C + CONVERT(binary(8), HT2.FK9) + 0x7C + CONVERT(binary(8), HT2.FK10) + 0x7C + CONVERT(binary(8), HT2.FK11) + 0x7C + CONVERT(binary(8), HT2.FK12) + 0x7C + CONVERT(binary(8), HT2.FK13) + 0x7C + CONVERT(binary(8), HT2.FK14) + 0x7C + CONVERT(binary(8), HT2.FK15) + 0x7C + CONVERT(varbinary(1000), HT2.STR1) + 0x7C + CONVERT(varbinary(1000), HT2.STR2) + 0x7C + CONVERT(varbinary(1000), HT2.STR3) + 0x7C + CONVERT(varbinary(1000), HT2.STR4) + 0x7C + CONVERT(varbinary(1000), HT2.STR5) + 0x7C + CONVERT(binary(1), HT2.COMP1) + 0x7C + CONVERT(binary(1), HT2.COMP2) + 0x7C + CONVERT(binary(1), HT2.COMP3) + 0x7C + CONVERT(binary(1), HT2.COMP4) + 0x7C + CONVERT(binary(1), HT2.COMP5) ) <> dbo.SpookyHash ( CONVERT(binary(8), HT1.FK1) + 0x7C + CONVERT(binary(8), HT1.FK2) + 0x7C + CONVERT(binary(8), HT1.FK3) + 0x7C + CONVERT(binary(8), HT1.FK4) + 0x7C + CONVERT(binary(8), HT1.FK5) + 0x7C + CONVERT(binary(8), HT1.FK6) + 0x7C + CONVERT(binary(8), HT1.FK7) + 0x7C + CONVERT(binary(8), HT1.FK8) + 0x7C + CONVERT(binary(8), HT1.FK9) + 0x7C + CONVERT(binary(8), HT1.FK10) + 0x7C + CONVERT(binary(8), HT1.FK11) + 0x7C + CONVERT(binary(8), HT1.FK12) + 0x7C + CONVERT(binary(8), HT1.FK13) + 0x7C + CONVERT(binary(8), HT1.FK14) + 0x7C + CONVERT(binary(8), HT1.FK15) + 0x7C + CONVERT(varbinary(1000), HT1.STR1) + 0x7C + CONVERT(varbinary(1000), HT1.STR2) + 0x7C + CONVERT(varbinary(1000), HT1.STR3) + 0x7C + CONVERT(varbinary(1000), HT1.STR4) + 0x7C + CONVERT(varbinary(1000), HT1.STR5) + 0x7C + CONVERT(binary(1), HT1.COMP1) + 0x7C + CONVERT(binary(1), HT1.COMP2) + 0x7C + CONVERT(binary(1), HT1.COMP3) + 0x7C + CONVERT(binary(1), HT1.COMP4) + 0x7C + CONVERT(binary(1), HT1.COMP5) );
使用 LOB 版本時,第一個參數應強制轉換或轉換為
varbinary(max)
.執行計劃
安全幽靈
Data.HashFunction庫使用SQL Server考慮的許多 CLR 語言功能*。*
UNSAFE
可以編寫與SAFE
狀態兼容的基本 Spooky Hash。我根據Jon Hanna 的SpookilySharp編寫的範例如下:https://gist.github.com/SQLKiwi/7a5bb26b0bee56f6d28a1d26669ce8f2