數據倉庫和數據湖更新的最佳索引策略
我們有一個 sql server 數據庫,用作數據湖和數據倉庫。數據庫中的每個表都有一些標準化的定義,因為我們現在有 600 個左右的表,因此維護需要在某種程度上自動化。
載入每個表的一般過程是首先將表的副本載入到 changeLog 模式中的雜湊表中(如果我們可以確定更改的記錄是什麼,有時只載入更改的記錄),然後將 changeLog 表與目標表進行比較桌子。目標表用於報告,因此這種 changeLog 方法允許我們持久化目標表並僅應用最少的 UPDATE/INSERT 操作。
每個目標表都有一個唯一鍵/業務鍵,可通過配置表辨識,並具有在每個表中命名相同的標準化審計列。審計專欄告訴我們
- 當記錄被添加到數據倉庫時
- 上次在數據倉庫中更新記錄的時間
- 記錄是否已在源中刪除
- 使用 HASHBYTES(‘SHA2_256’, CONCAT()) 更改的記錄標識符
更改後的記錄標識符曾經是 CHECKSUM(),但我們發現 CHECKSUM() 的衝突率太高而無法信任。我剛剛在每個表中添加了一個 HASHBYTES() 列並填充了它。
我將 HASHBYTES() 列創建為 VARBINARY(MAX)。現在,每次載入表時,我們可以通過將在 changeLog 表中計算的新 HASHBYTES() 值與目標表中的持久值進行比較來判斷是否需要更新記錄。
我立即註意到從 INT CHECKSUM() 切換到 VARBINARY(MAX) HASHBYTES() 導致更新檢查過程顯著減慢。我在每個 CHECKSUM 列上都有 NONCLUSTERED 索引,但在我剛剛添加的 HASHBYTES 列上沒有。每個表的唯一鍵也有聚集索引。
- 添加以檢查更新的理想索引是什麼?
- 我可以為每個表添加一個標準化索引嗎?
- VARBINARY(MAX) 是正確的數據類型還是可以安全地將其縮小到更小的大小?
希望這足以讓這個問題變得有意義。我需要盡快加快這個過程。
編輯:我正在添加一個大型 SQL 腳本作為範例,它具有表的 changeLog 版本和目標版本的範例表定義,以及為更新目標版本而執行的查詢。
--OBJECT DEFINITIONS CREATE TABLE [changeLog].[DimSalesOffice]( [Sales Office Code] [varchar](3) NOT NULL, [CCN_Key] [uniqueidentifier] NOT NULL, [Sales Office Name (Short)] [nvarchar](14) NOT NULL, [Sales Office Name (Long)] [nvarchar](50) NOT NULL, [Sales Office City] [nvarchar](50) NOT NULL, [Sales Office StateProvince] [nvarchar](50) NOT NULL, [Sales Office Postal Code] [nvarchar](20) NOT NULL, [Sales Office Country Code] [varchar](3) NOT NULL, [Sales Office Address] [nvarchar](65) NOT NULL, [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL, [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL, [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL, [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL, [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL, [Sales Office City (Native Language)] [nvarchar](100) NOT NULL, [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL, [Sales Office Region] [nvarchar](100) NULL, [Native Language Code] [varchar](2) NOT NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[DimSalesOffice]( [SalesOffice_Key] [uniqueidentifier] NOT NULL, [Sales Office Code] [varchar](3) NOT NULL, [CCN_Key] [uniqueidentifier] NOT NULL, [Sales Office Name (Short)] [nvarchar](14) NOT NULL, [Sales Office Name (Long)] [nvarchar](50) NOT NULL, [Sales Office City] [nvarchar](50) NOT NULL, [Sales Office StateProvince] [nvarchar](50) NOT NULL, [Sales Office Postal Code] [nvarchar](20) NOT NULL, [Sales Office Country Code] [varchar](3) NOT NULL, [Sales Office Address] [nvarchar](65) NOT NULL, [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL, [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL, [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL, [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL, [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL, [Sales Office City (Native Language)] [nvarchar](100) NOT NULL, [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL, [Sales Office Region] [nvarchar](100) NULL, [Native Language Code] [varchar](2) NOT NULL, [DW_CreatedOn] [datetime2](7) NULL, [DW_ModifiedOn] [datetime2](7) NULL, [DW_IsDeleted?] [bit] NULL, [DW_Checksum] [int] NULL, [Source_ModifiedOn] [datetime2](7) NULL, [DW_Hashbytes] [varbinary](max) NULL, PRIMARY KEY NONCLUSTERED ( [SalesOffice_Key] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] GO ALTER TABLE [dbo].[DimSalesOffice] ADD DEFAULT (newsequentialid()) FOR [SalesOffice_Key] GO CREATE UNIQUE CLUSTERED INDEX [IX_UK_DimSalesOffice] ON [dbo].[DimSalesOffice] ( [Sales Office Code] ASC ) GO --MERGE QUERY DECLARE @InsertRecordCount INT, @UpdateRecordCount INT; /*****UPDATE*****/ UPDATE [dbo].[DimSalesOffice] SET [CCN_Key] = [Source].[CCN_Key], [Sales Office Name (Short)] = [Source].[Sales Office Name (Short)], [Sales Office Name (Long)] = [Source].[Sales Office Name (Long)], [Sales Office City] = [Source].[Sales Office City], [Sales Office StateProvince] = [Source].[Sales Office StateProvince], [Sales Office Postal Code] = [Source].[Sales Office Postal Code], [Sales Office Country Code] = [Source].[Sales Office Country Code], [Sales Office Address] = [Source].[Sales Office Address], [Sales Office Address (Line 2)] = [Source].[Sales Office Address (Line 2)], [Sales Office Name (Short - Native Language)] = [Source].[Sales Office Name (Short - Native Language)], [Sales Office Name (Long - Native Language)] = [Source].[Sales Office Name (Long - Native Language)], [Sales Office Address (Native Language)] = [Source].[Sales Office Address (Native Language)], [Sales Office Address (Line 2 - Native Language)] = [Source].[Sales Office Address (Line 2 - Native Language)], [Sales Office City (Native Language)] = [Source].[Sales Office City (Native Language)], [Sales Office StateProvince (Native Language)] = [Source].[Sales Office StateProvince (Native Language)], [Sales Office Region] = [Source].[Sales Office Region], [Native Language Code] = [Source].[Native Language Code], [DW_Checksum] = CHECKSUM( [Source].[CCN_Key], [Source].[Sales Office Name (Short)], [Source].[Sales Office Name (Long)], [Source].[Sales Office City], [Source].[Sales Office StateProvince], [Source].[Sales Office Postal Code], [Source].[Sales Office Country Code], [Source].[Sales Office Address], [Source].[Sales Office Address (Line 2)], [Source].[Sales Office Name (Short - Native Language)], [Source].[Sales Office Name (Long - Native Language)], [Source].[Sales Office Address (Native Language)], [Source].[Sales Office Address (Line 2 - Native Language)], [Source].[Sales Office City (Native Language)], [Source].[Sales Office StateProvince (Native Language)], [Source].[Sales Office Region], [Source].[Native Language Code], 0 ), [DW_Hashbytes] = HASHBYTES( 'SHA2_256', ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|' + '0' ), [Source_ModifiedOn] = NULL, [DW_ModifiedOn] = GETUTCDATE(), [DW_IsDeleted?] = 0 FROM [changeLog].[DimSalesOffice] [Source] JOIN [dbo].[DimSalesOffice] ON [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code] AND ISNULL([DimSalesOffice].[DW_Hashbytes], HASHBYTES('SHA2_256', '')) <> HASHBYTES( 'SHA2_256', ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|' + '0' ) SET @UpdateRecordCount = @@ROWCOUNT; /*****Soft Deletes*****/ UPDATE [dbo].[DimSalesOffice] SET [DW_Checksum] = 0, [DW_Hashbytes] = HASHBYTES( 'SHA2_256', ISNULL(CAST([DimSalesOffice].[CCN_Key] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office City] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([DimSalesOffice].[Native Language Code] AS NVARCHAR(MAX)), '') + '|' + '1' ), [Source_ModifiedOn] = NULL, [DW_ModifiedOn] = GETUTCDATE(), [DW_IsDeleted?] = 1 FROM [dbo].[DimSalesOffice] WHERE NOT EXISTS ( SELECT 1 FROM [changeLog].[DimSalesOffice] [Source] WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code] ) SET @UpdateRecordCount = @UpdateRecordCount + @@ROWCOUNT; /*****INSERT*****/ INSERT INTO [dbo].[DimSalesOffice] ( [Sales Office Code], [CCN_Key], [Sales Office Name (Short)], [Sales Office Name (Long)], [Sales Office City], [Sales Office StateProvince], [Sales Office Postal Code], [Sales Office Country Code], [Sales Office Address], [Sales Office Address (Line 2)], [Sales Office Name (Short - Native Language)], [Sales Office Name (Long - Native Language)], [Sales Office Address (Native Language)], [Sales Office Address (Line 2 - Native Language)], [Sales Office City (Native Language)], [Sales Office StateProvince (Native Language)], [Sales Office Region], [Native Language Code], [DW_Checksum], [DW_Hashbytes], [Source_ModifiedOn], [DW_ModifiedOn], [DW_IsDeleted?], [DW_CreatedOn] ) SELECT [Source].[Sales Office Code], [Source].[CCN_Key], [Source].[Sales Office Name (Short)], [Source].[Sales Office Name (Long)], [Source].[Sales Office City], [Source].[Sales Office StateProvince], [Source].[Sales Office Postal Code], [Source].[Sales Office Country Code], [Source].[Sales Office Address], [Source].[Sales Office Address (Line 2)], [Source].[Sales Office Name (Short - Native Language)], [Source].[Sales Office Name (Long - Native Language)], [Source].[Sales Office Address (Native Language)], [Source].[Sales Office Address (Line 2 - Native Language)], [Source].[Sales Office City (Native Language)], [Source].[Sales Office StateProvince (Native Language)], [Source].[Sales Office Region], [Source].[Native Language Code], [DW_Checksum] = CHECKSUM( [Source].[CCN_Key], [Source].[Sales Office Name (Short)], [Source].[Sales Office Name (Long)], [Source].[Sales Office City], [Source].[Sales Office StateProvince], [Source].[Sales Office Postal Code], [Source].[Sales Office Country Code], [Source].[Sales Office Address], [Source].[Sales Office Address (Line 2)], [Source].[Sales Office Name (Short - Native Language)], [Source].[Sales Office Name (Long - Native Language)], [Source].[Sales Office Address (Native Language)], [Source].[Sales Office Address (Line 2 - Native Language)], [Source].[Sales Office City (Native Language)], [Source].[Sales Office StateProvince (Native Language)], [Source].[Sales Office Region], [Source].[Native Language Code], 0 ), [DW_Hashbytes] = HASHBYTES( 'SHA2_256', ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|' + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|' + '0' ), [Source_ModifiedOn] = NULL, [DW_ModifiedOn] = GETUTCDATE(), [DW_IsDeleted?] = 0, [DW_CreatedOn] = GETUTCDATE() FROM [changeLog].[DimSalesOffice] [Source] WHERE NOT EXISTS ( SELECT 1 FROM [dbo].[DimSalesOffice] WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code] ) SET @InsertRecordCount = @@ROWCOUNT; SELECT [Update Record Count] = @UpdateRecordCount, [Insert Record Count] = @InsertRecordCount;
反過來回答,因為它更具語言意義:
- VARBINARY(MAX) 是正確的數據類型還是可以安全地將其縮小到更小的大小?
正如我在評論中提到的,
SHA2_256 means
它將輸出散列為 256 位,也就是 32 字節(1 字節中 8 位),這意味著VARBINARY
您需要的最大大小是VARBINARY(32)
. 這在文件中提到HASBYTES()
:輸出符合算法標準:MD2、MD4、MD5為128位(16字節);SHA 和 SHA1 為 160 位(20 字節);SHA2_256 為 256 位(32 字節),SHA2_512 為512 位(64 字節)。
所以,是的,你可以安全地將你的尺寸
VARBINARY
縮小到VARBINARY(32)
.
- 添加以檢查更新的理想索引是什麼?
- 我可以為每個表添加一個標準化索引嗎?
是的,在減少
VARBINARY
欄位的大小後,可以將其添加到表的索引中。我會推荐一個非聚集索引,它以主鍵欄位開頭,然後HASHBYTES()
在定義中將計算欄位排在第二位。使其成為非聚集索引的原因是因為您通常希望避免索引中的熱列,尤其是與每個非聚集索引一起儲存的聚集索引。頻繁更新的熱列會導致對索引的大量寫入,並且在聚集索引的情況下,這些寫入也必鬚髮生在每個非聚集索引上(因為聚集索引與它一起儲存)。所有列上的行雜湊肯定會經常更改。
以主鍵欄位開頭是有意義的,因為您需要先按該欄位連接以匹配相同的行,然後按
HASHBYTES()
計算欄位連接以檢查它們是否不同。您將如何改進雜湊字節計算?我正在考慮索引一個計算列。
是的,您可以使用計算列(甚至不需要持久化但可以)並對其進行索引,或者您也可以創建索引視圖,我之前都做過。我會首先為計算列拍攝,因為它比索引視圖更靈活一點,將實際行雜湊耦合到表本身的行中,並且管理的對象更少。