Sql-Server

數據倉庫和數據湖更新的最佳索引策略

  • June 17, 2022

我們有一個 sql server 數據庫,用作數據湖和數據倉庫。數據庫中的每個表都有一些標準化的定義,因為我們現在有 600 個左右的表,因此維護需要在某種程度上自動化。

載入每個表的一般過程是首先將表的副本載入到 changeLog 模式中的雜湊表中(如果我們可以確定更改的記錄是什麼,有時只載入更改的記錄),然後將 changeLog 表與目標表進行比較桌子。目標表用於報告,因此這種 changeLog 方法允許我們持久化目標表並僅應用最少的 UPDATE/INSERT 操作。

每個目標表都有一個唯一鍵/業務鍵,可通過配置表辨識,並具有在每個表中命名相同的標準化審計列。審計專欄告訴我們

  1. 當記錄被添加到數據倉庫時
  2. 上次在數據倉庫中更新記錄的時間
  3. 記錄是否已在源中刪除
  4. 使用 HASHBYTES(‘SHA2_256’, CONCAT()) 更改的記錄標識符

更改後的記錄標識符曾經是 CHECKSUM(),但我們發現 CHECKSUM() 的衝突率太高而無法信任。我剛剛在每個表中添加了一個 HASHBYTES() 列並填充了它。

我將 HASHBYTES() 列創建為 VARBINARY(MAX)。現在,每次載入表時,我們可以通過將在 changeLog 表中計算的新 HASHBYTES() 值與目標表中的持久值進行比較來判斷是否需要更新記錄。

我立即註意到從 INT CHECKSUM() 切換到 VARBINARY(MAX) HASHBYTES() 導致更新檢查過程顯著減慢。我在每個 CHECKSUM 列上都有 NONCLUSTERED 索引,但在我剛剛添加的 HASHBYTES 列上沒有。每個表的唯一鍵也有聚集索引。

  1. 添加以檢查更新的理想索引是什麼?
  2. 我可以為每個表添加一個標準化索引嗎?
  3. VARBINARY(MAX) 是正確的數據類型還是可以安全地將其縮小到更小的大小?

希望這足以讓這個問題變得有意義。我需要盡快加快這個過程。

編輯:我正在添加一個大型 SQL 腳本作為範例,它具有表的 changeLog 版本和目標版本的範例表定義,以及為更新目標版本而執行的查詢。

--OBJECT DEFINITIONS
CREATE TABLE [changeLog].[DimSalesOffice](
   [Sales Office Code] [varchar](3) NOT NULL,
   [CCN_Key] [uniqueidentifier] NOT NULL,
   [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
   [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
   [Sales Office City] [nvarchar](50) NOT NULL,
   [Sales Office StateProvince] [nvarchar](50) NOT NULL,
   [Sales Office Postal Code] [nvarchar](20) NOT NULL,
   [Sales Office Country Code] [varchar](3) NOT NULL,
   [Sales Office Address] [nvarchar](65) NOT NULL,
   [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
   [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
   [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
   [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
   [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
   [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
   [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
   [Sales Office Region] [nvarchar](100) NULL,
   [Native Language Code] [varchar](2) NOT NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DimSalesOffice](
   [SalesOffice_Key] [uniqueidentifier] NOT NULL,
   [Sales Office Code] [varchar](3) NOT NULL,
   [CCN_Key] [uniqueidentifier] NOT NULL,
   [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
   [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
   [Sales Office City] [nvarchar](50) NOT NULL,
   [Sales Office StateProvince] [nvarchar](50) NOT NULL,
   [Sales Office Postal Code] [nvarchar](20) NOT NULL,
   [Sales Office Country Code] [varchar](3) NOT NULL,
   [Sales Office Address] [nvarchar](65) NOT NULL,
   [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
   [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
   [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
   [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
   [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
   [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
   [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
   [Sales Office Region] [nvarchar](100) NULL,
   [Native Language Code] [varchar](2) NOT NULL,
   [DW_CreatedOn] [datetime2](7) NULL,
   [DW_ModifiedOn] [datetime2](7) NULL,
   [DW_IsDeleted?] [bit] NULL,
   [DW_Checksum] [int] NULL,
   [Source_ModifiedOn] [datetime2](7) NULL,
   [DW_Hashbytes] [varbinary](max) NULL,
PRIMARY KEY NONCLUSTERED 
(
   [SalesOffice_Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[DimSalesOffice] ADD  DEFAULT (newsequentialid()) FOR [SalesOffice_Key]
GO

CREATE UNIQUE CLUSTERED INDEX [IX_UK_DimSalesOffice] ON [dbo].[DimSalesOffice]
(
   [Sales Office Code] ASC
)
GO

--MERGE QUERY
DECLARE @InsertRecordCount INT, @UpdateRecordCount INT;

/*****UPDATE*****/
UPDATE [dbo].[DimSalesOffice] SET
   [CCN_Key] = [Source].[CCN_Key],
   [Sales Office Name (Short)] = [Source].[Sales Office Name (Short)],
   [Sales Office Name (Long)] = [Source].[Sales Office Name (Long)],
   [Sales Office City] = [Source].[Sales Office City],
   [Sales Office StateProvince] = [Source].[Sales Office StateProvince],
   [Sales Office Postal Code] = [Source].[Sales Office Postal Code],
   [Sales Office Country Code] = [Source].[Sales Office Country Code],
   [Sales Office Address] = [Source].[Sales Office Address],
   [Sales Office Address (Line 2)] = [Source].[Sales Office Address (Line 2)],
   [Sales Office Name (Short - Native Language)] = [Source].[Sales Office Name (Short - Native Language)],
   [Sales Office Name (Long - Native Language)] = [Source].[Sales Office Name (Long - Native Language)],
   [Sales Office Address (Native Language)] = [Source].[Sales Office Address (Native Language)],
   [Sales Office Address (Line 2 - Native Language)] = [Source].[Sales Office Address (Line 2 - Native Language)],
   [Sales Office City (Native Language)] = [Source].[Sales Office City (Native Language)],
   [Sales Office StateProvince (Native Language)] = [Source].[Sales Office StateProvince (Native Language)],
   [Sales Office Region] = [Source].[Sales Office Region],
   [Native Language Code] = [Source].[Native Language Code],
   [DW_Checksum] =
       CHECKSUM(
           [Source].[CCN_Key],
           [Source].[Sales Office Name (Short)],
           [Source].[Sales Office Name (Long)],
           [Source].[Sales Office City],
           [Source].[Sales Office StateProvince],
           [Source].[Sales Office Postal Code],
           [Source].[Sales Office Country Code],
           [Source].[Sales Office Address],
           [Source].[Sales Office Address (Line 2)],
           [Source].[Sales Office Name (Short - Native Language)],
           [Source].[Sales Office Name (Long - Native Language)],
           [Source].[Sales Office Address (Native Language)],
           [Source].[Sales Office Address (Line 2 - Native Language)],
           [Source].[Sales Office City (Native Language)],
           [Source].[Sales Office StateProvince (Native Language)],
           [Source].[Sales Office Region],
           [Source].[Native Language Code],
           0
       ),
   [DW_Hashbytes] = 
       HASHBYTES(
           'SHA2_256',
           ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
           + '0'
       ),
   [Source_ModifiedOn] = NULL,
   [DW_ModifiedOn] = GETUTCDATE(),
   [DW_IsDeleted?] = 0
FROM [changeLog].[DimSalesOffice] [Source]
JOIN [dbo].[DimSalesOffice]
   ON [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
   AND ISNULL([DimSalesOffice].[DW_Hashbytes], HASHBYTES('SHA2_256', '')) <> HASHBYTES(
           'SHA2_256',
           ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
           + '0'
       )
SET @UpdateRecordCount = @@ROWCOUNT;

/*****Soft Deletes*****/
UPDATE [dbo].[DimSalesOffice] SET
   [DW_Checksum] = 0,
   [DW_Hashbytes] = 
       HASHBYTES(
           'SHA2_256',
           ISNULL(CAST([DimSalesOffice].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([DimSalesOffice].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
           + '1'
       ),
   [Source_ModifiedOn] = NULL,
   [DW_ModifiedOn] = GETUTCDATE(),
   [DW_IsDeleted?] = 1
FROM [dbo].[DimSalesOffice]
WHERE NOT EXISTS
   (
       SELECT 1
       FROM [changeLog].[DimSalesOffice] [Source]
       WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
   )
SET @UpdateRecordCount = @UpdateRecordCount + @@ROWCOUNT;

/*****INSERT*****/
INSERT INTO [dbo].[DimSalesOffice]
   (
       [Sales Office Code],
       [CCN_Key],
       [Sales Office Name (Short)],
       [Sales Office Name (Long)],
       [Sales Office City],
       [Sales Office StateProvince],
       [Sales Office Postal Code],
       [Sales Office Country Code],
       [Sales Office Address],
       [Sales Office Address (Line 2)],
       [Sales Office Name (Short - Native Language)],
       [Sales Office Name (Long - Native Language)],
       [Sales Office Address (Native Language)],
       [Sales Office Address (Line 2 - Native Language)],
       [Sales Office City (Native Language)],
       [Sales Office StateProvince (Native Language)],
       [Sales Office Region],
       [Native Language Code],
       [DW_Checksum],
       [DW_Hashbytes],
       [Source_ModifiedOn],
       [DW_ModifiedOn],
       [DW_IsDeleted?],
       [DW_CreatedOn]
   )
SELECT
   [Source].[Sales Office Code],
   [Source].[CCN_Key],
   [Source].[Sales Office Name (Short)],
   [Source].[Sales Office Name (Long)],
   [Source].[Sales Office City],
   [Source].[Sales Office StateProvince],
   [Source].[Sales Office Postal Code],
   [Source].[Sales Office Country Code],
   [Source].[Sales Office Address],
   [Source].[Sales Office Address (Line 2)],
   [Source].[Sales Office Name (Short - Native Language)],
   [Source].[Sales Office Name (Long - Native Language)],
   [Source].[Sales Office Address (Native Language)],
   [Source].[Sales Office Address (Line 2 - Native Language)],
   [Source].[Sales Office City (Native Language)],
   [Source].[Sales Office StateProvince (Native Language)],
   [Source].[Sales Office Region],
   [Source].[Native Language Code],
   [DW_Checksum] = 
       CHECKSUM(
           [Source].[CCN_Key],
           [Source].[Sales Office Name (Short)],
           [Source].[Sales Office Name (Long)],
           [Source].[Sales Office City],
           [Source].[Sales Office StateProvince],
           [Source].[Sales Office Postal Code],
           [Source].[Sales Office Country Code],
           [Source].[Sales Office Address],
           [Source].[Sales Office Address (Line 2)],
           [Source].[Sales Office Name (Short - Native Language)],
           [Source].[Sales Office Name (Long - Native Language)],
           [Source].[Sales Office Address (Native Language)],
           [Source].[Sales Office Address (Line 2 - Native Language)],
           [Source].[Sales Office City (Native Language)],
           [Source].[Sales Office StateProvince (Native Language)],
           [Source].[Sales Office Region],
           [Source].[Native Language Code],
           0
       ),
   [DW_Hashbytes] = 
       HASHBYTES(
           'SHA2_256',
           ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
           + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
           + '0'
       ),
   [Source_ModifiedOn] = NULL,
   [DW_ModifiedOn] = GETUTCDATE(),
   [DW_IsDeleted?] = 0,
   [DW_CreatedOn] = GETUTCDATE()
FROM [changeLog].[DimSalesOffice] [Source]
WHERE NOT EXISTS
   (
       SELECT 1
       FROM [dbo].[DimSalesOffice]
       WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
   )
SET @InsertRecordCount = @@ROWCOUNT;

SELECT [Update Record Count] = @UpdateRecordCount, [Insert Record Count] = @InsertRecordCount;

反過來回答,因為它更具語言意義:

  1. VARBINARY(MAX) 是正確的數據類型還是可以安全地將其縮小到更小的大小?

正如我在評論中提到的,SHA2_256 means它將輸出散列為 256 位,也就是 32 字節(1 字節中 8 位),這意味著VARBINARY您需要的最大大小是VARBINARY(32). 這在文件中提到HASBYTES()

輸出符合算法標準:MD2、MD4、MD5為128位(16字節);SHA 和 SHA1 為 160 位(20 字節);SHA2_256 為 256 位(32 字節),SHA2_512 為512 位(64 字節)。

所以,是的,你可以安全地將你的尺寸VARBINARY縮小到VARBINARY(32).

  1. 添加以檢查更新的理想索引是什麼?
  2. 我可以為每個表添加一個標準化索引嗎?

是的,在減少VARBINARY欄位的大小後,可以將其添加到表的索引中。我會推荐一個非聚集索引,它以主鍵欄位開頭,然後HASHBYTES()在定義中將計算欄位排在第二位。

使其成為非聚集索引的原因是因為您通常希望避免索引中的熱列,尤其是與每個非聚集索引一起儲存的聚集索引。頻繁更新的熱列會導致對索引的大量寫入,並且在聚集索引的情況下,這些寫入也必鬚髮生在每個非聚集索引上(因為聚集索引與它一起儲存)。所有列上的行雜湊肯定會經常更改。

以主鍵欄位開頭是有意義的,因為您需要先按該欄位連接以匹配相同的行,然後按HASHBYTES()計算欄位連接以檢查它們是否不同。

您將如何改進雜湊字節計算?我正在考慮索引一個計算列。

是的,您可以使用計算列(甚至不需要持久化但可以)並對其進行索引,或者您也可以創建索引視圖,我之前都做過。我會首先為計算列拍攝,因為它比索引視圖更靈活一點,將實際行雜湊耦合到表本身的行中,並且管理的對象更少。

引用自:https://dba.stackexchange.com/questions/313401