SQL Server 儲存過程漸變維度
有人可以對我們的 Kimball 數據倉庫的這個緩慢變化的維度進行程式碼審查嗎?這是類型 2 維度。我在網際網路上查看了很多程式碼,但是我想使用非 TSQL 特定的命令(試圖防止使用 OUTPUT 和 $Action)。如果您能找到一種方法來提高效率/優化,請隨時進行編輯。我們將來可能會轉移到 MySQL 或 PostgreSQL。
表:
create table dbo.Stagingfood ( FoodNaturalId int primary key identity(1,1), FoodName varchar(255), FoodCategory varchar(255) ) create table dbo.Dimfood ( DimFoodId int primary key identity(1,1), FoodNaturalId int, FoodName varchar(255), FoodCategory varchar(255), begindate datetime, enddate datetime )
程式碼:
create procedure dbo.DimFoodImport as declare @newdatetime datetime begin transaction -- Insert new records which do not exist insert into dbo.DimFood ( FoodNaturalId, FoodName, FoodCategory, begindate, enddate ) select stgfood.FoodNaturalId, stgfood.FoodName, stgfood.FoodCategory, getdate() as begindate, '12/31/9999' as enddate from dbo.Stagingfood stgfood left join dbo.DimFood df on df.FoodNaturalId = stgfood.FoodNaturalId where df.FoodNaturalId is null -- Close off existing records that have changed set @newdatetime = getdate() update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df left join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId and df.enddate = '12/31/9999' where stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory -- Insert new updated records insert into dbo.DimFood ( FoodNaturalId, FoodName, FoodCategory, begindate, enddate ) select stgfood.FoodNaturalId, stgfood.FoodName, stgfood.FoodCategory, @newdatetime as begindate, '12/31/9999' as enddate from dbo.Stagingfood stgfood left join dbo.DimFood df on df.FoodNaturalId = stgfood.FoodNaturalId and df.EndDate = @newdatetime where stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory commit transaction
假設我有適當的索引,未來的 SCD 表將有 10-15 列。以上只是範例。
如果您可以找到一種方法來提高效率/優化,請隨時編輯
在不了解索引等的情況下,我嘗試用一些範例數據填充臨時表:
(FoodName,FoodCategory) values('Chicken','Meat') go 1000 insert into dbo.Stagingfood (FoodName,FoodCategory) values('Veal','Meat') go 10000 insert into dbo.Stagingfood (FoodName,FoodCategory) values('Porc','Meat') go 20000
第一部分:插入
引起我注意的一件事是第一個插入:
– 插入不存在的新記錄
insert into dbo.DimFood ( FoodNaturalId, FoodName, FoodCategory, begindate, enddate ) select stgfood.FoodNaturalId, stgfood.FoodName, stgfood.FoodCategory, getdate() as begindate, '12/31/9999' as enddate from dbo.Stagingfood stgfood left join dbo.DimFood df on df.FoodNaturalId = stgfood.FoodNaturalId where df.FoodNaturalId is null;
當您的數據變大時,這種左連接會導致問題。左連接將使您的查詢始終需要連接所有記錄,然後才能應用過濾器。
我的重寫+添加索引是這樣的:
CREATE NONCLUSTERED INDEX IX_DimFood_FoodNaturalId ON [dbo].[Dimfood] ([FoodNaturalId]); insert into dbo.DimFood ( FoodNaturalId, FoodName, FoodCategory, begindate, enddate ) select stgfood.FoodNaturalId, stgfood.FoodName, stgfood.FoodCategory, getdate() as begindate, '12/31/9999' as enddate from dbo.Stagingfood stgfood where not exists (select * from dbo.DimFood df where df.FoodNaturalId = stgfood.FoodNaturalId);
您可以檢查兩者的區別,帶有左連接的那個使用合併連接(LEFT OUTER JOIN)進入過濾器,而重寫使用沒有過濾器的左反半連接。
粘貼計劃:
https://www.brentozar.com/pastetheplan/?id=HyVZigsjm
第二部分:更新
這主要是需要用實際數據集進行測試的東西。但是在可以使用內連接的地方進行左連接又是一個不需要的額外步驟。另一件事是,or 子句在優化器中通常不會很好。
這可能是重寫,但 or 子句可能需要保留,這需要再次測試。
-- Close off existing records that have changed
可以更改此左連接,因為您正在檢查現有記錄。
set @newdatetime = getdate()
這種重寫+索引在我的數據集上給出了更好的結果
CREATE NONCLUSTERED INDEX IX_Enddate_FoodNaturalId on [dbo].[Dimfood] (Enddate,[FoodNaturalId]) include(foodname,foodcategory); set @newdatetime = getdate(); update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df inner join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId where( df.enddate = '12/31/9999' and (stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory));
如果“或”子句在大數據集上出現問題,您可能必須將它們分散到兩個更新中:
update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df inner join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId where( df.enddate = '12/31/9999' and stgfood.FoodName <> df.FoodName) update dbo.DimFood set EndDate = @newdatetime from dbo.DimFood df inner join dbo.Stagingfood stgfood on df.FoodNaturalId = stgfood.FoodNaturalId where( df.enddate = '12/31/9999' and stgfood.FoodCategory <> df.FoodCategory);
計劃不將“或”子句分成兩部分:(第一部分原創,第二部分重寫)
https://www.brentozar.com/pastetheplan/?id=BJDvNZjjm
第三部分:插入新的更新記錄
同樣在這裡,如果您只想檢查新更新的記錄,則可以使用內部連接來改進查詢。您也不需要 where 子句。
-- Insert new updated records insert into dbo.DimFood ( FoodNaturalId, FoodName, FoodCategory, begindate, enddate ) select stgfood.FoodNaturalId, stgfood.FoodName, stgfood.FoodCategory, @newdatetime as begindate, '12/31/9999' as enddate from dbo.Stagingfood stgfood inner join dbo.DimFood df on df.FoodNaturalId = stgfood.FoodNaturalId and df.EndDate = @newdatetime;
粘貼計劃(第一個計劃原始,第二個計劃重寫):