Sql-Server

SQL Server 儲存過程漸變維度

  • October 22, 2018

有人可以對我們的 Kimball 數據倉庫的這個緩慢變化的維度進行程式碼審查嗎?這是類型 2 維度。我在網際網路上查看了很多程式碼,但是我想使用非 TSQL 特定的命令(試圖防止使用 OUTPUT 和 $Action)。如果您能找到一種方法來提高效率/優化,請隨時進行編輯。我們將來可能會轉移到 MySQL 或 PostgreSQL。

https://www.mssqltips.com/sqlservertip/2883/using-the-sql-server-merge-statement-to-process-type-2-slowly-changing-dimensions/

https://sqlblogcasts.com/blogs/atulthakor/archive/2011/01/10/t-sql-scd-slowly-changing-dimension-type-2-using-a-merge-statement.aspx

表:

create table dbo.Stagingfood
(
   FoodNaturalId int primary key identity(1,1),
   FoodName varchar(255),
   FoodCategory varchar(255)
)

create table dbo.Dimfood
(
   DimFoodId int primary key identity(1,1),
   FoodNaturalId int,
   FoodName varchar(255),
   FoodCategory varchar(255),
   begindate datetime,
   enddate datetime
)

程式碼:

create procedure dbo.DimFoodImport
as

declare @newdatetime datetime

begin transaction


   -- Insert new records which do not exist

   insert into dbo.DimFood
   (
       FoodNaturalId,
       FoodName,
       FoodCategory,
       begindate,
       enddate
   )
   select 
       stgfood.FoodNaturalId,
       stgfood.FoodName,
       stgfood.FoodCategory,
       getdate() as begindate,
       '12/31/9999' as enddate
   from dbo.Stagingfood stgfood 
   left join dbo.DimFood df
       on df.FoodNaturalId = stgfood.FoodNaturalId
   where df.FoodNaturalId is null


   -- Close off existing records that have changed
   set @newdatetime = getdate()

   update dbo.DimFood
   set EndDate = @newdatetime
   from dbo.DimFood  df 
   left join dbo.Stagingfood stgfood 
       on df.FoodNaturalId = stgfood.FoodNaturalId
       and df.enddate = '12/31/9999'
   where 
       stgfood.FoodName <> df.FoodName
       or stgfood.FoodCategory <> df.FoodCategory

   -- Insert new updated records

   insert into dbo.DimFood
   (
       FoodNaturalId,
       FoodName,
       FoodCategory,
       begindate,
       enddate
   )
   select 
       stgfood.FoodNaturalId,
       stgfood.FoodName,
       stgfood.FoodCategory,
       @newdatetime as begindate,
       '12/31/9999' as enddate
   from dbo.Stagingfood stgfood 
   left join dbo.DimFood df
       on df.FoodNaturalId = stgfood.FoodNaturalId
       and df.EndDate = @newdatetime
   where 
       stgfood.FoodName <> df.FoodName
       or stgfood.FoodCategory <> df.FoodCategory

commit transaction

假設我有適當的索引,未來的 SCD 表將有 10-15 列。以上只是範例。

如果您可以找到一種方法來提高效率/優化,請隨時編輯

在不了解索引等的情況下,我嘗試用一些範例數據填充臨時表:

   (FoodName,FoodCategory)
values('Chicken','Meat')
go 1000

insert into dbo.Stagingfood 
(FoodName,FoodCategory)
values('Veal','Meat')
go 10000


insert into dbo.Stagingfood 
(FoodName,FoodCategory)
values('Porc','Meat')
go 20000

第一部分:插入

引起我注意的一件事是第一個插入:

– 插入不存在的新記錄

insert into dbo.DimFood
(
   FoodNaturalId,
   FoodName,
   FoodCategory,
   begindate,
   enddate
)
select 
   stgfood.FoodNaturalId,
   stgfood.FoodName,
   stgfood.FoodCategory,
   getdate() as begindate,
   '12/31/9999' as enddate
from dbo.Stagingfood stgfood 
left join dbo.DimFood df
   on df.FoodNaturalId = stgfood.FoodNaturalId
where df.FoodNaturalId is null;

當您的數據變大時,這種左連接會導致問題。左連接將使您的查詢始終需要連接所有記錄,然後才能應用過濾器。

我的重寫+添加索引是這樣的:

CREATE NONCLUSTERED INDEX IX_DimFood_FoodNaturalId
ON [dbo].[Dimfood] ([FoodNaturalId]);

  insert into dbo.DimFood
   (
       FoodNaturalId,
       FoodName,
       FoodCategory,
       begindate,
       enddate
   )
   select 
       stgfood.FoodNaturalId,
       stgfood.FoodName,
       stgfood.FoodCategory,
       getdate() as begindate,
       '12/31/9999' as enddate
   from dbo.Stagingfood stgfood 
   where not exists 
  (select * from  dbo.DimFood df where df.FoodNaturalId = stgfood.FoodNaturalId);

您可以檢查兩者的區別,帶有左連接的那個使用合併連接(LEFT OUTER JOIN)進入過濾器,而重寫使用沒有過濾器的左反半連接。

粘貼計劃:

https://www.brentozar.com/pastetheplan/?id=HyVZigsjm

第二部分:更新

這主要是需要用實際數據集進行測試的東西。但是在可以使用內連接的地方進行左連接又是一個不需要的額外步驟。另一件事是,or 子句在優化器中通常不會很好。

這可能是重寫,但 or 子句可能需要保留,這需要再次測試。

 -- Close off existing records that have changed

可以更改此左連接,因為您正在檢查現有記錄。

   set @newdatetime = getdate()

這種重寫+索引在我的數據集上給出了更好的結果

   CREATE NONCLUSTERED INDEX IX_Enddate_FoodNaturalId
   on [dbo].[Dimfood] (Enddate,[FoodNaturalId])
   include(foodname,foodcategory);
   set @newdatetime = getdate();

   update dbo.DimFood
   set EndDate = @newdatetime
   from dbo.DimFood  df 
   inner join   dbo.Stagingfood stgfood 
   on df.FoodNaturalId = stgfood.FoodNaturalId
   where( df.enddate = '12/31/9999' 
   and (stgfood.FoodName <> df.FoodName or stgfood.FoodCategory <> df.FoodCategory));

如果“或”子句在大數據集上出現問題,您可能必須將它們分散到兩個更新中:

update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood  df      inner join   dbo.Stagingfood stgfood 
on df.FoodNaturalId = stgfood.FoodNaturalId
where( df.enddate = '12/31/9999' and stgfood.FoodName <> df.FoodName)


update dbo.DimFood
set EndDate = @newdatetime
from dbo.DimFood  df      inner join   dbo.Stagingfood stgfood 
on df.FoodNaturalId = stgfood.FoodNaturalId
where( df.enddate = '12/31/9999' and stgfood.FoodCategory <> df.FoodCategory);

計劃不將“或”子句分成兩部分:(第一部分原創,第二部分重寫)

https://www.brentozar.com/pastetheplan/?id=BJDvNZjjm

第三部分:插入新的更新記錄

同樣在這裡,如果您只想檢查新更新的記錄,則可以使用內部連接來改進查詢。您也不需要 where 子句。

  -- Insert new updated records
insert into dbo.DimFood
(
   FoodNaturalId,
   FoodName,
   FoodCategory,
   begindate,
   enddate
)
select 
   stgfood.FoodNaturalId,
   stgfood.FoodName,
   stgfood.FoodCategory,
   @newdatetime as begindate,
   '12/31/9999' as enddate
from dbo.Stagingfood stgfood 
inner join dbo.DimFood df
on df.FoodNaturalId = stgfood.FoodNaturalId
and df.EndDate = @newdatetime;

粘貼計劃(第一個計劃原始,第二個計劃重寫):

https://www.brentozar.com/pastetheplan/?id=BJk2V-sim

引用自:https://dba.stackexchange.com/questions/220704