Sql-Server-2017

高效的維度和事實連接

  • July 9, 2021

我有一個大的事實表和一個簡單星型模式中的小得多的維度表:

--1.
CREATE TABLE dbo.Dim
(
Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
CustomerName VARCHAR(2000)
)
--index
CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName);


--2. 
CREATE TABLE dbo.Fact
(
...
PurchaseDate DATE 
CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id)
...
)
--index
CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact;

執行以下簡單查詢,過濾事實表並加入維度:

SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)

我們得到以下醜陋的查詢計劃: 在此處輸入圖像描述

有趣的是,維度表傾向於在 4 次迭代中掃描所有 500 000 行,但最終在事實表的該日期範圍內只需要幾千行。

這對於較大的維度表來說是非常低效的,基本上所有的行都是一直掃描的,就像查找表索引甚至不存在一樣。

預期的事情是sql server首先將事實表限制在日期範圍內,然後使用這個有限的CustomerKeyId範圍,它使用索引查找從小維度表中查找CustomerName。

  1. 這真的是星型模式的效率低下,還是我在這裡想念什麼?
  2. 換句話說,我怎麼能強制 sql server 準備有限的 CustomerKeyId 表並只查找那些?(不知何故有CTE?)

這是一個可以玩的範例:

--1.
CREATE TABLE dbo.Dim
(
Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
CustomerName VARCHAR(2000)
)
--index
CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName);

with q as
(
  select top 100000 row_number() over (order by (select null)) rn
  from sys.messages m, sys.objects o
)
insert into dim(CustomerName) 
select concat('CustomerName',rn)
from q

--2. 

CREATE TABLE dbo.Fact
(
PurchaseDate DATE,
CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id)
)
--index
CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact;


with q as
(
  select top 10000000 row_number() over (order by (select null)) rn
  from sys.messages m, sys.objects o
)
insert into Fact(PurchaseDate,CustomerNameId) 
select dateadd(day,rn%1000,'20000101'), 1+rn%100000
from q


SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)


SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER LOOP JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)

計劃在這裡

您會看到使用索引查找的循環連接比掃描並行執行的每個執行緒上的維度並執行雜湊連接更昂貴:

(70000 rows affected)

SQL Server Execution Times:
  CPU time = 62 ms,  elapsed time = 64 ms.

(70000 rows affected)

SQL Server Execution Times:
  CPU time = 108 ms,  elapsed time = 90 ms.

引用自:https://dba.stackexchange.com/questions/295468