搜尋模式的查詢性能
在SO上發布了相同的問題,但沒有得到答案。
根據 Andrey Nikolov 的評論,嘗試使用Trigram,但表名和列(超過 50 列)是動態的,我的意思是將從前端獲取。
問題:
我有以下範例數據來理解要求。
桌子:
create table ft_test ( col1 int identity(1,1), col2 varchar(max) ); insert into ft_test values('John A Henry'); insert into ft_test values('Dsouza mak Dee'); insert into ft_test values('Smith Ashla Don');
創建全文索引:
CREATE UNIQUE INDEX UIX_test_Col1 ON ft_test(col1); CREATE FULLTEXT CATALOG ftcat WITH ACCENT_SENSITIVITY = OFF CREATE FULLTEXT INDEX ON ft_test (col2 LANGUAGE 1033) KEY INDEX UIX_test_Col1 ON ftcat WITH STOPLIST = SYSTEM
筆記:
- 在表的給定列中搜尋給定模式。
- 我的表中有超過 2 億條記錄。
- 在所需列上創建 FULLTEXT 索引。
尋找表現最好的搜尋查詢。
吉文斯:
- 圖案:
oh
- 要搜尋的列:
col2
預期輸出:
col1 col2 ---------------------- 1 John A Henry
嘗試:
- 使用全文索引:
CONTAINS
SELECT col1, col2 FROM ft_test1 WHERE CONTAINS(col2, '"*oh*"')
沒有輸出。
- 使用全文索引:
FREETEXT
SELECT col1, col2 FROM ft_test1 WHERE FREETEXT(col2, '"*oh*"')
沒有輸出。
- 使用模式索引:
PATINDEX
SELECT col1, col2 FROM ft_test1 WHERE PATINDEX('%oh%',col2)>0
得到了輸出。
執行計劃:
Table Scan
- 使用字元表達式:
CHARINDEX
SELECT col1, col2 FROM ft_test1 WHERE CHARINDEX('oh',col2)>0
得到了輸出。
執行計劃:
Table Scan
- 使用
LIKE
.SELECT col1, col2 FROM ft_test1 WHERE col2 LIKE '%oh%'
得到了輸出。
執行計劃:
Table Scan
…嘗試使用 Trigram,但表名和列(超過 50 列)是動態的…
使用我在具有多列、動態 SQL 或 2 億條記錄的 SQL Server 中的 Trigram Wildcard String Search 中所寫的 trigram 技術沒有什麼特別的問題。
那篇文章的評論顯示人們使用多列、更大的表甚至加密的基本技術。假設進行更改的人了解我的文章中顯示的基本底層實現,所需的修改通常非常簡單。
請注意,儘管三元組搜尋需要至少三個字元的子字元串才能搜尋(顧名思義)。如果您確實需要執行兩個字元的搜尋(如問題範例中所示),則需要進行更多修改,並且您需要仔細評估數據的成本和收益。如果沒有更多資訊,似乎對數億行的兩個字元匹配會產生大量匹配。
如果 n-gram 搜尋不適合您,並且您可以容忍一些延遲,那麼像Elasticsearch這樣的外部解決方案可能值得研究。
例子
下面的程式碼顯示了一種擴展基本三元組搜尋以處理同一個表中的三個字元串列的方法。為簡潔起見,它沒有實現我的文章中顯示的觸發邏輯來保持三元組與底層數據同步。如果需要,則所需的修改遵循大致相同的模式。
CREATE TABLE dbo.Example ( id integer IDENTITY NOT NULL, string1 varchar(10) NOT NULL, string2 varchar(20) NOT NULL, string3 varchar(30) NOT NULL, CONSTRAINT [PK dbo.Example (id)] PRIMARY KEY CLUSTERED (id) ); GO -- 1 million rows INSERT dbo.Example WITH (TABLOCKX) ( string1, string2, string3 ) SELECT TOP (1 * 1000 * 1000) -- 10 characters numeric REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0'), -- 10 characters numeric plus 10 characters [0-9A-F] REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0') + RIGHT(NEWID(), 10), -- 30 characters [0-9A-F] RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10) FROM master.dbo.spt_values AS SV1 CROSS JOIN master.dbo.spt_values AS SV2 OPTION (MAXDOP 1); GO --- Generate trigrams from a string CREATE FUNCTION dbo.GenerateTrigrams ( @string varchar(255) ) RETURNS table WITH SCHEMABINDING AS RETURN WITH N16 AS ( SELECT V.v FROM ( VALUES (0),(0),(0),(0),(0),(0),(0),(0), (0),(0),(0),(0),(0),(0),(0),(0) ) AS V (v)), -- Numbers table (256) Nums AS ( SELECT n = ROW_NUMBER() OVER (ORDER BY A.v) FROM N16 AS A CROSS JOIN N16 AS B ), Trigrams AS ( -- Every 3-character substring SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END) trigram = SUBSTRING(@string, N.n, 3) FROM Nums AS N ORDER BY N.n ) -- Remove duplicates and ensure all three characters are alphanumeric SELECT DISTINCT T.trigram FROM Trigrams AS T WHERE -- Binary collation comparison so ranges work as expected T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%'; GO -- Trigrams for Example table CREATE TABLE dbo.Example_Trigrams ( id integer NOT NULL, column_id integer NOT NULL, trigram char(3) NOT NULL ); GO -- Generate trigrams for string1 DECLARE @column_id integer = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string1', 'ColumnId'); INSERT dbo.Example_Trigrams WITH (TABLOCKX) ( id, column_id, trigram ) SELECT E.id, @column_id, GT.trigram FROM dbo.Example AS E CROSS APPLY dbo.GenerateTrigrams(E.string1) AS GT; GO -- Generate trigrams for string2 DECLARE @column_id integer = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string2', 'ColumnId'); INSERT dbo.Example_Trigrams WITH (TABLOCKX) ( id, column_id, trigram ) SELECT E.id, @column_id, GT.trigram FROM dbo.Example AS E CROSS APPLY dbo.GenerateTrigrams(E.string2) AS GT; GO -- Generate trigrams for string3 DECLARE @column_id integer = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string3', 'ColumnId'); INSERT dbo.Example_Trigrams WITH (TABLOCKX) ( id, column_id, trigram ) SELECT E.id, @column_id, GT.trigram FROM dbo.Example AS E CROSS APPLY dbo.GenerateTrigrams(E.string3) AS GT; GO -- Trigram search index CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.Example_Trigrams (column_id, trigram, id)] ON dbo.Example_Trigrams (column_id, trigram, id) WITH (DATA_COMPRESSION = ROW); GO -- Selectivity of each trigram (performance optimization) CREATE VIEW dbo.Example_TrigramCounts WITH SCHEMABINDING AS SELECT ET.column_id, ET.trigram, cnt = COUNT_BIG(*) FROM dbo.Example_Trigrams AS ET GROUP BY ET.column_id, ET.trigram; GO -- Materialize the view CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.Example_TrigramCounts (column_id, trigram)] ON dbo.Example_TrigramCounts (column_id, trigram); GO -- Most selective trigrams for a search string -- Always returns a row (NULLs if no trigrams found) CREATE FUNCTION dbo.Example_GetBestTrigrams ( @column_name sysname, @string varchar(255) ) RETURNS table WITH SCHEMABINDING AS RETURN SELECT -- Pivot trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END), trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END), trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END) FROM ( -- Generate trigrams for the search string -- and choose the most selective three SELECT TOP (3) rn = ROW_NUMBER() OVER ( ORDER BY ETC.cnt ASC), GT.trigram FROM dbo.GenerateTrigrams(@string) AS GT JOIN dbo.Example_TrigramCounts AS ETC WITH (NOEXPAND) ON ETC.column_id = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId') AND ETC.trigram = GT.trigram ORDER BY ETC.cnt ASC ) AS BT; GO -- Returns Example ids matching all provided (non-null) trigrams CREATE FUNCTION dbo.Example_GetTrigramMatchIDs ( @column_name sysname, @Trigram1 char(3), @Trigram2 char(3), @Trigram3 char(3) ) RETURNS @IDs table (id integer PRIMARY KEY) WITH SCHEMABINDING AS BEGIN DECLARE @column_id integer = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId'); IF @Trigram1 IS NOT NULL BEGIN IF @Trigram2 IS NOT NULL BEGIN IF @Trigram3 IS NOT NULL BEGIN -- 3 trigrams available INSERT @IDs (id) SELECT ET1.id FROM dbo.Example_Trigrams AS ET1 WHERE ET1.trigram = @Trigram1 AND ET1.column_id = @column_id INTERSECT SELECT ET2.id FROM dbo.Example_Trigrams AS ET2 WHERE ET2.trigram = @Trigram2 AND ET2.column_id = @column_id INTERSECT SELECT ET3.id FROM dbo.Example_Trigrams AS ET3 WHERE ET3.trigram = @Trigram3 AND ET3.column_id = @column_id OPTION (MERGE JOIN); END; ELSE BEGIN -- 2 trigrams available INSERT @IDs (id) SELECT ET1.id FROM dbo.Example_Trigrams AS ET1 WHERE ET1.trigram = @Trigram1 AND ET1.column_id = @column_id INTERSECT SELECT ET2.id FROM dbo.Example_Trigrams AS ET2 WHERE ET2.trigram = @Trigram2 AND ET2.column_id = @column_id OPTION (MERGE JOIN); END; END; ELSE BEGIN -- 1 trigram available INSERT @IDs (id) SELECT ET1.id FROM dbo.Example_Trigrams AS ET1 WHERE ET1.trigram = @Trigram1 AND ET1.column_id = @column_id; END; END; RETURN; END; GO -- Search implementation CREATE FUNCTION dbo.Example_TrigramSearch ( @column_name sysname, @search varchar(255) ) RETURNS table WITH SCHEMABINDING AS RETURN SELECT Result.id, Result.string FROM dbo.Example_GetBestTrigrams(@column_name, @search) AS GBT CROSS APPLY ( -- Trigram search SELECT E.id, string = CASE @column_name WHEN 'string1' THEN E.string1 WHEN 'string2' THEN E.string2 WHEN 'string3' THEN E.string3 ELSE NULL END FROM dbo.Example_GetTrigramMatchIDs (@column_name, GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID JOIN dbo.Example AS E ON E.id = MID.id WHERE -- Residual exact match check CASE @column_name WHEN 'string1' THEN E.string1 WHEN 'string2' THEN E.string2 WHEN 'string3' THEN E.string3 ELSE NULL END LIKE @search -- At least one trigram found AND GBT.trigram1 IS NOT NULL UNION ALL -- Non-trigram search SELECT E.id, string = CASE @column_name WHEN 'string1' THEN E.string1 WHEN 'string2' THEN E.string2 WHEN 'string3' THEN E.string3 ELSE NULL END FROM dbo.Example AS E WHERE CASE @column_name WHEN 'string1' THEN E.string1 WHEN 'string2' THEN E.string2 WHEN 'string3' THEN E.string3 ELSE NULL END LIKE @search -- No trigram found AND GBT.trigram1 IS NULL ) AS Result;
該特定實現選擇允許使用列名作為參數進行三元組搜尋:
SELECT ETS.id, ETS.string FROM dbo.Example_TrigramSearch ( N'string1', '%12345%' ) AS ETS; SELECT ETS.id, ETS.string FROM dbo.Example_TrigramSearch ( N'string2', '%ABC12%' ) AS ETS; SELECT ETS.id, ETS.string FROM dbo.Example_TrigramSearch ( N'string3', '%ABC45%' ) AS ETS;
你應該解釋一下需求而不是性能,一兩列並不顯示實際問題的嚴重性。
無論如何,
CONTAINS
查詢FREETEXT
不起作用,因為它們不搜尋單詞或片語之間的字母。它們適用於以匹配字母開頭的單詞或片語,以及其他條件。包含
此外,在字母前使用萬用字元“*”也沒有任何意義,因為它不會被 CONTAINS 和 FREETEXT 讀取為萬用字元
SELECT col1, col2 FROM ft_test1 WHERE FREETEXT(col2, '"jo*"')
所以上面的查詢將返回輸出,因為單詞以
jo
is開頭John
。還要注意’“jo*”’ * 之前沒有使用,即使使用了也會忽略。很明顯
CONTAINS
,FREETEXT
不是你的要求。以下連結可能會對您有所幫助, 部分單詞搜尋