Sql-Server

搜尋模式的查詢性能

  • January 29, 2019

在SO上發布了相同的問題,但沒有得到答案。

根據 Andrey Nikolov 的評論,嘗試使用Trigram,但表名和列(超過 50 列)是動態的,我的意思是將從前端獲取。

問題

我有以下範例數據來理解要求。

桌子:

create table ft_test
(
   col1 int identity(1,1),
   col2 varchar(max)
);

insert into ft_test values('John A Henry');
insert into ft_test values('Dsouza mak Dee');
insert into ft_test values('Smith Ashla Don');

創建全文索引

CREATE UNIQUE INDEX UIX_test_Col1 ON ft_test(col1);

CREATE FULLTEXT CATALOG ftcat
WITH ACCENT_SENSITIVITY = OFF

CREATE FULLTEXT INDEX ON ft_test  
(col2 LANGUAGE 1033)  
KEY INDEX UIX_test_Col1  
ON ftcat  
WITH STOPLIST = SYSTEM

筆記:

  1. 在表的給定列中搜尋給定模式。
  2. 我的表中有超過 2 億條記錄。
  3. 在所需列上創建 FULLTEXT 索引。

尋找表現最好的搜尋查詢。

吉文斯:

  1. 圖案:oh
  2. 要搜尋的列:col2

預期輸出:

col1    col2
----------------------
1       John A Henry

嘗試:

  1. 使用全文索引:CONTAINS
SELECT col1, col2
FROM ft_test1
WHERE CONTAINS(col2, '"*oh*"')  

沒有輸出。

  1. 使用全文索引:FREETEXT
SELECT col1, col2
FROM ft_test1
WHERE FREETEXT(col2, '"*oh*"')  

沒有輸出。

  1. 使用模式索引:PATINDEX
SELECT col1, col2
FROM ft_test1
WHERE PATINDEX('%oh%',col2)>0

得到了輸出。

執行計劃:Table Scan

  1. 使用字元表達式:CHARINDEX
SELECT col1, col2
FROM ft_test1
WHERE CHARINDEX('oh',col2)>0

得到了輸出。

執行計劃:Table Scan

  1. 使用LIKE.
SELECT col1, col2
FROM ft_test1
WHERE col2 LIKE '%oh%'   

得到了輸出。

執行計劃:Table Scan

…嘗試使用 Trigram,但表名和列(超過 50 列)是動態的…

使用我在具有多列、動態 SQL 或 2 億條記錄的 SQL Server 中的 Trigram Wildcard String Search 中所寫的 trigram 技術沒有什麼特別的問題。

那篇文章的評論顯示人們使用多列、更大的表甚至加密的基本技術。假設進行更改的人了解我的文章中顯示的基本底層實現,所需的修改通常非常簡單。

請注意,儘管三元組搜尋需要至少三個字元的子字元串才能搜尋(顧名思義)。如果您確實需要執行兩個字元的搜尋(如問題範例中所示),則需要進行更多修改,並且您需要仔細評估數據的成本和收益。如果沒有更多資訊,似乎對數億行的兩個字元匹配會產生大量匹配。

如果 n-gram 搜尋不適合您,並且您可以容忍一些延遲,那麼像Elasticsearch這樣的外部解決方案可能值得研究。

例子

下面的程式碼顯示了一種擴展基本三元組搜尋以處理同一個表中的三個字元串列的方法。為簡潔起見,它沒有實現我的文章中顯示的觸發邏輯來保持三元組與底層數據同步。如果需要,則所需的修改遵循大致相同的模式。

CREATE TABLE dbo.Example 
(
   id integer IDENTITY NOT NULL,
   string1 varchar(10) NOT NULL,
   string2 varchar(20) NOT NULL,
   string3 varchar(30) NOT NULL,

   CONSTRAINT [PK dbo.Example (id)]
       PRIMARY KEY CLUSTERED (id)
);
GO
-- 1 million rows
INSERT dbo.Example 
   WITH (TABLOCKX)
(
   string1,
   string2,
   string3
)
SELECT TOP (1 * 1000 * 1000)
   -- 10 characters numeric
   REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0'),
   -- 10 characters numeric plus 10 characters [0-9A-F]
   REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0') + RIGHT(NEWID(), 10),
   -- 30 characters [0-9A-F]
   RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10)
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
OPTION (MAXDOP 1);
GO
--- Generate trigrams from a string
CREATE FUNCTION dbo.GenerateTrigrams
(
   @string varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS RETURN
   WITH
       N16 AS 
       (
           SELECT V.v 
           FROM 
           (
               VALUES 
                   (0),(0),(0),(0),(0),(0),(0),(0),
                   (0),(0),(0),(0),(0),(0),(0),(0)
           ) AS V (v)),
       -- Numbers table (256)
       Nums AS 
       (
           SELECT n = ROW_NUMBER() OVER (ORDER BY A.v)
           FROM N16 AS A 
           CROSS JOIN N16 AS B
       ),
       Trigrams AS
       (
           -- Every 3-character substring
           SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END)
               trigram = SUBSTRING(@string, N.n, 3)
           FROM Nums AS N
           ORDER BY N.n
       )
   -- Remove duplicates and ensure all three characters are alphanumeric
   SELECT DISTINCT 
       T.trigram
   FROM Trigrams AS T
   WHERE
       -- Binary collation comparison so ranges work as expected
       T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%';
GO
-- Trigrams for Example table
CREATE TABLE dbo.Example_Trigrams
(
   id integer NOT NULL,
   column_id integer NOT NULL,
   trigram char(3) NOT NULL
);
GO
-- Generate trigrams for string1
DECLARE @column_id integer = 
   COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string1', 'ColumnId');

INSERT dbo.Example_Trigrams
   WITH (TABLOCKX)
   (
       id,
       column_id,
       trigram
   )
SELECT
   E.id,
   @column_id,
   GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string1) AS GT;
GO
-- Generate trigrams for string2 
DECLARE @column_id integer = 
   COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string2', 'ColumnId');

INSERT dbo.Example_Trigrams
   WITH (TABLOCKX)
   (
       id,
       column_id,
       trigram
   )
SELECT
   E.id,
   @column_id,
   GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string2) AS GT;
GO
-- Generate trigrams for string3
DECLARE @column_id integer = 
   COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string3', 'ColumnId');

INSERT dbo.Example_Trigrams
   WITH (TABLOCKX)
   (
       id,
       column_id,
       trigram
   )
SELECT
   E.id,
   @column_id,
   GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string3) AS GT;
GO
-- Trigram search index
CREATE UNIQUE CLUSTERED INDEX
   [CUQ dbo.Example_Trigrams (column_id, trigram, id)]
ON dbo.Example_Trigrams (column_id, trigram, id)
WITH (DATA_COMPRESSION = ROW);
GO
-- Selectivity of each trigram (performance optimization)
CREATE VIEW dbo.Example_TrigramCounts
WITH SCHEMABINDING
AS
SELECT
   ET.column_id,
   ET.trigram, 
   cnt = COUNT_BIG(*)
FROM dbo.Example_Trigrams AS ET
GROUP BY
   ET.column_id,
   ET.trigram;
GO
-- Materialize the view
CREATE UNIQUE CLUSTERED INDEX
   [CUQ dbo.Example_TrigramCounts (column_id, trigram)]
ON dbo.Example_TrigramCounts (column_id, trigram);
GO
-- Most selective trigrams for a search string
-- Always returns a row (NULLs if no trigrams found)
CREATE FUNCTION dbo.Example_GetBestTrigrams
(
   @column_name sysname,
   @string varchar(255)
)
RETURNS table
WITH SCHEMABINDING AS
RETURN
   SELECT
       -- Pivot
       trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END),
       trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END),
       trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END)
   FROM 
   (
       -- Generate trigrams for the search string
       -- and choose the most selective three
       SELECT TOP (3)
           rn = ROW_NUMBER() OVER (
               ORDER BY ETC.cnt ASC),
           GT.trigram
       FROM dbo.GenerateTrigrams(@string) AS GT
       JOIN dbo.Example_TrigramCounts AS ETC
           WITH (NOEXPAND)
           ON ETC.column_id = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId')
           AND ETC.trigram = GT.trigram
       ORDER BY
           ETC.cnt ASC
   ) AS BT;
GO
-- Returns Example ids matching all provided (non-null) trigrams
CREATE FUNCTION dbo.Example_GetTrigramMatchIDs
(
   @column_name sysname,
   @Trigram1 char(3),
   @Trigram2 char(3),
   @Trigram3 char(3)
)
RETURNS @IDs table (id integer PRIMARY KEY)
WITH SCHEMABINDING AS
BEGIN
   DECLARE @column_id integer =
       COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId');

   IF  @Trigram1 IS NOT NULL
   BEGIN
       IF @Trigram2 IS NOT NULL
       BEGIN
           IF @Trigram3 IS NOT NULL
           BEGIN
               -- 3 trigrams available
               INSERT @IDs (id)
               SELECT ET1.id
               FROM dbo.Example_Trigrams AS ET1 
               WHERE ET1.trigram = @Trigram1
               AND ET1.column_id = @column_id
               INTERSECT
               SELECT ET2.id
               FROM dbo.Example_Trigrams AS ET2
               WHERE ET2.trigram = @Trigram2
               AND ET2.column_id = @column_id
               INTERSECT
               SELECT ET3.id
               FROM dbo.Example_Trigrams AS ET3
               WHERE ET3.trigram = @Trigram3
               AND ET3.column_id = @column_id
               OPTION (MERGE JOIN);
           END;
           ELSE
           BEGIN
               -- 2 trigrams available
               INSERT @IDs (id)
               SELECT ET1.id
               FROM dbo.Example_Trigrams AS ET1 
               WHERE ET1.trigram = @Trigram1
               AND ET1.column_id = @column_id
               INTERSECT
               SELECT ET2.id
               FROM dbo.Example_Trigrams AS ET2
               WHERE ET2.trigram = @Trigram2
               AND ET2.column_id = @column_id
               OPTION (MERGE JOIN);
           END;
       END;
       ELSE
       BEGIN
           -- 1 trigram available
           INSERT @IDs (id)
           SELECT ET1.id
           FROM dbo.Example_Trigrams AS ET1 
           WHERE ET1.trigram = @Trigram1
           AND ET1.column_id = @column_id;
       END;
   END;

   RETURN;
END;
GO
-- Search implementation
CREATE FUNCTION dbo.Example_TrigramSearch
(
   @column_name sysname,
   @search varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS
RETURN
   SELECT
       Result.id,
       Result.string
   FROM dbo.Example_GetBestTrigrams(@column_name, @search) AS GBT
   CROSS APPLY
   (
       -- Trigram search
       SELECT
           E.id,
           string =
               CASE @column_name
                   WHEN 'string1' THEN E.string1
                   WHEN 'string2' THEN E.string2
                   WHEN 'string3' THEN E.string3
                   ELSE NULL
               END
       FROM dbo.Example_GetTrigramMatchIDs
           (@column_name, GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID
       JOIN dbo.Example AS E
           ON E.id = MID.id
       WHERE
           -- Residual exact match check
           CASE @column_name
               WHEN 'string1' THEN E.string1
               WHEN 'string2' THEN E.string2
               WHEN 'string3' THEN E.string3
               ELSE NULL
           END LIKE @search
           -- At least one trigram found 
           AND GBT.trigram1 IS NOT NULL

       UNION ALL

       -- Non-trigram search
       SELECT
           E.id,
           string = 
               CASE @column_name
                   WHEN 'string1' THEN E.string1
                   WHEN 'string2' THEN E.string2
                   WHEN 'string3' THEN E.string3
                   ELSE NULL
               END
       FROM dbo.Example AS E
       WHERE
           CASE @column_name
               WHEN 'string1' THEN E.string1
               WHEN 'string2' THEN E.string2
               WHEN 'string3' THEN E.string3
               ELSE NULL
           END LIKE @search
           -- No trigram found 
           AND GBT.trigram1 IS NULL
   ) AS Result;

該特定實現選擇允許使用列名作為參數進行三元組搜尋:

SELECT
   ETS.id,
   ETS.string
FROM dbo.Example_TrigramSearch
(
   N'string1', 
   '%12345%'
) AS ETS;

SELECT
   ETS.id,
   ETS.string
FROM dbo.Example_TrigramSearch
(
   N'string2', 
   '%ABC12%'
) AS ETS;

SELECT
   ETS.id,
   ETS.string
FROM dbo.Example_TrigramSearch
(
   N'string3', 
   '%ABC45%'
) AS ETS;

你應該解釋一下需求而不是性能,一兩列並不顯示實際問題的嚴重性。

無論如何,CONTAINS查詢FREETEXT不起作用,因為它們不搜尋單詞或片語之間的字母。

它們適用於以匹配字母開頭的單詞或片語,以及其他條件。包含

此外,在字母前使用萬用字元“*”也沒有任何意義,因為它不會被 CONTAINS 和 FREETEXT 讀取為萬用字元

SELECT col1, col2
FROM ft_test1
WHERE FREETEXT(col2, '"jo*"')

所以上面的查詢將返回輸出,因為單詞以jois開頭John。還要注意’“jo*”’ * 之前沒有使用,即使使用了也會忽略。

很明顯CONTAINSFREETEXT不是你的要求。

以下連結可能會對您有所幫助, 部分單詞搜尋

引用自:https://dba.stackexchange.com/questions/226318