Sql-Server

使用 LEVENSHTEIN 函式查詢返回的結果比預期的多

  • November 9, 2020

我的目標是為同一日期內的不同項目在列中選擇相似的字元串,DETAILS並且時差小於 10 分鐘。該表類似於下圖:

-------------------------------------------------------
| ID*  | ITEM |    DATE    |   TIME   |    DETAILS    |
-------------------------------------------------------
| 1    | XYZ  | 20.10.2019 | 12:35:10 | Some_string   |
-------------------------------------------------------
| 2    | ABC  | 20.10.2019 | 12:36:10 | Some_strin    |
-------------------------------------------------------
| 3    | KLM  | 20.10.2019 | 12:36:10 | SSome_sng     |
-------------------------------------------------------
| 4    | XYZ  | 20.10.2019 | 12:55:10 | Another_string|
-------------------------------------------------------

該列中有三個不同的值,ITEM我想在同一天內為每個項目輸入類似的詳細資訊,並且時差小於 10 分鐘。

因此,為了達到預期的結果集,我使用了一個Levenshtein 距離函式,它為 100% 匹配的字元串返回 0,為不匹配的字元串返回一個大於零的整數。DETAILS我在一個語句中使用了這個函式來在給定條件的限制中找到相似的ITEMS,我的查詢是:

SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS,
   t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS
FROM MY_TABLE t1
   INNER JOIN MY_TABLE t2
   ON t1.ITEM != t2.ITEM
WHERE
   dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20
   AND
   t1.DATE = t2.DATE
   AND
   DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10

但是這個陳述給了我一個看起來像笛卡爾積的巨大表格。我怎樣才能得到想要的結果?


DDL:

CREATE TABLE [dbo].[MY_TABLE](
   [ID] int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
   [ITEM] nchar(10) NULL,
   [DATE] date NULL,
   [TIME] time(7) NULL,
   [DETAILS] nchar(100) NULL);
GO

INSERT INTO MY_TABLE([ITEM],[DATE],[TIME],[DETAILS])
VALUES('XYZ', '10.20.2019', '12:35:10', 'Some_string' ),
     ('ABC', '10.20.2019', '12:36:10', 'Some_strin' ),
     ('KLM', '10.20.2019', '12:36:10', 'SSome_sng' ),
     ('XYZ', '10.20.2019', '12:55:10', 'Another_string' );
GO
CREATE function LEVENSHTEIN ( @SourceString nvarchar(100), @TargetString nvarchar(100) )
--Returns the Levenshtein Distance between @SourceString string and @TargetString
--Translated to TSQL by Joseph Gama
--Updated slightly by Phil Factor
returns int
as
BEGIN
DECLARE @Matrix Nvarchar(4000), @LD int, @TargetStringLength int, @SourceStringLength int,
@ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int, 
@Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int
-- Step 1: Set n to be the length of s. Set m to be the length of t. 
--                    If n = 0, return m and exit.
--    If m = 0, return n and exit.
--    Construct a matrix containing 0..m rows and 0..n columns.
if @SourceString is null or @TargetString is null return null
Select @SourceStringLength=LEN(@SourceString), 
    @TargetStringLength=LEN(@TargetString),
    @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1))
If @SourceStringLength = 0 return @TargetStringLength
If @TargetStringLength = 0 return @SourceStringLength
if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1
--Step 2: Initialize the first row to 0..n.
--     Initialize the first column to 0..m.
SET @ii=0
WHILE @ii<=@SourceStringLength
   BEGIN
   SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i
   SET @ii=@ii+1
   END
SET @ii=0
WHILE @ii<=@TargetStringLength
   BEGIN
   SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j
   SET @ii=@ii+1
   END
--Step 3 Examine each character of s (i from 1 to n).
SET @ii=1
WHILE @ii<=@SourceStringLength
   BEGIN

--Step 4   Examine each character of t (j from 1 to m).
   SET @jj=1
   WHILE @jj<=@TargetStringLength
       BEGIN
--Step 5 and 6
       Select 
       --Set cell d[i,j] of the matrix equal to the minimum of:
       --a. The cell immediately above plus 1: d[i-1,j] + 1.
       --b. The cell immediately to the left plus 1: d[i,j-1] + 1.
       --c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost
       @Above=unicode(substring(@Matrix,@jj*(@SourceStringLength+1)+@ii-1+1,1))+1,
       @ToTheLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii+1,1))+1,
       @AboveAndToLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii-1+1,1))
        + case when (substring(@SourceString,@ii,1)) = (substring(@TargetString,@jj,1)) 
           then 0 else 1 end--the cost
       -- If s[i] equals t[j], the cost is 0.
     -- If s[i] doesn't equal t[j], the cost is 1.
       -- now calculate the minimum value of the three
       if (@Above < @ToTheLeft) AND (@Above < @AboveAndToLeft) 
           select @MinimumValueOfCells=@Above
     else if (@ToTheLeft < @Above) AND (@ToTheLeft < @AboveAndToLeft)
           select @MinimumValueOfCells=@ToTheLeft
       else
           select @MinimumValueOfCells=@AboveAndToLeft
       Select @Matrix=STUFF(@Matrix,
                  @jj*(@SourceStringLength+1)+@ii+1,1,
                  nchar(@MinimumValueOfCells)),
          @jj=@jj+1
       END
   SET @ii=@ii+1
   END    
--Step 7 After iteration steps (3, 4, 5, 6) are complete, distance is found in cell d[n,m]
return unicode(substring(
  @Matrix,@SourceStringLength*(@TargetStringLength+1)+@TargetStringLength+1,1
  ))
END
go

這個語句給了我一個看起來像笛卡爾積的大表

我會說這是因為你的加入。

FROM MY_TABLE t1
INNER JOIN MY_TABLE t2
ON t1.ITEM != t2.ITEM

當您使用!=運算符時,您將每一行與除具有相同值的列 ITEM 之外的所有其他行連接起來,但它不會避免顛倒行的順序,並且最終會在 TIME 比較上產生負差異. 這是一個例子:

CREATE TABLE [dbo].[MY_TABLE](
   [ITEM] [nchar](10) NULL,
   [DATE] [date] NULL,
   [TIME] [time](7) NULL,
   [DETAILS] [nchar](100) NULL);

INSERT INTO MY_TABLE([ITEM],[DATE],[TIME],[DETAILS])
VALUES('XYZ', '10.20.2019', '12:35:10', 'ronaldo' ),
   ('XRZ', '10.20.2019', '12:36:10', 'ronalde' ),
   ('XRZ', '10.20.2019', '12:36:10', 'ronnaudus' ),
   ('PRZ', '10.20.2019', '12:55:10', 'ronal' );

使用該範例數據執行此查詢:

SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS,
   t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS, 
   dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) AS 'LEVENSHTEIN',
   DATEDIFF(MINUTE, t1.TIME, t2.TIME) AS 'DATEDIFF'
FROM MY_TABLE t1
   INNER JOIN MY_TABLE t2
   ON t1.ITEM != t2.ITEM
WHERE
   dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20
   AND
   t1.DATE = t2.DATE
   AND
   DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10;

我添加了最後兩列,以便更容易看到問題。結果如下:

查詢結果

看到負的 DATEDIFF 了嗎?這就是您在結果中獲得的行數超過預期的原因。要糾正這種情況,只需向您的加入添加一個條件,如下所示:

SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS,
   t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS, 
   dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) AS 'LEVENSHTEIN',
   DATEDIFF(MINUTE, t1.TIME, t2.TIME) AS 'DATEDIFF'
FROM MY_TABLE t1
   INNER JOIN MY_TABLE t2
   ON t1.ITEM != t2.ITEM
   AND (t1.DATE <= t2.DATE AND t1.TIME <= t2.TIME) --Avoid joining retroactive time
WHERE
   dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20
   AND
   t1.DATE = t2.DATE
   AND
   DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10;

引用自:https://dba.stackexchange.com/questions/256038