Sql-Server
使用 LEVENSHTEIN 函式查詢返回的結果比預期的多
我的目標是為同一日期內的不同項目在列中選擇相似的字元串,
DETAILS
並且時差小於 10 分鐘。該表類似於下圖:------------------------------------------------------- | ID* | ITEM | DATE | TIME | DETAILS | ------------------------------------------------------- | 1 | XYZ | 20.10.2019 | 12:35:10 | Some_string | ------------------------------------------------------- | 2 | ABC | 20.10.2019 | 12:36:10 | Some_strin | ------------------------------------------------------- | 3 | KLM | 20.10.2019 | 12:36:10 | SSome_sng | ------------------------------------------------------- | 4 | XYZ | 20.10.2019 | 12:55:10 | Another_string| -------------------------------------------------------
該列中有三個不同的值,
ITEM
我想在同一天內為每個項目輸入類似的詳細資訊,並且時差小於 10 分鐘。因此,為了達到預期的結果集,我使用了一個Levenshtein 距離函式,它為 100% 匹配的字元串返回 0,為不匹配的字元串返回一個大於零的整數。
DETAILS
我在一個語句中使用了這個函式來在給定條件的限制中找到相似的ITEMS
,我的查詢是:SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS, t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS FROM MY_TABLE t1 INNER JOIN MY_TABLE t2 ON t1.ITEM != t2.ITEM WHERE dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20 AND t1.DATE = t2.DATE AND DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10
但是這個陳述給了我一個看起來像笛卡爾積的巨大表格。我怎樣才能得到想要的結果?
DDL:
CREATE TABLE [dbo].[MY_TABLE]( [ID] int NOT NULL IDENTITY(1, 1) PRIMARY KEY, [ITEM] nchar(10) NULL, [DATE] date NULL, [TIME] time(7) NULL, [DETAILS] nchar(100) NULL); GO INSERT INTO MY_TABLE([ITEM],[DATE],[TIME],[DETAILS]) VALUES('XYZ', '10.20.2019', '12:35:10', 'Some_string' ), ('ABC', '10.20.2019', '12:36:10', 'Some_strin' ), ('KLM', '10.20.2019', '12:36:10', 'SSome_sng' ), ('XYZ', '10.20.2019', '12:55:10', 'Another_string' ); GO
CREATE function LEVENSHTEIN ( @SourceString nvarchar(100), @TargetString nvarchar(100) ) --Returns the Levenshtein Distance between @SourceString string and @TargetString --Translated to TSQL by Joseph Gama --Updated slightly by Phil Factor returns int as BEGIN DECLARE @Matrix Nvarchar(4000), @LD int, @TargetStringLength int, @SourceStringLength int, @ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int, @Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int -- Step 1: Set n to be the length of s. Set m to be the length of t. -- If n = 0, return m and exit. -- If m = 0, return n and exit. -- Construct a matrix containing 0..m rows and 0..n columns. if @SourceString is null or @TargetString is null return null Select @SourceStringLength=LEN(@SourceString), @TargetStringLength=LEN(@TargetString), @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1)) If @SourceStringLength = 0 return @TargetStringLength If @TargetStringLength = 0 return @SourceStringLength if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1 --Step 2: Initialize the first row to 0..n. -- Initialize the first column to 0..m. SET @ii=0 WHILE @ii<=@SourceStringLength BEGIN SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i SET @ii=@ii+1 END SET @ii=0 WHILE @ii<=@TargetStringLength BEGIN SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j SET @ii=@ii+1 END --Step 3 Examine each character of s (i from 1 to n). SET @ii=1 WHILE @ii<=@SourceStringLength BEGIN --Step 4 Examine each character of t (j from 1 to m). SET @jj=1 WHILE @jj<=@TargetStringLength BEGIN --Step 5 and 6 Select --Set cell d[i,j] of the matrix equal to the minimum of: --a. The cell immediately above plus 1: d[i-1,j] + 1. --b. The cell immediately to the left plus 1: d[i,j-1] + 1. --c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost @Above=unicode(substring(@Matrix,@jj*(@SourceStringLength+1)+@ii-1+1,1))+1, @ToTheLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii+1,1))+1, @AboveAndToLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii-1+1,1)) + case when (substring(@SourceString,@ii,1)) = (substring(@TargetString,@jj,1)) then 0 else 1 end--the cost -- If s[i] equals t[j], the cost is 0. -- If s[i] doesn't equal t[j], the cost is 1. -- now calculate the minimum value of the three if (@Above < @ToTheLeft) AND (@Above < @AboveAndToLeft) select @MinimumValueOfCells=@Above else if (@ToTheLeft < @Above) AND (@ToTheLeft < @AboveAndToLeft) select @MinimumValueOfCells=@ToTheLeft else select @MinimumValueOfCells=@AboveAndToLeft Select @Matrix=STUFF(@Matrix, @jj*(@SourceStringLength+1)+@ii+1,1, nchar(@MinimumValueOfCells)), @jj=@jj+1 END SET @ii=@ii+1 END --Step 7 After iteration steps (3, 4, 5, 6) are complete, distance is found in cell d[n,m] return unicode(substring( @Matrix,@SourceStringLength*(@TargetStringLength+1)+@TargetStringLength+1,1 )) END go
這個語句給了我一個看起來像笛卡爾積的大表
我會說這是因為你的加入。
FROM MY_TABLE t1 INNER JOIN MY_TABLE t2 ON t1.ITEM != t2.ITEM
當您使用
!=
運算符時,您將每一行與除具有相同值的列 ITEM 之外的所有其他行連接起來,但它不會避免顛倒行的順序,並且最終會在 TIME 比較上產生負差異. 這是一個例子:CREATE TABLE [dbo].[MY_TABLE]( [ITEM] [nchar](10) NULL, [DATE] [date] NULL, [TIME] [time](7) NULL, [DETAILS] [nchar](100) NULL); INSERT INTO MY_TABLE([ITEM],[DATE],[TIME],[DETAILS]) VALUES('XYZ', '10.20.2019', '12:35:10', 'ronaldo' ), ('XRZ', '10.20.2019', '12:36:10', 'ronalde' ), ('XRZ', '10.20.2019', '12:36:10', 'ronnaudus' ), ('PRZ', '10.20.2019', '12:55:10', 'ronal' );
使用該範例數據執行此查詢:
SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS, t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS, dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) AS 'LEVENSHTEIN', DATEDIFF(MINUTE, t1.TIME, t2.TIME) AS 'DATEDIFF' FROM MY_TABLE t1 INNER JOIN MY_TABLE t2 ON t1.ITEM != t2.ITEM WHERE dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20 AND t1.DATE = t2.DATE AND DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10;
我添加了最後兩列,以便更容易看到問題。結果如下:
看到負的 DATEDIFF 了嗎?這就是您在結果中獲得的行數超過預期的原因。要糾正這種情況,只需向您的加入添加一個條件,如下所示:
SELECT t1.ITEM, t1.DATE, t1.TIME, t1.DETAILS, t2.ITEM, t2.DATE, t2.TIME, t2.DETAILS, dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) AS 'LEVENSHTEIN', DATEDIFF(MINUTE, t1.TIME, t2.TIME) AS 'DATEDIFF' FROM MY_TABLE t1 INNER JOIN MY_TABLE t2 ON t1.ITEM != t2.ITEM AND (t1.DATE <= t2.DATE AND t1.TIME <= t2.TIME) --Avoid joining retroactive time WHERE dbo.LEVENSHTEIN(t1.DETAILS, t2.DETAILS) < 20 AND t1.DATE = t2.DATE AND DATEDIFF(MINUTE, t1.TIME, t2.TIME) <= 10;