Sql-Server

SQL Server 全文索引器與停止列表/停止詞

  • December 7, 2012

當我從事研究生項目*(使用 SQL Server 2012 語義搜尋進行文本探勘)時,*我遇到了需要在這個網站上發布問題的情況,希望有人能幫助我。

這個問題是關於SQL Server 2012 中的停用詞和停用詞的。我已經設置了一個概念證明,我正在嘗試使用新的語義搜尋功能索引文件並列出統計相關的關鍵片語。因為我不希望某些詞被編入索引,因此在統計上相關的關鍵片語,我正在創建一個停止列表來排除這些詞。

英語的停用詞/停用詞 ( lcid 1033 ):

/* Create stoplist and add words */ 
CREATE FULLTEXT STOPLIST [naam van de stoplist];
  ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'beeten' LANGUAGE 'English';
  ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'centimeter' LANGUAGE 'English';
  ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'info' LANGUAGE 'English';
  ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'ruud' LANGUAGE 'English';
GO

使用自定義停止列表和語義創建全文目錄、全文索引:

/* Full-Text catalog */
CREATE FULLTEXT CATALOG [ft] WITH ACCENT_SENSITIVITY = ON AS DEFAULT;
GO

/* Full-Text Index */
CREATE FULLTEXT INDEX ON [dbo].[Documents]
   (   file_stream Language 1033 STATISTICAL_SEMANTICS )
   KEY INDEX DocumentsFt
   WITH STOPLIST = [naam van de stoplist];
GO

我嘗試了我能想到的一切來檢查我是否錯過了什麼:

   /*Select all words in the stoplist, with some debug information*/
   SELECT sys.fulltext_stoplists.stoplist_id AS [Stoplist id]
       ,  sys.fulltext_stoplists.name AS [Stoplist]
       ,  sys.database_principals.name AS [Owner]
       ,  sys.fulltext_languages.lcid AS [LCID]
       ,  sys.fulltext_languages.name AS [Taal]
       ,  sys.fulltext_stopwords.stopword AS [Stopwoord] 
   FROM sys.fulltext_languages
   INNER JOIN sys.fulltext_stopwords 
       ON sys.fulltext_stopwords.language_id = sys.fulltext_languages.lcid
   INNER JOIN sys.fulltext_stoplists 
       ON sys.fulltext_stopwords.stoplist_id = sys.fulltext_stoplists.stoplist_id
   INNER JOIN sys.database_principals ON sys.database_principals.principal_id = sys.fulltext_stoplists.principal_id
   WHERE sys.fulltext_stoplists.name = 'naam van de stoplist';

/* List with all Full-Text Indexes (with statistical_semantics) */
SELECT sys.fulltext_catalogs.name [Full-Text catalog]   
   , sys.indexes.name AS [Index] 
   , sys.indexes.type_desc AS [Index type]
   , sys.fulltext_indexes.is_enabled AS [Index in use]
   , sys.fulltext_stoplists.name AS [Stoplist]
   , sys.tables.name AS [Table]
   , sys.columns.name AS [Column]
   , sys.fulltext_index_columns.language_id AS [LCID]
   , sys.fulltext_languages.name AS [Language]
   , sys.fulltext_index_columns.statistical_semantics [Semantic]
FROM sys.fulltext_catalogs
INNER JOIN sys.fulltext_indexes 
   ON sys.fulltext_catalogs.fulltext_catalog_id = sys.fulltext_indexes.fulltext_catalog_id
INNER JOIN sys.fulltext_index_columns 
   ON sys.fulltext_indexes.object_id = sys.fulltext_index_columns.object_id
INNER JOIN sys.indexes 
   ON sys.fulltext_indexes.object_id = sys.indexes.object_id 
   AND sys.fulltext_indexes.unique_index_id = sys.indexes.index_id
INNER JOIN sys.index_columns 
   ON sys.indexes.object_id = sys.index_columns.object_id 
   AND sys.indexes.index_id = sys.index_columns.index_id
INNER JOIN sys.columns 
   ON sys.index_columns.object_id = sys.columns.object_id 
   AND sys.index_columns.column_id = sys.columns.column_id
INNER JOIN sys.tables 
   ON sys.fulltext_indexes.object_id = sys.tables.object_id
INNER JOIN sys.fulltext_languages 
   ON sys.fulltext_index_columns.language_id = sys.fulltext_languages.lcid
LEFT JOIN sys.fulltext_stoplists 
   ON sys.fulltext_indexes.stoplist_id = sys.fulltext_stoplists.stoplist_id    
WHERE sys.fulltext_index_columns.statistical_semantics = 1
ORDER BY sys.fulltext_catalogs.name
       ,sys.indexes.name
       ,sys.index_columns.key_ordinal;

/* Rebuild catalog */
ALTER FULLTEXT CATALOG [ft] REBUILD;
GO

/* Check status of the catalog rebuild */
/*  0 = Idle.
1 = Full population is in progress.
2 = Incremental population is in progress.
3 = Propagation of tracked changes is in progress.
4 = Background update index is in progress, such as automatic change tracking.
5 = Full-text indexing is throttled or pause    
*/
SELECT FULLTEXTCATALOGPROPERTY('ft', 'PopulateStatus') AS Status;
GO

/* Repopulate Full-Text Index */
ALTER FULLTEXT INDEX ON dbo.Documents START UPDATE POPULATION;
GO

上面的所有命令都表明設置正確。

當我查看索引詞時,我仍然會看到停止列表中的詞,例如“beeten”。

SELECT * 
FROM sys.dm_fts_index_keywords(DB_ID('SQLServerArticles'), OBJECT_ID('Documents'))
WHERE display_term = 'beeten';

如果全文解析器無法與以下語句一起正常工作,我什至嘗試過。

SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "testing for fruit and nuts centimeter, any type of Beeten" ', 1033, 8, 0)

此語句返回以下結果:

Exact Match testing
Exact Match for
Exact Match fruit
Exact Match and
Exact Match nuts
Noise Word  centimeter
Exact Match any
Exact Match type
Exact Match of
Noise Word  beeten

這個結果表明單詞“beeten”是一個雜訊詞。索引時應該跳過這個詞嗎?我錯過了什麼?

再說一遍:因為我不希望某些詞被索引,因此在統計上相關的關鍵片語,我正在創建一個停止列表來排除這些詞。

如果您的系統區域設置與英語不同,則存在一個已知錯誤(Microsoft Connect 項目753596),其中使用系統區域設置停用詞而不是儲存在文件表中的文件的全文索引停用詞。

引用自:https://dba.stackexchange.com/questions/29926