從列中獲取 ts_query 語言時未使用 PostgreSQL GIN 索引

January 2, 2017

我有一個儲存一些多語言內容的表：

CREATE TABLE search (
 content text NOT NULL,
 language regconfig NOT NULL,
 fulltext tsvector
);

CREATE INDEX search_fulltext ON search USING GIN(fulltext);

INSERT INTO search (language, content) VALUES 
 ('dutch', 'Als achter vliegen vliegen vliegen vliegen vliegen vliegen achterna'),
 ('dutch', 'Langs de koele kali liep een kale koeli met een kilo kali op zijn kale koeli-kop.'),
 ('dutch', 'Moeder sneed zeven scheve sneden brood'),
 ('english', 'I saw Susie sitting in a shoe shine shop. Where she sits she shines, and where she shines she sits.'),
 ('english', 'How can a clam cram in a clean cream can?'),
 ('english', 'Can you can a can as a canner can can a can?');

UPDATE search SET fulltext = to_tsvector(language, content);

為了確保我始終使用正確的語言進行搜尋，我使用了以下查詢：

SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'shine');
(1 row)

SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'vlieg');
(1 row)

因為硬編碼語言不會給出正確的結果：

SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'shine');
(0 rows)

SELECT FROM search WHERE fulltext @@ to_tsquery('english', 'vlieg');
(0 rows)

然而問題是 PostgreSQL 在使用第一組查詢時不使用 GIN 索引，而是進行順序掃描：

SET enable_seqscan = OFF;（注意：由於行數較少，我已禁用掃描這些範例）

EXPLAIN ANALYZE SELECT * FROM search WHERE fulltext @@ to_tsquery(language, 'shine');
---
Seq Scan on search  (cost=0.00..17.35 rows=2 width=136) (actual time=0.040..0.044 rows=1 loops=1)
   Filter: (fulltext @@ to_tsquery(language, 'shine'::text))
   Rows Removed by Filter: 5
Planning time: 0.039 ms
Execution time: 0.064 ms
(5 rows)

雖然在硬編碼語言時會這樣做：

EXPLAIN ANALYZE SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'vlieg');
---
Bitmap Heap Scan on search  (cost=12.63..23.66 rows=82 width=0) (actual time=0.044..0.044 rows=1 loops=1)
 Recheck Cond: (fulltext @@ '''vlieg'''::tsquery)
 Heap Blocks: exact=1
 -&gt;  Bitmap Index Scan on search_fulltext  (cost=0.00..12.61 rows=82 width=0) (actual time=0.037..0.037 rows=1 loops=1)
       Index Cond: (fulltext @@ '''vlieg'''::tsquery)
Planning time: 0.128 ms
Execution time: 0.065 ms
(7 rows)

所以我的問題是：是否有可能使用中的一列ts_query來使用正確的語言配置並且仍然讓 Postgres 使用 GIN 索引？

我正在使用 PostgreSQL 9.4。

編輯：

這是來自真實表的執行計劃：

使用語言配置列：

Seq Scan on search  (cost=0.00..8727.25 rows=188 width=0) (actual time=0.725..352.307 rows=1689 loops=1)
 Filter: (fulltext @@ to_tsquery(language_config, 'example'::text))
 Rows Removed by Filter: 35928
Planning time: 0.053 ms
Execution time: 352.915 ms

硬編碼語言時：

Bitmap Heap Scan on search  (cost=28.65..4088.92 rows=1633 width=0) (actual time=0.514..10.475 rows=1684 loops=1)
 Recheck Cond: (fulltext @@ '''exampl'''::tsquery)
 Heap Blocks: exact=1522
 -&gt;  Bitmap Index Scan on search_fulltext  (cost=0.00..28.24 rows=1633 width=0) (actual time=0.333..0.333 rows=1684 loops=1)
       Index Cond: (fulltext @@ '''exampl'''::tsquery)
Planning time: 0.180 ms
Execution time: 10.564 ms

編輯#2

用 Postgres 9.5 試了一下，結果一樣

我建議使用部分錶達式索引的解決方案：
CREATE TABLE search (
  search_id serial PRIMARY KEY
, language  regconfig NOT NULL  -- order of columns matters a bit
, content   text NOT NULL
  --  *no* redundant fulltext tsvector
);
沒有多餘fulltext的列 - 使表更小，這有助於整體性能。
為每種相關語言創建一個部分錶達式索引：
CREATE INDEX search_fulltext_dutch ON search USING GIN(to_tsvector('dutch', content))
WHERE language = 'dutch'::regconfig;
CREATE INDEX search_fulltext_english ON search USING GIN(to_tsvector('english', content))
WHERE language = 'english'::regconfig;
-- more?
所有的部分索引加起來只和你的總索引一樣大。
然後在查詢中匹配索引條件：
SELECT * FROM search  -- does not return useless column fulltext now
WHERE  language = 'dutch'::regconfig  -- match partial index condition
AND    to_tsvector('dutch', content) @@ to_tsquery('dutch', 'vliegen')

UNION ALL
SELECT * FROM search
WHERE  language = 'english'::regconfig
AND    to_tsvector('english', content) @@ to_tsquery('english', 'vliegen');

-- more?
您可以通過這種方式獲得點陣圖索引或索引掃描。
另一個索引language可能對其他目的有用，此查詢不需要它。

@3manuek 讓我思考……如果事先本地化查詢以使查詢不依賴於行怎麼辦。所以我想出了這個：

SELECT 
 *
FROM 
 search s 
LEFT JOIN (
 SELECT 'english'::regconfig AS language, to_tsquery('english', 'vliegen') as q
 UNION ALL SELECT 'dutch'::regconfig AS language, to_tsquery('dutch', 'vliegen') as q
 UNION ALL SELECT 'simple'::regconfig AS language, to_tsquery('simple', 'vliegen') as q
) q ON (s.language=q.language) WHERE fulltext @@ q;

查詢計劃看起來像這樣（在真實數據庫上）：

Nested Loop  (cost=205.36..1327.05 rows=188 width=1590) (actual time=3.726..7.045 rows=16 loops=1)
 -&gt;  Append  (cost=0.00..0.06 rows=3 width=36) (actual time=0.001..0.006 rows=3 loops=1)
       -&gt;  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
       -&gt;  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
       -&gt;  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
 -&gt;  Bitmap Heap Scan on search s  (cost=205.36..441.70 rows=63 width=1554) (actual time=2.323..2.331 rows=5 loops=3)
       Recheck Cond: ((fulltext @@ ('''vliegen'''::tsquery)) AND ((language)::oid = (('english'::regconfig))::oid))
       Heap Blocks: exact=16
       -&gt;  BitmapAnd  (cost=205.36..205.36 rows=63 width=0) (actual time=2.316..2.316 rows=0 loops=3)
             -&gt;  Bitmap Index Scan on search_fulltext  (cost=0.00..17.41 rows=188 width=0) (actual time=0.022..0.022 rows=9 loops=3)
                   Index Cond: (fulltext @@ ('''vliegen'''::tsquery))
             -&gt;  Bitmap Index Scan on search_language  (cost=0.00..187.67 rows=12539 width=0) (actual time=2.284..2.284 rows=12539 loops=3)
                   Index Cond: ((language)::oid = (('english'::regconfig))::oid)
Planning time: 0.234 ms
Execution time: 7.088 ms

它似乎工作正常，但我對它不是很有信心。

編輯更新為使用 UNION ALL 而不是 UNION，這消除了子查詢中唯一/排序的需要

編輯 2似乎有一次我創建了一個語言索引，查詢計劃器也使用它：

CREATE INDEX search_language ON search USING BTREE(language);

這對這個查詢有所幫助。

編輯 3使用 LEFT JOIN 這在技術上可能更正確，並使查詢與查詢計劃更匹配

引用自：https://dba.stackexchange.com/questions/149765

從列中獲取 ts_query 語言時未使用 PostgreSQL GIN 索引

相關問答

PostgreSQL：索引觸發器/掛鉤？對於同義詞

如何在PostgreSQL中儲存和查詢匹配前綴或後綴的字元串？

如何在子查詢中刪除重複項和排序？

在 Postgres 中同時搜尋大量人員

PostgreSQL btree_gin 擴展是否使用 btree 或 gin 資料結構？

tsvector_update_trigger 和 ts_vector 生成的列有什麼區別？