為什麼我的 tsv 索引沒有被使用?
我正在嘗試使 postgres 全文搜尋工具發揮作用。
我有兩張表,一張我只是為了測試而創建的,而我希望能夠搜尋的實際一張:
測試表:
webarchive=# \d test_sites Table "public.test_sites" Column | Type | Modifiers -------------+----------+--------------------------------------------------------- id | integer | not null default nextval('test_sites_id_seq'::regclass) content | text | tsv_content | tsvector | Indexes: "test_sites_pkey" PRIMARY KEY, btree (id) "idx_test_web_pages_content" gin (tsv_content) Triggers: web_pages_testing_content_change_trigger AFTER INSERT OR UPDATE ON test_sites FOR EACH ROW EXECUTE PROCEDURE web_pages_testing_content_update_func()
“真實”表:
webarchive=# \d web_pages Table "public.web_pages" Column | Type | Modifiers --------------+-----------------------------+-------------------------------------------------------- id | integer | not null default nextval('web_pages_id_seq'::regclass) state | dlstate_enum | not null errno | integer | url | text | not null starturl | text | not null netloc | text | not null file | integer | priority | integer | not null distance | integer | not null is_text | boolean | limit_netloc | boolean | title | citext | mimetype | text | type | itemtype_enum | raw_content | text | content | text | fetchtime | timestamp without time zone | addtime | timestamp without time zone | tsv_content | tsvector | Indexes: "web_pages_pkey" PRIMARY KEY, btree (id) "ix_web_pages_url" UNIQUE, btree (url) "idx_web_pages_content" gin (tsv_content) "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text)) "ix_web_pages_distance" btree (distance) "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 "ix_web_pages_priority" btree (priority) "ix_web_pages_type" btree (type) "ix_web_pages_url_ops" btree (url text_pattern_ops) Foreign-key constraints: "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id) Triggers: web_pages_content_change_trigger AFTER INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()
除了額外的位之外,兩者都有一個
content
列,以及一個tsv_content
帶有gin()
索引的列。tsv_content
每次修改列時都有一個觸發器更新列content
。請注意,另一個
gin
索引工作正常,實際上我最初gin (to_tsvector('english'::regconfig, content::text))
在內容列上也有一個索引,而不是第二列,但是在等待該索引在測試中重建幾次之後,我決定使用一個單獨的列來預儲存 tsvector 值。對測試表執行查詢使用我期望的索引:
webarchive=# EXPLAIN ANALYZE SELECT test_sites.id, test_sites.content, ts_rank_cd(test_sites.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM test_sites WHERE test_sites.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test_sites (cost=16.45..114.96 rows=25 width=669) (actual time=0.175..3.720 rows=143 loops=1) Recheck Cond: (tsv_content @@ to_tsquery('testing'::text)) Heap Blocks: exact=117 -> Bitmap Index Scan on idx_test_web_pages_content (cost=0.00..16.44 rows=25 width=0) (actual time=0.109..0.109 rows=143 loops=1) Index Cond: (tsv_content @@ to_tsquery('testing'::text)) Planning time: 0.414 ms Execution time: 3.800 ms (7 rows)
然而,在真實表上的完全相同的查詢似乎永遠不會產生任何結果,只是一個普通的舊順序掃描:
webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content, ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Seq Scan on web_pages (cost=0.00..4406819.80 rows=19751 width=505) (actual time=0.343..142325.954 rows=134949 loops=1) Filter: (tsv_content @@ to_tsquery('testing'::text)) Rows Removed by Filter: 12764373 Planning time: 0.436 ms Execution time: 142341.489 ms (5 rows)
我已將我的工作記憶體增加到 3 GB,看看這是否是問題所在,但事實並非如此。
此外,應該注意的是,這些是相當大的表 - 大約 150GB 的文本跨越 4M 行(在
content
/tsv_content
是8M 的附加行NULL
)。該
test_sites
表有 1/1000 的行web_pages
,因為當每個查詢都需要數分鐘時進行試驗有點令人望而卻步。我正在使用 postgresql 9.5(是的,我自己編譯它,我想要
ON CONFLICT
)。似乎還沒有標籤。我已經閱讀了 9.5 的未解決問題,但我看不出這是其中任何一個問題的結果。
完全重建索引後,問題仍然存在:
webarchive=# ANALYZE web_pages ; ANALYZE webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content, ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on web_pages (cost=10000000000.00..10005252343.30 rows=25109 width=561) (actual time=7.114..146444.168 rows=134949 loops=1) Filter: (tsv_content @@ to_tsquery('testing'::text)) Rows Removed by Filter: 13137318 Planning time: 0.521 ms Execution time: 146465.188 ms (5 rows)
請注意,我實際上只是
ANALYZE
ed,並且 seqscan 被禁用。
好吧,我花了一些時間在帶有 DB 的磁碟上騰出一些額外的空間,將其他一些數據庫移到另一個 SSD 上。
然後我跑遍
VACUUM ANALYZE
了整個數據庫,現在顯然它注意到我有索引。我之前已經分析和清理過這個表,但顯然它在某種程度上有所作為,而不是一般地而不是特定的表。去搞清楚。
webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on web_pages (cost=1185.79..93687.30 rows=23941 width=189) (actual time=41.448..152.108 rows=134949 loops=1) Recheck Cond: (tsv_content @@ to_tsquery('testing'::text)) Heap Blocks: exact=105166 -> Bitmap Index Scan on idx_web_pages_content (cost=0.00..1179.81 rows=23941 width=0) (actual time=24.940..24.940 rows=134996 loops=1) Index Cond: (tsv_content @@ to_tsquery('testing'::text)) Planning time: 0.452 ms Execution time: 154.942 ms (7 rows)
VACUUM FULL;
既然我有足夠的空間進行處理,我也藉此機會執行了一個。由於我在開發過程中一直在試驗,所以我在表格中遇到了相當多的行流失,我想嘗試合併由此產生的任何文件碎片。