Postgresql

為什麼我的 tsv 索引沒有被使用?

  • October 1, 2015

我正在嘗試使 postgres 全文搜尋工具發揮作用。

我有兩張表,一張我只是為了測試而創建的,而我希望能夠搜尋的實際一張:

測試表:

webarchive=# \d test_sites
                           Table "public.test_sites"
  Column    |   Type   |                        Modifiers
-------------+----------+---------------------------------------------------------
id          | integer  | not null default nextval('test_sites_id_seq'::regclass)
content     | text     |
tsv_content | tsvector |
Indexes:
   "test_sites_pkey" PRIMARY KEY, btree (id)
   "idx_test_web_pages_content" gin (tsv_content)
Triggers:
   web_pages_testing_content_change_trigger AFTER INSERT OR UPDATE ON test_sites FOR EACH ROW EXECUTE PROCEDURE web_pages_testing_content_update_func()

“真實”表:

webarchive=# \d web_pages
                                     Table "public.web_pages"
   Column    |            Type             |                       Modifiers
--------------+-----------------------------+--------------------------------------------------------
id           | integer                     | not null default nextval('web_pages_id_seq'::regclass)
state        | dlstate_enum                | not null
errno        | integer                     |
url          | text                        | not null
starturl     | text                        | not null
netloc       | text                        | not null
file         | integer                     |
priority     | integer                     | not null
distance     | integer                     | not null
is_text      | boolean                     |
limit_netloc | boolean                     |
title        | citext                      |
mimetype     | text                        |
type         | itemtype_enum               |
raw_content  | text                        |
content      | text                        |
fetchtime    | timestamp without time zone |
addtime      | timestamp without time zone |
tsv_content  | tsvector                    |
Indexes:
   "web_pages_pkey" PRIMARY KEY, btree (id)
   "ix_web_pages_url" UNIQUE, btree (url)
   "idx_web_pages_content" gin (tsv_content)
   "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
   "ix_web_pages_distance" btree (distance)
   "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000
   "ix_web_pages_priority" btree (priority)
   "ix_web_pages_type" btree (type)
   "ix_web_pages_url_ops" btree (url text_pattern_ops)
Foreign-key constraints:
   "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
   web_pages_content_change_trigger AFTER INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()

除了額外的位之外,兩者都有一個content列,以及一個tsv_content帶有gin()索引的列。tsv_content每次修改列時都有一個觸發器更新列content

請注意,另一個 gin索引工作正常,實際上我最初gin (to_tsvector('english'::regconfig, content::text))在內容列上也有一個索引,而不是第二列,但是在等待該索引在測試中重建幾次之後,我決定使用一個單獨的列來預儲存 tsvector 值。

對測試表執行查詢使用我期望的索引:

webarchive=# EXPLAIN ANALYZE SELECT
   test_sites.id,
   test_sites.content,
   ts_rank_cd(test_sites.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
FROM
   test_sites
WHERE
   test_sites.tsv_content @@ to_tsquery($$testing$$);
                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test_sites  (cost=16.45..114.96 rows=25 width=669) (actual time=0.175..3.720 rows=143 loops=1)
  Recheck Cond: (tsv_content @@ to_tsquery('testing'::text))
  Heap Blocks: exact=117
  ->  Bitmap Index Scan on idx_test_web_pages_content  (cost=0.00..16.44 rows=25 width=0) (actual time=0.109..0.109 rows=143 loops=1)
        Index Cond: (tsv_content @@ to_tsquery('testing'::text))
Planning time: 0.414 ms
Execution time: 3.800 ms
(7 rows)

然而,在真實表上的完全相同的查詢似乎永遠不會產生任何結果,只是一個普通的舊順序掃描:

webarchive=# EXPLAIN ANALYZE SELECT
      web_pages.id,
      web_pages.content,
      ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
  FROM
      web_pages
  WHERE
      web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on web_pages  (cost=0.00..4406819.80 rows=19751 width=505) (actual time=0.343..142325.954 rows=134949 loops=1)
  Filter: (tsv_content @@ to_tsquery('testing'::text))
  Rows Removed by Filter: 12764373
Planning time: 0.436 ms
Execution time: 142341.489 ms
(5 rows)

我已將我的工作記憶體增加到 3 GB,看看這是否是問題所在,但事實並非如此。

此外,應該注意的是,這些是相當大的表 - 大約 150GB 的文本跨越 4M 行(在content/tsv_content是8M 的附加行NULL)。

test_sites表有 1/1000 的行web_pages,因為當每個查詢都需要數分鐘時進行試驗有點令人望而卻步。


我正在使用 postgresql 9.5(是的,我自己編譯它,我想要ON CONFLICT)。似乎還沒有標籤。

我已經閱讀了 9.5 的未解決問題,但我看不出這是其中任何一個問題的結果。


完全重建索​​引後,問題仍然存在:

webarchive=# ANALYZE web_pages ;
ANALYZE
webarchive=# EXPLAIN ANALYZE SELECT
   web_pages.id,
   web_pages.content,
   ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
FROM
   web_pages
WHERE
   web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on web_pages  (cost=10000000000.00..10005252343.30 rows=25109 width=561) (actual time=7.114..146444.168 rows=134949 loops=1)
  Filter: (tsv_content @@ to_tsquery('testing'::text))
  Rows Removed by Filter: 13137318
Planning time: 0.521 ms
Execution time: 146465.188 ms
(5 rows)

請注意,我實際上只是ANALYZEed,並且 seqscan 被禁用。

好吧,我花了一些時間在帶有 DB 的磁碟上騰出一些額外的空間,將其他一些數據庫移到另一個 SSD 上。

然後我跑遍VACUUM ANALYZE 了整個數據庫,現在顯然它注意到我有索引。

我之前已經分析和清理過這個表,但顯然它在某種程度上有所作為,而不是一般地而不是特定的表。去搞清楚。

webarchive=# EXPLAIN ANALYZE SELECT
   web_pages.id,
   web_pages.content
FROM
   web_pages
WHERE
   web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on web_pages  (cost=1185.79..93687.30 rows=23941 width=189) (actual time=41.448..152.108 rows=134949 loops=1)
  Recheck Cond: (tsv_content @@ to_tsquery('testing'::text))
  Heap Blocks: exact=105166
  ->  Bitmap Index Scan on idx_web_pages_content  (cost=0.00..1179.81 rows=23941 width=0) (actual time=24.940..24.940 rows=134996 loops=1)
        Index Cond: (tsv_content @@ to_tsquery('testing'::text))
Planning time: 0.452 ms
Execution time: 154.942 ms
(7 rows)

VACUUM FULL;既然我有足夠的空間進行處理,我也藉此機會執行了一個。由於我在開發過程中一直在試驗,所以我在表格中遇到了相當多的行流失,我想嘗試合併由此產生的任何文件碎片。

引用自:https://dba.stackexchange.com/questions/116549