相似度函式的最佳索引
所以我有這個包含 620 萬條記錄的表,我必須為該列執行具有相似性的搜尋查詢。查詢可以是:
SELECT "lca_test".* FROM "lca_test" WHERE (similarity(job_title, 'sales executive') > 0.6) AND worksite_city = 'los angeles' ORDER BY salary ASC LIMIT 50 OFFSET 0
可以在 where(year = X, worksite_state = N, status = ‘certified’, visa_class = Z) 中添加更多條件。
執行其中一些查詢可能需要很長時間,超過 30 秒。有時超過一分鐘。
EXPLAIN ANALYZE
前面提到的查詢給了我這個:Limit (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1) -> Index Scan using index_lca_test_on_salary on lca_test (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1) >>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision)) >>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms Total runtime: 33487.802 ms
我不知道我應該如何索引我的列以使其快速執行。
編輯:這是 postgres 版本:
x86_64-unknown-linux-gnu 上的 PostgreSQL 9.3.5,由 gcc (Debian 4.7.2-5) 4.7.2 編譯,64 位
這是表定義:
Table "public.lca_test" Column | Type | Modifiers | Storage | Stats target | Description ------------------------+-------------------+-------------------------------------------------------+----------+--------------+------------- id | integer | not null default nextval('lca_test_id_seq'::regclass) | plain | | raw_id | integer | | plain | | year | integer | | plain | | company_id | integer | | plain | | visa_class | character varying | | extended | | employement_start_date | character varying | | extended | | employement_end_date | character varying | | extended | | employer_name | character varying | | extended | | employer_address1 | character varying | | extended | | employer_address2 | character varying | | extended | | employer_city | character varying | | extended | | employer_state | character varying | | extended | | employer_postal_code | character varying | | extended | | employer_phone | character varying | | extended | | employer_phone_ext | character varying | | extended | | job_title | character varying | | extended | | soc_code | character varying | | extended | | naic_code | character varying | | extended | | prevailing_wage | character varying | | extended | | pw_unit_of_pay | character varying | | extended | | wage_unit_of_pay | character varying | | extended | | worksite_city | character varying | | extended | | worksite_state | character varying | | extended | | worksite_postal_code | character varying | | extended | | total_workers | integer | | plain | | case_status | character varying | | extended | | case_no | character varying | | extended | | salary | real | | plain | | salary_max | real | | plain | | prevailing_wage_second | real | | plain | | lawyer_id | integer | | plain | | citizenship | character varying | | extended | | class_of_admission | character varying | | extended | | Indexes: "lca_test_pkey" PRIMARY KEY, btree (id) "index_lca_test_on_id_and_salary" btree (id, salary) "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year) "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay) "index_lca_test_on_id_and_visa_class" btree (id, visa_class) "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state) "index_lca_test_on_lawyer_id" btree (lawyer_id) "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id) "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second) "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class) "index_lca_test_on_salary" btree (salary) "index_lca_test_on_visa_class" btree (visa_class) "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay) "index_lca_test_on_worksite_state" btree (worksite_state) "index_lca_test_on_year_and_company_id" btree (year, company_id) "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status) "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops) "lca_test_company_id" btree (company_id) "lca_test_employer_name" btree (employer_name) "lca_test_id" btree (id) "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary) Foreign-key constraints: "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id) Has OIDs: no
值得一提的是,您安裝了
pg_trgm
提供該similarity()
功能的附加模組。相似運算元**
%
**無論您做什麼,都使用相似性運算符
%
而不是表達式(similarity(job_title, 'sales executive') > 0.6)
。索引支持綁定到Postgres中的運算符,而不是函式。要獲得所需的最小相似度
0.6
,請設置GUC 參數:SET pg_trgm.similarity_threshold = 0.6; -- once per session
(在 Postgres 9.6 或更早版本中使用 deprecated
SELECT set_limit(0.6);
)該設置將在您的會話的其餘部分保留,直到重置。檢查:
SHOW pg_trgm.similarity_threshold;
(曾經是
SELECT show_limit();
)簡單案例
僅在給定字元串的列中獲得最佳匹配
job_title
將是“最近鄰”搜尋的簡單情況,並且可以使用三元運算符類gist_trgm_ops
(但不能使用 GIN 索引)通過 GiST 索引來解決:CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);
要還包括一個相等條件,
worksite_city
您將需要附加模組btree_gist
。執行(每個數據庫一次):CREATE EXTENSION btree_gist;
然後:
CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);
詢問:
SET pg_trgm.similarity_threshold = 0.6 -- once per session SELECT * FROM lca_test WHERE job_title % 'sales executive' AND worksite_city = 'los angeles' ORDER BY (job_title <-> 'sales executive') LIMIT 50;
<->
作為“距離”運算符:一減去
similarity()
值。Postgres 還可以組合兩個單獨的索引,一個普通的 btree 索引 on
worksite_city
和一個單獨的 GiST 索引 onjob_title
,但是多列索引應該是最快的 - 當像你一樣組合兩列時。你的情況
但是,您的查詢按 排序
salary
,而不是按距離或相似性排序,這完全是另外一回事。現在我們可以同時使用 GIN 和 GiST 索引,GIN 會更快。(在後來的版本中更是如此,對 GIN 索引進行了重大改進 - 升級提示!)附加相等性檢查的類似故事
worksite_city
:安裝附加模組btree_gin
。執行(每個數據庫一次):CREATE EXTENSION btree_gin;
然後:
CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
詢問:
SET pg_trgm.similarity_threshold = 0.6; -- once per session SELECT * FROM lca_test WHERE job_title % 'sales executive' AND worksite_city = 'los angeles' ORDER BY salary LIMIT 50; -- OFFSET 0
"index_lcas_job_title_trigram"
同樣,這也適用於您已經擁有的更簡單的索引 ( ),可能與其他索引結合使用(效率較低) 。最好的解決方案取決於完整的畫面。進一步閱讀:
旁白
- 你有很多索引。您確定它們都在使用並支付維護費用嗎?
- 您有一些可疑的數據類型:
employement_start_date | character varying employement_end_date | character varying
似乎那些應該是
date
。等等。