Postgresql

相似度函式的最佳索引

  • May 6, 2021

所以我有這個包含 620 萬條記錄的表,我必須為該列執行具有相似性的搜尋查詢。查詢可以是:

SELECT  "lca_test".* FROM "lca_test"
WHERE (similarity(job_title, 'sales executive') > 0.6)
AND worksite_city = 'los angeles' 
ORDER BY salary ASC LIMIT 50 OFFSET 0

可以在 where(year = X, worksite_state = N, status = ‘certified’, visa_class = Z) 中添加更多條件。

執行其中一些查詢可能需要很長時間,超過 30 秒。有時超過一分鐘。

EXPLAIN ANALYZE前面提到的查詢給了我這個:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms

我不知道我應該如何索引我的列以使其快速執行。

編輯:這是 postgres 版本:

x86_64-unknown-linux-gnu 上的 PostgreSQL 9.3.5,由 gcc (Debian 4.7.2-5) 4.7.2 編譯,64 位

這是表定義:

                                                        Table "public.lca_test"
        Column         |       Type        |                       Modifiers                       | Storage  | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
id                     | integer           | not null default nextval('lca_test_id_seq'::regclass) | plain    |              |
raw_id                 | integer           |                                                       | plain    |              |
year                   | integer           |                                                       | plain    |              |
company_id             | integer           |                                                       | plain    |              |
visa_class             | character varying |                                                       | extended |              |
employement_start_date | character varying |                                                       | extended |              |
employement_end_date   | character varying |                                                       | extended |              |
employer_name          | character varying |                                                       | extended |              |
employer_address1      | character varying |                                                       | extended |              |
employer_address2      | character varying |                                                       | extended |              |
employer_city          | character varying |                                                       | extended |              |
employer_state         | character varying |                                                       | extended |              |
employer_postal_code   | character varying |                                                       | extended |              |
employer_phone         | character varying |                                                       | extended |              |
employer_phone_ext     | character varying |                                                       | extended |              |
job_title              | character varying |                                                       | extended |              |
soc_code               | character varying |                                                       | extended |              |
naic_code              | character varying |                                                       | extended |              |
prevailing_wage        | character varying |                                                       | extended |              |
pw_unit_of_pay         | character varying |                                                       | extended |              |
wage_unit_of_pay       | character varying |                                                       | extended |              |
worksite_city          | character varying |                                                       | extended |              |
worksite_state         | character varying |                                                       | extended |              |
worksite_postal_code   | character varying |                                                       | extended |              |
total_workers          | integer           |                                                       | plain    |              |
case_status            | character varying |                                                       | extended |              |
case_no                | character varying |                                                       | extended |              |
salary                 | real              |                                                       | plain    |              |
salary_max             | real              |                                                       | plain    |              |
prevailing_wage_second | real              |                                                       | plain    |              |
lawyer_id              | integer           |                                                       | plain    |              |
citizenship            | character varying |                                                       | extended |              |
class_of_admission     | character varying |                                                       | extended |              |
Indexes:
   "lca_test_pkey" PRIMARY KEY, btree (id)
   "index_lca_test_on_id_and_salary" btree (id, salary)
   "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
   "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
   "index_lca_test_on_id_and_visa_class" btree (id, visa_class)
   "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
   "index_lca_test_on_lawyer_id" btree (lawyer_id)
   "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
   "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
   "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
   "index_lca_test_on_salary" btree (salary)
   "index_lca_test_on_visa_class" btree (visa_class)
   "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
   "index_lca_test_on_worksite_state" btree (worksite_state)
   "index_lca_test_on_year_and_company_id" btree (year, company_id)
   "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
   "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
   "lca_test_company_id" btree (company_id)
   "lca_test_employer_name" btree (employer_name)
   "lca_test_id" btree (id)
   "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
   "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no

值得一提的是,您安裝了pg_trgm提供該similarity()功能的附加模組。

相似運算元**%**

無論您做什麼,都使用相似性運算符%而不是表達式(similarity(job_title, 'sales executive') > 0.6)。索引支持綁定到Postgres中的運算符,而不是函式。

要獲得所需的最小相似度0.6,請設置GUC 參數

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

(在 Postgres 9.6 或更早版本中使用 deprecated SELECT set_limit(0.6);

該設置將在您的會話的其餘部分保留,直到重置。檢查:

SHOW pg_trgm.similarity_threshold;

(曾經是SELECT show_limit();

簡單案例

僅在給定字元串的列中獲得最佳匹配job_title將是“最近鄰”搜尋的簡單情況,並且可以使用三元運算符類gist_trgm_ops(但不能使用 GIN 索引)通過 GiST 索引來解決:

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

要還包括一個相等條件,worksite_city您將需要附加模組btree_gist。執行(每個數據庫一次):

CREATE EXTENSION btree_gist;

然後:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

詢問:

SET pg_trgm.similarity_threshold = 0.6  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<->作為“距離”運算符:

一減去similarity()值。

Postgres 還可以組合兩個單獨的索引,一個普通的 btree 索引 onworksite_city和一個單獨的 GiST 索引 on job_title,但是多列索引應該是最快的 - 當像你一樣組合兩列時。

你的情況

但是,您的查詢按 排序salary,而不是按距離或相似性排序,這完全是另外一回事。現在我們可以同時使用 GIN 和 GiST 索引,GIN 會更快。(在後來的版本中更是如此,對 GIN 索引進行了重大改進 - 升級提示!)

附加相等性檢查的類似故事worksite_city:安裝附加模組btree_gin。執行(每個數據庫一次):

CREATE EXTENSION btree_gin;

然後:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

詢問:

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50; -- OFFSET 0

"index_lcas_job_title_trigram"同樣,這也適用於您已經擁有的更簡單的索引 ( ),可能與其他索引結合使用(效率較低) 。最好的解決方案取決於完整的畫面。

進一步閱讀:

旁白

  • 你有很多索引。您確定它們都在使用並支付維護費用嗎?
  • 您有一些可疑的數據類型:
 employement_start_date | character varying
 employement_end_date   | character varying

似乎那些應該是date。等等。

引用自:https://dba.stackexchange.com/questions/103821