PostgreSQL 相似度運算元——如何優化/索引?
我是 SQL 和 PostgreSQL 的新手。我試圖弄清楚如何讓這種類型的查詢利用索引進行三元運算。使用
PostgreSQL 12.7 on x86_64-pc-linux-gnu
基本思想是我們獲取一個搜尋片語,將其拆分為不同的單詞,然後查看我們在搜尋就緒名稱的數據庫列中獲得了多少“相似性”匹配。與搜尋就緒名稱中找到的詞具有相似性的詞越多,得分就越高。我們還將整體搜尋片語與原始名稱相比較,作為“提升”權重的乘數。
dpl_base 表有 71,000 行,如下所示:
dpl_codes 表有 100 行,如下所示:
到目前為止,我都嘗試過:
create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops); create index trgm_idx_gin_dpl_base on dpl_base using gin (denied_name_searchable, denied_name_original gin_trgm_ops);
連同其他各種“標準”指數。無論有沒有索引,查詢 EXPLAIN ANALYZE 都會給出相同的精確計劃。所以指數似乎沒有什麼區別。查詢執行得非常快,通常不到 3 秒。也許我正在追逐一些我不需要的東西……我只是想學習如何正確索引這種設計的查詢:
SET pg_trgm.similarity_threshold = 0.35; SELECT /* create weighting value for the distinct-word hits within the SEARCHABLE column */ /* multiply by the similarity value for the original search phrase, against the ORIGINAL column */ ( ('BAD' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + ('ACTOR' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int ) * (-(DPLB.DENIED_NAME_ORIGINAL <-> 'Bad Actor') + 1) AS WEIGHT, /* add in the remaining columns from our two tables */ DPLB.DENIED_NAME_ORIGINAL, DPLC.DENIAL_REASON FROM DPL_BASE DPLB INNER JOIN DPL_CODES DPLC ON DPLB.DENIAL_CODE = DPLC.DENIAL_CODE WHERE /* must have at least one hit from our distinct words, in the SEARCHABLE column */ ( ('Bad' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + ('Actor' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int ) > 0 ORDER BY WEIGHT DESC, DPLB.DENIED_NAME_ORIGINAL ASC;
這是查詢計劃的範例。**任何有關(a)正確索引方法和/或(b)**更好的查詢設計或優化的提示或建議- 將不勝感激。
|QUERY PLAN | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |Gather Merge (cost=9448.31..11733.27 rows=19584 width=62) (actual time=525.228..529.633 rows=204 loops=1) | | Workers Planned: 2 | | Workers Launched: 2 | | -> Sort (cost=8448.29..8472.77 rows=9792 width=62) (actual time=519.954..520.104 rows=68 loops=3) | | Sort Key: (((((('YOUTH'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer))::double precision * ((- ((dplb.denied_name_original)::text <-> 'Youth Society'::text)) + '1'::double precision))) DESC, dplb.denied_name_original| | Sort Method: quicksort Memory: 34kB | | Worker 0: Sort Method: quicksort Memory: 34kB | | Worker 1: Sort Method: quicksort Memory: 34kB | | -> Hash Join (cost=4.25..7799.21 rows=9792 width=62) (actual time=23.524..519.630 rows=68 loops=3) | | Hash Cond: (dplb.denial_code = dplc.denial_code) | | -> Parallel Seq Scan on dpl_base dplb (cost=0.00..7229.60 rows=9792 width=70) (actual time=22.937..516.516 rows=68 loops=3) | | Filter: (((('YOUTH'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer) > 0) | | Rows Removed by Filter: 23432 | | -> Hash (cost=3.00..3.00 rows=100 width=38) (actual time=0.401..0.407 rows=100 loops=3) | | Buckets: 1024 Batches: 1 Memory Usage: 15kB | | -> Seq Scan on dpl_codes dplc (cost=0.00..3.00 rows=100 width=38) (actual time=0.044..0.216 rows=100 loops=3) | |Planning Time: 0.399 ms | |Execution Time: 530.078 ms
將布爾值轉換為整數然後對它們進行算術運算肯定會搞砸索引。
(foo::int + bar::int) >0
應該與以下內容相同:
foo or bar
只有後者才有更好的機會被索引。還,
'cat' % ANY(string_to_array('hot dog',' '))
應該類似於,但不完全相同
'cat' <% 'hot dog'
但同樣至少有一些機會使用索引。或者,將您的表格分解為一個不同的表格,其中每個元素都有一行,
string_to_array(upper((denied_name_searchable)::text), ' '::text)
這樣您就不需要動態分解它。最後,
create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);
索引運算符不分佈在
,
. 您需要為每一列指定它。因此,該索引根本不能用於對“denied_name_searchable”進行三元組搜尋。此外,首先在索引中包含“denied_name_original”似乎沒有任何意義。