Postgresql

PostgreSQL 相似度運算元——如何優化/索引?

  • August 5, 2021

我是 SQL 和 PostgreSQL 的新手。我試圖弄清楚如何讓這種類型的查詢利用索引進行三元運算。使用PostgreSQL 12.7 on x86_64-pc-linux-gnu

基本思想是我們獲取一個搜尋片語,將其拆分為不同的單詞,然後查看我們在搜尋就緒名稱的數據庫列中獲得了多少“相似性”匹配。與搜尋就緒名稱中找到的詞具有相似性的詞越多,得分就越高。我們還將整體搜尋片語與原始名稱相比較,作為“提升”權重的乘數。

dpl_base 表有 71,000 行,如下所示:

在此處輸入圖像描述

dpl_codes 表有 100 行,如下所示:

在此處輸入圖像描述

到目前為止,我都嘗試過:

create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);    
create index trgm_idx_gin_dpl_base on dpl_base using gin (denied_name_searchable, denied_name_original gin_trgm_ops);

連同其他各種“標準”指數。無論有沒有索引,查詢 EXPLAIN ANALYZE 都會給出相同的精確計劃。所以指數似乎沒有什麼區別。查詢執行得非常快,通常不到 3 秒。也許我正在追逐一些我不需要的東西……我只是想學習如何正確索引這種設計的查詢:

SET pg_trgm.similarity_threshold = 0.35;
SELECT
/* create weighting value for the distinct-word hits within the SEARCHABLE column */
/* multiply by the similarity value for the original search phrase, against the ORIGINAL column */
(
 ('BAD' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + 
 ('ACTOR' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
) 
* (-(DPLB.DENIED_NAME_ORIGINAL <-> 'Bad Actor') + 1) AS WEIGHT,
/* add in the remaining columns from our two tables */
DPLB.DENIED_NAME_ORIGINAL, DPLC.DENIAL_REASON
FROM DPL_BASE DPLB 
INNER JOIN DPL_CODES DPLC ON DPLB.DENIAL_CODE = DPLC.DENIAL_CODE 
WHERE
/* must have at least one hit from our distinct words, in the SEARCHABLE column */
( 
 ('Bad' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + 
 ('Actor' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
) > 0 
ORDER BY WEIGHT DESC, DPLB.DENIED_NAME_ORIGINAL ASC;

這是查詢計劃的範例。**任何有關(a)正確索引方法和/或(b)**更好的查詢設計或優化的提示或建議- 將不勝感激。

|QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Gather Merge  (cost=9448.31..11733.27 rows=19584 width=62) (actual time=525.228..529.633 rows=204 loops=1)                                                                                                                                                                                                                                                                                       |
|  Workers Planned: 2                                                                                                                                                                                                                                                                                                                                                                             |
|  Workers Launched: 2                                                                                                                                                                                                                                                                                                                                                                            |
|  ->  Sort  (cost=8448.29..8472.77 rows=9792 width=62) (actual time=519.954..520.104 rows=68 loops=3)                                                                                                                                                                                                                                                                                            |
|        Sort Key: (((((('YOUTH'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer))::double precision * ((- ((dplb.denied_name_original)::text <-> 'Youth Society'::text)) + '1'::double precision))) DESC, dplb.denied_name_original|
|        Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                                     |
|        Worker 0:  Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                          |
|        Worker 1:  Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                          |
|        ->  Hash Join  (cost=4.25..7799.21 rows=9792 width=62) (actual time=23.524..519.630 rows=68 loops=3)                                                                                                                                                                                                                                                                                     |
|              Hash Cond: (dplb.denial_code = dplc.denial_code)                                                                                                                                                                                                                                                                                                                                   |
|              ->  Parallel Seq Scan on dpl_base dplb  (cost=0.00..7229.60 rows=9792 width=70) (actual time=22.937..516.516 rows=68 loops=3)                                                                                                                                                                                                                                                      |
|                    Filter: (((('YOUTH'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer) > 0)                                                                                                                                                |
|                    Rows Removed by Filter: 23432                                                                                                                                                                                                                                                                                                                                                |
|              ->  Hash  (cost=3.00..3.00 rows=100 width=38) (actual time=0.401..0.407 rows=100 loops=3)                                                                                                                                                                                                                                                                                          |
|                    Buckets: 1024  Batches: 1  Memory Usage: 15kB                                                                                                                                                                                                                                                                                                                                |
|                    ->  Seq Scan on dpl_codes dplc  (cost=0.00..3.00 rows=100 width=38) (actual time=0.044..0.216 rows=100 loops=3)                                                                                                                                                                                                                                                              |
|Planning Time: 0.399 ms                                                                                                                                                                                                                                                                                                                                                                          |
|Execution Time: 530.078 ms    

將布爾值轉換為整數然後對它們進行算術運算肯定會搞砸索引。

(foo::int + bar::int) >0

應該與以下內容相同:

foo or bar

只有後者才有更好的機會被索引。還,

'cat' % ANY(string_to_array('hot dog',' '))

應該類似於,但不完全相同

'cat' <% 'hot dog'

但同樣至少有一些機會使用索引。或者,將您的表格分解為一個不同的表格,其中每個元素都有一行,string_to_array(upper((denied_name_searchable)::text), ' '::text)這樣您就不需要動態分解它。

最後,

create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);

索引運算符不分佈在,. 您需要為每一列指定它。因此,該索引根本不能用於對“denied_name_searchable”進行三元組搜尋。此外,首先在索引中包含“denied_name_original”似乎沒有任何意義。

引用自:https://dba.stackexchange.com/questions/297647