Postgresql

通過 word_similarity 優化 postgres 排序

  • November 25, 2020

我在 Postgresql 12 中有一個包含 2M 行酒店名稱的表。我正在建構一個預先輸入的類型,以便使用者可以通過名稱快速找到酒店。我玩過不同的 Postgres 選項,包括 FTS、trigrams 和 levenshtein 距離。

pg_trgm 中的 word_similarity 給了我最好的結果,但是每當我需要按相似度分數排序時,事情就會變得太慢(沒有排序查詢以毫秒為單位完成):

SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5;

                                         name                                          |    sml
----------------------------------------------------------------------------------------+------------
Courtyard by Marriott New York Downtown Manhattan/World Trade Center Area              | 0.53846157
Fairfield Inn & Suites by Marriott New York Downtown Manhattan/World Trade Center Area | 0.53846157
Residence Inn by Marriott New York Downtown Manhattan/World Trade Center Area          | 0.53846157
AC Hotel by Marriott New York Times Square                                             |    0.53125
Courtyard by Marriott World Trade Center, Abu Dhabi                                    |  0.5263158
(5 rows)

Time: 9602.969 ms (00:09.603)
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5;
                                                                                         QUERY PLAN                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=65390.53..65391.11 rows=5 width=27) (actual time=9619.113..9625.482 rows=5 loops=1)
  Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
  Buffers: shared hit=1746167
  ->  Gather Merge  (cost=65390.53..65589.11 rows=1702 width=27) (actual time=9619.109..9625.474 rows=5 loops=1)
        Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=1746167
        ->  Sort  (cost=64390.50..64392.63 rows=851 width=27) (actual time=9612.578..9612.580 rows=4 loops=3)
              Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
              Sort Key: (word_similarity('trade center new york mariott'::text, (hotels_hotel.name)::text)) DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=1746167
              Worker 0: actual time=9610.098..9610.100 rows=5 loops=1
                Buffers: shared hit=581861
              Worker 1: actual time=9609.314..9609.317 rows=5 loops=1
                Buffers: shared hit=579828
              ->  Parallel Index Only Scan using hotels_hotel_name_a9005e17 on public.hotels_hotel  (cost=0.43..64376.37 rows=851 width=27) (actual time=4.040..9606.166 rows=15070 loops=3)
                    Output: name, word_similarity('trade center new york mariott'::text, (name)::text)
                    Filter: ('trade center new york mariott'::text %> (hotels_hotel.name)::text)
                    Rows Removed by Filter: 666002
                    Heap Fetches: 2
                    Buffers: shared hit=1746113
                    Worker 0: actual time=0.281..9603.591 rows=14890 loops=1
                      Buffers: shared hit=581834
                    Worker 1: actual time=8.157..9602.811 rows=14678 loops=1
                      Buffers: shared hit=579801
Planning Time: 0.396 ms
Execution Time: 9625.576 ms
(31 rows)

Time: 9626.933 ms (00:09.627)

**有什麼辦法可以優化這個查詢嗎?**很高興建立一個巨大的索引,因為我認為這是必要的。

為什麼 word_similarity 最適合我的案例:

  • 允許拼寫錯誤(注意我在查詢中拼錯了 marriott)
  • 允許單詞的不同順序

編輯 1

沒有排序的解釋分析:

EXPLAIN ANALYZE SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name LIMIT 5;
                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..1264.75 rows=5 width=27) (actual time=0.391..15.981 rows=5 loops=1)
  ->  Seq Scan on hotels_hotel  (cost=0.00..516777.29 rows=2043 width=27) (actual time=0.384..15.969 rows=5 loops=1)
        Filter: ('trade center new york mariott'::text %> (name)::text)
        Rows Removed by Filter: 292
Planning Time: 0.285 ms
Execution Time: 16.041 ms
(6 rows)

Time: 17.800 ms

編輯 2

表中的相關指數:

CREATE INDEX autocomplete_gist ON public.hotels_hotel USING gist (name gist_trgm_ops)
CREATE INDEX autocomplete_name_idx ON public.hotels_hotel USING gin (name gin_trgm_ops)
CREATE INDEX hotels_hotel_name_a9005e17 ON public.hotels_hotel USING btree (name)
CREATE INDEX hotels_hotel_name_a9005e17_like ON public.hotels_hotel USING btree (name varchar_pattern_ops)
CREATE UNIQUE INDEX hotels_hotel_pkey ON public.hotels_hotel USING btree (id)
CREATE INDEX hotels_hotel_popularity_05985c85 ON public.hotels_hotel USING btree (popularity)

編輯 3

新查詢:

EXPLAIN (ANALYZE, BUFFERS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel ORDER BY name <->> 'trade center new york mariott' LIMIT 5;
                                                                      QUERY PLAN

Limit  (cost=0.41..5.65 rows=5 width=31) (actual time=3632.397..3633.554 rows=5 loops=1)
  Buffers: shared hit=1 read=32537
  I/O Timings: read=184.432
  ->  Index Scan using autocomplete_gist on hotels_hotel  (cost=0.41..2140836.75 rows=2043215 width=31) (actual time=3632.394..3633.548 rows=5 loops=1)
        Order By: ((name)::text <->> 'trade center new york mariott'::text)
        Buffers: shared hit=1 read=32537
        I/O Timings: read=184.432
Planning Time: 0.250 ms
Execution Time: 3679.847 ms

使用 GiST 索引:

CREATE INDEX ON hotels_hotel USING gist (name gist_trgm_ops);

並像這樣搜尋:

SELECT name,
      word_similarity('trade center new york mariott', name) AS sml
FROM hotels_hotel
ORDER BY name <->> 'trade center new york mariott'
LIMIT 5;

這將為您提供五個最接近的匹配項。

使用 GIN 索引而不是 GiST 可能會更好。

GIN 不像 GiST 那樣支持 KNN,因此您必須應用具有適當值的匹配操作pg_trgm.word_similarity_threshold,然後對倖存的行進行排序。

SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel 
WHERE name %> 'trade center new york mariott' 
ORDER BY name <->> 'trade center new york mariott' 
LIMIT 5;

規劃器並沒有非常巧妙地區分 GIN 和 GiST 索引的成本,因此您可能需要刪除 GiST 索引以使其使用 GIN。

引用自:https://dba.stackexchange.com/questions/280196