Postgresql
通過 word_similarity 優化 postgres 排序
我在 Postgresql 12 中有一個包含 2M 行酒店名稱的表。我正在建構一個預先輸入的類型,以便使用者可以通過名稱快速找到酒店。我玩過不同的 Postgres 選項,包括 FTS、trigrams 和 levenshtein 距離。
pg_trgm 中的 word_similarity 給了我最好的結果,但是每當我需要按相似度分數排序時,事情就會變得太慢(沒有排序查詢以毫秒為單位完成):
SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5; name | sml ----------------------------------------------------------------------------------------+------------ Courtyard by Marriott New York Downtown Manhattan/World Trade Center Area | 0.53846157 Fairfield Inn & Suites by Marriott New York Downtown Manhattan/World Trade Center Area | 0.53846157 Residence Inn by Marriott New York Downtown Manhattan/World Trade Center Area | 0.53846157 AC Hotel by Marriott New York Times Square | 0.53125 Courtyard by Marriott World Trade Center, Abu Dhabi | 0.5263158 (5 rows) Time: 9602.969 ms (00:09.603)
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=65390.53..65391.11 rows=5 width=27) (actual time=9619.113..9625.482 rows=5 loops=1) Output: name, (word_similarity('trade center new york mariott'::text, (name)::text)) Buffers: shared hit=1746167 -> Gather Merge (cost=65390.53..65589.11 rows=1702 width=27) (actual time=9619.109..9625.474 rows=5 loops=1) Output: name, (word_similarity('trade center new york mariott'::text, (name)::text)) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1746167 -> Sort (cost=64390.50..64392.63 rows=851 width=27) (actual time=9612.578..9612.580 rows=4 loops=3) Output: name, (word_similarity('trade center new york mariott'::text, (name)::text)) Sort Key: (word_similarity('trade center new york mariott'::text, (hotels_hotel.name)::text)) DESC Sort Method: top-N heapsort Memory: 25kB Worker 0: Sort Method: top-N heapsort Memory: 25kB Worker 1: Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=1746167 Worker 0: actual time=9610.098..9610.100 rows=5 loops=1 Buffers: shared hit=581861 Worker 1: actual time=9609.314..9609.317 rows=5 loops=1 Buffers: shared hit=579828 -> Parallel Index Only Scan using hotels_hotel_name_a9005e17 on public.hotels_hotel (cost=0.43..64376.37 rows=851 width=27) (actual time=4.040..9606.166 rows=15070 loops=3) Output: name, word_similarity('trade center new york mariott'::text, (name)::text) Filter: ('trade center new york mariott'::text %> (hotels_hotel.name)::text) Rows Removed by Filter: 666002 Heap Fetches: 2 Buffers: shared hit=1746113 Worker 0: actual time=0.281..9603.591 rows=14890 loops=1 Buffers: shared hit=581834 Worker 1: actual time=8.157..9602.811 rows=14678 loops=1 Buffers: shared hit=579801 Planning Time: 0.396 ms Execution Time: 9625.576 ms (31 rows) Time: 9626.933 ms (00:09.627)
**有什麼辦法可以優化這個查詢嗎?**很高興建立一個巨大的索引,因為我認為這是必要的。
為什麼 word_similarity 最適合我的案例:
- 允許拼寫錯誤(注意我在查詢中拼錯了 marriott)
- 允許單詞的不同順序
編輯 1
沒有排序的解釋分析:
EXPLAIN ANALYZE SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..1264.75 rows=5 width=27) (actual time=0.391..15.981 rows=5 loops=1) -> Seq Scan on hotels_hotel (cost=0.00..516777.29 rows=2043 width=27) (actual time=0.384..15.969 rows=5 loops=1) Filter: ('trade center new york mariott'::text %> (name)::text) Rows Removed by Filter: 292 Planning Time: 0.285 ms Execution Time: 16.041 ms (6 rows) Time: 17.800 ms
編輯 2
表中的相關指數:
CREATE INDEX autocomplete_gist ON public.hotels_hotel USING gist (name gist_trgm_ops) CREATE INDEX autocomplete_name_idx ON public.hotels_hotel USING gin (name gin_trgm_ops) CREATE INDEX hotels_hotel_name_a9005e17 ON public.hotels_hotel USING btree (name) CREATE INDEX hotels_hotel_name_a9005e17_like ON public.hotels_hotel USING btree (name varchar_pattern_ops) CREATE UNIQUE INDEX hotels_hotel_pkey ON public.hotels_hotel USING btree (id) CREATE INDEX hotels_hotel_popularity_05985c85 ON public.hotels_hotel USING btree (popularity)
編輯 3
新查詢:
EXPLAIN (ANALYZE, BUFFERS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel ORDER BY name <->> 'trade center new york mariott' LIMIT 5; QUERY PLAN Limit (cost=0.41..5.65 rows=5 width=31) (actual time=3632.397..3633.554 rows=5 loops=1) Buffers: shared hit=1 read=32537 I/O Timings: read=184.432 -> Index Scan using autocomplete_gist on hotels_hotel (cost=0.41..2140836.75 rows=2043215 width=31) (actual time=3632.394..3633.548 rows=5 loops=1) Order By: ((name)::text <->> 'trade center new york mariott'::text) Buffers: shared hit=1 read=32537 I/O Timings: read=184.432 Planning Time: 0.250 ms Execution Time: 3679.847 ms
使用 GiST 索引:
CREATE INDEX ON hotels_hotel USING gist (name gist_trgm_ops);
並像這樣搜尋:
SELECT name, word_similarity('trade center new york mariott', name) AS sml FROM hotels_hotel ORDER BY name <->> 'trade center new york mariott' LIMIT 5;
這將為您提供五個最接近的匹配項。
使用 GIN 索引而不是 GiST 可能會更好。
GIN 不像 GiST 那樣支持 KNN,因此您必須應用具有適當值的匹配操作
pg_trgm.word_similarity_threshold
,然後對倖存的行進行排序。SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel WHERE name %> 'trade center new york mariott' ORDER BY name <->> 'trade center new york mariott' LIMIT 5;
規劃器並沒有非常巧妙地區分 GIN 和 GiST 索引的成本,因此您可能需要刪除 GiST 索引以使其使用 GIN。