Postgresql
cube
查詢排序的性能問題
我有一個在表中
cube
命名的列,該列儲存我轉換為密集格式的欄位中某些文本的矢量化(TF-IDF)表示。我在 上創建了 GIST 索引,但查詢性能有問題。此查詢大約需要 20 秒(在 32GB 機器上大約需要 5MM 行):embedding``documents
select id from documents where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25 order by embedding <-> cube('(0.08470847,...,0.06106149)') asc limit 25
相同的查詢沒有在
order by
毫秒內執行。我不確定如何提高訂購性能。
我對查詢進行了解釋分析,結果如下:
Limit (cost=0.54..323.63 rows=25 width=12) (actual time=18032.104..18704.827 rows=25 loops=1) -> Index Scan using ix_100 on documents (cost=0.54..22895274.16 rows=1771566 width=12) (actual time=18032.101..18704.797 rows=25 loops=1) Order By: (embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube) Filter: ((embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube) < '0.25'::double precision) Planning Time: 1.575 ms Execution Time: 18728.073 ms
我不知道如何從這裡開始,我希望避免在應用層中獲取結果後進行排序,理想情況下應該在數據庫中工作。
有任何想法嗎?
編輯:為帶有限制的查詢添加說明(分析,緩衝區)
詢問:
explain (analyze, buffers) select id from documents where embedding <-> cube('(0.08470847,..,0.06106149)') < 0.25 limit 10;
使用此輸出:
Limit (cost=0.00..7.73 rows=10 width=4) (actual time=0.036..0.076 rows=10 loops=1) Buffers: shared hit=5 -> Seq Scan on documents (cost=0.00..1370989.16 rows=1772915 width=4) (actual time=0.034..0.072 rows=10 loops=1) Filter: ((embedding <-> '(0.084708469999999994..., 0.061061490000000003)'::cube) < '0.25'::double precision) Rows Removed by Filter: 10 Buffers: shared hit=5 Planning Time: 0.107 ms Execution Time: 0.098 ms
編輯 -2 :
每次上次更新的修改查詢和結果返回到每個查詢約 20 秒
Limit (cost=0.54..323.56 rows=25 width=12) (actual time=727.488..21603.571 rows=25 loops=1) Buffers: shared read=1352076 -> Index Scan using ix_100 on documents (cost=0.54..22910761.65 rows=1773159 width=12) (actual time=727.485..21603.535 rows=25 loops=1) Order By: (embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube) Filter: ((embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube) < '0.25'::double precision) Buffers: shared read=1352076 Planning Time: 0.164 ms Execution Time: 21644.516 ms
排序對查詢返回的值起作用。
在這裡,您在嵌入列上有一個索引,但您正在對
embedding <-> cube('(0.08470847,...,0.06106149)')
未編入索引的結果進行排序。因此,首先借助子查詢檢索所需的結果,然後進行排序。
select id,EDistance from ( select id, embedding <-> cube('(0.08470847,...,0.06106149)') EDistance from documents where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25 limit 25 ) t order by EDistance ASC
謝謝!