Postgresql

cube 查詢排序的性能問題

  • July 4, 2019

我有一個在表中cube命名的列,該列儲存我轉換為密集格式的欄位中某些文本的矢量化(TF-IDF)表示。我在 上創建了 GIST 索引,但查詢性能有問題。此查詢大約需要 20 秒(在 32GB 機器上大約需要 5MM 行):embedding``documents

select id 
from documents 
where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25 
order by embedding <-> cube('(0.08470847,...,0.06106149)') asc 
limit 25

相同的查詢沒有order by毫秒內執行。

我不確定如何提高訂購性能。

我對查詢進行了解釋分析,結果如下:

Limit  (cost=0.54..323.63 rows=25 width=12) (actual time=18032.104..18704.827 rows=25 loops=1)
 ->  Index Scan using ix_100 on documents  (cost=0.54..22895274.16 rows=1771566 width=12) (actual time=18032.101..18704.797 rows=25 loops=1)
       Order By: (embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube)
       Filter: ((embedding <-> '(0.084708469999999994,... , 0.061061490000000003)'::cube) < '0.25'::double precision)
Planning Time: 1.575 ms
Execution Time: 18728.073 ms

我不知道如何從這裡開始,我希望避免在應用層中獲取結果後進行排序,理想情況下應該在數據庫中工作。

有任何想法嗎?

編輯:為帶有限制的查詢添加說明(分析,緩衝區)

詢問:

explain (analyze, buffers) 
select id 
from documents 
where embedding <-> cube('(0.08470847,..,0.06106149)') < 0.25 
limit 10;

使用此輸出:

Limit  (cost=0.00..7.73 rows=10 width=4) (actual time=0.036..0.076 rows=10 loops=1)
 Buffers: shared hit=5
 ->  Seq Scan on documents  (cost=0.00..1370989.16 rows=1772915 width=4) (actual time=0.034..0.072 rows=10 loops=1)
       Filter: ((embedding <-> '(0.084708469999999994..., 0.061061490000000003)'::cube) < '0.25'::double precision)
       Rows Removed by Filter: 10
       Buffers: shared hit=5
Planning Time: 0.107 ms
Execution Time: 0.098 ms 

編輯 -2 :

每次上次更新的修改查詢和結果返回到每個查詢約 20 秒

Limit  (cost=0.54..323.56 rows=25 width=12) (actual time=727.488..21603.571 rows=25 loops=1)
 Buffers: shared read=1352076
 ->  Index Scan using ix_100 on documents  (cost=0.54..22910761.65 rows=1773159 width=12) (actual time=727.485..21603.535 rows=25 loops=1)
       Order By: (embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube)
       Filter: ((embedding <-> '(0.0665496899999999947, ... 0.063358020000000001)'::cube) < '0.25'::double precision)
       Buffers: shared read=1352076
Planning Time: 0.164 ms
Execution Time: 21644.516 ms

排序對查詢返回的值起作用。

在這裡,您在嵌入列上有一個索引,但您正在對embedding <-> cube('(0.08470847,...,0.06106149)')未編入索引的結果進行排序。

因此,首先借助子查詢檢索所需的結果,然後進行排序。

select id,EDistance
from
(
select id, embedding <-> cube('(0.08470847,...,0.06106149)') EDistance 
from documents 
where embedding <-> cube('(0.08470847,...,0.06106149)') < 0.25
limit 25
) t
order by EDistance ASC

謝謝!

引用自:https://dba.stackexchange.com/questions/241983