`cube` 查詢排序的性能問題

July 4, 2019

我有一個在表中cube命名的列，該列儲存我轉換為密集格式的欄位中某些文本的矢量化（TF-IDF）表示。我在上創建了 GIST 索引，但查詢性能有問題。此查詢大約需要 20 秒（在 32GB 機器上大約需要 5MM 行）：embedding``documents

select id 
from documents 
where embedding &lt;-&gt; cube('(0.08470847,...,0.06106149)') &lt; 0.25 
order by embedding &lt;-&gt; cube('(0.08470847,...,0.06106149)') asc 
limit 25

相同的查詢沒有在order by毫秒內執行。

我不確定如何提高訂購性能。

我對查詢進行了解釋分析，結果如下：

Limit  (cost=0.54..323.63 rows=25 width=12) (actual time=18032.104..18704.827 rows=25 loops=1)
 -&gt;  Index Scan using ix_100 on documents  (cost=0.54..22895274.16 rows=1771566 width=12) (actual time=18032.101..18704.797 rows=25 loops=1)
       Order By: (embedding &lt;-&gt; '(0.084708469999999994,... , 0.061061490000000003)'::cube)
       Filter: ((embedding &lt;-&gt; '(0.084708469999999994,... , 0.061061490000000003)'::cube) &lt; '0.25'::double precision)
Planning Time: 1.575 ms
Execution Time: 18728.073 ms

我不知道如何從這裡開始，我希望避免在應用層中獲取結果後進行排序，理想情況下應該在數據庫中工作。

有任何想法嗎？

編輯：為帶有限制的查詢添加說明（分析，緩衝區）

詢問：

explain (analyze, buffers) 
select id 
from documents 
where embedding &lt;-&gt; cube('(0.08470847,..,0.06106149)') &lt; 0.25 
limit 10;

使用此輸出：

Limit  (cost=0.00..7.73 rows=10 width=4) (actual time=0.036..0.076 rows=10 loops=1)
 Buffers: shared hit=5
 -&gt;  Seq Scan on documents  (cost=0.00..1370989.16 rows=1772915 width=4) (actual time=0.034..0.072 rows=10 loops=1)
       Filter: ((embedding &lt;-&gt; '(0.084708469999999994..., 0.061061490000000003)'::cube) &lt; '0.25'::double precision)
       Rows Removed by Filter: 10
       Buffers: shared hit=5
Planning Time: 0.107 ms
Execution Time: 0.098 ms

編輯 -2 ：

每次上次更新的修改查詢和結果返回到每個查詢約 20 秒

Limit  (cost=0.54..323.56 rows=25 width=12) (actual time=727.488..21603.571 rows=25 loops=1)
 Buffers: shared read=1352076
 -&gt;  Index Scan using ix_100 on documents  (cost=0.54..22910761.65 rows=1773159 width=12) (actual time=727.485..21603.535 rows=25 loops=1)
       Order By: (embedding &lt;-&gt; '(0.0665496899999999947, ... 0.063358020000000001)'::cube)
       Filter: ((embedding &lt;-&gt; '(0.0665496899999999947, ... 0.063358020000000001)'::cube) &lt; '0.25'::double precision)
       Buffers: shared read=1352076
Planning Time: 0.164 ms
Execution Time: 21644.516 ms

排序對查詢返回的值起作用。
在這裡，您在嵌入列上有一個索引，但您正在對embedding <-> cube('(0.08470847,...,0.06106149)')未編入索引的結果進行排序。
因此，首先借助子查詢檢索所需的結果，然後進行排序。
select id,EDistance
from
(
select id, embedding &lt;-&gt; cube('(0.08470847,...,0.06106149)') EDistance 
from documents 
where embedding &lt;-&gt; cube('(0.08470847,...,0.06106149)') &lt; 0.25
limit 25
) t
order by EDistance ASC
謝謝！

引用自：https://dba.stackexchange.com/questions/241983

`cube` 查詢排序的性能問題

相關問答

兩台伺服器上的Postgresql查詢計劃不同

PostgreSQL NOT IN 數組慢查詢

如何有效地獲取 PostgreSQL 表中每個 id 的最舊值？

Postgres：為什麼這個三元組索引會減慢正則表達式查詢？

如果我將它分成 3 個較小的查詢，為什麼我的大 Postgres 查詢會更快？

FROM 子句中的相關函式是否針對每一行執行？

cube 查詢排序的性能問題

相關問答

兩台伺服器上的Postgresql查詢計劃不同

PostgreSQL NOT IN 數組慢查詢

如何有效地獲取 PostgreSQL 表中每個 id 的最舊值？

Postgres：為什麼這個三元組索引會減慢正則表達式查詢？

如果我將它分成 3 個較小的查詢，為什麼我的大 Postgres 查詢會更快？

FROM 子句中的相關函式是否針對每一行執行？

`cube` 查詢排序的性能問題