Postgresql

空間索引能否幫助“範圍-按-限制”查詢

  • January 21, 2019

問這個問題,特別是針對 Postgres,因為它對 R-tree/空間索引有很好的支持。

我們有下表,其中包含單詞及其頻率的樹結構(嵌套集模型):

lexikon
-------
_id   integer  PRIMARY KEY
word  text
frequency integer
lset  integer  UNIQUE KEY
rset  integer  UNIQUE KEY

和查詢:

SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N

我想覆蓋索引會很有用,但我覺得如果範圍內的值(lset, frequency, word)太多,它可能表現不佳。lset``(@High, @Low)

有時,一個簡單的索引(frequency DESC)也可能就足夠了,當使用該索引的搜尋早期產生@N與範圍條件匹配的行時。

但似乎性能很大程度上取決於參數值。

有沒有辦法讓它快速執行,無論範圍(@Low, @High)是寬還是窄,也不管高頻詞是否幸運地在(窄)選定的範圍內?

R-tree/空間索引會有幫助嗎?

添加索引,重寫查詢,重新設計表,沒有限制。

您可以通過首先在頻率較高的行中搜尋來獲得更好的性能。這可以通過“顆粒化”頻率然後按程序逐步執行它們來實現,例如如下:

–testbed 和lexikon虛擬數據:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
                     word text, 
                     frequency integer, 
                     lset integer, 
                     width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule分析(主要用於資訊和調整):

create table granule as 
select width_granule, count(*) as freq, 
      min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
width_granule |  freq  | granule_start | granule_end
---------------+--------+---------------+-------------
            0 | 500000 |             1 |           1
            1 | 300000 |             2 |           4
            2 | 123077 |             5 |          12
            3 |  47512 |            13 |          33
            4 |  18422 |            34 |          90
            5 |   6908 |            91 |         244
            6 |   2580 |           245 |         665
            7 |    949 |           666 |        1808
            8 |    349 |          1811 |        4901
            9 |    129 |          4926 |       13333
           10 |     47 |         13513 |       35714
           11 |     17 |         37037 |       90909
           12 |      7 |        100000 |      250000
           13 |      2 |        333333 |      500000
           14 |      1 |       1000000 |     1000000
*/
alter table granule drop column freq;
--

首先掃描高頻的功能:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
      returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
 m integer;
 n integer := 0;
 r record;
begin 
 for r in (select width_granule from granule order by width_granule desc) loop
   return query( select * 
                 from lexikon 
                 where width_granule=r.width_granule 
                       and lset>=p_lset_low and lset<=p_lset_high );
   get diagnostics m = row_count;
   n = n+m;
   exit when n>=p_limit;
 end loop;
end;$$;

結果(時間可能應該加一點鹽,但每個查詢執行兩次以對抗任何記憶體)

首先使用我們編寫的函式:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 |      7092 | 23237 |             9
246 | word25112 |      4065 | 25112 |             8
275 | word23825 |      3636 | 23825 |             8
409 | word28660 |      2444 | 28660 |             8
418 | word29923 |      2392 | 29923 |             8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 |      7092 | 23237 |             9
246 | word25112 |      4065 | 25112 |             8
275 | word23825 |      3636 | 23825 |             8
409 | word28660 |      2444 | 28660 |             8
418 | word29923 |      2392 | 29923 |             8
Time: 0.510 ms
*/

然後進行簡單的索引掃描:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 |      7092 | 23237 |             9
246 | word25112 |      4065 | 25112 |             8
275 | word23825 |      3636 | 23825 |             8
409 | word28660 |      2444 | 28660 |             8
418 | word29923 |      2392 | 29923 |             8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 |      7092 | 23237 |             9
246 | word25112 |      4065 | 25112 |             8
275 | word23825 |      3636 | 23825 |             8
409 | word28660 |      2444 | 28660 |             8
418 | word29923 |      2392 | 29923 |             8
Time: 51.250 ms
*/
\timing off
--
rollback;

根據您的實際數據,您可能希望改變顆粒的數量以及用於將行放入其中的函式。頻率的實際分佈在這裡很關鍵,條款的預期值和所尋求的範圍limit大小也是如此。lset

引用自:https://dba.stackexchange.com/questions/18300