Postgresql
空間索引能否幫助“範圍-按-限制”查詢
問這個問題,特別是針對 Postgres,因為它對 R-tree/空間索引有很好的支持。
我們有下表,其中包含單詞及其頻率的樹結構(嵌套集模型):
lexikon ------- _id integer PRIMARY KEY word text frequency integer lset integer UNIQUE KEY rset integer UNIQUE KEY
和查詢:
SELECT word FROM lexikon WHERE lset BETWEEN @Low AND @High ORDER BY frequency DESC LIMIT @N
我想覆蓋索引會很有用,但我覺得如果範圍內的值
(lset, frequency, word)
太多,它可能表現不佳。lset``(@High, @Low)
有時,一個簡單的索引
(frequency DESC)
也可能就足夠了,當使用該索引的搜尋早期產生@N
與範圍條件匹配的行時。但似乎性能很大程度上取決於參數值。
有沒有辦法讓它快速執行,無論範圍
(@Low, @High)
是寬還是窄,也不管高頻詞是否幸運地在(窄)選定的範圍內?R-tree/空間索引會有幫助嗎?
添加索引,重寫查詢,重新設計表,沒有限制。
您可以通過首先在頻率較高的行中搜尋來獲得更好的性能。這可以通過“顆粒化”頻率然後按程序逐步執行它們來實現,例如如下:
–testbed 和
lexikon
虛擬數據:begin; set role dba; create role stack; grant stack to dba; create schema authorization stack; set role stack; -- create table lexikon( _id serial, word text, frequency integer, lset integer, width_granule integer); -- insert into lexikon(word, frequency, lset) select word, (1000000/row_number() over(order by random()))::integer as frequency, lset from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z; -- update lexikon set width_granule=ln(frequency)::integer; -- create index on lexikon(width_granule, lset); create index on lexikon(lset); -- the second index is not used with the function but is added to make the timings 'fair'
granule
分析(主要用於資訊和調整):create table granule as select width_granule, count(*) as freq, min(frequency) as granule_start, max(frequency) as granule_end from lexikon group by width_granule; -- select * from granule order by 1; /* width_granule | freq | granule_start | granule_end ---------------+--------+---------------+------------- 0 | 500000 | 1 | 1 1 | 300000 | 2 | 4 2 | 123077 | 5 | 12 3 | 47512 | 13 | 33 4 | 18422 | 34 | 90 5 | 6908 | 91 | 244 6 | 2580 | 245 | 665 7 | 949 | 666 | 1808 8 | 349 | 1811 | 4901 9 | 129 | 4926 | 13333 10 | 47 | 13513 | 35714 11 | 17 | 37037 | 90909 12 | 7 | 100000 | 250000 13 | 2 | 333333 | 500000 14 | 1 | 1000000 | 1000000 */ alter table granule drop column freq; --
首先掃描高頻的功能:
create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer) returns setof lexikon language plpgsql set search_path to 'stack' as $$ declare m integer; n integer := 0; r record; begin for r in (select width_granule from granule order by width_granule desc) loop return query( select * from lexikon where width_granule=r.width_granule and lset>=p_lset_low and lset<=p_lset_high ); get diagnostics m = row_count; n = n+m; exit when n>=p_limit; end loop; end;$$;
結果(時間可能應該加一點鹽,但每個查詢執行兩次以對抗任何記憶體)
首先使用我們編寫的函式:
\timing on -- select * from f(20000, 30000, 5) order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 80.452 ms */ select * from f(20000, 30000, 5) order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 0.510 ms */
然後進行簡單的索引掃描:
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 218.897 ms */ select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5; /* _id | word | frequency | lset | width_granule -----+-----------+-----------+-------+--------------- 141 | word23237 | 7092 | 23237 | 9 246 | word25112 | 4065 | 25112 | 8 275 | word23825 | 3636 | 23825 | 8 409 | word28660 | 2444 | 28660 | 8 418 | word29923 | 2392 | 29923 | 8 Time: 51.250 ms */ \timing off -- rollback;
根據您的實際數據,您可能希望改變顆粒的數量以及用於將行放入其中的函式。頻率的實際分佈在這裡很關鍵,條款的預期值和所尋求的範圍
limit
大小也是如此。lset