Postgresql
Postgres:為什麼這個三元組索引會減慢正則表達式查詢?
我
keyvalues
在 Postgres 中有一個 TEXT 列:select * from test5 limit 5; id | keyvalues ----+------------------------------------------------------ 1 | ^ first 1 | second 3 2 | ^ first 1 | second 2 ^ first 2 | second 3 3 | ^ first 1 | second 2 | second 3 4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2 5 | ^ first 2 | second 3 ^ first 1 | second 3
我的查詢必須排除
^
匹配中間的字元,所以我使用正則表達式:explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1) -> Gather (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3) -> Parallel Seq Scan on test5 (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3) Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text) Rows Removed by Filter: 1666668 Planning Time: 0.068 ms Execution Time: 7337.184 ms
查詢有效(零行匹配),但在 > 7 秒時太慢了。
我認為用三元組索引會有所幫助,但沒有運氣:
create extension if not exists pg_trgm; create index on test5 using gin (keyvalues gin_trgm_ops); explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1) -> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1) Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text) Rows Removed by Index Recheck: 5000005 Heap Blocks: exact=47620 -> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1) Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text) Planning Time: 0.412 ms Execution Time: 23734.722 ms
使用 trigram 索引的查詢要慢 3 倍!它仍然返回正確的結果(零行)。我希望 trigram 索引能夠立即找出
second 0
任何地方都沒有字元串,並且速度非常快。(動機:我想避免規範化
keyvalues
到另一個表中,所以我希望TEXT
使用文本索引和正則表達式在單個欄位中編碼匹配邏輯。邏輯有效,但是太慢了,就像 JSONB 一樣。)
我希望 trigram 索引能夠立即找出
second 0
任何地方都沒有字元串‘second’ 和 ‘0’ 是單獨的單詞,因此它無法檢測到它們的共同缺失。似乎它可以檢測到“0”的缺失,但來自“contrib/pg_trgm/trgm_regexp.c”的這條評論似乎是相關的:
* Note: Using again the example "foo bar", we will not consider the * trigram " b", though this trigram would be found by the trigram * extraction code. Since we will find " ba", it doesn't seem worth * trying to hack the algorithm to generate the additional trigram.
由於 0 是模式字元串中的最後一個字元,因此也不會有“0a”形式的三元組,所以它只是錯過了這個機會。
即使不是因為這個限制,你的方法似乎也非常脆弱。