Postgresql

Postgres:為什麼這個三元組索引會減慢正則表達式查詢?

  • June 11, 2019

keyvalues在 Postgres 中有一個 TEXT 列:

select * from test5 limit 5;

id |                      keyvalues
----+------------------------------------------------------
 1 | ^ first 1 | second 3
 2 | ^ first 1 | second 2 ^ first 2 | second 3
 3 | ^ first 1 | second 2 | second 3
 4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
 5 | ^ first 2 | second 3 ^ first 1 | second 3

我的查詢必須排除^匹配中間的字元,所以我使用正則表達式:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
  ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
              ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                    Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                    Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms

查詢有效(零行匹配),但在 > 7 秒時太慢了。

我認為用三元組索引會有所幫助,但沒有運氣:

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                  QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
  ->  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
        Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
        Rows Removed by Index Recheck: 5000005
        Heap Blocks: exact=47620
        ->  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
              Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms

使用 trigram 索引的查詢要慢 3 倍!它仍然返回正確的結果(零行)。我希望 trigram 索引能夠立即找出second 0任何地方都沒有字元串,並且速度非常快。

(動機:我想避免規範化keyvalues另一個表中,所以我希望TEXT使用文本索引和正則表達式在單個欄位中編碼匹配邏輯。邏輯有效,但是太慢了,就像 JSONB 一樣。)

我希望 trigram 索引能夠立即找出second 0任何地方都沒有字元串

‘second’ 和 ‘0’ 是單獨的單詞,因此它無法檢測到它們的共同缺失。似乎它可以檢測到“0”的缺失,但來自“contrib/pg_trgm/trgm_regexp.c”的這條評論似乎是相關的:

* Note: Using again the example "foo bar", we will not consider the
* trigram "  b", though this trigram would be found by the trigram
* extraction code.  Since we will find " ba", it doesn't seem worth
* trying to hack the algorithm to generate the additional trigram.

由於 0 是模式字元串中的最後一個字元,因此也不會有“0a”形式的三元組,所以它只是錯過了這個機會。

即使不是因為這個限制,你的方法似乎也非常脆弱。

引用自:https://dba.stackexchange.com/questions/240122