Postgres：為什麼這個三元組索引會減慢正則表達式查詢？

June 11, 2019

我keyvalues在 Postgres 中有一個 TEXT 列：

select * from test5 limit 5;

id |                      keyvalues
----+------------------------------------------------------
 1 | ^ first 1 | second 3
 2 | ^ first 1 | second 2 ^ first 2 | second 3
 3 | ^ first 1 | second 2 | second 3
 4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
 5 | ^ first 2 | second 3 ^ first 1 | second 3

我的查詢必須排除^匹配中間的字元，所以我使用正則表達式：

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
  -&gt;  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -&gt;  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
              -&gt;  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                    Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                    Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms

查詢有效（零行匹配），但在 > 7 秒時太慢了。

我認為用三元組索引會有所幫助，但沒有運氣：

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                  QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
  -&gt;  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
        Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
        Rows Removed by Index Recheck: 5000005
        Heap Blocks: exact=47620
        -&gt;  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
              Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms

使用 trigram 索引的查詢要慢 3 倍！它仍然返回正確的結果（零行）。我希望 trigram 索引能夠立即找出second 0任何地方都沒有字元串，並且速度非常快。

（動機：我想避免規範化keyvalues到另一個表中，所以我希望TEXT使用文本索引和正則表達式在單個欄位中編碼匹配邏輯。邏輯有效，但是太慢了，就像 JSONB 一樣。）

我希望 trigram 索引能夠立即找出second 0任何地方都沒有字元串
‘second’ 和 ‘0’ 是單獨的單詞，因此它無法檢測到它們的共同缺失。似乎它可以檢測到“0”的缺失，但來自“contrib/pg_trgm/trgm_regexp.c”的這條評論似乎是相關的：
* Note: Using again the example "foo bar", we will not consider the
* trigram "  b", though this trigram would be found by the trigram
* extraction code.  Since we will find " ba", it doesn't seem worth
* trying to hack the algorithm to generate the additional trigram.
由於 0 是模式字元串中的最後一個字元，因此也不會有“0a”形式的三元組，所以它只是錯過了這個機會。
即使不是因為這個限制，你的方法似乎也非常脆弱。

引用自：https://dba.stackexchange.com/questions/240122

Postgres：為什麼這個三元組索引會減慢正則表達式查詢？

相關問答

兩台伺服器上的Postgresql查詢計劃不同

PostgreSQL NOT IN 數組慢查詢

如何有效地獲取 PostgreSQL 表中每個 id 的最舊值？

`cube` 查詢排序的性能問題

如果我將它分成 3 個較小的查詢，為什麼我的大 Postgres 查詢會更快？

FROM 子句中的相關函式是否針對每一行執行？

Postgres：為什麼這個三元組索引會減慢正則表達式查詢？

相關問答

兩台伺服器上的Postgresql查詢計劃不同

PostgreSQL NOT IN 數組慢查詢

如何有效地獲取 PostgreSQL 表中每個 id 的最舊值？

cube 查詢排序的性能問題

如果我將它分成 3 個較小的查詢，為什麼我的大 Postgres 查詢會更快？

FROM 子句中的相關函式是否針對每一行執行？

`cube` 查詢排序的性能問題