Postgresql
Postgres 不使用 group by 的表達式索引
我正在對具有匹配索引的單個列進行簡單的分組,它工作正常:
create index on contacts (home_city); select home_city, count(*) from contacts group by home_city limit 50;
在 800 萬行表上花費 150 毫秒。這已經足夠快了。解釋分析輸出:
"Limit (cost=1000.46..9256.24 rows=50 width=17) (actual time=14.201..76.630 rows=50 loops=1)" " -> Finalize GroupAggregate (cost=1000.46..207395.17 rows=1250 width=17) (actual time=14.200..62.196 rows=50 loops=1)" " Group Key: home_city" " -> Gather Merge (cost=1000.46..207370.17 rows=2500 width=17) (actual time=14.100..76.557 rows=97 loops=1)" " Workers Planned: 2" " Workers Launched: 2" " -> Partial GroupAggregate (cost=0.43..206081.59 rows=1250 width=17) (actual time=0.897..56.260 rows=41 loops=3)" " Group Key: home_city" " -> Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6 (cost=0.43..188620.82 rows=3489654 width=9) (actual time=0.026..28.866 rows=105725 loops=3)" " Heap Fetches: 0" "Planning Time: 0.099 ms" "Execution Time: 76.680 ms"
現在我想用不區分大小寫的索引來做到這一點:
create index on contacts (lower(home_city)); select lower(home_city), count(*) from contacts group by lower(home_city) limit 50;
需要 3-4 秒。它似乎使用了錯誤的索引:
"Limit (cost=215869.40..215882.20 rows=50 width=40) (actual time=4253.233..4253.846 rows=50 loops=1)" " -> Finalize GroupAggregate (cost=215869.40..216175.66 rows=1197 width=40) (actual time=4253.232..4253.335 rows=50 loops=1)" " Group Key: (lower((home_city)::text))" " -> Gather Merge (cost=215869.40..216148.72 rows=2394 width=40) (actual time=4253.219..4253.796 rows=102 loops=1)" " Workers Planned: 2" " Workers Launched: 2" " -> Sort (cost=214869.38..214872.37 rows=1197 width=40) (actual time=4244.566..4244.630 rows=710 loops=3)" " Sort Key: (lower((home_city)::text))" " Sort Method: quicksort Memory: 88kB" " Worker 0: Sort Method: quicksort Memory: 89kB" " Worker 1: Sort Method: quicksort Memory: 120kB" " -> Partial HashAggregate (cost=214793.22..214808.18 rows=1197 width=40) (actual time=4241.397..4241.658 rows=1021 loops=3)" " Group Key: lower((home_city)::text)" " -> Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6 (cost=0.43..197344.95 rows=3489654 width=32) (actual time=0.059..2918.355 rows=2778720 loops=3)" " Heap Fetches: 0" "Planning Time: 0.118 ms" "Execution Time: 4253.906 ms"
當我刪除普通 home_city 上的索引時,它會進行全表掃描,需要 6 秒。
如何讓它使用正確的索引?我在這裡想念什麼?
(我使用的是 Postgres 11.5)。
有這個程式碼
check_index_only
在src/backend/optimizer/path/indxpath.c
:/* * check_index_only * Determine whether an index-only scan is possible for this index. */ static bool check_index_only(RelOptInfo *rel, IndexOptInfo *index) { [...] /* * Construct a bitmapset of columns that the index can return back in an * index-only scan. If there are multiple index columns containing the * same attribute, all of them must be capable of returning the value, * since we might recheck operators on any of them. (Potentially we could * be smarter about that, but it's such a weird situation that it doesn't * seem worth spending a lot of sweat on.) */ for (i = 0; i < index->ncolumns; i++) { int attno = index->indexkeys[i]; /* * For the moment, we just ignore index expressions. It might be nice * to do something with them, later. */ if (attno == 0) continue;
因此,如果我沒看錯,那麼在考慮僅索引掃描時會忽略表達式。
我認為這不是原則問題,只是沒有實施。
從 PostgreSQL v12 開始,您可以使用不區分大小寫的 ICU 排序規則來獲得您想要的內容:
CREATE COLLATION und_ci ( LOCALE = 'und-u-ks-level2', PROVIDER = icu, DETERMINISTIC = FALSE );
然後您必須更改列以使用該排序規則:
ALTER TABLE contacts ALTER home_city TYPE text COLLATE und_ci;
該索引可用於不區分大小寫的搜尋,您也可以將其用於僅索引掃描。