Postgres 不使用 group by 的表達式索引

December 10, 2019

我正在對具有匹配索引的單個列進行簡單的分組，它工作正常：

create index on contacts (home_city);
select home_city, count(*) 
from contacts 
group by home_city 
limit 50;

在 800 萬行表上花費 150 毫秒。這已經足夠快了。解釋分析輸出：

"Limit  (cost=1000.46..9256.24 rows=50 width=17) (actual time=14.201..76.630 rows=50 loops=1)"
"  -&gt;  Finalize GroupAggregate  (cost=1000.46..207395.17 rows=1250 width=17) (actual time=14.200..62.196 rows=50 loops=1)"
"        Group Key: home_city"
"        -&gt;  Gather Merge  (cost=1000.46..207370.17 rows=2500 width=17) (actual time=14.100..76.557 rows=97 loops=1)"
"              Workers Planned: 2"
"              Workers Launched: 2"
"              -&gt;  Partial GroupAggregate  (cost=0.43..206081.59 rows=1250 width=17) (actual time=0.897..56.260 rows=41 loops=3)"
"                    Group Key: home_city"
"                    -&gt;  Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6  (cost=0.43..188620.82 rows=3489654 width=9) (actual time=0.026..28.866 rows=105725 loops=3)"
"                          Heap Fetches: 0"
"Planning Time: 0.099 ms"
"Execution Time: 76.680 ms"

現在我想用不區分大小寫的索引來做到這一點：

create index on contacts (lower(home_city));
select lower(home_city), count(*) 
from contacts 
group by lower(home_city) 
limit 50;

需要 3-4 秒。它似乎使用了錯誤的索引：

"Limit  (cost=215869.40..215882.20 rows=50 width=40) (actual time=4253.233..4253.846 rows=50 loops=1)"
"  -&gt;  Finalize GroupAggregate  (cost=215869.40..216175.66 rows=1197 width=40) (actual time=4253.232..4253.335 rows=50 loops=1)"
"        Group Key: (lower((home_city)::text))"
"        -&gt;  Gather Merge  (cost=215869.40..216148.72 rows=2394 width=40) (actual time=4253.219..4253.796 rows=102 loops=1)"
"              Workers Planned: 2"
"              Workers Launched: 2"
"              -&gt;  Sort  (cost=214869.38..214872.37 rows=1197 width=40) (actual time=4244.566..4244.630 rows=710 loops=3)"
"                    Sort Key: (lower((home_city)::text))"
"                    Sort Method: quicksort  Memory: 88kB"
"                    Worker 0:  Sort Method: quicksort  Memory: 89kB"
"                    Worker 1:  Sort Method: quicksort  Memory: 120kB"
"                    -&gt;  Partial HashAggregate  (cost=214793.22..214808.18 rows=1197 width=40) (actual time=4241.397..4241.658 rows=1021 loops=3)"
"                          Group Key: lower((home_city)::text)"
"                          -&gt;  Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6  (cost=0.43..197344.95 rows=3489654 width=32) (actual time=0.059..2918.355 rows=2778720 loops=3)"
"                                Heap Fetches: 0"
"Planning Time: 0.118 ms"
"Execution Time: 4253.906 ms"

當我刪除普通 home_city 上的索引時，它會進行全表掃描，需要 6 秒。

如何讓它使用正確的索引？我在這裡想念什麼？

（我使用的是 Postgres 11.5）。

有這個程式碼check_index_only在src/backend/optimizer/path/indxpath.c：
/*
* check_index_only
*      Determine whether an index-only scan is possible for this index.
*/
static bool
check_index_only(RelOptInfo *rel, IndexOptInfo *index)
{

[...]

   /*
    * Construct a bitmapset of columns that the index can return back in an
    * index-only scan.  If there are multiple index columns containing the
    * same attribute, all of them must be capable of returning the value,
    * since we might recheck operators on any of them.  (Potentially we could
    * be smarter about that, but it's such a weird situation that it doesn't
    * seem worth spending a lot of sweat on.)
    */
   for (i = 0; i &lt; index-&gt;ncolumns; i++)
   {
       int         attno = index-&gt;indexkeys[i];

       /*
        * For the moment, we just ignore index expressions.  It might be nice
        * to do something with them, later.
        */
       if (attno == 0)
           continue;
因此，如果我沒看錯，那麼在考慮僅索引掃描時會忽略表達式。
我認為這不是原則問題，只是沒有實施。
從 PostgreSQL v12 開始，您可以使用不區分大小寫的 ICU 排序規則來獲得您想要的內容：
CREATE COLLATION und_ci (
  LOCALE = 'und-u-ks-level2',
  PROVIDER = icu,
  DETERMINISTIC = FALSE
);
然後您必須更改列以使用該排序規則：
ALTER TABLE contacts ALTER home_city TYPE text COLLATE und_ci;
該索引可用於不區分大小寫的搜尋，您也可以將其用於僅索引掃描。

引用自：https://dba.stackexchange.com/questions/255353

Postgres 不使用 group by 的表達式索引

相關問答

如何優化索引列上的 IN 查詢

慢查詢/索引創建（PostgreSQL 9.2）

帶有 WHERE 條件和 GROUP BY 的 SQL 查詢的索引

FROM 子句中的相關函式是否針對每一行執行？

使用大 IN 優化 Postgres 查詢

大表中的慢速索引掃描