索引可以通過 GROUP BY/聚合對整個表（無選擇性）加速查詢嗎？

January 10, 2017

假設我有一個包含 3 列 a、b 和 c 的表。
我可以通過使用索引來加快看起來像這樣的查詢嗎？
SELECT a,b,SUM(c)  # or AVG(c)
FROM table
GROUP BY a,b
ORDER BY a,b
;
如果上述問題是肯定的，您推薦哪種類型的索引以及****它如何工作？

不見得。GROUP BY並且ORDER BY通常需要排序。但是，在這種情況下，HashAggregate使用了 a（可能是因為我們正在處理整個表）。

CREATE TABLE foo AS
SELECT x % 5 AS a, x % 10 AS b, x AS c
FROM generate_series(1,1e6) AS x;

使用 HashAggregate 計劃，

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=611.607..611.608 rows=10 loops=1)
  Sort Key: a, b
  Sort Method: quicksort  Memory: 25kB
  -&gt;  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=611.589..611.593 rows=10 loops=1)
        Group Key: a, b
        -&gt;  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.012..71.157 rows=1000000 loops=1)
Planning time: 0.168 ms
Execution time: 611.665 ms

所以我們添加一個索引…

CREATE INDEX idx ON foo (a,b);
VACUUM FULL ANALYZE foo;

…仍然顯示相同的查詢計劃。所以我們禁用 HashAggregate

SET enable_hashagg = false;

然後再試一次..

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                           QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=0.42..61292.04 rows=50 width=14) (actual time=108.149..655.536 rows=10 loops=1)
  Group Key: a, b
  -&gt;  Index Scan using idx on foo  (cost=0.42..53791.41 rows=1000000 width=14) (actual time=0.066..272.299 rows=1000000 loops=1)
Planning time: 0.121 ms
Execution time: 655.594 ms
(5 rows)

而且，與之前的 611ms 相比，它需要更多的時間 655ms。

需要更快？

如果這還不夠快（並且 611 毫秒來分組和匯總一百萬行也不錯）。然後，MATERIALIZED VIEW如果您的工作量允許（如果查詢很熱和/或不經常更新），您可以使用 a，

CREATE MATERIALIZED VIEW foo2 AS
SELECT a,b,sum(c)
FROM foo
GROUP BY a,b
ORDER BY a,b;

現在，當TABLE foo2. 然後只需REFRESH MATERIALIZED VIEW foo2;刷新視圖即可。或者，您可以創建一個觸發器更新另一個表並使用觸發器對其進行更新。

在實際聚合的列上。

有一些例外，但sum()不是其中之一。大多數聚合不使用索引，因為它們通常不需要索引。例外是特定於訂單的聚合（如min()和max()）。因此，例如，如果在我們為(a,b)您創建索引之後執行 a sum(a)，

# EXPLAIN ANALYZE SELECT sum(a) FROM foo;
                                                    QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=18666.00..18666.01 rows=1 width=4) (actual time=287.063..287.063 rows=1 loops=1)
  -&gt;  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=4) (actual time=0.015..85.435 rows=1000000 loops=1)
Planning time: 0.098 ms
Execution time: 287.104 ms
(4 rows)

您可以看到它仍然使用 seq 掃描。您將看到sum(c)完全沒有索引的相同計劃。現在這是踢球者，

# EXPLAIN ANALYZE SELECT min(a) FROM foo;
                                                             QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=0.48..0.49 rows=1 width=0) (actual time=0.041..0.041 rows=1 loops=1)
  InitPlan 1 (returns $0)
    -&gt;  Limit  (cost=0.42..0.48 rows=1 width=4) (actual time=0.036..0.037 rows=1 loops=1)
          -&gt;  Index Only Scan using idx on foo  (cost=0.42..56291.41 rows=1000000 width=4) (actual time=0.035..0.035 rows=1 loops=1)
                Index Cond: (a IS NOT NULL)
                Heap Fetches: 1
Planning time: 0.171 ms
Execution time: 0.080 ms
(8 rows)

min(a)不像sum(a)可以使用排序，因此查詢計劃器意識到索引掃描，它不是免費的，有一個好處。

在 (a,b,c) 上使用索引證明

無論出於何種原因，如果您想看到進一步索引對c求和無關緊要的證據（如果在閱讀上述內容後您仍然不明白為什麼，請提出問題），

-- turn this back on we turned it off earlier
SET enable_hashagg = true;
DROP INDEX idx;
CREATE INDEX idx ON foo (a,b,c);
VACUUM FULL ANALYZE foo;
EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=608.888..608.889 rows=10 loops=1)
  Sort Key: a, b
  Sort Method: quicksort  Memory: 25kB
  -&gt;  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=608.869..608.871 rows=10 loops=1)
        Group Key: a, b
        -&gt;  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.015..72.613 rows=1000000 loops=1)
Planning time: 0.130 ms
Execution time: 608.947 ms
(8 rows)

一點改善都沒有。禁用hashagg仍然沒有任何改善。

TLDR；

在這個特定和簡單的案例中，索引無關緊要。計劃者選擇最佳方法。

引用自：https://dba.stackexchange.com/questions/160586

索引可以通過 GROUP BY/聚合對整個表（無選擇性）加速查詢嗎？

需要更快？

在實際聚合的列上。

在 (a,b,c) 上使用索引證明

TLDR；

相關問答

對大量重複值使用什麼索引？

大表的高效分頁

如何使 MySQL 在整數範圍選擇查詢中使用索引

如何在 PostgreSQL 中使 DISTINCT ON 更快？

非聚集索引更新未顯示在執行計劃中

索引對更新列不在索引中的更新語句的影響