Index

索引可以通過 GROUP BY/聚合對整個表(無選擇性)加速查詢嗎?

  • January 10, 2017

假設我有一個包含 3 列 a、b 和 c 的表。

我可以通過使用索引來加快看起來像這樣的查詢嗎?

SELECT a,b,SUM(c)  # or AVG(c)
FROM table
GROUP BY a,b
ORDER BY a,b
;

如果上述問題是肯定的,您推薦哪種類型的索引以及****它如何工作?

不見得。GROUP BY並且ORDER BY通常需要排序。但是,在這種情況下,HashAggregate使用了 a(可能是因為我們正在處理整個表)。

CREATE TABLE foo AS
SELECT x % 5 AS a, x % 10 AS b, x AS c
FROM generate_series(1,1e6) AS x;

使用 HashAggregate 計劃,

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=611.607..611.608 rows=10 loops=1)
  Sort Key: a, b
  Sort Method: quicksort  Memory: 25kB
  ->  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=611.589..611.593 rows=10 loops=1)
        Group Key: a, b
        ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.012..71.157 rows=1000000 loops=1)
Planning time: 0.168 ms
Execution time: 611.665 ms

所以我們添加一個索引…

CREATE INDEX idx ON foo (a,b);
VACUUM FULL ANALYZE foo;

…仍然顯示相同的查詢計劃。所以我們禁用 HashAggregate

SET enable_hashagg = false;

然後再試一次..

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                           QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=0.42..61292.04 rows=50 width=14) (actual time=108.149..655.536 rows=10 loops=1)
  Group Key: a, b
  ->  Index Scan using idx on foo  (cost=0.42..53791.41 rows=1000000 width=14) (actual time=0.066..272.299 rows=1000000 loops=1)
Planning time: 0.121 ms
Execution time: 655.594 ms
(5 rows)

而且,與之前的 611ms 相比,它需要更多的時間 655ms。

需要更快?

如果這還不夠快(並且 611 毫秒來分組和匯總一百萬行也不錯)。然後,MATERIALIZED VIEW如果您的工作量允許(如果查詢很熱和/或不經常更新),您可以使用 a,

CREATE MATERIALIZED VIEW foo2 AS
SELECT a,b,sum(c)
FROM foo
GROUP BY a,b
ORDER BY a,b;

現在,當TABLE foo2. 然後只需REFRESH MATERIALIZED VIEW foo2;刷新視圖即可。或者,您可以創建一個觸發器更新另一個表並使用觸發器對其進行更新。

在實際聚合的列上。

有一些例外,但sum()不是其中之一。大多數聚合不使用索引,因為它們通常不需要索引。例外是特定於訂單的聚合(如min()max())。因此,例如,如果在我們為(a,b)您創建索引之後執行 a sum(a)

# EXPLAIN ANALYZE SELECT sum(a) FROM foo;
                                                    QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=18666.00..18666.01 rows=1 width=4) (actual time=287.063..287.063 rows=1 loops=1)
  ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=4) (actual time=0.015..85.435 rows=1000000 loops=1)
Planning time: 0.098 ms
Execution time: 287.104 ms
(4 rows)

您可以看到它仍然使用 seq 掃描。您將看到sum(c)完全沒有索引的相同計劃。現在這是踢球者,

# EXPLAIN ANALYZE SELECT min(a) FROM foo;
                                                             QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=0.48..0.49 rows=1 width=0) (actual time=0.041..0.041 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.42..0.48 rows=1 width=4) (actual time=0.036..0.037 rows=1 loops=1)
          ->  Index Only Scan using idx on foo  (cost=0.42..56291.41 rows=1000000 width=4) (actual time=0.035..0.035 rows=1 loops=1)
                Index Cond: (a IS NOT NULL)
                Heap Fetches: 1
Planning time: 0.171 ms
Execution time: 0.080 ms
(8 rows)

min(a)不像sum(a)可以使用排序,因此查詢計劃器意識到索引掃描,它不是免費的,有一個好處。

在 (a,b,c) 上使用索引證明

無論出於何種原因,如果您想看到進一步索引對c求和無關緊要的證據(如果在閱讀上述內容後您仍然不明白為什麼,請提出問題),

-- turn this back on we turned it off earlier
SET enable_hashagg = true;
DROP INDEX idx;
CREATE INDEX idx ON foo (a,b,c);
VACUUM FULL ANALYZE foo;
EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=608.888..608.889 rows=10 loops=1)
  Sort Key: a, b
  Sort Method: quicksort  Memory: 25kB
  ->  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=608.869..608.871 rows=10 loops=1)
        Group Key: a, b
        ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.015..72.613 rows=1000000 loops=1)
Planning time: 0.130 ms
Execution time: 608.947 ms
(8 rows)

一點改善都沒有。禁用hashagg仍然沒有任何改善。

TLDR;

在這個特定和簡單的案例中,索引無關緊要。計劃者選擇最佳方法。

引用自:https://dba.stackexchange.com/questions/160586