提高大型 PostgresSQL 表中 COUNT/GROUP-BY 的性能？

January 3, 2019

我正在執行 PostgresSQL 9.2，並且有一個 12 列的關係，大約有 6,700,000 行。它包含 3D 空間中的節點，每個節點都引用一個使用者（創建它的人）。要查詢哪個使用者創建了多少個節點，我執行以下操作（添加explain analyze以獲取更多資訊）：

EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                   QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
  -&gt;  Seq Scan on treenode  (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
        Filter: (project_id = 1)
Total runtime: 1747.653 ms

如您所見，這大約需要 1.7 秒。考慮到數據量，這還不錯，但我想知道這是否可以改進。我試圖在使用者列上添加一個 BTree 索引，但這沒有任何幫助。

你有其他建議嗎？

為了完整起見，這是包含所有索引的完整表定義（沒有外鍵約束、引用和触發器）：

   Column     |           Type           |                      Modifiers                    
---------------+--------------------------+------------------------------------------------------
id            | bigint                   | not null default nextval('concept_id_seq'::regclass)
user_id       | bigint                   | not null
creation_time | timestamp with time zone | not null default now()
edition_time  | timestamp with time zone | not null default now()
project_id    | bigint                   | not null
location      | double3d                 | not null
reviewer_id   | integer                  | not null default (-1)
review_time   | timestamp with time zone |
editor_id     | integer                  |
parent_id     | bigint                   |
radius        | double precision         | not null default 0
confidence    | integer                  | not null default 5
skeleton_id   | bigint                   |
Indexes:
   "treenode_pkey" PRIMARY KEY, btree (id)
   "treenode_id_key" UNIQUE CONSTRAINT, btree (id)
   "skeleton_id_treenode_index" btree (skeleton_id)
   "treenode_editor_index" btree (editor_id)
   "treenode_location_x_index" btree (((location).x))
   "treenode_location_y_index" btree (((location).y))
   "treenode_location_z_index" btree (((location).z))
   "treenode_parent_id" btree (parent_id)
   "treenode_user_index" btree (user_id)

**編輯：**這是結果，當我使用@ypercube 提出的查詢（和索引）時（查詢大約需要 5.3 秒沒有EXPLAIN ANALYZE）：

EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
                                                                       QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on auth_user u  (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
  SubPlan 1
    -&gt;  Aggregate  (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
          -&gt;  Bitmap Heap Scan on treenode t  (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
                Recheck Cond: ((project_id = 1) AND (user_id = u.id))
                Rows Removed by Index Recheck: 461076
                -&gt;  Bitmap Index Scan on treenode_user_index  (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
                      Index Cond: ((project_id = 1) AND (user_id = u.id))
Total runtime: 5556.190 ms
(9 rows)

Time: 5556.804 ms

**編輯 2：**這是結果，當我按照@erwin-brandstetter 的建議使用indexon （但還沒有架構優化）時（查詢以與原始查詢相同的速度執行 1.5 秒）：project_id, user_id

EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                       QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
  -&gt;  Seq Scan on treenode  (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
        Filter: (project_id = 1)
Total runtime: 1807.368 ms
(4 rows)

主要問題是缺少索引。但還有更多。
SELECT user_id, count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;
你有很多bigint列。大概是矯枉過正。通常，對於和integer之類的列來說已經綽綽有餘了。這也將有助於下一個項目。在優化表定義時，請考慮這個相關的答案，重點是數據對齊和填充。但其餘大部分也適用：project_id``user_id
為讀取性能配置 PostgreSQL
房間裡的大象：沒有索引project_id。創建一個。這比這個答案的其餘部分更重要。
在此期間，將其設為多列索引：
CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
如果你聽從我的建議，integer這裡會很完美：
複合索引是否也適用於第一個欄位的查詢？
user_id是定義NOT NULL的，所以count(user_id)等價於count(*)，但後者有點短和快。（在這個特定的查詢中，這甚至可以在沒有user_id定義的情況下應用NOT NULL。）
id已經是主鍵了，額外的UNIQUE約束是沒用的 ballast。算了吧：
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)
旁白：我不會id用作列名。使用描述性的東西，例如treenode_id.
添加資訊
問：How many different project_id and user_id?
答：not more than five different project_id。
這意味著 Postgres 必須讀取整個表的大約 20%才能滿足您的查詢。除非它可以使用僅索引掃描，否則對錶的順序掃描將比涉及任何索引更快。在這裡沒有更多的性能 - 除了通過優化表和伺服器設置。
至於僅索引掃描：要查看其效果如何，VACUUM ANALYZE請在負擔得起的情況下執行（以獨占方式鎖定表）。然後再次嘗試您的查詢。現在只使用索引應該會稍微快一些。首先閱讀此相關答案：
使用 ORDER BY 日期和文本優化簡單查詢
以及Postgres 9.6和Postgres Wiki on index-only scans添加的手冊頁。

引用自：https://dba.stackexchange.com/questions/62500

提高大型 PostgresSQL 表中 COUNT/GROUP-BY 的性能？

添加資訊

相關問答

提高 PostgreSQL 表中大掃描計算的性能

避免由 DATE_TRUNC 引起的對 Indexed Group By 排序

更有效的分組來計算組中的項目

使用 GROUP BY 和 ORDER BY 對大表進行慢速查詢

使用 where 子句計算近似值

加速 GROUP BY，HAVING COUNT 查詢