提高大型 PostgresSQL 表中 COUNT/GROUP-BY 的性能?
我正在執行 PostgresSQL 9.2,並且有一個 12 列的關係,大約有 6,700,000 行。它包含 3D 空間中的節點,每個節點都引用一個使用者(創建它的人)。要查詢哪個使用者創建了多少個節點,我執行以下操作(添加
explain analyze
以獲取更多資訊):EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1) -> Seq Scan on treenode (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1) Filter: (project_id = 1) Total runtime: 1747.653 ms
如您所見,這大約需要 1.7 秒。考慮到數據量,這還不錯,但我想知道這是否可以改進。我試圖在使用者列上添加一個 BTree 索引,但這沒有任何幫助。
你有其他建議嗎?
為了完整起見,這是包含所有索引的完整表定義(沒有外鍵約束、引用和触發器):
Column | Type | Modifiers ---------------+--------------------------+------------------------------------------------------ id | bigint | not null default nextval('concept_id_seq'::regclass) user_id | bigint | not null creation_time | timestamp with time zone | not null default now() edition_time | timestamp with time zone | not null default now() project_id | bigint | not null location | double3d | not null reviewer_id | integer | not null default (-1) review_time | timestamp with time zone | editor_id | integer | parent_id | bigint | radius | double precision | not null default 0 confidence | integer | not null default 5 skeleton_id | bigint | Indexes: "treenode_pkey" PRIMARY KEY, btree (id) "treenode_id_key" UNIQUE CONSTRAINT, btree (id) "skeleton_id_treenode_index" btree (skeleton_id) "treenode_editor_index" btree (editor_id) "treenode_location_x_index" btree (((location).x)) "treenode_location_y_index" btree (((location).y)) "treenode_location_z_index" btree (((location).z)) "treenode_parent_id" btree (parent_id) "treenode_user_index" btree (user_id)
**編輯:**這是結果,當我使用@ypercube 提出的查詢(和索引)時(查詢大約需要 5.3 秒沒有
EXPLAIN ANALYZE
):EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on auth_user u (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1) SubPlan 1 -> Aggregate (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46) -> Bitmap Heap Scan on treenode t (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46) Recheck Cond: ((project_id = 1) AND (user_id = u.id)) Rows Removed by Index Recheck: 461076 -> Bitmap Index Scan on treenode_user_index (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46) Index Cond: ((project_id = 1) AND (user_id = u.id)) Total runtime: 5556.190 ms (9 rows) Time: 5556.804 ms
**編輯 2:**這是結果,當我按照@erwin-brandstetter 的建議使用
index
on (但還沒有架構優化)時(查詢以與原始查詢相同的速度執行 1.5 秒):project_id, user_id
EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1) -> Seq Scan on treenode (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1) Filter: (project_id = 1) Total runtime: 1807.368 ms (4 rows)
主要問題是缺少索引。但還有更多。
SELECT user_id, count(*) AS ct FROM treenode WHERE project_id = 1 GROUP BY user_id;
你有很多
bigint
列。大概是矯枉過正。通常,對於和integer
之類的列來說已經綽綽有餘了。這也將有助於下一個項目。在優化表定義時,請考慮這個相關的答案,重點是數據對齊和填充。但其餘大部分也適用:project_id``user_id
房間裡的大象:沒有索引
project_id
。創建一個。這比這個答案的其餘部分更重要。在此期間,將其設為多列索引:
CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
如果你聽從我的建議,
integer
這裡會很完美:
user_id
是定義NOT NULL
的,所以count(user_id)
等價於count(*)
,但後者有點短和快。(在這個特定的查詢中,這甚至可以在沒有user_id
定義的情況下應用NOT NULL
。)id
已經是主鍵了,額外的UNIQUE
約束是沒用的 ballast。算了吧:"treenode_pkey" PRIMARY KEY, btree (id) "treenode_id_key" UNIQUE CONSTRAINT, btree (id)
旁白:我不會
id
用作列名。使用描述性的東西,例如treenode_id
.添加資訊
問:
How many different project_id and user_id?
答:
not more than five different project_id
。這意味著 Postgres 必須讀取整個表的大約 20%才能滿足您的查詢。除非它可以使用僅索引掃描,否則對錶的順序掃描將比涉及任何索引更快。在這裡沒有更多的性能 - 除了通過優化表和伺服器設置。
至於僅索引掃描:要查看其效果如何,
VACUUM ANALYZE
請在負擔得起的情況下執行(以獨占方式鎖定表)。然後再次嘗試您的查詢。現在只使用索引應該會稍微快一些。首先閱讀此相關答案: