Postgresql

提高大型 PostgresSQL 表中 COUNT/GROUP-BY 的性能?

  • January 3, 2019

我正在執行 PostgresSQL 9.2,並且有一個 12 列的關係,大約有 6,700,000 行。它包含 3D 空間中的節點,每個節點都引用一個使用者(創建它的人)。要查詢哪個使用者創建了多少個節點,我執行以下操作(添加explain analyze以獲取更多資訊):

EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                   QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
  ->  Seq Scan on treenode  (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
        Filter: (project_id = 1)
Total runtime: 1747.653 ms

如您所見,這大約需要 1.7 秒。考慮到數據量,這還不錯,但我想知道這是否可以改進。我試圖在使用者列上添加一個 BTree 索引,但這沒有任何幫助。

你有其他建議嗎?


為了完整起見,這是包含所有索引的完整表定義(沒有外鍵約束、引用和触發器):

   Column     |           Type           |                      Modifiers                    
---------------+--------------------------+------------------------------------------------------
id            | bigint                   | not null default nextval('concept_id_seq'::regclass)
user_id       | bigint                   | not null
creation_time | timestamp with time zone | not null default now()
edition_time  | timestamp with time zone | not null default now()
project_id    | bigint                   | not null
location      | double3d                 | not null
reviewer_id   | integer                  | not null default (-1)
review_time   | timestamp with time zone |
editor_id     | integer                  |
parent_id     | bigint                   |
radius        | double precision         | not null default 0
confidence    | integer                  | not null default 5
skeleton_id   | bigint                   |
Indexes:
   "treenode_pkey" PRIMARY KEY, btree (id)
   "treenode_id_key" UNIQUE CONSTRAINT, btree (id)
   "skeleton_id_treenode_index" btree (skeleton_id)
   "treenode_editor_index" btree (editor_id)
   "treenode_location_x_index" btree (((location).x))
   "treenode_location_y_index" btree (((location).y))
   "treenode_location_z_index" btree (((location).z))
   "treenode_parent_id" btree (parent_id)
   "treenode_user_index" btree (user_id)

**編輯:**這是結果,當我使用@ypercube 提出的查詢(和索引)時(查詢大約需要 5.3 秒沒有EXPLAIN ANALYZE):

EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
                                                                       QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on auth_user u  (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
  SubPlan 1
    ->  Aggregate  (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
          ->  Bitmap Heap Scan on treenode t  (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
                Recheck Cond: ((project_id = 1) AND (user_id = u.id))
                Rows Removed by Index Recheck: 461076
                ->  Bitmap Index Scan on treenode_user_index  (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
                      Index Cond: ((project_id = 1) AND (user_id = u.id))
Total runtime: 5556.190 ms
(9 rows)

Time: 5556.804 ms

**編輯 2:**這是結果,當我按照@erwin-brandstetter 的建議使用indexon (但還沒有架構優化)時(查詢以與原始查詢相同的速度執行 1.5 秒):project_id, user_id

EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                       QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
  ->  Seq Scan on treenode  (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
        Filter: (project_id = 1)
Total runtime: 1807.368 ms
(4 rows)

主要問題是缺少索引。但還有更多。

SELECT user_id, count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;
  • 你有很多bigint列。大概是矯枉過正。通常,對於和integer之類的列來說已經綽綽有餘了。這也將有助於下一個項目。在優化表定義時,請考慮這個相關的答案,重點是數據對齊填充。但其餘大部分也適用:project_id``user_id

  • 房間裡的大象:沒有索引project_id。創建一個。這比這個答案的其餘部分更重要。

在此期間,將其設為多列索引:

CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);

如果你聽從我的建議,integer這裡會很完美:

  • user_id是定義NOT NULL的,所以count(user_id)等價於count(*),但後者有點短和快。(在這個特定的查詢中,這甚至可以在沒有user_id定義的情況下應用NOT NULL。)
  • id已經是主鍵了,額外的UNIQUE約束是沒用的 ballast。算了吧:
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)

旁白:我不會id用作列名。使用描述性的東西,例如treenode_id.

添加資訊

問:How many different project_id and user_id?

答:not more than five different project_id

這意味著 Postgres 必須讀取整個表的大約 20%才能滿足您的查詢。除非它可以使用僅索引掃描,否則對錶的順序掃描將比涉及任何索引更快。在這裡沒有更多的性能 - 除了通過優化表和伺服器設置。

至於僅索引掃描:要查看其效果如何,VACUUM ANALYZE請在負擔得起的情況下執行(以獨占方式鎖定表)。然後再次嘗試您的查詢。現在使用索引應該會稍微快一些。首先閱讀此相關答案:

以及Postgres 9.6Postgres Wiki on index-only scans添加的手冊頁。

引用自:https://dba.stackexchange.com/questions/62500