Postgresql

多表 GROUP BY 查詢優化

  • February 13, 2015

目前我有一個GROUP BY非常慢的多表/列查詢。以下是查詢使用的表架構:

報告表

                                                               Table "report"
      Column       |           Type           |                           Modifiers                           | Storage  | Stats target | Description 
--------------------+--------------------------+---------------------------------------------------------------+----------+--------------+-------------
id                 | integer                  | not null default nextval('dashboard_report_id_seq'::regclass) | plain    |              | 
county             | character varying(200)   | not null                                                      | extended |              | 
district           | character varying(200)   | not null                                                      | extended |              | 
report_name        | character varying(500)   | not null                                                      | extended |              | 
Indexes:
   "dashboard_report_pkey" PRIMARY KEY, btree (id)

問題表

                                                      Table "question"
  Column    |          Type          |                            Modifiers                            | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------
id          | integer                | not null default nextval('dashboard_question_id_seq'::regclass) | plain    |              | 
name        | character varying(255) | not null                                                        | extended |              | 
label       | character varying(255) | not null                                                        | extended |              | 
report_type | character varying(255) | not null                                                        | extended |              | 
Indexes:
   "dashboard_question_pkey" PRIMARY KEY, btree (id)

問答表

                                                          Table "questionanswer"
  Column    |          Type          |                               Modifiers                               | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------------+----------+--------------+-------------
id          | integer                | not null default nextval('dashboard_questionanswer_id_seq'::regclass) | plain    |              | 
answer      | character varying(255) | not null                                                              | extended |              | 
question_id | integer                | not null                                                              | plain    |              | 
report_id   | integer                | not null                                                              | plain    |              | 
Indexes:
   "dashboard_questionanswer_pkey" PRIMARY KEY, btree (id)
   "dashboard_questionanswer_6f78b20c" btree (report_id)
   "dashboard_questionanswer_7aa0f6ee" btree (question_id)
Foreign-key constraints:
   "dashboard_que_report_id_4c2a87ee585b6121_fk_dashboard_report_id" FOREIGN KEY (report_id) REFERENCES dashboard_report(id) DEFERRABLE INITIALLY DEFERRED
   "dashboard_question_id_522f82097923c241_fk_dashboard_question_id" FOREIGN KEY (question_id) REFERENCES dashboard_question(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

詢問:

SELECT 
   "report"."county", 
   "report"."district", 
   "report"."report_name", 
   "question"."report_type", 
   COUNT(DISTINCT "questionanswer"."report_id")
FROM "questionanswer" 
   INNER JOIN "question" ON ( "questionanswer"."question_id" = "question"."id" )
   INNER JOIN "report" ON ( "questionanswer"."report_id" = "report"."id" )
WHERE 
   "question"."name" = 'touch' 
GROUP BY 
   "report"."county", 
   "report"."district", 
   "report"."report_name", 
   "question"."report_type" ;

我不確定從哪裡開始優化此查詢,目前它在我的系統上相當慢〜 300 毫秒。我確實嘗試單獨索引子句引用的所有列,GROUP BY但這也無助於查詢時間。

所以基本上使用者送出報告,其數據儲存在報告中,每個報告有多個 QuestionAnswers,儲存在 QuestionAnswer 表中。問題元數據位於問題表中。QuestionAnswer 表記錄會很大,接下來是 Report 表數據,Question 表有一組只有 40 個問題。查詢的目的是找出針對特定問題的每個縣、區和報告類型,我們有多少報告回答了該問題。

我的 Postgres Sql 版本是 PostgreSQL 9.3.5。

查詢計劃在這裡

我的機器硬體規格是 MacOSX 10.9.5 4-Core 8 GM RAM SSD,加密文件系統。

GROUP BY會索引幫助引用的所有列嗎?

還是我應該選擇多列索引?

我確實想知道,為什麼你有問題report_type的as 屬性? 儘管如此,您的目標是:

查詢的目的是找出針對特定問題的每個縣、區和報告類型,我們有多少報告回答了該問題。

為什麼要包括report_nameGROUP BY步驟中?這與您的定義相衝突。我認為你應該刪除它:

SELECT r.county, r.district, q.report_type
    , count(DISTINCT r.id) AS reports
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1,2,3;

此外,只要您將查詢限制為單個問題,每個定義的結果中只有一個問題。 report_type將其包含在結果和GROUP BY子句中不會更改數字。

至於性能:要麼創建一個**UNIQUE約束**(按(question_id, report_id)那個順序!),就像我懷疑你應該有的那樣:

ALTER TABLE questionanswer ADD CONSTRAINT qa_uni UNIQUE (question_id, report_id);

或者,除此之外,至少在(question_id, report_id).

為什麼索引/約束中的列順序很重要?

有了UNIQUE約束,查詢變得相當便宜:

    , count(*) AS reports

只要你只有40 questions你不需要索引question.name只要你選擇問題name,你應該仍然UNIQUE對該列有約束。

剩下的就靠PKreport了。


如果您真的想為每個問題計算不同的縣和區,相關查詢:

SELECT q.id, q.name, q.report_type
    , count(DISTINCT r.county)      AS distinct_counties
    , count(DISTINCT r.district)    AS distinct_districts
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1;  -- the PK column covers the whole table

引用自:https://dba.stackexchange.com/questions/91701