點陣圖堆掃描很慢
我有 2 個表:mailings和form_responses。form_responses表與mailings表有一個外鍵,因此mailing具有一對多的 form_responses。我想要做的是計算郵件統計:每個郵件獲取總form_responses以及收到最後一個form_response的時間。我想退回最後收到form_response的前 20 封**郵件。但是當它必須查詢 50k+ 行時,查詢變得非常慢。
郵件(~100k 行)
Column | Type | Modifiers --------------+-----------------------------+--------------------------------------------------- id | integer | pkey project_id | integer | not null deleted | boolean | active | boolean | created_at | timestamp with time zone | Indexes: "mailings_pkey" PRIMARY KEY, btree (id) "mailing_on_project_id_where_not_deleted_and_active" btree (project_id) (WHERE deleted IS NOT true AND active IS TRUE) + individual indexes on almost all columns
form_responses(~6m 行)
Column | Type | Modifiers --------------+-----------------------------+--------------------------------------------------- id | integer | pkey project_id | integer | not null mailing_id | integer | deleted | boolean | is_unanswered| boolean | created_at | timestamp with time zone | Indexes: "form_responses_pkey" PRIMARY KEY, btree (id) "form_responses_on_pid_cr_time_where_not_del_not_is_unans" btree (project_id, created_time) (WHERE deleted IS NOT true AND is_uanswered IS NOT TRUE) + individual indexes on almost all columns
我想在一個日期範圍內獲取項目(id=1)的郵件統計資訊。我只想要未刪除且處於活動狀態的郵件的統計資訊,並且只計算未刪除的回复。
所以這給了我以下查詢
EXPLAIN (ANALYZE, BUFFERS) SELECT count(DISTINCT form_responses.id) AS total, max(form_responses.created_time) AS form_response_last_received, (SELECT mailings.name FROM mailings WHERE id = form_responses.mailing_id) AS name, form_responses.mailing_id AS mailing_id FROM form_responses JOIN mailings ON mailings.id = form_responses.mailing_id WHERE form_responses.deleted IS NOT true AND form_responses.is_unanswered IS NOT true AND form_responses.created_time >= '2017-08-02 00:00:00' AND form_responses.created_time <= '2018-01-29 00:00:00' AND form_responses.project_id = 1 AND mailings.deleted IS NOT true AND mailings.project_id = 1 AND mailings.active IS true GROUP BY form_responses.mailing_id ORDER BY form_response_last_received DESC LIMIT 20 OFFSET 0
這給了我以下輸出:
Limit (cost=308997.69..308997.74 rows=20 width=16) (actual time=7762.554..7762.565 rows=20 loops=1) Buffers: shared hit=74535 read=13513 dirtied=1 -> Sort (cost=308997.69..308998.19 rows=197 width=16) (actual time=7762.552..7762.557 rows=20 loops=1) Sort Key: (max(form_responses.created_time)) Sort Method: top-N heapsort Memory: 26kB Buffers: shared hit=74535 read=13513 dirtied=1 -> GroupAggregate (cost=307039.18..308992.45 rows=197 width=16) (actual time=7673.523..7762.278 rows=686 loops=1) Group Key: form_responses.mailing_id Buffers: shared hit=74535 read=13513 dirtied=1 -> Sort (cost=307039.18..307111.58 rows=28961 width=16) (actual time=7673.277..7702.783 rows=137542 loops=1) Sort Key: form_responses.mailing_id Sort Method: quicksort Memory: 12592kB Buffers: shared hit=71790 read=13512 dirtied=1 -> Hash Join (cost=7376.42..304892.90 rows=28961 width=16) (actual time=44.576..7609.818 rows=137542 loops=1) Hash Cond: (form_responses.mailing_id = mailings.id) Buffers: shared hit=71790 read=13512 dirtied=1 -> Bitmap Heap Scan on form_responses (cost=3815.27..299862.55 rows=134811 width=16) (actual time=41.439..7499.305 rows=137543 loops=1) Recheck Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone) AND (deleted IS NOT TRUE) AND (is_unanswered IS NOT TRUE)) Heap Blocks: exact=84014 Buffers: shared hit=71032 read=13512 -> Bitmap Index Scan on form_responses_on_pid_cr_time_where_not_del_not_is_unans (cost=0.00..3781.57 rows=134811 width=0) (actual time=23.892..23.892 rows=137543 loops=1) Index Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone)) Buffers: shared hit=530 -> Hash (cost=3286.82..3286.82 rows=21946 width=4) (actual time=3.093..3.093 rows=2866 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 101kB Buffers: shared hit=758 dirtied=1 -> Bitmap Heap Scan on mailings (cost=874.50..3286.82 rows=21946 width=4) (actual time=0.403..2.145 rows=2866 loops=1) Recheck Cond: ((project_id = 1) AND (deleted IS NOT TRUE) AND (active IS TRUE)) Heap Blocks: exact=732 Buffers: shared hit=758 dirtied=1 -> Bitmap Index Scan on mailing_on_project_id_where_not_deleted_and_active (cost=0.00..869.01 rows=21946 width=0) (actual time=0.315..0.315 rows=2866 loops=1) Index Cond: (project_id = 1) Buffers: shared hit=26 SubPlan 1 -> Index Scan using mailings_pkey on mailings mailings_1 (cost=0.42..8.44 rows=1 width=26) (actual time=0.003..0.003 rows=1 loops=686) Index Cond: (id = form_responses.mailing_id) Buffers: shared hit=2745 read=1 Planning time: 0.865 ms Execution time: 7764.951 ms
在此範例中,它查詢約 135k 行,最多可達到約 600k。如您所見,99% 的時間都花在點陣圖堆掃描上。我
db.t2.large
在 AWS 上執行一個具有 100gb (SSD) 儲存的 RDS 數據庫,它執行 postgres 9.4.7
- 有沒有辦法避免進行點陣圖堆掃描?
- 或者有什麼方法可以提高這個速度?
- 我可以創建一個覆蓋索引,因此它不必讀取表單響應表本身嗎?
- 我應該將form_responses表聚集在project_id和/或mailing_id上嗎?
- 我應該升級我的硬體嗎?
- 我對我的數據庫要求太多了嗎?
可以避免點陣圖掃描
set enable_bitmapscan to off
。然後你可以看到它選擇了什麼,以及該選擇是如何執行的。這通常是一種探索性工具,在生產系統中以這種方式全域設置通常不是一個好主意。但是您可以只在一個會話中設置,因此在生產系統上進行測試沒有問題。如果您使用相同的參數第二次執行相同的查詢,它會更快嗎?如果是這樣,您可以通過將表預熱
form_responses
到記憶體中來提高性能(例如,使用pg_prewarm
擴展),假設您的記憶體足夠大以容納所有內容。你也可以嘗試增加effective_io_concurrency
,如果 RDS 允許的話。如果您創建一個索引,
mailing_on_project_id_where_not_deleted_and_active
但也包括列列表mailing_id
的id
末尾,那麼它可以像一個覆蓋索引一樣,您可以讓它使用僅索引掃描。你必須保持你的桌子充分吸塵,才能最大限度地發揮作用。預設的清理級別可能不夠激進,具體取決於數據更新的模式。PostgreSQL 基於索引而不是列對錶進行集群。分群
mailing_on_project_id_where_not_deleted_and_active
可能會幫助這個特定的查詢。根據上述答案和結果,您可能需要升級硬體。如果
form_responses
不適合 RAM,則更多 RAM 會有所幫助。如果這不可行並且需要從磁碟讀取,那麼從 HDD 升級到 SSD 會有所幫助(除非 RDS 已經在 SSD 上,在這種情況下,除非它受到限制,否則您首先不應該遇到問題) .此查詢可能會在 PostgreSQL 9.6 中引入並在 10 中改進的一些並行查詢實現中受益,但我不知道 RDS 是否支持這些。