大表上的 Postgres 索引掃描需要很長時間
我在 PostgreSQL 10.10 數據庫(AWS RDS 實例)上有一個大表(約 5 億條記錄,85GB 大小)。這些記錄是由一堆遊戲模擬創建的,這些模擬在模擬中發生時插入“事件”。這是一個範例記錄:
灌木叢 時間 event_id 事件類型 事件標誌 event_json area_id 154052 0 3535337 6 9 {"object": 27561, "subject": 28073, "object_loc": 10778050, "subject_loc": 3737, "captured": 5}
3737 其中
runnum
是執行號,time
是生成事件的模擬時間,event_type
是event_flag
描述事件的列舉,event_json
是格式中的事件數據jsonb
(這允許具有不同負載的不同類型的事件)。我想查詢每次執行在特定場景中“擷取”鹼基的總和。
我有一個這樣定義的索引:
CREATE INDEX events_runnum_type_captured ON details.events USING btree (runnum ASC NULLS LAST, event_type ASC NULLS LAST, ((event_json ->> 'captured'::text)::integer) ASC NULLS LAST) TABLESPACE pg_default;
我的查詢是:
SELECT programming_scenario_id, D.runnum, SUM((A.event_json->>'captured')::int) as times FROM simulations.programming_scenarios F JOIN simulations.runs D ON D.programming_scenario_id = F.id JOIN details.events A ON A.runnum = D.runnum AND A.event_type = 6 WHERE id IN (SELECT id FROM simulations.programming_scenarios WHERE assessment_id = 180) GROUP BY programming_scenario_id, D.runnum
這個查詢需要大約 5 分鐘才能完成,這對我來說似乎很慢,考慮到它有一個完整的覆蓋索引並且可以從僅索引掃描中受益。我已經研究過使用
EXPLAIN
,這是輸出:GroupAggregate (cost=15.15..1464244.05 rows=4272 width=20) Group Key: d.runnum -> Nested Loop (cost=15.15..1454074.88 rows=810116 width=132) -> Nested Loop (cost=14.58..1101.10 rows=291 width=12) Join Filter: (f.id = d.programming_scenario_id) -> Index Scan using runs_runnum_idx on runs d (cost=0.28..873.90 rows=4272 width=12) -> Materialize (cost=14.29..34.97 rows=3 width=8) -> Nested Loop (cost=14.29..34.96 rows=3 width=8) -> HashAggregate (cost=14.02..14.05 rows=3 width=4) Group Key: programming_scenarios.id -> Bitmap Heap Scan on programming_scenarios (cost=4.30..14.01 rows=3 width=4) Recheck Cond: (assessment_id = 180) -> Bitmap Index Scan on programming_scenarios_assessment_id_idx (cost=0.00..4.30 rows=3 width=0) Index Cond: (assessment_id = 180) -> Index Only Scan using programming_scenarios_id_idx on programming_scenarios f (cost=0.28..6.96 rows=1 width=4) Index Cond: (id = programming_scenarios.id) -> Index Scan using events_runnum_type_captured on events a (cost=0.57..4965.23 rows=2781 width=128) Index Cond: ((runnum = d.runnum) AND (event_type = 6))
從對 pgAdmin4 的分析來看,它似乎
Index Scan using events_runnum_type_captured on events a
是罪魁禍首,佔用了大部分執行時間。 我希望 PostgresIndex-Only
在這種情況下使用掃描,因為所有返回的欄位都在索引中。我已經
VACUUM ANALYZE
在表上執行以確保查詢計劃是最新的,但無濟於事。我能做些什麼來加快這個查詢嗎?
編輯:
根據要求,這裡是輸出
EXPLAIN(ANALYZE, BUFFERS)
:GroupAggregate (cost=15.15..1464244.05 rows=4272 width=20) (actual time=734.949..199676.720 rows=500 loops=1) Group Key: d.runnum Buffers: shared hit=201586 read=702520 -> Nested Loop (cost=15.15..1454074.88 rows=810116 width=132) (actual time=0.509..198947.468 rows=1542148 loops=1) Buffers: shared hit=201586 read=702520 -> Nested Loop (cost=14.58..1101.10 rows=291 width=12) (actual time=0.102..12.800 rows=500 loops=1) Join Filter: (f.id = d.programming_scenario_id) Rows Removed by Join Filter: 21005 Buffers: shared hit=71 read=745 -> Index Scan using runs_runnum_idx on runs d (cost=0.28..873.90 rows=4272 width=12) (actual time=0.009..7.647 rows=4301 loops=1) Buffers: shared hit=50 read=745 -> Materialize (cost=14.29..34.97 rows=3 width=8) (actual time=0.000..0.000 rows=5 loops=4301) Buffers: shared hit=21 -> Nested Loop (cost=14.29..34.96 rows=3 width=8) (actual time=0.040..0.047 rows=5 loops=1) Buffers: shared hit=21 -> HashAggregate (cost=14.02..14.05 rows=3 width=4) (actual time=0.021..0.022 rows=5 loops=1) Group Key: programming_scenarios.id Buffers: shared hit=5 -> Bitmap Heap Scan on programming_scenarios (cost=4.30..14.01 rows=3 width=4) (actual time=0.013..0.018 rows=5 loops=1) Recheck Cond: (assessment_id = 180) Heap Blocks: exact=3 Buffers: shared hit=5 -> Bitmap Index Scan on programming_scenarios_assessment_id_idx (cost=0.00..4.30 rows=3 width=0) (actual time=0.010..0.010 rows=5 loops=1) Index Cond: (assessment_id = 180) Buffers: shared hit=2 -> Index Only Scan using programming_scenarios_id_idx on programming_scenarios f (cost=0.28..6.96 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=5) Index Cond: (id = programming_scenarios.id) Heap Fetches: 5 Buffers: shared hit=16 -> Index Scan using events_runnum_type_captured on events a (cost=0.57..4965.23 rows=2781 width=128) (actual time=0.819..397.212 rows=3084 loops=500) Index Cond: ((runnum = d.runnum) AND (event_type = 6)) Buffers: shared hit=201515 read=701775 Planning time: 8.195 ms Execution time: 199677.031 ms
WHERE
謂詞 (id IN (SELECT id FROM simulations.programming_scenarios WHERE assessment_id = 180)
) 匹配總行數的約 15%(約 73M 出 ~500M)。
僅索引掃描不適用於表達式索引中的表達式,規劃器不夠聰明。它認為它需要根列“event_json”在索引中才能使用僅索引掃描。如果您希望索引僅按原樣掃描該表,則必須在索引中包含整個 jsonb 列。基於 N = 1 的樣本量,它至少足夠小以至於這樣做是合理的。
嘗試:
SELECT programming_scenario_id, D.runnum, SUM((A.event_json->>'captured')::int) as times FROM simulations.programming_scenarios F join simulations.programming_scenarios psc on (psc.id = f.id) JOIN simulations.runs D ON D.programming_scenario_id = F.id JOIN details.events A ON A.runnum = D.runnum AND A.event_type = 6 WHERE psc.assessment_id = 180 GROUP BY programming_scenario_id, D.runnum
如果事件類型始終為 6,您可以更改索引:
CREATE INDEX events_runnum_type_captured ON details.events USING btree (runnum ASC NULLS LAST, ((event_json ->> 'captured'::text)::integer) ASC NULLS LAST) where event_type = 6