在長表中分組
我的應用程序在 Postgresql 9.4 上執行,我發現了一個問題,我不知道如何解決它。
我正在執行 VoIp 應用程序,我們有一些太長的表(行數超過 60M),其中包含所有呼叫詳細資訊,我正在嘗試對其進行一些報告,但是有些查詢太慢了,我們沒有知道我怎樣才能做得更快。
例如,一個客戶可能一個月有 700K 的呼叫,我想圍繞它進行一些分組,獲取總成本,按 json 欄位過濾等。
為此,我們使用索引,但在這種情況下,索引很重,正如我所說的,很多記錄相似(account_name)但索引很大。
因此,我用於計費的系統具有以下表格:
CREATE TABLE cdrs_primary ( id SERIAL PRIMARY KEY, cgrid CHAR(40) NOT NULL, tor VARCHAR(16) NOT NULL, accid VARCHAR(64) NOT NULL, cdrhost VARCHAR(64) NOT NULL, cdrsource VARCHAR(64) NOT NULL, reqtype VARCHAR(24) NOT NULL, direction VARCHAR(8) NOT NULL, tenant VARCHAR(64) NOT NULL, category VARCHAR(32) NOT NULL, account VARCHAR(128) NOT NULL, subject VARCHAR(128) NOT NULL, destination VARCHAR(128) NOT NULL, setup_time TIMESTAMP NOT NULL, pdd NUMERIC(12,9) NOT NULL, answer_time TIMESTAMP NOT NULL, usage NUMERIC(30,9) NOT NULL, supplier VARCHAR(128) NOT NULL, disconnect_cause VARCHAR(64) NOT NULL, created_at TIMESTAMP, deleted_at TIMESTAMP, UNIQUE (cgrid) ); CREATE INDEX answer_time_idx ON cdrs_primary (answer_time); CREATE INDEX deleted_at_cp_idx ON cdrs_primary (deleted_at); CREATE TABLE cdrs_extra ( id SERIAL PRIMARY KEY, cgrid CHAR(40) NOT NULL, extra_fields jsonb NOT NULL, created_at TIMESTAMP, deleted_at TIMESTAMP, UNIQUE (cgrid) ); CREATE INDEX deleted_at_ce_idx ON cdrs_extra (deleted_at); CREATE TABLE cost_details ( id SERIAL PRIMARY KEY, cgrid CHAR(40) NOT NULL, runid VARCHAR(64) NOT NULL, tor VARCHAR(16) NOT NULL, direction VARCHAR(8) NOT NULL, tenant VARCHAR(128) NOT NULL, category VARCHAR(32) NOT NULL, account VARCHAR(128) NOT NULL, subject VARCHAR(128) NOT NULL, destination VARCHAR(128) NOT NULL, cost NUMERIC(20,4) NOT NULL, timespans jsonb, cost_source VARCHAR(64) NOT NULL, created_at TIMESTAMP, updated_at TIMESTAMP, deleted_at TIMESTAMP, UNIQUE (cgrid, runid) ); CREATE INDEX deleted_at_cd_idx ON cost_details (deleted_at); CREATE TABLE rated_cdrs ( id SERIAL PRIMARY KEY, cgrid CHAR(40) NOT NULL, runid VARCHAR(64) NOT NULL, reqtype VARCHAR(24) NOT NULL, direction VARCHAR(8) NOT NULL, tenant VARCHAR(64) NOT NULL, category VARCHAR(32) NOT NULL, account VARCHAR(128) NOT NULL, subject VARCHAR(128) NOT NULL, destination VARCHAR(128) NOT NULL, setup_time TIMESTAMP NOT NULL, pdd NUMERIC(12,9) NOT NULL, answer_time TIMESTAMP NOT NULL, usage NUMERIC(30,9) NOT NULL, supplier VARCHAR(128) NOT NULL, disconnect_cause VARCHAR(64) NOT NULL, cost NUMERIC(20,4) DEFAULT NULL, extra_info text, created_at TIMESTAMP, updated_at TIMESTAMP, deleted_at TIMESTAMP, UNIQUE (cgrid, runid) ); CREATE INDEX deleted_at_rc_idx ON rated_cdrs (deleted_at);
在這種情況下,帳戶是客戶名稱或 ID。我的想法是創建一個查詢,我可以獲得以下資訊:
- cdrs_primary.account
- 計數(cdrs_primary.cdrhost)
- 總和(額定_cdrs.cost)
- 總和(額定_cdrs.usage)
- 方向
這需要通過以下過濾: - cdrs_extra.extra_fields
$$ ‘connection’ $$
- cost_details.timespans$$ 0 $$$$ ‘MatchedDestId’ $$
- answer_time 我的想法是使用以下查詢:
select p.account, p.direction, date_trunc('day', r.answer_time), count(p.cdrhost) as host, sum(r.cost) as cost, sum(r.usage)/60 as minutes from cdrs_primary p join rated_cdrs r on p.cgrid = r.cgrid join cost_details c on (r.cgrid=c.cgrid and r.runid=c.runid) join cdrs_extra e on (e.cgrid=r.cgrid) where r.setup_time > '2015-08-15' and e.extra_fields->>'connection' = '1' and c.timespans->0->>'MatchedDestId' IN ('UKN_FM1', 'UKN_FM2') and p.account='eloy' group by p.account, p.direction, date_trunc('day', r.answer_time)
這裡有解釋:
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── HashAggregate (cost=402230.95..402230.99 rows=2 width=40) (actual time=23854.251..23854.251 rows=0 loops=1) Group Key: p.account, p.direction, date_trunc('day'::text, r.answer_time) Buffers: shared hit=24645 read=151172 -> Nested Loop (cost=317806.74..402230.92 rows=2 width=40) (actual time=23854.247..23854.247 rows=0 loops=1) Join Filter: (p.cgrid = c.cgrid) Buffers: shared hit=24645 read=151172 -> Nested Loop (cost=317806.18..395160.27 rows=249 width=170) (actual time=23854.246..23854.246 rows=0 loops=1) Buffers: shared hit=24645 read=151172 -> Merge Join (cost=317805.62..320484.41 rows=2643 width=106) (actual time=23854.244..23854.244 rows=0 loops=1) Merge Cond: (p.cgrid = e.cgrid) Buffers: shared hit=24645 read=151172 -> Sort (cost=196879.28..198155.65 rows=510548 width=65) (actual time=19980.899..20121.328 rows=518651 loops=1) Sort Key: p.cgrid Sort Method: quicksort Memory: 85224kB Buffers: shared hit=1 read=116096 -> Bitmap Heap Scan on cdrs_primary p (cost=11509.18..148475.03 rows=510548 width=65) (actual time=319.521..15620.892 rows=518651 loops=1) Recheck Cond: ((account)::text = 'eloy'::text) Heap Blocks: exact=114107 Buffers: shared hit=1 read=116096 -> Bitmap Index Scan on cdrs_primary_account (cost=0.00..11381.54 rows=510548 width=0) (actual time=270.386..270.386 rows=518651 loops=1) Index Cond: ((account)::text = 'eloy'::text) Buffers: shared read=1990 -> Sort (cost=120926.33..120976.15 rows=19928 width=41) (actual time=3247.967..3284.225 rows=144073 loops=1) Sort Key: e.cgrid Sort Method: quicksort Memory: 17400kB Buffers: shared hit=24644 read=35076 -> Seq Scan on cdrs_extra e (cost=0.00..119503.22 rows=19928 width=41) (actual time=2275.996..2358.434 rows=144073 loops=1) Filter: ((extra_fields ->> 'connection'::text) = '1'::text) Rows Removed by Filter: 3841475 Buffers: shared hit=24644 read=35076 -> Index Scan using rated_cdrs_runid_cgrid_idx on rated_cdrs r (cost=0.56..28.24 rows=1 width=64) (never executed) Index Cond: (cgrid = p.cgrid) Filter: (setup_time > '2015-08-15 00:00:00'::timestamp without time zone) -> Index Scan using cost_details_cgrid_runid on cost_details c (cost=0.56..28.38 rows=1 width=48) (never executed) Index Cond: ((cgrid = r.cgrid) AND ((runid)::text = (r.runid)::text)) Filter: (((timespans -> 0) ->> 'MatchedDestId'::text) = ANY ('{UKN_FM1,UKN_FM2}'::text[])) Planning time: 4.432 ms Execution time: 23863.701 ms
我添加了以下索引,但我認為這根本不起作用,查詢它仍然很慢。
cdrs_extra
"cdrs_extra_connection_idx" gin ((extra_fields -> 'connection'::text))
評級_cdrs:
"rated_cdrs_pkey" PRIMARY KEY, btree (id) "cost_idx" btree (setup_time) "deleted_at_rc_idx" btree (deleted_at) "rated_cdrs_runid_cgrid_idx" btree (cgrid, runid) "setup_time_idx" btree (setup_time DESC)
成本詳情:
"cost_details_cgrid_runid" btree (cgrid, runid) "cost_details_destination_key_idx" gin (((timespans -> 0) -> 'MatchDestId'::text))
索引的大小如下:
tablename │ indexname │ num_rows │ table_size │ index_size ──────────────┼──────────────────────────────────┼─────────────┼────────────┼──────────── cdrs_extra │ cdrs_extra_connection_idx │ 3.98555e+06 │ 467 MB │ 4232 kB cdrs_extra │ cdrs_extra_pkey │ 3.98555e+06 │ 467 MB │ 88 MB cdrs_extra │ deleted_at_ce_idx │ 3.98555e+06 │ 467 MB │ 159 MB cdrs_primary │ answer_time_idx │ 3.98555e+06 │ 1020 MB │ 154 MB cdrs_primary │ cdrs_primary_account │ 3.98555e+06 │ 1020 MB │ 115 MB cdrs_primary │ cdrs_primary_pkey │ 3.98555e+06 │ 1020 MB │ 85 MB cdrs_primary │ deleted_at_cp_idx │ 3.98555e+06 │ 1020 MB │ 85 MB cost_details │ cost_details_cgrid_runid │ 6.80088e+06 │ 7320 MB │ 471 MB cost_details │ cost_details_destination_key_idx │ 6.80088e+06 │ 7320 MB │ 8056 kB cost_details │ cost_details_pkey │ 6.80088e+06 │ 7320 MB │ 146 MB cost_details │ deleted_at_cd_idx │ 6.80088e+06 │ 7320 MB │ 146 MB rated_cdrs │ cost_idx │ 7.74959e+06 │ 1595 MB │ 166 MB rated_cdrs │ deleted_at_rc_idx │ 7.74959e+06 │ 1595 MB │ 166 MB rated_cdrs │ rated_cdrs_pkey │ 7.74959e+06 │ 1595 MB │ 166 MB rated_cdrs │ rated_cdrs_runid_cgrid_idx │ 7.74959e+06 │ 1595 MB │ 537 MB rated_cdrs │ setup_time_idx │ 7.74959e+06 │ 1595 MB │ 166 MB
有什麼想法可以加快這個查詢嗎?
謝謝
首先,最重要的是,您可以擺脫對 json 欄位的使用。如果您事先知道要查詢的欄位,最好將其設為真實列。
以低效擴展表為代價,您可以使用觸發器對生成的列執行此操作,然后索引生成的列。它會稍微損害您的插入性能,但您有杜松子酒索引,因此您顯然不會太介意插入性能。
另一種方法是在列上創建表達式索引,例如
CREATE INDEX cost_details_timespan_0_matcheddestid ON cost_details((timespans->0->>'MatchedDestId'));
僅當您始終在查詢中完全使用該表達式時,這才有用。為了更容易匹配,最好創建一個包裝 SQL 函式,例如
cost_details_get_first_matcheddestid(json)
,然後在索引和查詢中使用它。還要查看部分索引,例如,
CREATE INDEX cdrs_extra_cgrid_when_conn1(cgrid) WHERE (extra_fields->>'connection' = '1')
如果您總是加入
cdrs_extra
該 has的子集,這可能會很有用extra_fields->>'connection' = '1'
。盡量縮小大表的範圍,將不必要的和大量重複的列移到輔助表中。
最後,特別是如果您可以在數據可供分析之前承受延遲,請考慮使用事實表將數據批量載入到星型模式中。這聽起來既複雜又可怕,但實際上並非如此,並且對於某些類型的分析工作可以大大提高性能。
但我認為連接中有問題。查詢並獲取id花了5秒,我認為這是不正常的。
例如,請參閱此查詢:
explain (analyze, buffers) select p.account, p.direction, r.cost, r.usage, p.answer_time, c.timespans, e.extra_fields from cdrs_primary p join rated_cdrs r on p.cgrid = r.cgrid join cost_details c on (r.cgrid=c.cgrid and r.runid=c.runid) join cdrs_extra e on (e.cgrid=r.cgrid) where p.cgrid='40be31b7e4a67af15e770ff4086b47c3bbabd821'
在這裡你有解釋:
Merge Join (cost=1.11..290057.04 rows=46 width=930) (actual time=5152.442..5247.335 rows=2 loops=1) Merge Cond: ((r.runid)::text = (c.runid)::text) Buffers: shared hit=80834 read=109481 -> Nested Loop (cost=0.56..289971.42 rows=6 width=98) (actual time=5152.376..5247.242 rows=2 loops=1) Buffers: shared hit=80828 read=109481 -> Index Scan using rated_cdrs_runid_cgrid_idx on rated_cdrs r (cost=0.56..28.66 rows=6 width=56) (actual time=0.048..0.056 rows=2 loops=1) Index Cond: (cgrid = '40be31b7e4a67af15e770ff4086b47c3bbabd821'::bpchar) Buffers: shared hit=5 -> Materialize (cost=0.00..289942.69 rows=1 width=124) (actual time=2576.162..2623.587 rows=1 loops=2) Buffers: shared hit=80823 read=109481 -> Nested Loop (cost=0.00..289942.69 rows=1 width=124) (actual time=5152.316..5247.165 rows=1 loops=1) Buffers: shared hit=80823 read=109481 -> Seq Scan on cdrs_primary p (cost=0.00..180403.33 rows=1 width=77) (actual time=3392.014..3392.017 rows=1 loops=1) Filter: (cgrid = '40be31b7e4a67af15e770ff4086b47c3bbabd821'::bpchar) Rows Removed by Filter: 3985545 Buffers: shared hit=80759 read=49825 -> Seq Scan on cdrs_extra e (cost=0.00..109539.35 rows=1 width=47) (actual time=1760.290..1855.133 rows=1 loops=1) Filter: (cgrid = '40be31b7e4a67af15e770ff4086b47c3bbabd821'::bpchar) Rows Removed by Filter: 3985547 Buffers: shared hit=64 read=59656 -> Materialize (cost=0.56..84.96 rows=20 width=928) (actual time=0.055..0.061 rows=2 loops=1) Buffers: shared hit=6 -> Index Scan using cost_details_cgrid_runid on cost_details c (cost=0.56..84.91 rows=20 width=928) (actual time=0.052..0.056 rows=2 loops=1) Index Cond: (cgrid = '40be31b7e4a67af15e770ff4086b47c3bbabd821'::bpchar) Buffers: shared hit=6 Planning time: 0.586 ms Execution time: 5247.446 ms
這個連接可以優化嗎?任何postgresql參數?
問候,