優化大表上的 LATERAL JOIN 查詢
我正在使用 Postgres 9.5。我有一個表格,記錄了來自多個網站的頁麵點擊量。該表包含從 2016 年 1 月 1 日到 2016 年 6 月 30 日的大約 3200 萬行。
CREATE TABLE event_pg ( timestamp_ timestamp without time zone NOT NULL, person_id character(24), location_host varchar(256), location_path varchar(256), location_query varchar(256), location_fragment varchar(256) );
我正在嘗試調整一個查詢,該查詢計算執行給定頁麵點擊序列的人數。該查詢旨在回答諸如“有多少人查看了首頁,然後轉到幫助站點,然後查看了感謝頁面”之類的問題?結果看起來像這樣
╔════════════╦════════════╦═════════════╗ ║ home-page ║ help site ║ thankyou ║ ╠════════════╬════════════╬═════════════╣ ║ 10000 ║ 9800 ║1500 ║ ╚════════════╩════════════╩═════════════╝
請注意,數字正在減少,這是有道理的,因為查看首頁 9800 的 10000 人繼續訪問幫助站點,而這 1500 人繼續點擊感謝頁面。
3 步序列的 SQL 使用橫向連接,如下所示:
SELECT sum(view_homepage) AS view_homepage, sum(use_help) AS use_help, sum(thank_you) AS thank_you FROM ( -- Get the first time each user viewed the homepage. SELECT X.person_id, 1 AS view_homepage, min(timestamp_) AS view_homepage_time FROM event_pg X WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999' AND X.location_host like '2015.testonline.ca' GROUP BY X.person_id ) e1 LEFT JOIN LATERAL ( SELECT Y.person_id, 1 AS use_help, timestamp_ AS use_help_time FROM event_pg Y WHERE Y.person_id = e1.person_id AND location_host = 'helpcentre.testonline.ca' AND timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999' ORDER BY timestamp_ LIMIT 1 ) e2 ON true LEFT JOIN LATERAL ( SELECT 1 AS thank_you, timestamp_ AS thank_you_time FROM event_pg Z WHERE Z.person_id = e2.person_id AND location_fragment = '/file/thank-you' AND timestamp_ BETWEEN use_help_time AND timestamp '2016-04-30 23:59:59.999' ORDER BY timestamp_ LIMIT 1 ) e3 ON true;
我在
timestamp_
,person_id
和location
列上有一個索引。幾天或幾週的日期範圍的查詢非常快(1s 到 10s)。當我嘗試對 1 月 1 日至 7 月 30 日之間的所有內容執行查詢時,速度會變慢。這需要一分鐘多的時間。如果您比較下面的兩個解釋,您會發現它不再使用 timestamp_ 索引,而是執行 Seq Scan,因為索引不會給我們買任何東西,因為我們正在查詢“所有時間”,因此表中的幾乎所有記錄.
- 解釋幾天日期範圍(使用索引): https ://explain.depesz.com/s/2tOi
- 解釋 6 個月(seq 掃描): https ://explain.depesz.com/s/c0yq
現在我意識到橫向連接的嵌套循環性質會減慢它必須循環的更多記錄,但是有什麼辦法可以加快這個查詢的巨大日期範圍,以便它更好地擴展?
初步說明
您正在使用奇數數據類型。
character(24)
?char(n)
是一種過時的類型,幾乎總是錯誤的選擇。您有索引person_id
並重複加入它。integer
由於多種原因,效率會更高。(或者bigint
,如果您計劃在表的生命週期內銷毀超過 20 億行。)相關:
LIKE
沒有萬用字元是沒有意義的。改為使用=
。快點。
x.location_host LIKE '2015.testonline.ca'
x.location_host = '2015.testonline.ca'
- 使用
count(e1.*)
or代替為每個子查詢count(*)
添加具有值的虛擬列。1
(除了最後一個 (e3
),您不需要任何實際數據。)- 您有時將字元串文字轉換為()
timestamp
有時不是(timestamp '2016-04-30 23:59:59.999'
)不一致。要麼有意義,然後一直這樣做*,*要麼沒有,然後不要這樣做。它沒有。與列相比,字元串文字無論如何
timestamp
都會被強制轉換。timestamp
所以你不需要明確的演員表。
- Postgres 數據類型
timestamp
最多有 6 個小數位。你的BETWEEN
表達留下了極端情況。我用不易出錯的表達式替換了它們。索引
重要提示:要優化性能,請創建多列索引。
對於第一個子查詢
hp
:CREATE INDEX event_pg_location_host_timestamp__idx ON event_pg (location_host, timestamp_);
或者,如果您可以從中獲取僅索引掃描,請附加
person_id
到索引:CREATE INDEX event_pg_location_host_timestamp__person_id_idx ON event_pg (location_host, timestamp_, person_id);
對於跨越大部分或全部表的非常大的時間範圍,此索引應該是更可取的 - 它還支持
hlp
子查詢,因此以任何方式創建它:CREATE INDEX event_pg_location_host_person_id_timestamp__idx ON event_pg (location_host, person_id, timestamp_);
對於
tnk
:CREATE INDEX event_pg_location_fragment_timestamp__idx ON event_pg (location_fragment, person_id, timestamp_);
使用部分索引進行優化
如果您的謂詞
location_host
和是常量,我們可以使用location_fragment
更便宜的部分索引,特別是因為您的列看起來很大:location_*
CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_) WHERE location_host = '2015.testonline.ca'; CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_) WHERE location_host = 'helpcentre.testonline.ca'; CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_) WHERE location_fragment = '/file/thank-you';
考慮:
integer
同樣,使用或bigint
for時,所有這些索引都明顯更小更快person_id
。通常,您需要
ANALYZE
在創建新索引後對錶進行操作 - 或者等到 autovacuum 開始為您執行此操作。要獲得僅索引掃描,您的表必須
VACUUM
足夠。之後立即測試VACUUM
作為概念證明。如果您不熟悉僅索引掃描,請閱讀連結的 Postgres Wiki 頁面了解詳細資訊。基本查詢
實施我討論的內容。查詢小範圍(每行幾行
person_id
):SELECT count(*)::int AS view_homepage , count(hlp.hlp_ts)::int AS use_help , count(tnk.yes)::int AS thank_you FROM ( SELECT DISTINCT ON (person_id) person_id, timestamp_ AS hp_ts FROM event_pg WHERE timestamp_ >= '2016-04-23' AND timestamp_ < '2016-05-01' AND location_host = '2015.testonline.ca' ORDER BY person_id, timestamp_ ) hp LEFT JOIN LATERAL ( SELECT timestamp_ AS hlp_ts FROM event_pg y WHERE y.person_id = hp.person_id AND timestamp_ >= hp.hp_ts AND timestamp_ < '2016-05-01' AND location_host = 'helpcentre.testonline.ca' ORDER BY timestamp_ LIMIT 1 ) hlp ON true LEFT JOIN LATERAL ( SELECT true AS yes -- we only need existence FROM event_pg z WHERE z.person_id = hp.person_id -- we can use hp here AND location_fragment = '/file/thank-you' AND timestamp_ >= hlp.hlp_ts -- this introduces dependency on hlp anyways. AND timestamp_ < '2016-05-01' ORDER BY timestamp_ LIMIT 1 ) tnk ON true;
DISTINCT ON
每行幾行通常更便宜person_id
。詳細解釋:如果每行有很多行***(更可能用於更大的時間範圍),那麼第1a***
person_id
章的這個答案中討論的遞歸 CTE可以(快得多):請參閱下面的集成。
優化和自動化最佳查詢
這是一個古老的難題:一種查詢技術最適合較小的集合,另一種適合較大的集合。在您的特定情況下,我們從一開始就有一個非常好的指標 - 給定時間段的長度 - 我們可以用來決定。
我們將它全部包裝在一個 PL/pgSQL 函式中。
DISTINCT ON
當給定時間段長於設定門檻值時,我的實現從 rCTE 切換:CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp) RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS $func$ BEGIN CASE WHEN _ts_hi_excl <= _ts_low_inc THEN RAISE EXCEPTION 'Timestamp _ts_hi_excl (1st param) must be later than _ts_low_inc!'; WHEN _ts_hi_excl - _ts_low_inc < interval '10 days' THEN -- example value !!! -- DISTINCT ON for few rows per person_id RETURN QUERY WITH hp AS ( SELECT DISTINCT ON (person_id) person_id, timestamp_ AS hp_ts FROM event_pg WHERE timestamp_ >= _ts_low_inc AND timestamp_ < _ts_hi_excl AND location_host = '2015.testonline.ca' ORDER BY person_id, timestamp_ ) , hlp AS ( SELECT hp.person_id, hlp.hlp_ts FROM hp CROSS JOIN LATERAL ( SELECT timestamp_ AS hlp_ts FROM event_pg WHERE person_id = hp.person_id AND timestamp_ >= hp.hp_ts AND timestamp_ < _ts_hi_excl AND location_host = 'helpcentre.testonline.ca' -- match partial idx ORDER BY timestamp_ LIMIT 1 ) hlp ) SELECT (SELECT count(*)::int FROM hp) -- AS view_homepage , (SELECT count(*)::int FROM hlp) -- AS use_help , (SELECT count(*)::int -- AS thank_you FROM hlp CROSS JOIN LATERAL ( SELECT 1 -- we only care for existence FROM event_pg WHERE person_id = hlp.person_id AND location_fragment = '/file/thank-you' AND timestamp_ >= hlp.hlp_ts AND timestamp_ < _ts_hi_excl ORDER BY timestamp_ LIMIT 1 ) tnk ); ELSE -- rCTE for many rows per person_id RETURN QUERY WITH RECURSIVE hp AS ( ( -- parentheses required SELECT person_id, timestamp_ AS hp_ts FROM event_pg WHERE timestamp_ >= _ts_low_inc AND timestamp_ < _ts_hi_excl AND location_host = '2015.testonline.ca' -- match partial idx ORDER BY person_id, timestamp_ LIMIT 1 ) UNION ALL SELECT x.* FROM hp, LATERAL ( SELECT person_id, timestamp_ AS hp_ts FROM event_pg WHERE person_id > hp.person_id -- lateral reference AND timestamp_ >= _ts_low_inc -- repeat conditions AND timestamp_ < _ts_hi_excl AND location_host = '2015.testonline.ca' -- match partial idx ORDER BY person_id, timestamp_ LIMIT 1 ) x ) , hlp AS ( SELECT hp.person_id, hlp.hlp_ts FROM hp CROSS JOIN LATERAL ( SELECT timestamp_ AS hlp_ts FROM event_pg y WHERE y.person_id = hp.person_id AND location_host = 'helpcentre.testonline.ca' -- match partial idx AND timestamp_ >= hp.hp_ts AND timestamp_ < _ts_hi_excl ORDER BY timestamp_ LIMIT 1 ) hlp ) SELECT (SELECT count(*)::int FROM hp) -- AS view_homepage , (SELECT count(*)::int FROM hlp) -- AS use_help , (SELECT count(*)::int -- AS thank_you FROM hlp CROSS JOIN LATERAL ( SELECT 1 -- we only care for existence FROM event_pg WHERE person_id = hlp.person_id AND location_fragment = '/file/thank-you' AND timestamp_ >= hlp.hlp_ts AND timestamp_ < _ts_hi_excl ORDER BY timestamp_ LIMIT 1 ) tnk ); END CASE; END $func$ LANGUAGE plpgsql STABLE STRICT;
稱呼:
SELECT * FROM f_my_counts('2016-01-23', '2016-05-01');
根據定義,rCTE 與 CTE 一起使用。我還為查詢添加了CTE
DISTINCT ON
(就像我在評論中與 @Lennart 討論的那樣),這允許我們使用CROSS JOIN
而不是LEFT JOIN
減少每一步的集合,因為我們可以單獨計算每個 CTE。這會產生相反的效果:
- 一方面,我們減少了行數,這將使第三次連接更便宜。
- 另一方面,我們為 CTE 引入了成本,並且需要更多的 RAM,這對於像您這樣的大型查詢可能尤其重要。
你必須測試哪個勝過另一個。