Postgresql

優化大表上的 LATERAL JOIN 查詢

  • August 29, 2017

我正在使用 Postgres 9.5。我有一個表格,記錄了來自多個網站的頁麵點擊量。該表包含從 2016 年 1 月 1 日到 2016 年 6 月 30 日的大約 3200 萬行。

CREATE TABLE event_pg (
  timestamp_        timestamp without time zone NOT NULL,
  person_id         character(24),
  location_host     varchar(256),
  location_path     varchar(256),
  location_query    varchar(256),
  location_fragment varchar(256)
);

我正在嘗試調整一個查詢,該查詢計算執行給定頁麵點擊序列的人數。該查詢旨在回答諸如“有多少人查看了首頁,然後轉到幫助站點,然後查看了感謝頁面”之類的問題?結果看起來像這樣

╔════════════╦════════════╦═════════════╗
║  home-page ║ help site  ║ thankyou    ║
╠════════════╬════════════╬═════════════╣
║ 10000      ║ 9800       ║1500         ║
╚════════════╩════════════╩═════════════╝

請注意,數字正在減少,這是有道理的,因為查看首頁 9800 的 10000 人繼續訪問幫助站點,而這 1500 人繼續點擊感謝頁面。

3 步序列的 SQL 使用橫向連接,如下所示:

SELECT 
 sum(view_homepage) AS view_homepage,
 sum(use_help) AS use_help,
 sum(thank_you) AS thank_you
FROM (
 -- Get the first time each user viewed the homepage.
 SELECT X.person_id,
   1 AS view_homepage,
   min(timestamp_) AS view_homepage_time
 FROM event_pg X 
 WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
 AND X.location_host like '2015.testonline.ca'
 GROUP BY X.person_id
) e1 
LEFT JOIN LATERAL (
 SELECT
   Y.person_id,
   1 AS use_help,
   timestamp_ AS use_help_time
 FROM event_pg Y 
 WHERE 
   Y.person_id = e1.person_id AND
   location_host = 'helpcentre.testonline.ca' AND
   timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
 ORDER BY timestamp_
 LIMIT 1
) e2 ON true 
LEFT JOIN LATERAL (
 SELECT
   1 AS thank_you,
   timestamp_ AS thank_you_time
 FROM event_pg Z 
 WHERE Z.person_id = e2.person_id AND
   location_fragment =  '/file/thank-you' AND
   timestamp_ BETWEEN use_help_time AND timestamp '2016-04-30 23:59:59.999'
 ORDER BY timestamp_
 LIMIT 1
) e3 ON true;

我在timestamp_,person_idlocation列上有一個索引。幾天或幾週的日期範圍的查詢非常快(1s 到 10s)。當我嘗試對 1 月 1 日至 7 月 30 日之間的所有內容執行查詢時,速度會變慢。這需要一分鐘多的時間。如果您比較下面的兩個解釋,您會發現它不再使用 timestamp_ 索引,而是執行 Seq Scan,因為索引不會給我們買任何東西,因為我們正在查詢“所有時間”,因此表中的幾乎所有記錄.

現在我意識到橫向連接的嵌套循環性質會減慢它必須循環的更多記錄,但是有什麼辦法可以加快這個查詢的巨大日期範圍,以便它更好地擴展?

初步說明

x.location_host LIKE '2015.testonline.ca'

x.location_host = '2015.testonline.ca'

  • 使用count(e1.*)or代替為每個子查詢count(*)添加具有值的虛擬列。1(除了最後一個 ( e3),您不需要任何實際數據。)
  • 您有時將字元串文字轉換為()timestamp有時不是(timestamp '2016-04-30 23:59:59.999')不一致。要麼有意義,然後一直這樣做*,*要麼沒有,然後不要這樣做。

它沒有。與列相比,字元串文字無論如何timestamp都會被強制轉換。timestamp所以你不需要明確的演員表。

  • Postgres 數據類型timestamp最多有 6 個小數位。你的BETWEEN表達留下了極端情況。我用不易出錯的表達式替換了它們。

索引

重要提示:要優化性能,請創建多列索引

對於第一個子查詢hp

CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);

或者,如果您可以從中獲取僅索引掃描,請附加person_id到索引:

CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);

對於跨越大部分或全部表的非常的時間範圍,此索引應該是更可取的 - 它還支持hlp子查詢,因此以任何方式創建它:

CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);

對於tnk

CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);

使用部分索引進行優化

如果您的謂詞location_host和是常量,我們可以使用location_fragment更便宜的部分索引,特別是因為您的列看起來很大:location_*

CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = '2015.testonline.ca';

CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = 'helpcentre.testonline.ca';

CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_fragment = '/file/thank-you';

考慮:

integer同樣,使用或bigintfor時,所有這些索引都明顯更小更快person_id

通常,您需要ANALYZE在創建新索引後對錶進行操作 - 或者等到 autovacuum 開始為您執行此操作。

要獲得僅索引掃描,您的表必須VACUUM足夠。之後立即測試VACUUM作為概念證明。如果您不熟悉僅索引掃描,請閱讀連結的 Postgres Wiki 頁面了解詳細資訊。

基本查詢

實施我討論的內容。查詢小範圍(每行person_id):

SELECT count(*)::int           AS view_homepage
    , count(hlp.hlp_ts)::int AS use_help
    , count(tnk.yes)::int     AS thank_you
FROM  (
  SELECT DISTINCT ON (person_id)
         person_id, timestamp_ AS hp_ts
  FROM   event_pg
  WHERE  timestamp_ >= '2016-04-23'
  AND    timestamp_ <  '2016-05-01'
  AND    location_host = '2015.testonline.ca'
  ORDER  BY person_id, timestamp_
  ) hp
LEFT JOIN LATERAL (
  SELECT timestamp_ AS hlp_ts
  FROM   event_pg y 
  WHERE  y.person_id = hp.person_id
  AND    timestamp_ >= hp.hp_ts
  AND    timestamp_ <  '2016-05-01'
  AND    location_host = 'helpcentre.testonline.ca'
  ORDER  BY timestamp_
  LIMIT  1
  ) hlp ON true 
LEFT JOIN LATERAL (
  SELECT true AS yes                   -- we only need existence
  FROM   event_pg z
  WHERE  z.person_id = hp.person_id    -- we can use hp here
  AND    location_fragment = '/file/thank-you'
  AND    timestamp_ >= hlp.hlp_ts      -- this introduces dependency on hlp anyways.
  AND    timestamp_ <  '2016-05-01'
  ORDER  BY timestamp_
  LIMIT  1
  ) tnk ON true;

DISTINCT ON每行幾行通常更便宜person_id。詳細解釋:

如果每行有很多行***(更可能用於更大的時間範圍),那麼第1a***person_id章的這個答案中討論的遞歸 CTE可以(快得多):

請參閱下面的集成。

優化和自動化最佳查詢

這是一個古老的難題:一種查詢技術最適合較小的集合,另一種適合較大的集合。在您的特定情況下,我們從一開始就有一個非常好的指標 - 給定時間段的長度 - 我們可以用來決定。

我們將它全部包裝在一個 PL/pgSQL 函式中。DISTINCT ON當給定時間段長於設定門檻值時,我的實現從 rCTE 切換:

CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
 RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN

CASE
WHEN _ts_hi_excl <= _ts_low_inc THEN
  RAISE EXCEPTION 'Timestamp _ts_hi_excl (1st param) must be later than _ts_low_inc!';

WHEN _ts_hi_excl - _ts_low_inc < interval '10 days' THEN  -- example value !!!
-- DISTINCT ON for few rows per person_id
  RETURN QUERY
  WITH hp AS (
     SELECT DISTINCT ON (person_id)
            person_id, timestamp_ AS hp_ts
     FROM   event_pg
     WHERE  timestamp_ >= _ts_low_inc
     AND    timestamp_ <  _ts_hi_excl
     AND    location_host = '2015.testonline.ca'
     ORDER  BY person_id, timestamp_
     )
   , hlp AS (
     SELECT hp.person_id, hlp.hlp_ts
     FROM   hp
     CROSS  JOIN LATERAL (
        SELECT timestamp_ AS hlp_ts
        FROM   event_pg
        WHERE  person_id = hp.person_id
        AND    timestamp_ >= hp.hp_ts
        AND    timestamp_ < _ts_hi_excl
        AND    location_host = 'helpcentre.testonline.ca'  -- match partial idx
        ORDER  BY timestamp_
        LIMIT  1
        ) hlp
     )
  SELECT (SELECT count(*)::int FROM hp)   -- AS view_homepage
       , (SELECT count(*)::int FROM hlp)  -- AS use_help
       , (SELECT count(*)::int            -- AS thank_you
          FROM   hlp
          CROSS  JOIN LATERAL (
             SELECT 1                     -- we only care for existence
             FROM   event_pg
             WHERE  person_id = hlp.person_id
             AND    location_fragment = '/file/thank-you'
             AND    timestamp_ >= hlp.hlp_ts
             AND    timestamp_ < _ts_hi_excl
             ORDER  BY timestamp_
             LIMIT  1
             ) tnk
          );

ELSE
-- rCTE for many rows per person_id
  RETURN QUERY
  WITH RECURSIVE hp AS (
     (  -- parentheses required
     SELECT person_id, timestamp_ AS hp_ts
     FROM   event_pg
     WHERE  timestamp_ >= _ts_low_inc
     AND    timestamp_ <  _ts_hi_excl
     AND    location_host = '2015.testonline.ca'  -- match partial idx
     ORDER  BY person_id, timestamp_
     LIMIT  1
     )
     UNION ALL
     SELECT x.*
     FROM   hp, LATERAL (
        SELECT person_id, timestamp_ AS hp_ts
        FROM   event_pg
        WHERE  person_id  > hp.person_id  -- lateral reference
        AND    timestamp_ >= _ts_low_inc  -- repeat conditions
        AND    timestamp_ <  _ts_hi_excl
        AND    location_host = '2015.testonline.ca'  -- match partial idx
        ORDER  BY person_id, timestamp_
        LIMIT  1
        ) x
     )
   , hlp AS (
     SELECT hp.person_id, hlp.hlp_ts
     FROM   hp
     CROSS  JOIN LATERAL (
        SELECT timestamp_ AS hlp_ts
        FROM   event_pg y 
        WHERE  y.person_id = hp.person_id
        AND    location_host = 'helpcentre.testonline.ca'  -- match partial idx
        AND    timestamp_ >= hp.hp_ts
        AND    timestamp_ < _ts_hi_excl
        ORDER  BY timestamp_
        LIMIT  1
        ) hlp
     )
  SELECT (SELECT count(*)::int FROM hp)   -- AS view_homepage
       , (SELECT count(*)::int FROM hlp)  -- AS use_help
       , (SELECT count(*)::int            -- AS thank_you
          FROM   hlp
          CROSS  JOIN LATERAL (
             SELECT 1                     -- we only care for existence
             FROM   event_pg
             WHERE  person_id = hlp.person_id
             AND    location_fragment = '/file/thank-you'
             AND    timestamp_ >= hlp.hlp_ts
             AND    timestamp_ < _ts_hi_excl
             ORDER  BY timestamp_
             LIMIT  1
             ) tnk
          );
END CASE;

END
$func$  LANGUAGE plpgsql STABLE STRICT;

稱呼:

SELECT * FROM f_my_counts('2016-01-23', '2016-05-01');

根據定義,rCTE 與 CTE 一起使用。我還為查詢添加了CTEDISTINCT ON(就像我在評論中與 @Lennart 討論的那樣),這允許我們使用CROSS JOIN而不是LEFT JOIN減少每一步的集合,因為我們可以單獨計算每個 CTE。這會產生相反的效果:

  • 一方面,我們減少了行數,這將使第三次連接更便宜。
  • 另一方面,我們為 CTE 引入了成本,並且需要更多的 RAM,這對於像您這樣的大型查詢可能尤其重要。

你必須測試哪個勝過另一個。

引用自:https://dba.stackexchange.com/questions/143044