Postgresql

能夠根據事件查詢計算天數的 SQL 查詢

  • December 4, 2021

我有一個包含以下數據的表:

id    date_changed                  color_start         color_end
-------------------------------------------------------------------

1     2020-05-27 16:33:52.000       green                yellow
1     2020-06-11 20:12:18.000       yellow               red
1     2020-06-11 20:20:58.000       red                  green
2     2021-03-03 14:31:44.000       yellow               red
2     2020-08-06 14:59:21.000       green                yellow              
3     2021-04-28 12:36:45.000       green                red
...

例如,id #2 的項目在 2020-08-06 14:59:21 從綠色變為黃色,然後在 2021-03-03 14:31:44 從黃色變為讀取。我需要計算兩個時間範圍之間有多少項目處於綠色、黃色、紅色狀態。

我通過做一些研究來嘗試以下查詢,基本上列出過去一年的事件或每天,但這並不是我真正想要的。

SELECT d.date, items.id,
count(CASE WHEN items.color_end = 'yellow' THEN 1 ELSE null END) as yellow_count,
count(CASE WHEN items.color_end = 'green' THEN 1 ELSE null END) as green_count,
count(CASE WHEN items.color_end = 'red' THEN 1 ELSE null END) as red_count,
count(CASE WHEN items.color_end = 'yellow' THEN 1 ELSE null end) + 
count(CASE WHEN items.color_end = 'green' THEN 1 ELSE null END) + 
count(CASE WHEN items.color_end = 'red' THEN 1 ELSE null END) as total_count
FROM (SELECT to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date 
     FROM generate_series(0, 365, 1) AS offs
    ) d LEFT OUTER JOIN
    events items
    ON d.date = to_char(date_trunc('day', item.date_changed), 'YYYY-MM-DD')
GROUP BY d.date, items.id;

您的問題是您的數據模型存在缺陷。

您有 6 條記錄:

id    date_changed                  color_start         color_end
-------------------------------------------------------------------

1     2020-05-27 16:33:52.000       green                yellow
1     2020-06-11 20:12:18.000       yellow               red
1     2020-06-11 20:20:58.000       red                  green
2     2021-03-03 14:31:44.000       yellow               red
2     2020-08-06 14:59:21.000       green                yellow              
3     2021-04-28 12:36:45.000       green                red

應該有 9 條記錄。因此,您的記錄結構應該是這樣的(id僅適用於 = 1 的範例):

id    date_from               date_to                        status

1   -infinity,                2020-05-27 16:33:52+01         green
1   2020-05-27 16:33:52+01,   2020-06-11 20:12:18+01        yellow
1   2020-06-11 20:12:18+01,   2020-06-11 20:20:58+01           red
1   2020-06-11 20:20:58+01    +infinity                      green

您試圖在兩個不同欄位的一條記錄中記錄兩種狀態。因此,有兩種可能的重構解決方案:

您可能還想考慮我所說的“輔助”解決方案——即滾動你自己的觸發器或可能 Temporal Tables——目前僅作為(非貢獻)PostgreSQL 擴展提供。如果我對 PostgreSQL 有任何批評,其中之一就是他們還沒有(還)本機實現這個!

我們從表/數據開始:

--
--  Original table design from the OP
--

CREATE TABLE test
(
 id INT NOT NULL,
 dc TIMESTAMPTZ(0) NOT NULL,  -- all times are xx:yy:zz.000, so use a precision of 0
 cs TEXT NOT NULL,
 ce TEXT NOT NULL,
 
 UNIQUE (id, dc),
 CHECK (cs != ce)
);

記錄:

--
--  OP's data
--

INSERT INTO test VALUES
(1, '2020-05-27 16:33:52', 'green',  'yellow'),
(1, '2020-06-11 20:12:18', 'yellow', 'red'),
(1, '2020-06-11 20:20:58', 'red',    'green'),

(2, '2020-08-06 14:59:21', 'green',  'yellow'),              
(2, '2021-03-03 14:31:44', 'yellow', 'red'),

(3, '2021-04-28 12:36:45', 'green',  'red');

第一個解決方案 - 每條記錄兩個TIMESTAMPTZ欄位(見小提琴):

我們從使用LAG()LEAD()PostgreSQL視窗函式開始。視窗函式非常強大,非常值得學習。他們將回報多次研究它們所付出的任何努力。

--
--  "Foundation" query - using this as a Common Table Expression, we can refactor our
--  data - or better yet, refactor our base tables in our system.
--

SELECT
 id,
 COALESCE(LAG(dc)  OVER (PARTITION BY id ORDER BY dc), '-INFINITY') AS lag_dc,
 cs,
 dc,
 COALESCE(LEAD(dc) OVER (PARTITION BY id ORDER BY dc),  'INFINITY') AS lead_dc,
 ce
FROM
 test;

結果(在小提琴上更好地查看):

id        lag_dc           cs            dc               lead_dc             ce
1   -infinity              green    2020-05-27 16:33:52+01  2020-06-11 20:12:18+01  yellow
1   2020-05-27 16:33:52+01 yellow   2020-06-11 20:12:18+01  2020-06-11 20:20:58+01             red
1   2020-06-11 20:12:18+01 red      2020-06-11 20:20:58+01  infinity    green
2   -infinity              green    2020-08-06 14:59:21+01  2021-03-03 14:31:44+00  yellow
2   2020-08-06 14:59:21+01 yellow   2021-03-03 14:31:44+00  infinity    red
3   -infinity              green    2021-04-28 12:36:45+01  infinity    red

我不會重現小提琴上的每個查詢 - 這是第一個:

--
-- We get the first record of each set (by id) - from '-INFINITY' to the first 
-- date_changed (dc)
--

WITH cte1 AS
(
 SELECT
   id,
   COALESCE(LAG(dc)  OVER (PARTITION BY id ORDER BY dc), '-INFINITY') AS lag_dc,
   cs,
   dc,
   COALESCE(LEAD(dc) OVER (PARTITION BY id ORDER BY dc),  'INFINITY') AS lead_dc,
   ce
 FROM
   test
)
SELECT
 c1.id, c1.lag_dc AS df, c1.dc AS dt, c1.cs
FROM cte1 c1 WHERE lag_dc = '-INFINITY';

結果:

id      df              dt               cs
1   -infinity   2020-05-27 16:33:52+01  green
2   -infinity   2020-08-06 14:59:21+01  green
3   -infinity   2021-04-28 12:36:45+01  green

然後我們UNION三個這樣的查詢:

  • 每組的第一條記錄id
  • 每組的中間記錄,
  • 的最後記錄id

如下:

--
-- We now obtain the union of all 3 sets and we have our result!
--

WITH cte AS
(
 SELECT
   id,
   COALESCE(LAG(dc)  OVER (PARTITION BY id ORDER BY dc), '-INFINITY') AS lag_dc,
   cs,
   dc,
   COALESCE(LEAD(dc) OVER (PARTITION BY id ORDER BY dc),  'INFINITY') AS lead_dc,
   ce
 FROM
   test
)
SELECT 
 c1.id, 
 c1.lag_dc AS "Date from:", 
 c1.dc     AS "Date to:", 
 c1.cs     AS "Colour"  
FROM cte c1 
WHERE lag_dc = '-INFINITY'  -- first records
UNION ALL
SELECT  
 c2.id, c2.dc, c2.lead_dc, c2.ce FROM cte c2 WHERE lead_dc = 'INFINITY'  -- last records
UNION ALL
SELECT 
 c3.id, c3.dc, c3.lead_dc, c3.ce FROM cte c3  WHERE c3.lead_dc != 'INFINITY'  -- middle records
ORDER BY 1, 2;

結果:

id  Date from:               Date to:               Colour
1  -infinity                2020-05-27 16:33:52+01 green
1  2020-05-27 16:33:52+01   2020-06-11 20:12:18+01 yellow
1  2020-06-11 20:12:18+01   2020-06-11 20:20:58+01 red
1  2020-06-11 20:20:58+01  infinity                green
2  -infinity                2020-08-06 14:59:21+01 green
2  2020-08-06 14:59:21+01   2021-03-03 14:31:44+00 yellow
2  2021-03-03 14:31:44+00   infinity               red
3  -infinity                2021-04-28 12:36:45+01 green
3  2021-04-28 12:36:45+01   infinity               red

這使得查詢數據相對簡單 - 兩個範例查詢。

colour第一個範例查詢: 2021 年的狀態變化( ):

--
-- Records where the beginning and the end of the range falls 
-- anywhere >= 2021:01:01 00:00:00
--
-- It's bascially a record of any changes in status in 2021!
--

SELECT
 *
FROM
 test_rs
WHERE ts_from >= '2021-01-01 00:00:00' ANd ts_to >= '2021-01-01 00:00:00'
ORDER BY id, ts_from;

結果:

id  ts_from                 ts_to       colour
2   2021-03-03 14:31:44+00  infinity       red
3   2021-04-28 12:36:45+01  infinity       red

第二個範例查詢:某個時間點的狀態計數(2021 年元旦):

--
--  Status counts at exactly New Year, 2021 - we know that at the point, we had two
--  entities with status green and 1 with status yellow
--

SELECT 
 colour, COUNT(colour)
FROM
 test_rs
WHERE ts_from <= '2021-01-01 00:00:00' AND ts_to >= '2021-01-01 00:00:00'
GROUP BY colour
ORDER BY colour;

結果:

colour    count
green         2
yellow        1

TSTZRANGE第二種解決方案 -每條記錄一個欄位(見小提琴):

手冊中,我們有包容性和排斥性的界限和無限的(或無界的——也許是一個更好的術語)界限——即[或是]包容性界限,(並且)是排斥性界限。此外,[,或對於右括號 ( or )(,來說是無界的上界和無界的。,]``,)

重構時間戳範圍查詢:

我不會通過所有查詢來獲得最終結果——這與上面使用的過程類似。僅顯示最終查詢:

WITH cte AS
(

--
--  We don't need COALESCE in this case, since the range treats 'NULL' as -INFINITY
--  or +INFIITY depending on whether it's at the beginning or end of the range.
--

 SELECT
   id,
   LAG(dc)  OVER (PARTITION BY id ORDER BY dc) AS lag_dc,
   cs,
   dc,
   LEAD(dc) OVER (PARTITION BY id ORDER BY dc) AS lead_dc,
   ce
 FROM
   test
)
SELECT
 c.id, 
 TSTZRANGE(c.lag_dc, c.dc, '[)') AS "Date from:/Date to:", 
 c.cs AS "Colour"
FROM cte c WHERE c.lag_dc IS NULL
UNION ALL
SELECT  c.id, TSTZRANGE(c.dc, c.lead_dc, '[)'), c.ce 
FROM cte c WHERE lead_dc IS NULL
UNION ALL
SELECT c.id, TSTZRANGE(c.dc, c.lead_dc, '[)'), c.ce 
FROM cte c 
WHERE c.lead_dc != 'INFINITY'
ORDER BY 1, 2;

結果:

id     Date from:/Date to:                                Colour
1   (,"2020-05-27 16:33:52+01")                            green
1   ["2020-05-27 16:33:52+01","2020-06-11 20:12:18+01")   yellow
1   ["2020-06-11 20:12:18+01","2020-06-11 20:20:58+01")      red
1   ["2020-06-11 20:20:58+01",)                            green
2   (,"2020-08-06 14:59:21+01")                            green
2   ["2020-08-06 14:59:21+01","2021-03-03 14:31:44+00")   yellow
2   ["2021-03-03 14:31:44+00",)                              red
3   (,"2021-04-28 12:36:45+01")                            green
3   ["2021-04-28 12:36:45+01",)                              red

第一個範例查詢: 2021 年初之後的任何時間狀態變化。

--
-- Records where the beginning and the end of the range falls 
-- anywhere >= 2021:01:01 00:00:00
--
-- It's bascially a record of any changes in status in 2021!
--

SELECT * FROM test_rs
WHERE 
LOWER(df_dt) > '2021-01-01 00:00:00' AND
df_dt && TSTZRANGE('2021-01-01 00:00:00'::TIMESTAMPTZ, NULL, '[)');

結果:

id        df_dt                   colour
2   ["2021-03-03 14:31:44+00",)      red
3   ["2021-04-28 12:36:45+01",)      red

我們可以看到這等效於第一個範例查詢,其中 range 欄位代替了兩個時間戳。

WHERE子句值得一看:LOWER(df_dt)並且WHERE trb && TSTZRANGE('2021-01-01 00:00:00'::TIMESTAMPTZ, NULL, '[)');- 它使用了&&(overlaps) 運算符(請參閱手冊- 我也發現這篇文章很有幫助)。

因此,我們可以看到任何與從2021-01-01 00:00:00( 開始的任何時間重疊的時間範圍...MPTZ, NULL, '[)');都被拾取 - 請注意NULL構造TSTZRANGE值的過程。

第二次查詢:2021 年初的顏色計數

--
--  Status counts at exactly New Year, 2021 - we know that at the point, we had two
--  entities with status green and 1 with status yellow
--

SELECT colour, COUNT(colour) FROM test_rs
WHERE df_dt && TSTZRANGE('2021-01-01 00:00:00', '2021-01-01 00:00:00', '[]')
GROUP BY colour;

結果(與具有兩個時間戳的第二個範例查詢相同):

colour    count
green        2
yellow        1

輔助解決方案:

還有其他方法可以重構您的架構(和/或應用程式碼)以跟踪數據庫中的更改,特別是對於引用時間戳欄位的表 - 對這些的完整討論超出了此答案的範圍,但您可能需要考慮以下內容:

  • 使用觸發器自己滾動或
  • 安裝 Vlad Arkhipov 的Temporal Table解決方案。我在 POC 中使用過它,它似乎工作得很好,但沒有任何功能,比如MariaDB 的解決方案。這涉及編譯和安裝C基於擴展的擴展。
  • 安裝 Vik Fearing 的(一個 PostgreSQL 主要貢獻者)時態表擴展(也在C)。我沒有使用它,但這個人是主要貢獻者的事實不言而喻。截至撰寫本文時,似乎是最新的!
  • Near Form 的temporal_tables功能,根據 github 連結是:a temporal_tables extension in PL/pgSQL, without the need for external c extension. The goal is to be able to use it on AWS RDS and other hosted solutions, where using custom extensions or c functions is not an option. 聽起來很有趣,但我沒有嘗試過,所以無法發表評論。

引用自:https://dba.stackexchange.com/questions/303434