Postgresql

按時間間隔分組並輸出源和目的station_id和count

  • August 27, 2020

我被一個查詢困住了:

CREATE TABLE public.bulk_sample (
   serial_number character varying(255),
   validation_date timestamp,  -- timestamp of entry and exit
   station_id integer,
   direction integer           -- 1 = Entry | 2 = Exit
);

INSERT INTO public.bulk_sample VALUES
 ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:31:58', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:50:22', 113, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:16:56', 113, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:47:06', 120, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:02:12', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:47:45', 102, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 19:26:38', 102, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 20:17:24', 120, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 07:58:20', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 08:43:35', 104, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 16:38:10', 104, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:15:01', 119, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:42:29', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:48:05', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:17:59', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:25:25', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:16:12', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:32:51', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:31:20', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:39:33', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 20:57:50', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 21:16:25', 120, 2)
;

我必須創建一個查詢,其結果如下

source | dest | Count
120    | 113  |  1
113    | 120  |  1

我嘗試了以下程式碼,但無法獲得所需的結果:

SELECT serial_number
    , count(*)
    , min(validation_date) AS start_time
    , CASE WHEN count(*) > 1 THEN max(validation_date) END AS end_time
FROM  (
  SELECT serial_number, validation_date, count(step OR NULL) OVER (ORDER BY serial_number, 
validation_date) AS grp
  FROM  (
     SELECT *
          , lag(validation_date) OVER (PARTITION BY serial_number ORDER BY validation_date)
          < validation_date - interval '60 min' AS step
     FROM   table1 
      where BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
     ) sub1
  ) sub2
GROUP  BY serial_number, grp;

每次進出之間的時間間隔約為 55 分鐘至 60 分鐘。

我也嘗試過內部聯接,但無法按內部聯接中的時間間隔進行分組

SELECT source.station_id AS source_station ,dest.station_id AS destination_station ,source.count FROM 
   (
       SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
       WHERE 
           direction = 1 AND 
           validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59' 
       GROUP BY serial_number,station_id
   )source

INNER JOIN 
   (
       SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
       WHERE 
           direction = 2 AND 
           validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
       GROUP BY serial_number,station_id
   )dest
ON source.serial_number = dest.serial_number and source.station_id <> dest.station_id

挑戰有時是進入日期為空,有時退出日期為空。

這應該是最簡單和最快的,而每個事務serial_number從不重疊

WITH cte AS (
  SELECT serial_number, validation_date, station_id, direction
       , row_number() OVER (PARTITION BY serial_number ORDER BY validation_date) AS rn
  FROM   bulk_sample
  WHERE  validation_date >= '2020-02-01'  -- ①
  AND    validation_date <  '2020-02-02'  -- entry & exit must be within time frame
  )
SELECT s.station_id AS source, d.station_id AS dest, count(*)
FROM   cte s
JOIN   cte d USING (serial_number)
WHERE  s.direction = 1
AND    d.rn = s.rn + 1
GROUP  BY 1, 2
ORDER  BY 1, 2;  -- optional sort order

db<>在這裡擺弄

① 我重寫了WHERE條件,以最佳方式獲得 2020 年 2 月 1 日的所有內容。BETWEEN幾乎總是時間範圍的錯誤工具。看:

此外,當時間分量失去時,假定“2020-02-01”是一個完全有效的timestamp常數。00:00:00

在檢索給定時間範圍內的結果時,一個普通的 btree索引(validation_date)最佳的。對於完整的表,索引(serial_number, validation_date)會更有幫助。

validation_date IS NULL?

查詢繼續工作,而在給定時間範圍內只有最後一個目的地有,因為值碰巧按預設升序排列在最後。但它與. 您必須更仔細地定義這些可以彈出的位置以及如何準確處理它們。serial_number``validation_date IS NULL``NULL``validation_date IS NULL

(2x)uuid而不是varchar(255)for serial_number?

serial_number似乎是一個正好有 64 位數字的十六進制數。如果是這樣,varchar(255)是一個糟糕的選擇。看:

此外,一個uuid(32 個十六進制數字)就足夠了。如果需要所有 64 個十六進制數字,仍然考慮 2uuid列。更小、更快、更安全。考慮:

SELECT *
    , replace(uuid1::text || uuid2::text, '-', '') AS reverse_engineered
    , replace(uuid1::text || uuid2::text, '-', '') = serial_number AS identical
    , pg_column_size(serial_number) AS varchar_size
    , pg_column_size(uuid1) + pg_column_size(uuid2) AS uuid_size
FROM  (
  SELECT serial_number
       , left(serial_number, 32)::uuid  AS uuid1
       , right(serial_number, 32)::uuid AS uuid2
  FROM   bulk_sample
  ) sub;

db<>在這裡擺弄

看:

引用自:https://dba.stackexchange.com/questions/271037