Postgresql
按時間間隔分組並輸出源和目的station_id和count
我被一個查詢困住了:
CREATE TABLE public.bulk_sample ( serial_number character varying(255), validation_date timestamp, -- timestamp of entry and exit station_id integer, direction integer -- 1 = Entry | 2 = Exit ); INSERT INTO public.bulk_sample VALUES ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:31:58', 120, 1) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:50:22', 113, 2) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:16:56', 113, 1) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:47:06', 120, 2) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:02:12', 120, 1) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:47:45', 102, 2) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 19:26:38', 102, 1) , ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 20:17:24', 120, 2) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 07:58:20', 119, 1) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 08:43:35', 104, 2) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 16:38:10', 104, 1) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:15:01', 119, 2) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:42:29', 119, 1) , ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:48:05', 120, 2) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:17:59', 120, 1) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:25:25', 118, 2) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:16:12', 118, 1) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:32:51', 120, 2) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:31:20', 120, 1) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:39:33', 118, 2) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 20:57:50', 118, 1) , ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 21:16:25', 120, 2) ;
我必須創建一個查詢,其結果如下
source | dest | Count 120 | 113 | 1 113 | 120 | 1
我嘗試了以下程式碼,但無法獲得所需的結果:
SELECT serial_number , count(*) , min(validation_date) AS start_time , CASE WHEN count(*) > 1 THEN max(validation_date) END AS end_time FROM ( SELECT serial_number, validation_date, count(step OR NULL) OVER (ORDER BY serial_number, validation_date) AS grp FROM ( SELECT * , lag(validation_date) OVER (PARTITION BY serial_number ORDER BY validation_date) < validation_date - interval '60 min' AS step FROM table1 where BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59' ) sub1 ) sub2 GROUP BY serial_number, grp;
每次進出之間的時間間隔約為 55 分鐘至 60 分鐘。
我也嘗試過內部聯接,但無法按內部聯接中的時間間隔進行分組
SELECT source.station_id AS source_station ,dest.station_id AS destination_station ,source.count FROM ( SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1 WHERE direction = 1 AND validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59' GROUP BY serial_number,station_id )source INNER JOIN ( SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1 WHERE direction = 2 AND validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59' GROUP BY serial_number,station_id )dest ON source.serial_number = dest.serial_number and source.station_id <> dest.station_id
挑戰有時是進入日期為空,有時退出日期為空。
這應該是最簡單和最快的,而每個事務
serial_number
從不重疊:WITH cte AS ( SELECT serial_number, validation_date, station_id, direction , row_number() OVER (PARTITION BY serial_number ORDER BY validation_date) AS rn FROM bulk_sample WHERE validation_date >= '2020-02-01' -- ① AND validation_date < '2020-02-02' -- entry & exit must be within time frame ) SELECT s.station_id AS source, d.station_id AS dest, count(*) FROM cte s JOIN cte d USING (serial_number) WHERE s.direction = 1 AND d.rn = s.rn + 1 GROUP BY 1, 2 ORDER BY 1, 2; -- optional sort order
db<>在這裡擺弄
① 我重寫了
WHERE
條件,以最佳方式獲得 2020 年 2 月 1 日的所有內容。BETWEEN
幾乎總是時間範圍的錯誤工具。看:此外,當時間分量失去時,假定“2020-02-01”是一個完全有效的
timestamp
常數。00:00:00
在檢索給定時間範圍內的結果時,一個普通的 btree索引是
(validation_date)
最佳的。對於完整的表,索引(serial_number, validation_date)
會更有幫助。
validation_date IS NULL
?查詢繼續工作,而在給定時間範圍內只有最後一個目的地有,因為值碰巧按預設升序排列在最後。但它與. 您必須更仔細地定義這些可以彈出的位置以及如何準確處理它們。
serial_number``validation_date IS NULL``NULL``validation_date IS NULL
(2x)
uuid
而不是varchar(255)
forserial_number
?您
serial_number
似乎是一個正好有 64 位數字的十六進制數。如果是這樣,varchar(255)
是一個糟糕的選擇。看:此外,一個
uuid
(32 個十六進制數字)就足夠了。如果需要所有 64 個十六進制數字,仍然考慮 2uuid
列。更小、更快、更安全。考慮:SELECT * , replace(uuid1::text || uuid2::text, '-', '') AS reverse_engineered , replace(uuid1::text || uuid2::text, '-', '') = serial_number AS identical , pg_column_size(serial_number) AS varchar_size , pg_column_size(uuid1) + pg_column_size(uuid2) AS uuid_size FROM ( SELECT serial_number , left(serial_number, 32)::uuid AS uuid1 , right(serial_number, 32)::uuid AS uuid2 FROM bulk_sample ) sub;
db<>在這裡擺弄
看: