Query

mysql每隔1分鐘獲取3條以上的記錄,返回group_concat ID

  • January 25, 2021

目前我有這個數據集,我需要返回 60 秒範圍內且超過 3 個的分組 ID。

CREATE TABLE test 
(
  `id` bigint NOT NULL AUTO_INCREMENT,
  created_date TIMESTAMP(1) NOT NULL,
  origin_url   VARCHAR (200) NOT NULL,
  client_session_id VARCHAR (50) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UC_PRE_CHAT_TALKID_COL` (`id`)
);
INSERT INTO test VALUES
(1,'2021-01-18 11:02:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(2,'2021-01-18 11:02:35.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(3,'2021-01-18 11:02:03.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(4,'2021-01-18 11:11:28.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'),
(5,'2021-01-18 11:11:36.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'),
(6,'2021-01-18 11:11:05.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii');

db<>在這裡擺弄

像這樣的東西:

ids     origin_url              client_session_id
1,2,3   https://testett.com/    znkjoc3gfth2c3m0t1klii
4,5,6   https://rarara.com/     znkjoc3gfth2c3m0t1klii

編輯一些上下文:

我目前開發了一個每 1 分鐘執行一次的 cron 來分析數據庫中機器人記錄的最後 60 秒,我需要將 60 秒內有超過 3 條記錄的對話 ID 分組到相同的 url 和 client_session_id 中。

遵循我正在執行的 SQL:

select
   count(session_id),
   client_session_id,
   GROUP_CONCAT(id) as talkIds,
   origin_url 
from
   bot_talk
where
   created_date &gt; now() - interval 60 second
group by
   client_session_id, origin_url 
having
   count(session_id) &gt;= 3

此查詢按我的預期工作,但有時我的 cron 服務有時會關閉,並且我會失去那些重複的記錄。

我想過在一天結束的時候做一個 SQL(Cron) 來分析過去 24 小時,並根據我上面提到的規則查找重複的記錄?

這是答案 - 見小提琴。我寫給類似問題的另一個答案可能會提供一些更清晰的背景並且更簡單 - 請參閱此處

我要說的是,它提供了一些關於視窗函式的力量的概念。

我在評論中註意到關於什麼構成了一個組存在一些爭論——在這個例子中,我建構了 SQL 以便它選擇一個組,其中所有後續會話在第一個會話的 180 秒(即 3 分鐘)內開始——您可以自己將 180 更改為 60(或其他)。

為了測試,我添加了一些記錄,並CONSTRAINT在表定義中添加了 s。始終最好將盡可能多的內容放入 DDL - 您的數據庫是您保護數據的最後堡壘!

CREATE TABLE test 
(
 id bigint NOT NULL,
 created_date TIMESTAMP(2) NOT NULL,
 origin_url   VARCHAR (200) NOT NULL,
 client_session_id VARCHAR (50) NOT NULL,
 CONSTRAINT test_id_pk PRIMARY KEY (id),
 CONSTRAINT test_cd_url_sess_id_uq UNIQUE (created_date, origin_url, client_session_id)
);

始終使用命名約束 - 這些提供比以下更有意義的消息:... failed... CONSTRAINT xyz_000_43abc has been violated...

我填充它如下:

INSERT INTO test VALUES

--
-- 1 lone record...
--

(1,'2021-01-18 10:30:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), -- XX

--
-- 4 records within 180 seconds of the first one
--

(2,'2021-01-18 11:02:24.0', 'https://zes.com/', 'znkjoc'),
(3,'2021-01-18 11:02:35.0', 'https://zes.com/', 'znkjoc'),
(4,'2021-01-18 11:03:03.0', 'https://zes.com/', 'znkjoc'),  -- **
(5,'2021-01-18 11:04:15.0', 'https://zes.com/', 'znkjoc'),  -- YY

-- 
-- 3 records within 180s of the first one
--

(6,'2021-01-18 11:49:28.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'),
(7,'2021-01-18 11:49:48.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **
(8,'2021-01-18 11:50:13.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **

-- 1 lone record

(9,'2021-01-18 12:57:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),  -- XX


(10,'2021-02-18 09:02:24.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(11,'2021-02-18 09:02:35.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(12,'2021-02-18 09:03:03.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(13,'2021-02-18 09:04:15.0', 'https://rar.com/', 'znkjoc3'); -- ZZ


-- -- XX - Added record &gt; 1 minute from next or previous.
-- -- ** - Changed created_date to get groups within 180 seconds.
-- -- YY - Added record &lt; 3 minutes from previous to give 4 records
-- -- ZZ - Added group of 4 records at the end.

檢查邊緣情況總是值得的 - 數據集開頭/結尾的單個記錄+您想要在開頭和結尾擷取的組!我把它留給你做更詳盡的測試!

我先給出結果:

rn  st  sids                URL:                Session id: Session start time:            Session end time:
1   2   2,3,4,5             https://zes.com/    znkjoc      2021-01-18 11:02:24.00  2021-01-18 11:04:15.00
2   5   10,11,12,13         https://rar.com/    znkjoc3     2021-02-18 09:02:24.00  2021-02-18 09:04:15.00

有一個獎勵 - 您可以免費獲得多個緊密結合的會話的開始和結束!

我確實有一個建議——你真的不應該使用或收集數據作為逗號分隔的列表——SQL 不是為字元串操作而設計的,從這些列表中提取有意義的資訊和知識是很痛苦的——最好有一個單一的原子單個欄位中的基準 - 見1st Normal Form

我留下了我用來得出最終結果的各種“子小提琴” - 希望它們能幫助您了解視窗函式&c …我自己對結果的偏好將採用這種格式(請參閱小提琴 - 有一個記錄/會話 - 你可以根據需要修剪):

sid created_date    st  min_cd  max_cd  f_ts    l_ts    c_ts_asc    c_ts_desc   o_url   c_sess_id
1   2021-01-18 10:30:24.00  1   2021-01-18 10:30:24.00  2021-01-18 10:30:24.00  1   1   1   1   https://zendes.com/ znkjoc3gfth2c3m0t1klii
2   2021-01-18 11:02:24.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  1   4   1   4   https://zes.com/    znkjoc
3   2021-01-18 11:02:35.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  2   3   2   3   https://zes.com/    znkjoc
4   2021-01-18 11:03:03.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  3   2   3   2   https://zes.com/    znkjoc
5   2021-01-18 11:04:15.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  4   1   4   1   https://zes.com/    znkjoc
6   2021-01-18 11:49:28.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  1   3   1   3   https://rararar.com/    znkjoc3gfth2c3m0t1klii
7   2021-01-18 11:49:48.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  2   2   2   2   https://rararar.com/    znkjoc3gfth2c3m0t1klii
8   2021-01-18 11:50:13.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  3   1   3   1   https://rararar.com/    znkjoc3gfth2c3m0t1klii
9   2021-01-18 12:57:24.00  4   2021-01-18 12:57:24.00  2021-01-18 12:57:24.00  1   1   1   1   https://zendes.com/ znkjoc3gfth2c3m0t1klii
10  2021-02-18 09:02:24.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  1   4   1   4   https://rar.com/    znkjoc3
11  2021-02-18 09:02:35.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  2   3   2   3   https://rar.com/    znkjoc3
12  2021-02-18 09:03:03.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  3   2   3   2   https://rar.com/    znkjoc3
13  2021-02-18 09:04:15.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  4   1   4   1   https://rar.com/    znkjoc3

或者另一種方式可能是(再次,參見小提琴):

sid Session no. Start of sessions   End of sessions Session count   o_url   c_sess_id
2   2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  4   https://zes.com/    znkjoc
10  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  4   https://rar.com/    znkjoc3

你有開始和結束以及會話的數量……無論如何,這取決於你。看看這一點(很容易錯過):

MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,

這允許您有一個ASC結束和一個DESCedning 列表。

這意味著您可以這樣做:

WHERE (v.f_ts + v.l_ts) &gt;= 5

並且這對應的記錄必須是成束會話感興趣的記錄。

所以現在,SQL 來了(鼓聲,遠處傳來喇叭聲……)!去喝杯咖啡來閱讀它——它是一個怪物(需要修剪——但這對 OP 和其他任何人來說都是一個練習):

SELECT 
 ROW_NUMBER() OVER () AS rn,
 v.st,
 GROUP_CONCAT(v.sid SEPARATOR ',') as sids,
 o_url AS "URL:",
 c_sess_id AS "Session id:",
 min_cd AS "Session start time:",
 max_cd AS "Session end time:"
FROM
(
 SELECT 
   sid,
   created_date, st, 
   MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
   MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,
   ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date ASC)  AS f_ts,
   ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date DESC) AS l_ts,
   COUNT(st) OVER (PARTITION BY st ORDER BY created_date ASC)  AS c_ts_asc,
   COUNT(st) OVER (PARTITION BY st ORDER BY created_date DESC) AS c_ts_desc,
   o_url, 
   c_sess_id
 FROM
 (
   SELECT 
     sid,
     created_date, YYY, 
     SUM(testy) OVER (PARTITION BY testy ORDER BY created_date ASC) AS s, 
     ROW_NUMBER() 
       OVER (PARTITION BY testy ORDER BY created_date ASC) AS rn,
     SUM(testy) OVER y AS st,
     FIRST_VALUE(created_date) OVER (PARTITION BY testy ORDER BY created_date) AS fv,
     LAST_VALUE(created_date) OVER y AS lv,
     o_url,
     c_sess_id
   FROM
   (
     SELECT
       id AS sid,
       LAG(created_date, 1) OVER x AS lag_1,
       created_date,
       LEAD(created_date, 1) OVER x AS lead_1,  

       UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) - UNIX_TIMESTAMP(created_date) AS XXX,

       UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) AS YYY,

       --
       -- The IMPORTANT one!
       --

       CASE 
         WHEN (UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) &gt; 180
         OR   UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) IS NULL)
         THEN 1
         ELSE 0
       END AS testy,
     
       FIRST_VALUE(created_date) OVER x AS f_val,
 
       UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(FIRST_VALUE(created_date) OVER x) AS c_f_diff,
 
       ABS(TIMESTAMPDIFF(MINUTE, created_date, LEAD(created_date, 1) OVER x)) AS min_diff,  

       UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) AS ut_d, 

       origin_url AS o_url,
       client_session_id AS c_sess_id

     FROM test
     WINDOW x AS (PARTITION BY origin_url, client_session_id 
                    ORDER BY created_date ASC, origin_url, client_session_id)
     ORDER BY created_date ASC
   ) AS t 
   WINDOW y AS (ORDER BY created_date ASC ROWS BETWEEN UNBOUNDED PRECEDING
                                                   AND CURRENT ROW)
   ORDER BY created_date ASC
 ) AS u
 ORDER BY created_date ASC
) AS v
WHERE (v.f_ts + v.l_ts) &gt;= 5
GROUP BY v.st, o_url, c_sess_id, min_cd, max_cd;

引用自:https://dba.stackexchange.com/questions/283859