GROUP BY 的中位數
假設下表
t1
:================= | 標籤 | 值 | --+ 為簡單起見,val 為非 NULL ================= | a1 | v1 | | a1 | v2 | | a1 | v3 | | a1 | v4 | | a1 | v5 | | a2 | v6 | | a2 | v7 | | a2 | v8 | | a2 | v9 | | ... | ... | =================
如果您在 MySQL 中執行以下腳本:
SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`
您將獲得按列分組的平均值
tag
:================= | 標籤 | 平均 () | ================= | a1 | 平均1 | | a2 | 平均2 | | a3 | 平均3 | | a4 | 平均4 | | ... | ... | =================
此外
AVG()
,MySQL 還具有其他幾個內置函式來計算聚合值(例如SUM()
、MAX()
、COUNT()
和STD()
),這些函式的使用方式與上述腳本中的相同。但是,沒有內置函式median。這個問題已經在 SE 上多次出現過。但是,它們中的大多數都與沒有
GROUP BY
. 唯一一個GROUP BY
似乎是MySql: Count median grouped by day;但是,腳本似乎過於復雜。題
計算這個中位數的簡單方法(如果可能的話)是什麼?
跟進
補充公認答案的優秀文章:http:
//danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/
此查詢可以回答您的問題:中值和分組依據
SELECT tag, AVG(val) as median FROM ( SELECT tag, val, (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct, seq, (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta FROM (SELECT tag, val, @rownum := @rownum + 1 as seq FROM (SELECT * FROM median ORDER BY tag, val) t1 ORDER BY tag, seq ) t3 CROSS JOIN (SELECT @rownum := 0) x HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 and seq-delta = (ct+1)/2) ) T GROUP BY tag ORDER BY tag;
我在這個數據集上試過(主要來自這裡):
+------+------+ | tag | val | +------+------+ | 1 | 3 | | 1 | 13 |
…(見下面的解釋)
| 3 | 12 | | 3 | 43 | | 3 | 15 | +------+------+
結果是:
+------+---------+ | tag | median | +------+---------+ | 1 | 23.0000 | | 2 | 22.0000 | | 3 | 15.0000 | +------+---------+
解釋
將首先計算內部子查詢:序列為 (1)(2)(3)(4)。
– (4) 計算平均值(2 行或 1 行)
SELECT tag, AVG(val) as median FROM (
– (3) 獲取行來計算中值
SELECT tag, val, (SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct WHERE t2.tag = t3.tag) as ct, seq, (SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta WHERE t2.tag < t3.tag) as delta -- to compute the starting line number of a tag FROM (
– (2) 按標籤和序列對數據集進行排序
SELECT tag, val, @rownum := @rownum + 1 as seq -- +@rownum enable to create a sequence from 0 by 1 FROM (
– (1) 按標籤和值對數據集進行排序
SELECT * FROM median ORDER BY tag, val) t1
– (2) 在這裡繼續
ORDER BY tag, seq ) t3 CROSS JOIN (SELECT @rownum := 0) x -- +use to set @rownum to 0 (no data)
– (3) 在這裡繼續
HAVING (ct%2 = 0 -- +when ct is even, select the two lines around the middle and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 -- +when ct is odd, select the one line in the middle and seq-delta = (floor(ct+1)/2)) ) T
– (4) 繼續這裡
GROUP BY tag ORDER BY tag;
數據集:
after (1) after (2) processing (3) +------+------+ | tag | val | ct delta seq seq-delta +------+------+ | 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0 | 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8 | 1 | 7 | 15 0 3 3 | 1 | 12 | 15 0 4 4 | 1 | 13 | 15 0 5 5 | 1 | 14 | 15 0 6 6 | 1 | 21 | 15 0 7 7 | 1 | 23 | 15 0 8 8 ---> keep this line | 1 | 23 | 15 0 9 9 | 1 | 23 | 15 0 10 10 | 1 | 23 | 15 0 11 11 | 1 | 29 | 15 0 12 12 | 1 | 39 | 15 0 13 13 | 1 | 40 | 15 0 14 14 | 1 | 56 | 15 0 15 15 | 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 ) | 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7 | 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8 | 2 | 12 | 14 15 19 4 | 2 | 13 | 14 15 20 5 | 2 | 14 | 14 15 21 6 | 2 | 21 | 14 15 22 7 ---> keep this line | 2 | 23 | 14 15 23 8 ---> keep this line | 2 | 23 | 14 15 24 9 | 2 | 23 | 14 15 25 10 | 2 | 23 | 14 15 26 11 | 2 | 29 | 14 15 27 12 | 2 | 40 | 14 15 28 13 | 2 | 56 | 14 15 29 14 | 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0 | 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2 | 3 | 43 | 3 29 32 3 +------+------+
(3) 之後的數據集
+------+------+------+------+-------+ | tag | val | ct | seq | delta | +------+------+------+------+-------+ | 1 | 23 | 15 | 8 | 0 | | 2 | 21 | 14 | 22 | 15 | | 2 | 23 | 14 | 23 | 15 | | 3 | 15 | 3 | 31 | 29 | +------+------+------+------+-------+
外部查詢將按標籤值計算 avg(val) 組。
希望這可以幫助。
但是當有空值時,中值計算呢?請參閱下面的 EDIT2
替代方案:使用函式
DELIMITER // CREATE FUNCTION median(pTag int) RETURNS real READS SQL DATA DETERMINISTIC BEGIN DECLARE r real; -- result SELECT AVG(val) INTO r FROM ( SELECT val, (SELECT count(*) FROM median WHERE tag = pTag) as ct, seq FROM (SELECT val, @rownum := @rownum + 1 as seq FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 ORDER BY seq ) t3 CROSS JOIN (SELECT @rownum := 0) x HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 and seq = (ct+1)/2) ) T; return r; END// DELIMITER ;
但是將為每一行呼叫該函式:
SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
此查詢將“更好”:
select tag, median(tag) from (select distinct tag from median) t;
這就是我能做的!希望能幫助到你!
EDIT2:關於數據中的空值(範例中的列 val)
WHERE val IS NOT NULL
在計算行數的 2 個子查詢和獲取數據的子查詢中,顯示使用 WHERE 子句從源數據中省略空值。EDIT3 (LAST EDIT) : 改變@rownum 位置的初始化
它應該放在最深的級別:以便它在查詢的執行中最快地聲明。
DELIMITER // CREATE FUNCTION median(pTag int) RETURNS real READS SQL DATA DETERMINISTIC BEGIN DECLARE r real; -- result SELECT AVG(val) INTO r FROM ( SELECT val, (SELECT count(*) FROM median WHERE tag = pTag and val is not null) as ct, seq FROM (SELECT val, @rownum := @rownum + 1 as seq FROM (SELECT * FROM median CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here WHERE tag = pTag and val is not null ORDER BY val ) t1 ORDER BY seq ) t3 HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 and seq = (ct+1)/2) ) T; return r; END// DELIMITER ;
查詢也是如此。
使用 2 個數據集進行測試:
| 4 | NULL | | 4 | 10 | | 4 | 15 | | 4 | 20 | | 5 | NULL | | 5 | NULL | | 5 | NULL | +------+------+
一組 39 行(0.00 秒)
+------+--------------+ | tag | median2(tag) | +------+--------------+ | 1 | 23 | | 2 | 22 | | 3 | 15 | | 4 | 15 | | 5 | NULL | +------+--------------+ 5 rows in set (0.08 sec)