Mysql

GROUP BY 的中位數

  • April 21, 2020

假設下表t1

=================
| 標籤 | 值 | --+ 為簡單起見,val 為非 NULL
=================
| a1 | v1 |
| a1 | v2 |
| a1 | v3 |
| a1 | v4 |
| a1 | v5 |
| a2 | v6 |
| a2 | v7 |
| a2 | v8 |
| a2 | v9 |
| ... | ... |
=================

如果您在 MySQL 中執行以下腳本:

SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`

您將獲得按列分組的平均值tag

=================
| 標籤 | 平均 () |
=================
| a1 | 平均1 |
| a2 | 平均2 |
| a3 | 平均3 |
| a4 | 平均4 |
| ... | ... |
=================

此外AVG(),MySQL 還具有其他幾個內置函式來計算聚合值(例如SUM()MAX()COUNT()STD()),這些函式的使用方式與上述腳本中的相同。但是,沒有內置函式median

這個問題已經在 SE 上多次出現過。但是,它們中的大多數都與沒有GROUP BY. 唯一一個GROUP BY似乎是MySql: Count median grouped by day;但是,腳本似乎過於復雜。

計算這個中位數的簡單方法(如果可能的話)是什麼?

跟進

補充公認答案的優秀文章:http:

//danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/

此查詢可以回答您的問題:中值分組依據

           SELECT tag, AVG(val) as median
           FROM 
           (
             SELECT tag, val,
                 (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                 seq,
                 (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
               FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
                     FROM (SELECT * FROM median ORDER BY tag, val) t1 
                     ORDER BY tag, seq
                   ) t3 CROSS JOIN (SELECT @rownum := 0) x
               HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
                 or (ct%2 <> 0 and seq-delta = (ct+1)/2)
           ) T
           GROUP BY tag
           ORDER BY tag;

我在這個數據集上試過(主要來自這裡):

           +------+------+
           | tag  | val  |
           +------+------+
           |    1 |    3 |
           |    1 |   13 |

…(見下面的解釋)

           |    3 |   12 |
           |    3 |   43 |
           |    3 |   15 |
           +------+------+

結果是:

           +------+---------+
           | tag  | median  |
           +------+---------+
           |    1 | 23.0000 |
           |    2 | 22.0000 |
           |    3 | 15.0000 |
           +------+---------+

解釋

將首先計算內部子查詢:序列為 (1)(2)(3)(4)。

– (4) 計算平均值(2 行或 1 行)

   SELECT tag, AVG(val) as median                          
     FROM 
     (

– (3) 獲取行來計算中值

       SELECT tag, val,                                       
          (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
             WHERE t2.tag = t3.tag) as ct,
          seq,
          (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
             WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
        FROM (

– (2) 按標籤和序列對數據集進行排序

               SELECT tag, val,                            
                   @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
             FROM (

– (1) 按標籤和值對數據集進行排序

                   SELECT * FROM median           
                   ORDER BY tag, val) t1 

– (2) 在這裡繼續

             ORDER BY tag, seq
           ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)

– (3) 在這裡繼續

        HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                 and seq-delta between floor((ct+1)/2) 
                               and floor((ct+1)/2) +1)
          or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                 and seq-delta = (floor(ct+1)/2))
     ) T

– (4) 繼續這裡

     GROUP BY tag
     ORDER BY tag;

數據集:

       after (1)     after (2)           processing (3)   
   +------+------+                   
   | tag  | val  |  ct  delta  seq       seq-delta
   +------+------+                   
   |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
   |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
   |    1 |    7 |  15    0     3        3         
   |    1 |   12 |  15    0     4        4         
   |    1 |   13 |  15    0     5        5
   |    1 |   14 |  15    0     6        6
   |    1 |   21 |  15    0     7        7
   |    1 |   23 |  15    0     8        8 ---> keep this line
   |    1 |   23 |  15    0     9        9 
   |    1 |   23 |  15    0     10       10
   |    1 |   23 |  15    0     11       11
   |    1 |   29 |  15    0     12       12
   |    1 |   39 |  15    0     13       13
   |    1 |   40 |  15    0     14       14
   |    1 |   56 |  15    0     15       15

   |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
   |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
   |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
   |    2 |   12 |  14    15    19        4
   |    2 |   13 |  14    15    20        5
   |    2 |   14 |  14    15    21        6
   |    2 |   21 |  14    15    22        7 ---> keep this line
   |    2 |   23 |  14    15    23        8 ---> keep this line
   |    2 |   23 |  14    15    24        9
   |    2 |   23 |  14    15    25        10
   |    2 |   23 |  14    15    26        11
   |    2 |   29 |  14    15    27        12
   |    2 |   40 |  14    15    28        13
   |    2 |   56 |  14    15    29        14

   |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
   |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
   |    3 |   43 |  3     29    32        3
   +------+------+

(3) 之後的數據集

   +------+------+------+------+-------+
   | tag  | val  | ct   | seq  | delta |
   +------+------+------+------+-------+
   |    1 |   23 |   15 |    8 |     0 |
   |    2 |   21 |   14 |   22 |    15 |
   |    2 |   23 |   14 |   23 |    15 |
   |    3 |   15 |    3 |   31 |    29 |
   +------+------+------+------+-------+

外部查詢將按標籤值計算 avg(val) 組。

希望這可以幫助。

但是當有空值時,中值計算呢?請參閱下面的 EDIT2

替代方案:使用函式

   DELIMITER //
   CREATE FUNCTION median(pTag int)
       RETURNS real
          READS SQL DATA
          DETERMINISTIC
          BEGIN
             DECLARE r real; -- result
   SELECT AVG(val) INTO r
   FROM 
   (
     SELECT val,
          (SELECT count(*) FROM median WHERE tag = pTag) as ct,
          seq
       FROM (SELECT val, @rownum := @rownum + 1 as seq
             FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 
             ORDER BY seq
           ) t3 
           CROSS JOIN (SELECT @rownum := 0) x
       HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
         or (ct%2 <> 0 and seq = (ct+1)/2)
   ) T;
   return r;
   END//
   DELIMITER ;

但是將為每一行呼叫該函式:

SELECT tag, median(tag) FROM median; -- my test table is 'median' too...

此查詢將“更好”:

select tag, median(tag) 
 from (select distinct tag from median) t;

這就是我能做的!希望能幫助到你!

EDIT2:關於數據中的空值(範例中的列 val)

WHERE val IS NOT NULL在計算行數的 2 個子查詢和獲取數據的子查詢中,顯示使用 WHERE 子句從源數據中省略空值。

EDIT3 (LAST EDIT) : 改變@rownum 位置的初始化

它應該放在最深的級別:以便它在查詢的執行中最快地聲明。

DELIMITER //
CREATE FUNCTION median(pTag int)
   RETURNS real
      READS SQL DATA
      DETERMINISTIC
      BEGIN
         DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM 
(
 SELECT val,
      (SELECT count(*) FROM median WHERE tag = pTag and val is not null)     as ct,
      seq
   FROM (SELECT val, @rownum := @rownum + 1 as seq
         FROM (SELECT * FROM median 
                      CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
            WHERE tag = pTag and val is not null ORDER BY val 
         ) t1 
         ORDER BY seq
      ) t3     
   HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
     or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;

查詢也是如此。

使用 2 個數據集進行測試:

|    4 | NULL |
|    4 |   10 |
|    4 |   15 |
|    4 |   20 |
|    5 | NULL |
|    5 | NULL |
|    5 | NULL |
+------+------+

一組 39 行(0.00 秒)

+------+--------------+
| tag  | median2(tag) |
+------+--------------+
|    1 |           23 |
|    2 |           22 |
|    3 |           15 |
|    4 |           15 |
|    5 |         NULL |
+------+--------------+
5 rows in set (0.08 sec)

引用自:https://dba.stackexchange.com/questions/158333