Hive

使用 hiveql 的累積和

  • October 26, 2016

我在 Hive 中有一張桌子,看起來像:

col1       col2
b           1
b           2
a           3
b           2
c           4
c           5

我如何使用 hiveql 將col1元素組合在一起,對它們求和,按總和排序,以及基於總和創建累積總和 (csum)?

id       sum_all     csum
a         3           3
b         5           8
c         9           17

我只設法提出了分組和總和,但對累積總和一無所知。Hive 不支持相關子查詢

select col1 as id
     sum(col2) as sum_all
from t
group by col1
order by sum_all

結果如下:

id       sum_all
a         3
b         5
c         9

由於不允許關聯子查詢,請嘗試使用派生表,然後將它們連接起來。

select 
   a.id,
   a.sum_all,
   sum(b.sum_all) as csum
from
       ( select col1 as id,
                sum(col2) as sum_all
         from t
         group by col1
       )  a
   join
       ( select col1 as id,
                sum(col2) as sum_all
         from t
         group by col1
       )  b
    on
       ( b.sum_all < a.sum_all )
    or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
   a.sum_all, a.id
order by 
   a.sum_all, a.id ;

這本質上是派生的 group-by 表上的自聯接。首先將分組結果保存到臨時表中然後執行自連接可能更有效。


根據手冊,Hive 也有視窗聚合,所以你也可以使用它們:

select 
   a.id,
   a.sum_all,
   sum(a.sum_all) over (order by a.sum_all, a.id
                        rows between unbounded preceding
                                 and current row)
       as csum
from
       ( select col1 as id,
                sum(col2) as sum_all
         from t
         group by col1
       )  a
order by 
   sum_all, id ;

或與:

select 
   col1 as id,
   sum(col2) as sum_all,
   sum(sum(col2)) over (order by sum(col2), col1
                        rows between unbounded preceding
                                 and current row)
       as csum
from
   t
group by 
   col1
order by 
   sum_all, id ;

引用自:https://dba.stackexchange.com/questions/153338