Postgresql

不同維度的嵌套聚合函式

  • December 15, 2017

我有一張表,我定期記錄一組伺服器上所有網路介面的吞吐量:

create table net (
   ts timestamptz not null,
   host text not null,
   interface text not null,
   recv_bytes bigint not null
);
create index on net (ts);
insert into net (ts, host, interface, recv_bytes) values
   ('2017-01-01 00:00:00+00', 'a', 'eth0',  500),
   ('2017-01-01 00:00:00+00', 'b', 'eth1', 2000),
   ('2017-01-01 00:00:01+00', 'b', 'eth0', 1000),  -- measurements arrive with some jitter, +1s here
   ('2017-01-01 00:00:02+00', 'c', 'eth0',  100),  -- only present in this interval
   ('2017-01-01 00:00:04+00', 'b', 'eth1', 1900),
   ('2017-01-01 00:00:05+00', 'a', 'eth0',  550),
   ('2017-01-01 00:00:05+00', 'b', 'eth0', 1200),

   ('2017-01-01 00:00:10+00', 'a', 'eth0',  600),
   ('2017-01-01 00:00:10+00', 'b', 'eth0', 1500),
   ('2017-01-01 00:00:11+00', 'b', 'eth1', 1900),
   ('2017-01-01 00:00:15+00', 'a', 'eth0',  600),
   ('2017-01-01 00:00:15+00', 'b', 'eth1', 1400),
   ('2017-01-01 00:00:16+00', 'b', 'eth0', 1400),
   ('2017-01-01 00:00:16+00', 'b', 'eth1', 1700);  -- (b,eth1) appears 3 times in this interval

我想找到給定時間範圍內所有介面的總吞吐量,平均超過 10 秒的時間間隔。如果(主機,介面)對在給定的時間間隔內缺少數據,則可以簡單地省略它。這就是我想出的:

select ts_interval, host, sum(recv_bytes) as recv_bytes
from (
   select
       to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
       host,
       interface,
       avg(recv_bytes) as recv_bytes
   from net
   group by ts_interval, host, interface
) avg_net
where ts_interval >= '2017-01-01 00:00:00+00' and ts_interval < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

結果:

     ts_interval       | host |      recv_bytes
------------------------+------+-----------------------
2017-01-01 00:00:00+00 | a    |  525.0000000000000000
2017-01-01 00:00:00+00 | b    | 3050.0000000000000000
2017-01-01 00:00:00+00 | c    |  100.0000000000000000
2017-01-01 00:00:10+00 | a    |  600.0000000000000000
2017-01-01 00:00:10+00 | b    | 3116.6666666666666667

此查詢似乎過於冗長。我試圖做的本質是在一個維度上求和,在另一個維度上求平均值。在偽 SQL 中:

select
   to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
   host,
   sum(avg(recv_bytes OVER ts_interval) OVER host) as recv_bytes
from net
where ts >= '2017-01-01 00:00:00+00' and ts < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

有沒有更簡潔的方法來做到這一點?

詢問

您可以使用generate_series()生成時間柵格來加入:

SELECT g AS ts_interval, host, sum(recv_bytes) AS recv_bytes
FROM  (
  SELECT g, host, interface, avg(recv_bytes) AS recv_bytes
  FROM   generate_series(timestamptz '2017-01-01 00:00:00+00'
                       , timestamptz '2017-01-01 00:00:10+00'  -- 10 - only lower bound
                       , interval '10 sec') g
  JOIN   net n ON ts >= g
              AND ts <  g + interval '10 sec'
  GROUP  BY g, host, interface
  ) sub
GROUP  BY g, host
ORDER  BY g, host;

結果相同。不那麼冗長,但如果你在(ts).

再說一次,關鍵是要有“ sargable ”謂詞,沒有generate_series()這種方式你也可以得到:

SELECT ts_interval, host, sum(recv_bytes) as recv_bytes
FROM  (
  SELECT to_timestamp(trunc(extract(epoch from ts) / 10) * 10) AS ts_interval
       , host, interface, avg(recv_bytes) as recv_bytes
  FROM   net
  WHERE  ts >= '2017-01-01 00:00:00+00'
  AND    ts <  '2017-01-01 00:00:20+00' -- just make sure to match bounds
  GROUP  BY 1, 2, 3
  ) avg_net
GROUP  BY 1, 2
ORDER  BY 1, 2;

次要點:僅使用正數,trunc()相當於floor()並且更快一點。

我使用的是序數,因為您要求“不那麼冗長”的程式碼。但這不是這裡問題的核心……

核心問題

可以在一個查詢級別在聚合函式上執行視窗函式(儘管它通常不比使用子查詢快)。

但反過來不行:您不能在一個查詢級別中在視窗函式上執行聚合函式。只是不可能。視窗函式發生在 SQL 中的聚合函式之後。

有關的:

引用自:https://dba.stackexchange.com/questions/173044