Postgresql

優化視圖(和基礎表)以將時間戳平均到小時

  • December 15, 2017

我有這張桌子:

CREATE TABLE spp.rtprices (
 "interval" timestamp without time zone NOT NULL,
 rtlmp numeric(12,6),
 rtmcc numeric(12,6),
 rtmcl numeric(12,6),
 node_id integer NOT NULL,
 CONSTRAINT rtprices_pkey PRIMARY KEY ("interval", node_id),
 CONSTRAINT rtprices_node_id_fkey FOREIGN KEY (node_id)
     REFERENCES spp.nodes (node_id) MATCH SIMPLE
     ON UPDATE RESTRICT ON DELETE RESTRICT
)

還有一個相關的索引:

CREATE INDEX rtprices_node_id_interval_idx ON spp.rtprices (node_id, "interval");

反對它,我提出了這樣的觀點:

 CREATE OR REPLACE VIEW spp.rtprices_hourly AS 
 SELECT (rtprices."interval" - '00:05:00'::interval)::date::timestamp without time zone AS pricedate,
 date_part('hour'::text, date_trunc('hour'::text, rtprices."interval" - '00:05:00'::interval))::integer + 1 AS hour,
 rtprices.node_id,
 round(avg(rtprices.rtlmp), 2) AS rtlmp,
 round(avg(rtprices.rtmcc), 2) AS rtmcc,
 round(avg(rtprices.rtmcl), 2) AS rtmcl
 FROM spp.rtprices
 GROUP BY date_part('hour'::text, date_trunc('hour'::text, rtprices."interval" - '00:05:00'::interval))::integer + 1,
          rtprices.node_id,
          (rtprices."interval" - '00:05:00'::interval)::date::timestamp without time zone;

重點是給出每小時數字列的平均值(時間戳每 5 分鐘有一次數據)。問題是單日查詢node_id需要 30 秒以上才能查詢 24 條記錄。

explain analyze select * from spp.rtprices_hourly
where node_id=20 and pricedate='2015-02-02'

返回這個

  "HashAggregate  (cost=1128767.71..1128773.79 rows=135 width=28) (actual time=31155.023..31155.065 rows=24 loops=1)"
"  Group Key: ((date_part('hour'::text, date_trunc('hour'::text, (rtprices."interval" - '00:05:00'::interval))))::integer + 1), rtprices.node_id, (((rtprices."interval" - '00:05:00'::interval))::date)::timestamp without time zone"
"  ->  Bitmap Heap Scan on rtprices  (cost=10629.42..1128732.91 rows=2320 width=28) (actual time=25071.410..31153.715 rows=288 loops=1)"
"        Recheck Cond: (node_id = 20)"
"        Rows Removed by Index Recheck: 7142233"
"        Filter: (((("interval" - '00:05:00'::interval))::date)::timestamp without time zone = '2015-02-02 00:00:00'::timestamp without time zone)"
"        Rows Removed by Filter: 124909"
"        Heap Blocks: exact=43076 lossy=82085"
"        ->  Bitmap Index Scan on rtprices_node_id_interval_idx  (cost=0.00..10628.84 rows=464036 width=0) (actual time=68.999..68.999 rows=125197 loops=1)"
"              Index Cond: (node_id = 20)"
"Planning time: 5.243 ms"
"Execution time: 31155.392 ms"

更簡單的視圖

為此目標:

重點是給出每小時數字列的平均值

.. 截斷到完整小時似乎同樣好,這更簡單、更便宜:

CREATE OR REPLACE VIEW spp.rtprices_hourly AS 
SELECT date_trunc('hour', "interval") AS hour
    , node_id
    , round(avg(rtlmp), 2) AS rtlmp
    , round(avg(rtmcc), 2) AS rtmcc
    , round(avg(rtmcl), 2) AS rtmcl
FROM   spp.rtprices
GROUP  BY 1, 2;

更快的查詢

無論哪種方式,具有可搜尋謂詞的視圖上的等效查詢將是:

SELECT *
FROM   spp.rtprices_hourly
WHERE  node_id = 20
AND    hour >= '2015-02-02 0:0'::timestamp
AND    hour <  '2015-02-03 0:0'::timestamp;

這速度更快,但仍然沒有達到應有的速度。主要的性能損失是因為索引只能與索引條件on一起使用,該條件node_id在視圖中保留為其原始狀態。這就是為什麼你的索引rtprices_node_id_interval_idxnode_id重要的原因。為什麼?

在從堆中獲取元組之後(已從表中讀取行)hour,必須過濾第二個謂詞。大部分行在流程後期被丟棄,很多工作都是徒勞的。

直接查詢更快

在聚合之前執行原始查詢並應用謂詞會快得多:

SELECT date_trunc('hour', "interval") AS hour
    , node_id
    , round(avg(rtlmp), 2) AS rtlmp
    , round(avg(rtmcc), 2) AS rtmcc
    , round(avg(rtmcl), 2) AS rtmcl
FROM   spp.rtprices
WHERE  node_id = 20
AND    "interval" >= '2015-02-02 0:0'::timestamp
AND    "interval" <  '2015-02-03 0:0'::timestamp
GROUP  BY 1, 2;

您現在將看到所有謂詞的索引條件。更有效的索引仍然是node_id第一個。為什麼?

快速和簡短:創建一個函式

因此,這不適用於視圖。改用函式:

CREATE OR REPLACE FUNCTION rtprices_hourly(_node_id int
                                        , _from timestamp
                                        , _to timestamp = NULL)
 RETURNS TABLE (
   hour    timestamp
 , node_id int
 , rtlmp   numeric
 , rtmcc   numeric
 , rtmcl   numeric) AS
$func$
SELECT date_trunc('hour', r."interval")  -- AS hour
    , r.node_id
    , round(avg(r.rtlmp), 2)  -- AS rtlmp
    , round(avg(r.rtmcc), 2)  -- AS rtmcc
    , round(avg(r.rtmcl), 2)  -- AS rtmcl
FROM   spp.rtprices r
WHERE  r.node_id     = _node_id
AND    r."interval" >= _from
AND    r."interval" <  COALESCE(_to, _from + interval '1 day')
GROUP  BY 1, 2
$func$  LANGUAGE sql STABLE;
  • 注意 OUT 參數和列名之間的命名衝突。這就是我在這裡對所有列進行表格限定的原因。

現在您可以通過一個簡單的查詢獲得最佳性能:

SELECT * FROM rtprices_hourly(1, '2015-2-2 0:0'::timestamp, '2015-2-3 0:0'::timestamp);

我添加了一個便利功能,如果省略第二個參數,則預設為“一天后”:

SELECT * FROM rtprices_hourly(1, '2015-2-2 0:0'::timestamp);

有關函式參數和預設值的更多資訊:

您可以查詢任何範圍:

SELECT * FROM rtprices_hourly(1, '2015-2-2 10:0'::timestamp, '2015-2-2 20:0'::timestamp);

引用自:https://dba.stackexchange.com/questions/101956