Postgresql

大表慢窗函式查詢

  • July 10, 2015

我正在對 PostgreSQL 9.4rc1 上的新數據庫設計進行一些性能測試,我看到一些使用視窗函式的查詢非常慢。這是我的表設置:

CREATE TABLE player_stat (
 player_id    VARCHAR(200) NOT NULL,
 stat_id      BIGINT NOT NULL,
 value        BIGINT NOT NULL DEFAULT 0,
 last_updated TIMESTAMP WITH TIME ZONE NOT NULL,
 last_active  TIMESTAMP WITH TIME ZONE DEFAULT NULL,

 CONSTRAINT player_stat_pk PRIMARY KEY (player_id, stat_id),
 CONSTRAINT player_stat_fk1 FOREIGN KEY(stat_id) REFERENCES stat (id)
);
CREATE INDEX player_stat_stat_value_player_desc
 ON player_stat (stat_id, value DESC, player_id ASC);

我在這張表中插入了 3000 萬行,分為 3 個統計數據:

INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 1, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 2, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;
INSERT INTO player_stat (player_id, stat_id, value, last_updated) SELECT x.id, 3, x.v, now() FROM (SELECT generate_series(1,10000000) as id, trunc(random() * (1900-1200) + 1200) as v) AS x;

然後我嘗試對給定統計數據的玩家進行排名(編輯):

SELECT * FROM 
( SELECT player_id
      , rank() OVER (ORDER BY value DESC, player_id ASC)  as rank 
 FROM player_stat 
 WHERE stat_id = 1
) as t 
WHERE rank <= 20 
ORDER BY rank ASC;

此查詢大約需要 5.5 秒才能返回。在其上執行解釋返回以下內容:

"Sort  (cost=1167612.28..1176082.26 rows=3387993 width=15) (actual time=9726.132..9726.135 rows=20 loops=1)"
"  Sort Key: t.rank"
"  Sort Method: quicksort  Memory: 25kB"
"  ->  Subquery Scan on t  (cost=0.56..684349.57 rows=3387993 width=15) (actual time=0.080..9726.116 rows=20 loops=1)"
"        Filter: (t.rank <= 20)"
"        Rows Removed by Filter: 9999980"
"        ->  WindowAgg  (cost=0.56..557299.83 rows=10163979 width=15) (actual time=0.077..8351.124 rows=10000000 loops=1)"
"              ->  Index Only Scan using player_stat_stat_value_player_desc on player_stat  (cost=0.56..379430.20 rows=10163979 width=15) (actual time=0.054..2319.007 rows=10000000 loops=1)"
"                    Index Cond: (stat_id = 1)"
"                    Heap Fetches: 0"
"Planning time: 0.187 ms"
"Execution time: 9726.172 ms"

有什麼辦法可以加快速度嗎?所花費的時間似乎與桌上玩家的數量呈線性增長。

有什麼辦法可以加快速度嗎?

是的。不要使用varchar列作為integer數字。使用integer或者bigint如果您刻錄那麼多 ID - 表和索引要小得多,處理速度更快。由於您在測試中對1000 萬行進行排名,這將產生重大影響。

player_id VARCHAR(200) NOT NULL,

player_id int NOT NULL,

或者uuid如果你必須(我懷疑):

您的查詢對 1000 萬行進行排名。即使直接從索引中讀取並且沒有排序步驟,這也需要一些時間。

旁注:如果您首先批量插入行並在之後添加索引和 PK 約束(和 FK 約束),那將會快得多,而且您可以獲得完美的索引而不會膨脹而無需執行REINDEXor VACUUM FULL. 不過,在測試性能之前,請確保ANALYZE已在桌面上執行。

你沒問的

..但是,在這里四處走動,可能正在尋找什麼。

輸出顯示EXPLAIN您過濾了前 20 行:(t.rank <= 20). 您提出的查詢沒有顯示。實際匹配您的EXPLAIN輸出的查詢將是:

SELECT * FROM (
  SELECT player_id
       , rank() OVER (ORDER BY value DESC, player_id ASC) AS rank
  FROM   player_stat
  WHERE  stat_id = 1
  ) t
WHERE t.rank <= 20;

可以顯著改善

SELECT row_number() OVER (ORDER BY value DESC, player_id ASC) AS rank
    , player_id
FROM   player_stat
WHERE  stat_id = 1
ORDER  BY value DESC, player_id
LIMIT  20;

解釋

  • 性能的重要部分是LIMIT與匹配索引相結合的子句ORDER BY:現在查詢從頂部到索引恰好讀取 20 行,在原始版本中它必須讀取 10000000。我們只使用player_idand value,所以我們仍然可以進行僅索引掃描。剩下的就是花生了。
  • 這都是由於查詢中的事件序列SELECT:視窗函式在之前 LIMIT應用。僅當排序順序一致時,我們才不必考慮其餘適用的 10000000 行。
  • 我們可以使用LIMIT 20,因為前 20 個排名保證不超過 20 行。PK on(player_id, stat_id)保證player_id每一個都是唯一stat_id的,因為它包含在 中ORDER BY,每個等級只分配一次 - 這也意味著我們可以使用稍微便宜一點的row_number()代替。

引用自:https://dba.stackexchange.com/questions/106513