Postgresql

優化對 690,000 行表的昂貴的 GROUP BY / ORDER BY 查詢

  • March 3, 2022

我正在編寫一個查詢來計算他的遊戲得分。它執行 GROUP BY 以匯總儲存在 23 個不同行(由不同“技能”分隔)中的單個玩家統計數據的 hiscore 統計數據。

我提出的以下查詢有效:

SELECT * FROM (
   SELECT
       username,
       SUM(level) total_level,
       SUM(experience) total_experience,
       (SELECT rank FROM overall_hiscore_rankings WHERE username = skill_hiscore.username) rank
       FROM skill_hiscore GROUP BY username ORDER BY rank
    ) a WHERE rank >= 1 AND rank <= 25;

注意:overall_hiscore_rankings 是一個物化視圖,最後的過濾器是我對分頁的 seek 方法的實現,以防止昂貴的 LIMIT OFFSET 分頁。

在將大約 30,000 名玩家價值的數據播種到我的 hiscores 架構中之前,我沒有註意到這有任何問題。因為如上所述,每個玩家都有 23 行不同的行,所以這給我們留下了一個有690,000 行的表。

現在這給了我們非常糟糕的性能,實際上我的數據庫需要100 秒來執行這個查詢:

Sort  (cost=35926899.17..35926907.50 rows=3330 width=35) (actual time=98599.592..98599.595 rows=25 loops=1)
 Sort Key: ((SubPlan 1))
 Sort Method: quicksort  Memory: 26kB
 ->  GroupAggregate  (cost=0.42..35926704.35 rows=3330 width=35) (actual time=149.023..98599.514 rows=25 loops=1)
       Group Key: skill_hiscore.username
       Filter: (((SubPlan 2) >= 1) AND ((SubPlan 3) <= 25))
       Rows Removed by Filter: 29978
       ->  Index Scan using skill_hiscore_username on skill_hiscore  (cost=0.42..46884.88 rows=690063 width=19) (actual time=144.644..293.859 rows=690063 loops=1)
       SubPlan 1
         ->  Seq Scan on overall_hiscore_rankings  (cost=0.00..567.04 rows=1 width=8) (actual time=0.002..1.592 rows=1 loops=25)
               Filter: ((username)::text = (skill_hiscore.username)::text)
               Rows Removed by Filter: 30002
       SubPlan 2
         ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_1  (cost=0.00..567.04 rows=1 width=8) (actual time=0.817..1.633 rows=1 loops=30003)
               Filter: ((username)::text = (skill_hiscore.username)::text)
               Rows Removed by Filter: 30002
       SubPlan 3
         ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_2  (cost=0.00..567.04 rows=1 width=8) (actual time=0.816..1.633 rows=1 loops=30003)
               Filter: ((username)::text = (skill_hiscore.username)::text)
               Rows Removed by Filter: 30002
Planning Time: 2.820 ms
JIT:
 Functions: 21
 Options: Inlining true, Optimization true, Expressions true, Deforming true
 Timing: Generation 1.248 ms, Inlining 44.936 ms, Optimization 60.995 ms, Emission 38.520 ms, Total 145.700 ms
Execution Time: 98623.550 ms

顯然,ORDER BYandGROUP BY導致了最大的性能消耗,但我不確定如何重寫它。

我沒有彈出任何資訊表明此查詢需要 100 秒 - 我從未見過查詢執行得如此糟糕,坦率地說,我不知道如何優化它。

此查詢不受任何索引的支持。

我的主要問題是:這個查詢中缺少數據和數據****的索引是否足以使這個查詢的執行時間為 100 秒GROUP BY``ORDER BY

相關子查詢確實會拖累性能,而且沒有必要。

也缺乏索引。

請嘗試:

SELECT username
    , sum(s.level) AS total_level
    , sum(s.experience) AS total_experience
    , o.rank
FROM   overall_hiscore_rankings o
JOIN   skill_hiscore s USING (username)
WHERE  o.rank >= 1
AND    o.rank <= 25
GROUP  BY username, o.rank -- ?
ORDER  BY o.rank;

如果overall_hiscore_rankings.username是唯一的,您可以添加o.rankGROUP BY. 我需要表定義來確定……

理想情況下,您有如下索引:

CREATE INDEX ON overall_hiscore_rankings (rank, username);
CREATE INDEX ON skill_hiscore (username) INCLUDE (level, experience);

覆蓋索引只有在您的表足夠真空時才有意義。否則只做(username)- 您的查詢計劃中的索引skill_hiscore_username似乎涵蓋了這一點。看:

確保已autovacuum執行,或VACUUM ANALYZE在創建測試案例後手動執行一次以獲取目前列統計資訊並更新可見性圖。

順便說一句,您提到*“ 23個不同的行(由不同的技能分隔)”*,但WHERE rank >= 1 AND rank <= 25似乎選擇了其中的25個。你的意思是WHERE rank > 1 AND rank < 25

引用自:https://dba.stackexchange.com/questions/308156