Postgresql

優化查詢以查找評論文章的前 N 個使用者

  • May 12, 2016

問題

試圖找到最有效的查詢來檢索對文章發表評論的前 N ​​個(範例中為 5 個)使用者,如果使用者擁有最多的關注者,則該使用者被視為“頂部”。查詢優化器似乎沒有選擇正確的路徑。

表(Postgres v9.4.4)

user_account(40k 條記錄)

CREATE TABLE user_account (
 user_id TEXT PRIMARY KEY,
 name TEXT
);

以下(13k 條記錄)

CREATE TABLE following (
 follower_user_id TEXT,
 followed_user_id TEXT,
 PRIMARY KEY (followed_user_id, follower_user_id)
);

follower_count_mv(10.5k 記錄,只有 5 個使用者擁有 > 1 個關注者)

CREATE MATERIALIZED VIEW follower_count_mv AS
SELECT followed_user_id AS user_id, COUNT(*)::int AS follower_count
FROM following
WHERE deleted_at IS NULL
GROUP BY user_id;

CREATE UNIQUE INDEX follower_count_mv_user_id_idx ON follower_count_mv (user_id);
CREATE INDEX follower_count_mv_follower_count_idx ON follower_count_mv (follower_count);

user_post_comment(13.4k 條記錄,但大多數在 3 個文章上)

CREATE TABLE user_post_comment (
 comment_id TEXT PRIMARY KEY,
 user_id TEXT,
 post_id TEXT
)
CREATE INDEX user_post_comment_user_id_idx ON user_post_comment (user_id);
CREATE INDEX user_post_comment_post_id_idx ON user_post_comment (post_id);

我嘗試過的查詢

1)最自然的選擇:加入表格並排序

SELECT user_account.*
FROM user_account
JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id)
JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id)
WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
ORDER BY follower_count DESC LIMIT 5;

這是我最初擁有的,但查詢優化器似乎很難找出執行此操作的最佳方法。也許與數據分佈有關?

Limit  (cost=0.99..117.00 rows=5 width=580) (actual time=0.082..148.688 rows=2 loops=1)
 ->  Nested Loop  (cost=0.99..12136.14 rows=523 width=580) (actual time=0.081..148.687 rows=2 loops=1)
       ->  Nested Loop  (cost=0.57..6875.61 rows=1570 width=78) (actual time=0.049..148.624 rows=2 loops=1)
             ->  Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv  (cost=0.29..383.25 rows=10483 width=41) (actual time=0.011..1.904 rows=10483 loops=1)
             ->  Index Scan using user_post_comment_user_id_idx on user_post_comment  (cost=0.29..0.61 rows=1 width=37) (actual time=0.014..0.014 rows=0 loops=10483)
                   Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text)
                   Filter: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text)
                   Rows Removed by Filter: 0
       ->  Index Scan using user_account_pkey on user_account  (cost=0.41..3.34 rows=1 width=576) (actual time=0.029..0.030 rows=1 loops=2)
             Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text)
Planning time: 4.172 ms
Execution time: 148.763 ms

它似乎循環了 10483 次……為什麼?

2)#1沒有指定限制(顯然使它更快……)

Sort  (cost=6406.46..6407.76 rows=523 width=580) (actual time=14.574..14.574 rows=2 loops=1)
 Sort Key: follower_count_mv.follower_count
 Sort Method: quicksort  Memory: 26kB
 ->  Nested Loop  (cost=799.36..6382.84 rows=523 width=580) (actual time=11.633..14.545 rows=2 loops=1)
       ->  Hash Join  (cost=798.95..1122.31 rows=1570 width=78) (actual time=11.590..14.469 rows=2 loops=1)
             Hash Cond: ((follower_count_mv.user_id)::text = (user_post_comment.user_id)::text)
             ->  Seq Scan on follower_count_mv  (cost=0.00..202.83 rows=10483 width=41) (actual time=0.005..1.168 rows=10483 loops=1)
             ->  Hash  (cost=773.89..773.89 rows=2005 width=37) (actual time=11.448..11.448 rows=2005 loops=1)
                   Buckets: 1024  Batches: 1  Memory Usage: 136kB
                   ->  Bitmap Heap Scan on user_post_comment  (cost=87.82..773.89 rows=2005 width=37) (actual time=1.211..11.040 rows=2005 loops=1)
                         Recheck Cond: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text)
                         Heap Blocks: exact=105
                         ->  Bitmap Index Scan on user_post_comment_post_id_idx  (cost=0.00..87.32 rows=2005 width=0) (actual time=1.196..1.196 rows=2005 loops=1)
                               Index Cond: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text)
       ->  Index Scan using user_account_pkey on user_account  (cost=0.41..3.34 rows=1 width=576) (actual time=0.034..0.035 rows=1 loops=2)
             Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text)
Planning time: 1.935 ms
Execution time: 14.719 ms

3)最佳(但混亂)方式(我已經能夠找到)

SELECT user_account.*
FROM user_account
JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id)
JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id)
WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
AND user_account.user_id IN (SELECT user_id FROM follower_count_mv ORDER BY follower_count DESC LIMIT 5)
ORDER BY follower_count DESC LIMIT 5;

使用子查詢首先計算前 N 個使用者 ID 似乎迫使優化器進行更有效的計算。

Limit  (cost=44.87..44.88 rows=1 width=580) (actual time=0.588..0.588 rows=2 loops=1)
  ->  Sort  (cost=44.87..44.88 rows=1 width=580) (actual time=0.587..0.587 rows=2 loops=1)
        Sort Key: follower_count_mv.follower_count
        Sort Method: quicksort  Memory: 26kB
        ->  Nested Loop  (cost=1.52..44.86 rows=1 width=580) (actual time=0.358..0.571 rows=2 loops=1)
              ->  Nested Loop  (cost=1.23..44.47 rows=1 width=654) (actual time=0.116..0.405 rows=5 loops=1)
                    ->  Nested Loop  (cost=0.95..42.81 rows=5 width=613) (actual time=0.081..0.243 rows=5 loops=1)
                          ->  HashAggregate  (cost=0.53..0.58 rows=5 width=37) (actual time=0.028..0.030 rows=5 loops=1)
                                Group Key: ("ANY_subquery".user_id)::text
                                ->  Subquery Scan on "ANY_subquery"  (cost=0.29..0.52 rows=5 width=37) (actual time=0.014..0.019 rows=5 loops=1)
                                      ->  Limit  (cost=0.29..0.47 rows=5 width=41) (actual time=0.013..0.018 rows=5 loops=1)
                                            ->  Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv follower_count_mv_1  (cost=0.29..383.25 rows=10483 width=41) (actual time=0.013..0.018 rows=5 loops=1)
                          ->  Index Scan using user_account_pkey on user_account  (cost=0.41..8.43 rows=1 width=576) (actual time=0.041..0.041 rows=1 loops=5)
                                Index Cond: ((user_id)::text = ("ANY_subquery".user_id)::text)
                    ->  Index Scan using follower_count_mv_user_id_idx on follower_count_mv  (cost=0.29..0.32 rows=1 width=41) (actual time=0.030..0.030 rows=1 loops=5)
                          Index Cond: ((user_id)::text = (user_account.user_id)::text)
              ->  Index Scan using user_post_comment_idx on user_post_comment  (cost=0.29..0.39 rows=1 width=37) (actual time=0.031..0.032 rows=0 loops=5)
                    Index Cond: ((user_id)::text = (user_account.user_id)::text)
                    Filter: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text)
                    Rows Removed by Filter: 1
Planning time: 2.035 ms
Execution time: 0.785 ms

4)加入複合索引並使用子查詢獲取前5名再加入

CREATE INDEX user_post_comment_post_id_user_id_idx ON user_post_comment (post_id, user_id);
CREATE INDEX follower_count_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count);

SELECT ua.*
FROM (
 SELECT user_id
 FROM user_post_comment pc
 JOIN follower_count_mv fc ON (pc.user_id = fc.user_id)
 WHERE post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
 ORDER BY fc.follower_count DESC LIMIT 5
) sub
JOIN user_account ua ON (sub.user_id = ua.user_id)
JOIN follower_count_mv fc ON (sub.user_id = fc.user_id)
ORDER BY follower_count DESC LIMIT 5;

Limit  (cost=57.36..57.38 rows=5 width=579) (actual time=329.718..329.720 rows=2 loops=1)
 ->  Sort  (cost=57.36..57.38 rows=5 width=579) (actual time=329.717..329.717 rows=2 loops=1)
       Sort Key: fc.follower_count
       Sort Method: quicksort  Memory: 26kB
       ->  Nested Loop  (cost=1.40..57.31 rows=5 width=579) (actual time=0.118..329.709 rows=2 loops=1)
             Join Filter: ((pc.user_id)::text = (ua.user_id)::text)
             ->  Nested Loop  (cost=0.98..40.54 rows=5 width=78) (actual time=0.089..329.657 rows=2 loops=1)
                   ->  Limit  (cost=0.70..18.92 rows=5 width=41) (actual time=0.067..329.618 rows=2 loops=1)
                         ->  Nested Loop  (cost=0.70..5721.98 rows=1570 width=41) (actual time=0.067..329.617 rows=2 loops=1)
                               ->  Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv fc_1  (cost=0.29..383.25 rows=10483 width=41) (actual time=0.015..1.872 rows=10483 loops=1)
                               ->  Index Only Scan using user_post_comment_post_id_user_id_idx on user_post_comment pc  (cost=0.41..0.50 rows=1 width=37) (actual time=0.031..0.031 rows=0 loops=10483)
                                     Index Cond: ((post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) AND (user_id = (fc_1.user_id)::text))
                                     Heap Fetches: 0
                   ->  Index Only Scan using follower_count_mv_user_id_follower_count_idx on follower_count_mv fc  (cost=0.29..4.31 rows=1 width=41) (actual time=0.019..0.019 rows=1 loops=2)
                         Index Cond: (user_id = (pc.user_id)::text)
                         Heap Fetches: 0
             ->  Index Scan using user_account_pkey on user_account ua  (cost=0.41..3.34 rows=1 width=575) (actual time=0.023..0.024 rows=1 loops=2)
                   Index Cond: ((user_id)::text = (fc.user_id)::text)
Planning time: 3.007 ms
Execution time: 331.744 ms

估計的時間減少了,但執行時間增加了一倍。我不明白這是什麼原因…

懸而未決的問題

  1. 為什麼#1 循環遍歷整個follower_count_mv表?
  2. 為什麼LIMIT從 #1 中刪除會使查詢優化器選擇不同的查詢計劃,其中估計成本更高,但實際執行時間遠低於 #1?
  3. 為什麼查詢優化器不夠聰明,無法弄清楚它應該執行查詢#3 對查詢#1 所做的事情?數據分佈是否會絆倒它?
  4. 如果數據庫的數據分佈更加規範化,那麼查詢 #1 會表現最好嗎?我仍然不清楚分佈如何影響查詢計劃。
  5. 如果不是#3,在這種情況下使用的最佳查詢是什麼?#3對我來說似乎很老套;有沒有辦法讓查詢優化器對#1 的查詢使用與#2 相同的計劃?

更好的數據類型

text是鍵列的次優數據類型。使用起來會更有效率integer。有關的:

'26c72242-7e3b-4982-92c5-021b622d7ecd'在你的例子中看起來像一個 UUID。如果您需要使用 UUID,仍然不要將它們儲存為text. 適當的數據類型將**uuid**- 效率更高。細節:

索引

您只使用單列索引。由於您需要優化查詢的讀取性能,請添加這兩個多列索引

CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);

@Ziggy已經提到了第二個。

索引列的順序很重要:

Btree 索引可以以幾乎相同的成本向後掃描。但是使用匹配的降序排序會更快一些。(有一個 NULL 值的極端情況。)

詢問

對於您範例中的單個文章*,它不會比這更快:*

SELECT ua.*
FROM  (
  SELECT user_id, fc.follower_count 
  FROM  (
     SELECT DISTINCT user_id
     FROM   user_post_comment
     WHERE  post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
     ) pc
  JOIN   follower_count_mv fc USING (user_id)
  ORDER  BY fc.follower_count DESC
  LIMIT  5
  ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;

這是假設同一個使用者可以多次評論同一個文章。在加入之前折疊重複是最便宜的follower_count_mv

並直接加入follower_count_mvuser_account用作墊腳石既昂貴又無用。

降到前5名user_account 後才加入。

您沒有指定,但您的查詢有一個外部ORDER BY follower_count DESC. 我只包含follower_count在外部排序的子查詢中。


如果(post_id, user_id)是唯一的(使用者只能對每個文章發表評論),則簡化為:

SELECT ua.*
FROM  (
  SELECT user_id, fc.follower_count
  FROM   user_post_comment pc
  JOIN   follower_count_mv fc USING (user_id)
  WHERE  pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
  ORDER  BY fc.follower_count DESC
  LIMIT  5
  ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;

一次獲得多個文章的前 N ​​名有點複雜。此相關答案的第 2a 章中的詳細說明:

引用自:https://dba.stackexchange.com/questions/138158