優化查詢以查找評論文章的前 N 個使用者
問題
試圖找到最有效的查詢來檢索對文章發表評論的前 N 個(範例中為 5 個)使用者,如果使用者擁有最多的關注者,則該使用者被視為“頂部”。查詢優化器似乎沒有選擇正確的路徑。
表(Postgres v9.4.4)
user_account(40k 條記錄)
CREATE TABLE user_account ( user_id TEXT PRIMARY KEY, name TEXT );
以下(13k 條記錄)
CREATE TABLE following ( follower_user_id TEXT, followed_user_id TEXT, PRIMARY KEY (followed_user_id, follower_user_id) );
follower_count_mv(10.5k 記錄,只有 5 個使用者擁有 > 1 個關注者)
CREATE MATERIALIZED VIEW follower_count_mv AS SELECT followed_user_id AS user_id, COUNT(*)::int AS follower_count FROM following WHERE deleted_at IS NULL GROUP BY user_id; CREATE UNIQUE INDEX follower_count_mv_user_id_idx ON follower_count_mv (user_id); CREATE INDEX follower_count_mv_follower_count_idx ON follower_count_mv (follower_count);
user_post_comment(13.4k 條記錄,但大多數在 3 個文章上)
CREATE TABLE user_post_comment ( comment_id TEXT PRIMARY KEY, user_id TEXT, post_id TEXT ) CREATE INDEX user_post_comment_user_id_idx ON user_post_comment (user_id); CREATE INDEX user_post_comment_post_id_idx ON user_post_comment (post_id);
我嘗試過的查詢
1)最自然的選擇:加入表格並排序
SELECT user_account.* FROM user_account JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id) JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id) WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd' ORDER BY follower_count DESC LIMIT 5;
這是我最初擁有的,但查詢優化器似乎很難找出執行此操作的最佳方法。也許與數據分佈有關?
Limit (cost=0.99..117.00 rows=5 width=580) (actual time=0.082..148.688 rows=2 loops=1) -> Nested Loop (cost=0.99..12136.14 rows=523 width=580) (actual time=0.081..148.687 rows=2 loops=1) -> Nested Loop (cost=0.57..6875.61 rows=1570 width=78) (actual time=0.049..148.624 rows=2 loops=1) -> Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv (cost=0.29..383.25 rows=10483 width=41) (actual time=0.011..1.904 rows=10483 loops=1) -> Index Scan using user_post_comment_user_id_idx on user_post_comment (cost=0.29..0.61 rows=1 width=37) (actual time=0.014..0.014 rows=0 loops=10483) Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text) Filter: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) Rows Removed by Filter: 0 -> Index Scan using user_account_pkey on user_account (cost=0.41..3.34 rows=1 width=576) (actual time=0.029..0.030 rows=1 loops=2) Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text) Planning time: 4.172 ms Execution time: 148.763 ms
它似乎循環了 10483 次……為什麼?
2)#1沒有指定限制(顯然使它更快……)
Sort (cost=6406.46..6407.76 rows=523 width=580) (actual time=14.574..14.574 rows=2 loops=1) Sort Key: follower_count_mv.follower_count Sort Method: quicksort Memory: 26kB -> Nested Loop (cost=799.36..6382.84 rows=523 width=580) (actual time=11.633..14.545 rows=2 loops=1) -> Hash Join (cost=798.95..1122.31 rows=1570 width=78) (actual time=11.590..14.469 rows=2 loops=1) Hash Cond: ((follower_count_mv.user_id)::text = (user_post_comment.user_id)::text) -> Seq Scan on follower_count_mv (cost=0.00..202.83 rows=10483 width=41) (actual time=0.005..1.168 rows=10483 loops=1) -> Hash (cost=773.89..773.89 rows=2005 width=37) (actual time=11.448..11.448 rows=2005 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 136kB -> Bitmap Heap Scan on user_post_comment (cost=87.82..773.89 rows=2005 width=37) (actual time=1.211..11.040 rows=2005 loops=1) Recheck Cond: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) Heap Blocks: exact=105 -> Bitmap Index Scan on user_post_comment_post_id_idx (cost=0.00..87.32 rows=2005 width=0) (actual time=1.196..1.196 rows=2005 loops=1) Index Cond: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) -> Index Scan using user_account_pkey on user_account (cost=0.41..3.34 rows=1 width=576) (actual time=0.034..0.035 rows=1 loops=2) Index Cond: ((user_id)::text = (follower_count_mv.user_id)::text) Planning time: 1.935 ms Execution time: 14.719 ms
3)最佳(但混亂)方式(我已經能夠找到)
SELECT user_account.* FROM user_account JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id) JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id) WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd' AND user_account.user_id IN (SELECT user_id FROM follower_count_mv ORDER BY follower_count DESC LIMIT 5) ORDER BY follower_count DESC LIMIT 5;
使用子查詢首先計算前 N 個使用者 ID 似乎迫使優化器進行更有效的計算。
Limit (cost=44.87..44.88 rows=1 width=580) (actual time=0.588..0.588 rows=2 loops=1) -> Sort (cost=44.87..44.88 rows=1 width=580) (actual time=0.587..0.587 rows=2 loops=1) Sort Key: follower_count_mv.follower_count Sort Method: quicksort Memory: 26kB -> Nested Loop (cost=1.52..44.86 rows=1 width=580) (actual time=0.358..0.571 rows=2 loops=1) -> Nested Loop (cost=1.23..44.47 rows=1 width=654) (actual time=0.116..0.405 rows=5 loops=1) -> Nested Loop (cost=0.95..42.81 rows=5 width=613) (actual time=0.081..0.243 rows=5 loops=1) -> HashAggregate (cost=0.53..0.58 rows=5 width=37) (actual time=0.028..0.030 rows=5 loops=1) Group Key: ("ANY_subquery".user_id)::text -> Subquery Scan on "ANY_subquery" (cost=0.29..0.52 rows=5 width=37) (actual time=0.014..0.019 rows=5 loops=1) -> Limit (cost=0.29..0.47 rows=5 width=41) (actual time=0.013..0.018 rows=5 loops=1) -> Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv follower_count_mv_1 (cost=0.29..383.25 rows=10483 width=41) (actual time=0.013..0.018 rows=5 loops=1) -> Index Scan using user_account_pkey on user_account (cost=0.41..8.43 rows=1 width=576) (actual time=0.041..0.041 rows=1 loops=5) Index Cond: ((user_id)::text = ("ANY_subquery".user_id)::text) -> Index Scan using follower_count_mv_user_id_idx on follower_count_mv (cost=0.29..0.32 rows=1 width=41) (actual time=0.030..0.030 rows=1 loops=5) Index Cond: ((user_id)::text = (user_account.user_id)::text) -> Index Scan using user_post_comment_idx on user_post_comment (cost=0.29..0.39 rows=1 width=37) (actual time=0.031..0.032 rows=0 loops=5) Index Cond: ((user_id)::text = (user_account.user_id)::text) Filter: ((post_id)::text = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) Rows Removed by Filter: 1 Planning time: 2.035 ms Execution time: 0.785 ms
4)加入複合索引並使用子查詢獲取前5名再加入
CREATE INDEX user_post_comment_post_id_user_id_idx ON user_post_comment (post_id, user_id); CREATE INDEX follower_count_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count); SELECT ua.* FROM ( SELECT user_id FROM user_post_comment pc JOIN follower_count_mv fc ON (pc.user_id = fc.user_id) WHERE post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd' ORDER BY fc.follower_count DESC LIMIT 5 ) sub JOIN user_account ua ON (sub.user_id = ua.user_id) JOIN follower_count_mv fc ON (sub.user_id = fc.user_id) ORDER BY follower_count DESC LIMIT 5; Limit (cost=57.36..57.38 rows=5 width=579) (actual time=329.718..329.720 rows=2 loops=1) -> Sort (cost=57.36..57.38 rows=5 width=579) (actual time=329.717..329.717 rows=2 loops=1) Sort Key: fc.follower_count Sort Method: quicksort Memory: 26kB -> Nested Loop (cost=1.40..57.31 rows=5 width=579) (actual time=0.118..329.709 rows=2 loops=1) Join Filter: ((pc.user_id)::text = (ua.user_id)::text) -> Nested Loop (cost=0.98..40.54 rows=5 width=78) (actual time=0.089..329.657 rows=2 loops=1) -> Limit (cost=0.70..18.92 rows=5 width=41) (actual time=0.067..329.618 rows=2 loops=1) -> Nested Loop (cost=0.70..5721.98 rows=1570 width=41) (actual time=0.067..329.617 rows=2 loops=1) -> Index Scan Backward using follower_count_mv_follower_count_idx on follower_count_mv fc_1 (cost=0.29..383.25 rows=10483 width=41) (actual time=0.015..1.872 rows=10483 loops=1) -> Index Only Scan using user_post_comment_post_id_user_id_idx on user_post_comment pc (cost=0.41..0.50 rows=1 width=37) (actual time=0.031..0.031 rows=0 loops=10483) Index Cond: ((post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'::text) AND (user_id = (fc_1.user_id)::text)) Heap Fetches: 0 -> Index Only Scan using follower_count_mv_user_id_follower_count_idx on follower_count_mv fc (cost=0.29..4.31 rows=1 width=41) (actual time=0.019..0.019 rows=1 loops=2) Index Cond: (user_id = (pc.user_id)::text) Heap Fetches: 0 -> Index Scan using user_account_pkey on user_account ua (cost=0.41..3.34 rows=1 width=575) (actual time=0.023..0.024 rows=1 loops=2) Index Cond: ((user_id)::text = (fc.user_id)::text) Planning time: 3.007 ms Execution time: 331.744 ms
估計的時間減少了,但執行時間增加了一倍。我不明白這是什麼原因…
懸而未決的問題
- 為什麼#1 循環遍歷整個
follower_count_mv
表?- 為什麼
LIMIT
從 #1 中刪除會使查詢優化器選擇不同的查詢計劃,其中估計成本更高,但實際執行時間遠低於 #1?- 為什麼查詢優化器不夠聰明,無法弄清楚它應該執行查詢#3 對查詢#1 所做的事情?數據分佈是否會絆倒它?
- 如果數據庫的數據分佈更加規範化,那麼查詢 #1 會表現最好嗎?我仍然不清楚分佈如何影響查詢計劃。
- 如果不是#3,在這種情況下使用的最佳查詢是什麼?#3對我來說似乎很老套;有沒有辦法讓查詢優化器對#1 的查詢使用與#2 相同的計劃?
更好的數據類型
text
是鍵列的次優數據類型。使用起來會更有效率integer
。有關的:
'26c72242-7e3b-4982-92c5-021b622d7ecd'
在你的例子中看起來像一個 UUID。如果您需要使用 UUID,仍然不要將它們儲存為text
. 適當的數據類型將**uuid
**- 效率更高。細節:索引
您只使用單列索引。由於您需要優化查詢的讀取性能,請添加這兩個多列索引:
CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC); CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);
@Ziggy已經提到了第二個。
索引列的順序很重要:
Btree 索引可以以幾乎相同的成本向後掃描。但是使用匹配的降序排序會更快一些。(有一個 NULL 值的極端情況。)
詢問
對於您範例中的單個文章*,它不會比這更快:*
SELECT ua.* FROM ( SELECT user_id, fc.follower_count FROM ( SELECT DISTINCT user_id FROM user_post_comment WHERE post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd' ) pc JOIN follower_count_mv fc USING (user_id) ORDER BY fc.follower_count DESC LIMIT 5 ) sub JOIN user_account ua USING (user_id) ORDER BY sub.follower_count DESC;
這是假設同一個使用者可以多次評論同一個文章。在加入之前折疊重複是最便宜的
follower_count_mv
。並直接加入
follower_count_mv
。user_account
用作墊腳石既昂貴又無用。降到前5名
user_account
後才加入。您沒有指定,但您的查詢有一個外部
ORDER BY follower_count DESC
. 我只包含follower_count
在外部排序的子查詢中。如果
(post_id, user_id)
是唯一的(使用者只能對每個文章發表評論),則簡化為:SELECT ua.* FROM ( SELECT user_id, fc.follower_count FROM user_post_comment pc JOIN follower_count_mv fc USING (user_id) WHERE pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd' ORDER BY fc.follower_count DESC LIMIT 5 ) sub JOIN user_account ua USING (user_id) ORDER BY sub.follower_count DESC;
一次獲得多個文章的前 N 名有點複雜。此相關答案的第 2a 章中的詳細說明: