Postgresql
使用過濾器優化複雜的 Postgres 查詢
所以我之前問過這個問題,我得到了一個非常有見地的答案。但是,我希望能夠進一步細分此查詢,
Postgresql 9.6.3
並且它再次開始變慢。我不確定部分索引是否會有所幫助,因為它不是來自布爾值。所以這是執行非常好的基本查詢:
EXPLAIN ANALYZE SELECT posts.* FROM unnest('{17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305}'::int []) s(source_id), LATERAL (SELECT "posts".* FROM "posts" WHERE (source_id = s.source_id) AND ("posts"."deleted_at" IS NOT NULL) AND "posts"."rejected_at" IS NULL ORDER BY posts.external_created_at DESC LIMIT 100) posts ORDER BY posts.external_created_at DESC LIMIT 100 OFFSET 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=30895.79..30896.04 rows=100 width=1043) (actual time=5.299..5.337 rows=100 loops=1) -> Sort (cost=30895.78..30920.78 rows=10000 width=1043) (actual time=5.297..5.325 rows=101 loops=1) Sort Key: posts.external_created_at DESC Sort Method: top-N heapsort Memory: 110kB -> Nested Loop (cost=0.56..30512.87 rows=10000 width=1043) (actual time=0.085..4.077 rows=738 loops=1) -> Function Scan on unnest s (cost=0.00..1.00 rows=100 width=4) (actual time=0.011..0.016 rows=12 loops=1) -> Limit (cost=0.56..303.12 rows=100 width=1043) (actual time=0.018..0.298 rows=62 loops=12) -> Index Scan using index_posts_for_moderation_queue on posts (cost=0.56..7628.00 rows=2521 width=1043) (actual time=0.017..0.285 rows=62 loops=12) Index Cond: (source_id = s.source_id) Planning time: 0.443 ms Execution time: 5.433 ms (11 rows)
這是修改後的,帶有過濾器,速度要慢得多:
EXPLAIN ANALYZE SELECT posts.* FROM unnest('{17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305}'::int []) s(source_id), LATERAL (SELECT "posts".* FROM "posts" WHERE (source_id = s.source_id) AND ("posts"."deleted_at" IS NOT NULL) AND "posts"."deleted_by" = 'User' AND "posts"."rejected_at" IS NULL ORDER BY posts.external_created_at DESC LIMIT 100) posts ORDER BY posts.external_created_at DESC LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=551390.03..551390.28 rows=100 width=1043) (actual time=769.522..769.522 rows=0 loops=1) -> Sort (cost=551390.03..551391.78 rows=700 width=1043) (actual time=769.521..769.521 rows=0 loops=1) Sort Key: posts.external_created_at DESC Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=5513.47..551363.28 rows=700 width=1043) (actual time=769.508..769.508 rows=0 loops=1) -> Function Scan on unnest s (cost=0.00..1.00 rows=100 width=4) (actual time=0.012..0.022 rows=12 loops=1) -> Limit (cost=5513.47..5513.48 rows=7 width=1043) (actual time=64.122..64.122 rows=0 loops=12) -> Sort (cost=5513.47..5513.48 rows=7 width=1043) (actual time=64.120..64.120 rows=0 loops=12) Sort Key: posts.external_created_at DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on posts (cost=5485.28..5513.37 rows=7 width=1043) (actual time=64.104..64.104 rows=0 loops=12) Recheck Cond: ((source_id = s.source_id) AND (deleted_at IS NOT NULL) AND (rejected_at IS NULL) AND ((deleted_by)::text = 'User'::text)) Rows Removed by Index Recheck: 1 Heap Blocks: exact=9 -> BitmapAnd (cost=5485.28..5485.28 rows=7 width=0) (actual time=64.098..64.098 rows=0 loops=12) -> Bitmap Index Scan on index_posts_for_moderation_queue (cost=0.00..59.47 rows=2521 width=0) (actual time=0.028..0.028 rows=168 loops=12) Index Cond: (source_id = s.source_id) -> Bitmap Index Scan on index_posts_on_deleted_by (cost=0.00..5425.55 rows=291865 width=0) (actual time=76.855..76.855 rows=334200 loops=10) Index Cond: ((deleted_by)::text = 'User'::text) Planning time: 0.348 ms Execution time: 769.660 ms (21 rows)
兩者之間唯一的區別是第二個作為
AND "posts"."deleted_by" = 'User'
部分額外添加到橫向查詢中。問題是“使用者”的值在哪裡,它不是布爾值,可以是任何值。
有沒有辦法進一步優化這個查詢,以便它更快,即使使用 deleted_by 查詢集?
這是數據庫結構和索引和設置:
CREATE TABLE posts ( id integer NOT NULL, source_id integer, message text, image text, external_id text, created_at timestamp without time zone, updated_at timestamp without time zone, external text, like_count integer DEFAULT 0 NOT NULL, comment_count integer DEFAULT 0 NOT NULL, external_created_at timestamp without time zone, deleted_at timestamp without time zone, poster_name character varying(255), poster_image text, poster_url character varying(255), poster_id text, position integer, location character varying(255), description text, video text, rejected_at timestamp without time zone, deleted_by character varying(255), height integer, width integer ); CREATE INDEX index_posts_on_source_id_and_external_created_at ON posts USING btree (source_id, external_created_at DESC) WHERE deleted_at IS NOT NULL AND rejected_at IS NULL; CREATE INDEX index_posts_on_deleted_at ON posts USING btree (deleted_at); CREATE INDEX index_posts_on_deleted_by ON posts USING btree (deleted_by); CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);
上面的第一個索引是我對上一個問題的回答的結果。
Postgres 記憶體設置:
name, setting, unit 'default_statistics_target','100','' 'effective_cache_size','16384','8kB' 'maintenance_work_mem','16384','kB' 'max_connections','100','' 'random_page_cost','4',NULL 'seq_page_cost','1',NULL 'shared_buffers','16384','8kB' 'work_mem','1024','kB'
數據庫統計:
Total Posts: 20,997,027 Posts where deleted_at is null: 15,665,487 Distinct source_id's: 22,245 Max number of rows per single source_id: 1,543,950 Min number of rows per single source_id: 1 Most source_ids in a single query: 21 Distinct external_created_at: 11,146,151
編輯
我嘗試了從 Evan 那裡得到的具有不同源 ID 的簡化答案,而且速度很慢:
EXPLAIN ANALYZE SELECT * FROM posts AS p WHERE source_id IN (159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810) AND deleted_at IS NOT NULL AND deleted_by = 'Filter' AND rejected_at IS NULL ORDER BY external_created_at DESC LIMIT 100; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=74114.14..74114.19 rows=100 width=1060) (actual time=2794.981..2794.981 rows=0 loops=1) -> Sort (cost=74114.14..74115.48 rows=2678 width=1060) (actual time=2794.981..2794.981 rows=0 loops=1) Sort Key: external_created_at DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on posts p (cost=68759.42..74093.67 rows=2678 width=1060) (actual time=2794.977..2794.977 rows=0 loops=1) Recheck Cond: ((source_id = ANY ('{159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810}'::integer[])) AND (deleted_at IS NOT NULL) AND (rejected_at IS NULL) AND ((deleted_by)::text = 'Filter'::text)) Rows Removed by Index Recheck: 32326 Heap Blocks: exact=16019 -> BitmapAnd (cost=68759.42..68759.42 rows=2678 width=0) (actual time=2745.376..2745.376 rows=0 loops=1) -> Bitmap Index Scan on index_posts_for_moderation_queue (cost=0.00..830.64 rows=52637 width=0) (actual time=42.319..42.319 rows=272192 loops=1) Index Cond: (source_id = ANY ('{159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810}'::integer[])) -> Bitmap Index Scan on index_posts_on_deleted_by (cost=0.00..67928.46 rows=6942897 width=0) (actual time=2651.123..2651.123 rows=7863994 loops=1) Index Cond: ((deleted_by)::text = 'Filter'::text) Planning time: 0.856 ms Execution time: 2795.033 ms (15 rows)
我使用的原因
LATERAL
可以通過我之前優化此查詢的另一個問題來解釋。
直接從查詢中修復一些問題。試試這個。
- 停止使用雙引號。這些都不應該被雙引號引起來。
- 永遠不要說“,橫向”。那是 SQL-89 JOIN 語法。是時候更新它了。這些都是
CROSS JOIN LATERAL
- 不要將字元串文字用於整數。只做陣列$$ $$.
- 當您
CROSS JOIN LATERAL
可以將其重寫為INNER JOIN
.INNER JOIN
當您可以將其重寫為 a 時,請勿使用on 文字WHERE x IN ()
。WHERE x IN
當列表來自 SQL 時不要使用。使用EXISTS
(這在這裡不適用,但如果我在咆哮……)。嘗試這個。
EXPLAIN ANALYZE SELECT posts.* FROM posts AS p WHERE source_id IN (17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305) AND deleted_at IS NOT NULL AND deleted_by = 'User' AND posts.rejected_at IS NULL ORDER BY posts.external_created_at DESC LIMIT 100;
更新
您對該查詢的大問題只是
deleted_by
. 這是我的建議。這些是您目前的索引,
CREATE INDEX index_posts_on_source_id_and_external_created_at ON posts USING btree (source_id, external_created_at DESC) WHERE deleted_at IS NOT NULL AND rejected_at IS NULL; CREATE INDEX index_posts_on_deleted_at ON posts USING btree (deleted_at); CREATE INDEX index_posts_on_deleted_by ON posts USING btree (deleted_by); CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);
沒有理由擁有
index_posts_on_source_id_and_external_created_at
和index_posts_on_source_id
。它們都覆蓋了第一個 source_id。所以刪除index_posts_on_source_id
它只是減慢插入速度。二是你的大問題
deleted_by
。有兩種方法可以解決這個問題。
- 一個是複合索引,因此我們不必進行兩次索引掃描並將它們點陣圖合併在一起。
- 是謂詞索引。
如果
deleted_by
只能是幾種類型的值,您可以考慮將其創建為一種enum
類型並刪除字元串比較。