優化具有小 LIMIT 的查詢,以一列為謂詞並按另一列排序
我正在使用 Postgres 9.3.4,我有 4 個查詢,它們的輸入非常相似,但響應時間卻大不相同:
查詢 #1
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1) Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Rows Removed by Filter: 6913925 Total runtime: 507361.944 ms
查詢 #2
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1) -> Sort (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1) Sort Key: external_created_at Sort Method: top-N heapsort Memory: 80kB -> Index Scan using index_posts_on_source_id on posts (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1) Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 109 Total runtime: 2.125 ms
查詢 #3
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1) Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Rows Removed by Filter: 414 Total runtime: 55.683 ms
查詢 #4
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1) Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[])) Rows Removed by Filter: 6156 Total runtime: 320.778 ms
除了查看不同
source_id
s 的文章之外,所有 4 個都是相同的。四個中的三個最終使用以下索引:
CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC) WHERE (deleted_at IS NULL);
而#2 使用這個索引:
CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);
對我來說有趣的是,在使用
index_posts_on_external_created_at
索引的 3 個中,兩個非常快,而另一個(#1)非常慢。查詢 #2 的文章比其他 3 少得多,因此這可以解釋為什麼它使用
index_posts_on_source_id
索引。但是,如果我擺脫index_posts_on_external_created_at
索引,其他 3 個查詢在強制使用index_posts_on_source_id
索引時會非常慢。這是我對文章表的定義:
CREATE TABLE posts ( id integer NOT NULL, source_id integer, message text, image text, external_id text, created_at timestamp without time zone, updated_at timestamp without time zone, external text, like_count integer DEFAULT 0 NOT NULL, comment_count integer DEFAULT 0 NOT NULL, external_created_at timestamp without time zone, deleted_at timestamp without time zone, poster_name character varying(255), poster_image text, poster_url character varying(255), poster_id text, position integer, location character varying(255), description text, video text, rejected_at timestamp without time zone, deleted_by character varying(255), height integer, width integer );
我試過使用
CLUSTER posts USING index_posts_on_external_created_at
這本質上是一個按 external_created_at 排序的索引,這似乎是我發現的唯一有效方法。但是,我無法在生產中使用它,因為它在執行時會導致全域鎖定幾個小時。我在 heroku 上,所以我無法安裝pg_repack或類似的東西。
為什麼 #1 查詢會這麼慢,而其他查詢會非常快?我能做些什麼來減輕這種情況?
編輯:這是我的查詢,沒有
LIMIT
andORDER
查詢 #1
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 81440kB -> Bitmap Heap Scan on posts (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1) Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Rows Removed by Index Recheck: 5484857 Filter: (deleted_at IS NULL) Rows Removed by Filter: 3108465 -> Bitmap Index Scan on index_posts_on_source_id (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1) Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Total runtime: 95534.724 ms
查詢 #2
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1) Sort Key: external_created_at Sort Method: quicksort Memory: 557kB -> Index Scan using index_posts_on_source_id on posts (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1) Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 109 Total runtime: 214.507 ms
查詢 #3
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 854864kB -> Bitmap Heap Scan on posts (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1) Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Rows Removed by Index Recheck: 5304550 Filter: (deleted_at IS NULL) Rows Removed by Filter: 879414 -> Bitmap Index Scan on index_posts_on_source_id (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1) Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Total runtime: 136176.868 ms
查詢 #4
EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 35456kB -> Index Scan using index_posts_on_source_id on posts (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1) Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 54 Total runtime: 15397.453 ms
Postgres 記憶體設置:
name, setting, unit 'default_statistics_target','100','' 'effective_cache_size','16384','8kB' 'maintenance_work_mem','16384','kB' 'max_connections','100','' 'random_page_cost','4',NULL 'seq_page_cost','1',NULL 'shared_buffers','16384','8kB' 'work_mem','1024','kB'
數據庫統計:
Total Posts: 20,997,027 Posts where deleted_at is null: 15,665,487 Distinct source_id's: 22,245 Max number of rows per single source_id: 1,543,950 Min number of rows per single source_id: 1 Most source_ids in a single query: 21 Distinct external_created_at: 11,146,151
一般建議
所有關於性能優化的一般建議都適用。預設設置非常保守,其中一些資源設置對於具有數百萬行的表(尤其是)來說太低了。
work_mem
您需要配置 RDBMS 以明智地利用可用 RAM。Postgres Wiki 是一個很好的起點。這超出了這裡單個問題的範圍。但是,我在下面建議的查詢只需要非常適度的資源設置。
還要增加統計目標,
source_id
以便對關鍵列進行更詳細的統計:ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000; -- or similar
然後:
ANALYZE posts;
更多的:
您可以進一步優化儲存(以獲得較小的收益):
詢問
查詢本身很難優化。有關高級性能優化,請參閱@ypercube 的相關問題:
有一個簡單的方法*,如果…*
- 每個查詢的不同數量
source_id
相當少- 而且
LIMIT
它也相當小。…根據您添加的詳細資訊,這對您的情況是正確的。
以下查詢所需的唯一索引:
CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC) WHERE deleted_at IS NULL;
基於您的查詢 #1 的範例:
SELECT p.* FROM unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600 , 17804, 20717, 27598, 27599}'::int[]) s(source_id) , LATERAL ( SELECT * FROM posts WHERE source_id = s.source_id AND deleted_at IS NULL ORDER BY external_created_at DESC LIMIT 100 ) p ORDER BY p.external_created_at DESC LIMIT 100;
這是模擬鬆散的索引掃描,類似於此處詳細討論的內容:
如果n是 source_id 的數量(幸運的是從不 > 21
external_created_at DESC
),我們讓 Postgres 從索引中為每個獲取前 100 行(根據)source_id
,這本身非常快,但最大。(n-1) * 100行是多餘的。鑑於您的價值頻率:22,245
source_id
行,1 到 1,543,950 行 - 總共 20,997,027 行(您沒有澄清所有這些數字是否包括“已刪除”行,但只有約 25% 被“刪除”。)
…我希望其中一些
source_id
’s 開始時的行數少於 100 行。所以我們只需要在最壞的情況下(通常更少)對 2100 行進行排序以保持前 100 行。這不應該表現得那麼糟糕——一旦你配置了 Postgres 的資源設置不錯。如果您有一個包含所有 distinct 的源表,則使用它並儘早
source_id
消除不存在可能是有意義的:source_id
SELECT p.* FROM source s, LATERAL ( ... ) p WHERE s.source_id IN (19082, 19075, 20705, ...) ORDER BY ...
此表單最多可以使用 21 個
IN
值,但請考慮以下相關問題:如果您知道
external_created_at
結果中單個行的最小或最大行數,則可以進行更多優化source_id
…