Postgresql

優化具有小 LIMIT 的查詢,以一列為謂詞並按另一列排序

  • January 17, 2017

我正在使用 Postgres 9.3.4,我有 4 個查詢,它們的輸入非常相似,但響應時間卻大不相同:

查詢 #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
  ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
        Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
        Rows Removed by Filter: 6913925
Total runtime: 507361.944 ms

查詢 #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;                                            

   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
  ->  Sort  (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
        Sort Key: external_created_at
        Sort Method: top-N heapsort  Memory: 80kB
        ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
              Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
              Filter: (deleted_at IS NULL)
              Rows Removed by Filter: 109
Total runtime: 2.125 ms

查詢 #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1)
  ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1)
        Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
        Rows Removed by Filter: 414
Total runtime: 55.683 ms

查詢 #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                           QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1)
  ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1)
        Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
        Rows Removed by Filter: 6156
Total runtime: 320.778 ms

除了查看不同source_ids 的文章之外,所有 4 個都是相同的。

四個中的三個最終使用以下索引:

CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC)
WHERE (deleted_at IS NULL);

而#2 使用這個索引:

CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);

對我來說有趣的是,在使用index_posts_on_external_created_at索引的 3 個中,兩個非常快,而另一個(#1)非常慢。

查詢 #2 的文章比其他 3 少得多,因此這可以解釋為什麼它使用index_posts_on_source_id索引。但是,如果我擺脫index_posts_on_external_created_at索引,其他 3 個查詢在強制使用index_posts_on_source_id索引時會非常慢。

這是我對文章表的定義:

CREATE TABLE posts (
   id integer NOT NULL,
   source_id integer,
   message text,
   image text,
   external_id text,
   created_at timestamp without time zone,
   updated_at timestamp without time zone,
   external text,
   like_count integer DEFAULT 0 NOT NULL,
   comment_count integer DEFAULT 0 NOT NULL,
   external_created_at timestamp without time zone,
   deleted_at timestamp without time zone,
   poster_name character varying(255),
   poster_image text,
   poster_url character varying(255),
   poster_id text,
   position integer,
   location character varying(255),
   description text,
   video text,
   rejected_at timestamp without time zone,
   deleted_by character varying(255),
   height integer,
   width integer
);

我試過使用CLUSTER posts USING index_posts_on_external_created_at

這本質上是一個按 external_created_at 排序的索引,這似乎是我發現的唯一有效方法。但是,我無法在生產中使用它,因為它在執行時會導致全域鎖定幾個小時。我在 heroku 上,所以我無法安裝pg_repack或類似的東西。

為什麼 #1 查詢會這麼慢,而其他查詢會非常快?我能做些什麼來減輕這種情況?

編輯:這是我的查詢,沒有LIMITandORDER

查詢 #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1)
  Sort Key: external_created_at
  Sort Method: external merge  Disk: 81440kB
  ->  Bitmap Heap Scan on posts  (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1)
        Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
        Rows Removed by Index Recheck: 5484857
        Filter: (deleted_at IS NULL)
        Rows Removed by Filter: 3108465
        ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1)
              Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
Total runtime: 95534.724 ms

查詢 #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1)
  Sort Key: external_created_at
  Sort Method: quicksort  Memory: 557kB
  ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1)
        Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
        Filter: (deleted_at IS NULL)
        Rows Removed by Filter: 109
Total runtime: 214.507 ms

查詢 #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1)
  Sort Key: external_created_at
  Sort Method: external merge  Disk: 854864kB
  ->  Bitmap Heap Scan on posts  (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1)
        Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
        Rows Removed by Index Recheck: 5304550
        Filter: (deleted_at IS NULL)
        Rows Removed by Filter: 879414
        ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1)
              Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
Total runtime: 136176.868 ms

查詢 #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1)
  Sort Key: external_created_at
  Sort Method: external merge  Disk: 35456kB
  ->  Index Scan using index_posts_on_source_id on posts  (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1)
        Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
        Filter: (deleted_at IS NULL)
        Rows Removed by Filter: 54
Total runtime: 15397.453 ms

Postgres 記憶體設置:

name, setting, unit
'default_statistics_target','100',''
'effective_cache_size','16384','8kB'
'maintenance_work_mem','16384','kB'
'max_connections','100',''
'random_page_cost','4',NULL
'seq_page_cost','1',NULL
'shared_buffers','16384','8kB'
'work_mem','1024','kB'

數據庫統計:

Total Posts: 20,997,027
Posts where deleted_at is null: 15,665,487
Distinct source_id's: 22,245
Max number of rows per single source_id: 1,543,950
Min number of rows per single source_id: 1
Most source_ids in a single query: 21
Distinct external_created_at: 11,146,151

一般建議

所有關於性能優化的一般建議都適用。預設設置非常保守,其中一些資源設置對於具有數百萬行的表(尤其是)來說太低了。work_mem您需要配置 RDBMS 以明智地利用可用 RAM。Postgres Wiki 是一個很好的起點。這超出了這裡單個問題的範圍。

但是,我在下面建議的查詢只需要非常適度的資源設置。

還要增加統計目標,source_id以便對關鍵列進行更詳細的統計:

ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000;  -- or similar

然後:ANALYZE posts;

更多的:

您可以進一步優化儲存(以獲得較小的收益):

詢問

查詢本身很難優化。有關高級性能優化,請參閱@ypercube 的相關問題:

有一個簡單的方法*,如果…*

  • 每個查詢的不同數量source_id相當少
  • 而且LIMIT它也相當小。

…根據您添加的詳細資訊,這對您的情況是正確的。

以下查詢所需的唯一索引:

CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;

基於您的查詢 #1 的範例:

SELECT p.*
FROM   unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
             , 17804, 20717, 27598, 27599}'::int[]) s(source_id)
    , LATERAL (
  SELECT *
  FROM   posts
  WHERE  source_id = s.source_id
  AND    deleted_at IS NULL
  ORDER  BY external_created_at DESC
  LIMIT  100
  ) p
ORDER  BY p.external_created_at DESC
LIMIT  100;

這是模擬鬆散的索引掃描,類似於此處詳細討論的內容:

如果n是 source_id 的數量(幸運的是從不 > 21external_created_at DESC ),我們讓 Postgres 從索引中為每個獲取前 100 行(根據) source_id,這本身非常快,但最大。(n-1) * 100行是多餘的。鑑於您的價值頻率:

22,245source_id行,1 到 1,543,950 行 - 總共 20,997,027 行

(您沒有澄清所有這些數字是否包括“已刪除”行,但只有約 25% 被“刪除”。)

…我希望其中一些source_id’s 開始時的行數少於 100 行。所以我們只需要在最壞的情況下(通常更少)對 2100 行進行排序以保持前 100 行。這不應該表現得那麼糟糕——一旦你配置了 Postgres 的資源設置不錯。

如果您有一個包含所有 distinct 的源表,則使用它並儘早source_id消除不存在可能是有意義的:source_id

SELECT p.*
FROM   source s, LATERAL ( ... ) p
WHERE  s.source_id IN (19082, 19075, 20705, ...)
ORDER  BY ...

此表單最多可以使用 21 個IN值,但請考慮以下相關問題:

如果您知道external_created_at結果中單個行的最小或最大行數,則可以進行更多優化source_id

引用自:https://dba.stackexchange.com/questions/124790