Postgresql

當涉及 OR 時,Postgres 選擇過濾器而不是索引 cond

  • June 28, 2019

我有一張表,每天添加大約 2000 萬條記錄,我試圖通過它進行分頁以讓人們訪問其中的所有數據,但查詢時間必須是“體面的”(在我的情況下定義為小於 30 秒/查詢)。

為此,我過去一直在使用鍵集分頁,但是對於這個特定的查詢和表,我的查詢時間確實很慢,這似乎是因為查詢計劃者決定過濾掉一天的數據,然後對其執行過濾器而不是索引條件掃描。

該表如下所示:

create table mmsi_positions_archive
(
   id bigserial not null
       constraint mmsi_positions_archive_pkey
           primary key,
   position_id uuid,
   previous_id uuid,
   mmsi bigint not null,
   collection_type varchar not null,
   accuracy numeric,
   maneuver numeric,
   rate_of_turn numeric,
   status integer,
   speed numeric,
   course numeric,
   heading numeric,
   position geometry(Point,4326),
   timestamp timestamp with time zone not null,
   updated_at timestamp with time zone default now(),
   created_at timestamp with time zone default now()
);

create index ix_mmsi_positions_archive_mmsi
   on mmsi_positions_archive (mmsi);

create index ix_mmsi_positions_archive_position_id
   on mmsi_positions_archive (position_id);

create index ix_mmsi_positions_archive_timestamp_mmsi_id_asc
   on mmsi_positions_archive (timestamp, id);

我試圖分頁的列是timestampand id,為了提供幫助,我還更新了表統計目標timestamp並將其設置為最大值 10 000 並分析了表。

該表也按季度分區,但目前我只對單個分區的數據進行操作。

快速查詢

SELECT id
FROM mmsi_positions_archive
WHERE timestamp > '2019-03-10 00:00:00.000000+00:00'
 AND timestamp <= '2019-03-11 00:00:00+00:00'
ORDER BY timestamp, id
LIMIT 100

這給出了以下查詢計劃(注意mmsi_positions_archive表本身是空的,所有數據都在*_p2019_q1表中):

Limit  (cost=0.60..5.39 rows=100 width=16) (actual time=0.053..0.089 rows=100 loops=1)
 ->  Merge Append  (cost=0.60..773572.19 rows=16149157 width=16) (actual time=0.053..0.082 rows=100 loops=1)
"        Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id"
       ->  Sort  (cost=0.01..0.02 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1)
"              Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id"
             Sort Method: quicksort  Memory: 25kB
             ->  Seq Scan on mmsi_positions_archive  (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1)
                   Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone))
       ->  Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1  (cost=0.58..571707.70 rows=16149156 width=16) (actual time=0.043..0.067 rows=100 loops=1)
             Index Cond: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone))
             Heap Fetches: 0
Planning time: 67.023 ms
Execution time: 0.128 ms

鍵集分頁查詢(慢)

SELECT id
FROM mmsi_positions_archive
WHERE (timestamp > '2019-03-10 00:00:00.000000+00:00'
          OR (timestamp = '2019-03-10 00:00:00.000000+00:00' AND id >  1032749689))
 AND timestamp <= '2019-03-11 00:00:00+00:00'
ORDER BY timestamp, id
LIMIT 100

這給出了解釋,最終執行速度要慢得多:

Limit  (cost=0.60..25.08 rows=100 width=16) (actual time=332918.152..332918.192 rows=100 loops=1)
 ->  Merge Append  (cost=0.60..41278140.09 rows=168591751 width=16) (actual time=332918.152..332918.189 rows=100 loops=1)
"        Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id"
       ->  Sort  (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1)
"              Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id"
             Sort Method: quicksort  Memory: 25kB
             ->  Seq Scan on mmsi_positions_archive  (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1)
                   Filter: (("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone) AND (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) OR (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689))))
       ->  Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1  (cost=0.58..39170743.18 rows=168591750 width=16) (actual time=332918.147..332918.181 rows=100 loops=1)
             Index Cond: ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone)
             Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) OR (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689)))
             Rows Removed by Filter: 953622052
             Heap Fetches: 0
Planning time: 0.778 ms
Execution time: 332918.226 ms

據我了解,這最終會變慢,因為索引條件Index Cond: ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone最終會對大約 2000 萬*70 行索引數據進行 seq 掃描,然後將它們過濾掉。

解決方法

我做了一些測試,發現問題出OR在聲明中;如果我不這樣做,他們都會給我一個快速的計劃OR。因此,我將其切換並進行了UNION查詢以獲取我想要的數據:

SELECT id
FROM (
        SELECT *
        FROM (
                 SELECT id        AS id,
                        timestamp AS timestamp
                 FROM mmsi_positions_archive
                 WHERE timestamp = '2019-03-10 00:00:00.000000+00:00'
                   AND id > 1032749689
                 ORDER BY timestamp, id
                 LIMIT 100
             ) keyset
        UNION
        SELECT *
        FROM (
                 SELECT id        AS id,
                        timestamp AS timestamp
                 FROM mmsi_positions_archive
                 WHERE timestamp > '2019-03-10 00:00:00.000000+00:00'
                   AND timestamp <= '2019-03-11 00:00:00+00:00'
                 ORDER BY timestamp, id
                 LIMIT 100
             ) all_after
    ) archive_ids
ORDER BY timestamp, id
LIMIT 100

產生快速查詢和以下查詢計劃:

Limit  (cost=34.27..34.52 rows=100 width=16) (actual time=0.232..0.242 rows=100 loops=1)
 ->  Sort  (cost=34.27..34.77 rows=200 width=16) (actual time=0.231..0.238 rows=100 loops=1)
"        Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id"
       Sort Method: quicksort  Memory: 34kB
       ->  HashAggregate  (cost=22.63..24.63 rows=200 width=16) (actual time=0.151..0.167 rows=200 loops=1)
"              Group Key: mmsi_positions_archive.id, mmsi_positions_archive.""timestamp"""
             ->  Append  (cost=0.71..21.63 rows=200 width=16) (actual time=0.028..0.111 rows=200 loops=1)
                   ->  Limit  (cost=0.71..12.24 rows=100 width=16) (actual time=0.028..0.049 rows=100 loops=1)
                         ->  Merge Append  (cost=0.71..17.43 rows=145 width=16) (actual time=0.027..0.046 rows=100 loops=1)
                               Sort Key: mmsi_positions_archive.id
                               ->  Index Scan using mmsi_positions_archive_pkey on mmsi_positions_archive  (cost=0.12..8.14 rows=1 width=16) (actual time=0.010..0.010 rows=0 loops=1)
                                     Index Cond: (id > 1032749689)
                                     Filter: ("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone)
                               ->  Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1  (cost=0.58..7.46 rows=144 width=16) (actual time=0.017..0.028 rows=100 loops=1)
                                     Index Cond: (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689))
                                     Heap Fetches: 0
                   ->  Limit  (cost=0.60..5.39 rows=100 width=16) (actual time=0.012..0.049 rows=100 loops=1)
                         ->  Merge Append  (cost=0.60..773572.19 rows=16149157 width=16) (actual time=0.011..0.044 rows=100 loops=1)
"                                Sort Key: mmsi_positions_archive_1.""timestamp"", mmsi_positions_archive_1.id"
                               ->  Sort  (cost=0.01..0.02 rows=1 width=16) (actual time=0.005..0.005 rows=0 loops=1)
"                                      Sort Key: mmsi_positions_archive_1.""timestamp"", mmsi_positions_archive_1.id"
                                     Sort Method: quicksort  Memory: 25kB
                                     ->  Seq Scan on mmsi_positions_archive mmsi_positions_archive_1  (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1)
                                           Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone))
                               ->  Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1 mmsi_positions_archive_p2019q1_1  (cost=0.58..571707.70 rows=16149156 width=16) (actual time=0.006..0.031 rows=100 loops=1)
                                     Index Cond: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone))
                                     Heap Fetches: 0
Planning time: 1.059 ms
Execution time: 0.312 ms

雖然我可以重寫我的查詢以使用該UNION方法,但我確實想知道是否有某種方法可以更好地幫助 Postgres 通過使用OR?

我也在 AWS Aurora Postgres 9.6 上執行它。我知道我們落後了幾個主要版本,我正計劃盡快升級,但目前我只需要讓這件事正常工作。:)

幸運的是,這在 PostgreSQL 中非常簡單,因為它支持可以使用索引的“行值”(或複合值)之間的比較。

所以你可以寫:

WHERE (timestamp, id) > ('2019-03-10 00:00:00+00:00', 1032749689)
 AND timestamp <= '2019-03-11 00:00:00+00:00'
ORDER BY timestamp, id
LIMIT 100

此類行值的比較是按字典順序進行的,這正是您想要的。

這是該功能的文件連結

引用自:https://dba.stackexchange.com/questions/241591