當涉及 OR 時,Postgres 選擇過濾器而不是索引 cond
我有一張表,每天添加大約 2000 萬條記錄,我試圖通過它進行分頁以讓人們訪問其中的所有數據,但查詢時間必須是“體面的”(在我的情況下定義為小於 30 秒/查詢)。
為此,我過去一直在使用鍵集分頁,但是對於這個特定的查詢和表,我的查詢時間確實很慢,這似乎是因為查詢計劃者決定過濾掉一天的數據,然後對其執行過濾器而不是索引條件掃描。
該表如下所示:
create table mmsi_positions_archive ( id bigserial not null constraint mmsi_positions_archive_pkey primary key, position_id uuid, previous_id uuid, mmsi bigint not null, collection_type varchar not null, accuracy numeric, maneuver numeric, rate_of_turn numeric, status integer, speed numeric, course numeric, heading numeric, position geometry(Point,4326), timestamp timestamp with time zone not null, updated_at timestamp with time zone default now(), created_at timestamp with time zone default now() ); create index ix_mmsi_positions_archive_mmsi on mmsi_positions_archive (mmsi); create index ix_mmsi_positions_archive_position_id on mmsi_positions_archive (position_id); create index ix_mmsi_positions_archive_timestamp_mmsi_id_asc on mmsi_positions_archive (timestamp, id);
我試圖分頁的列是
timestamp
andid
,為了提供幫助,我還更新了表統計目標timestamp
並將其設置為最大值 10 000 並分析了表。該表也按季度分區,但目前我只對單個分區的數據進行操作。
快速查詢
SELECT id FROM mmsi_positions_archive WHERE timestamp > '2019-03-10 00:00:00.000000+00:00' AND timestamp <= '2019-03-11 00:00:00+00:00' ORDER BY timestamp, id LIMIT 100
這給出了以下查詢計劃(注意
mmsi_positions_archive
表本身是空的,所有數據都在*_p2019_q1
表中):Limit (cost=0.60..5.39 rows=100 width=16) (actual time=0.053..0.089 rows=100 loops=1) -> Merge Append (cost=0.60..773572.19 rows=16149157 width=16) (actual time=0.053..0.082 rows=100 loops=1) " Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id" -> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=1) " Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id" Sort Method: quicksort Memory: 25kB -> Seq Scan on mmsi_positions_archive (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1) Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone)) -> Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1 (cost=0.58..571707.70 rows=16149156 width=16) (actual time=0.043..0.067 rows=100 loops=1) Index Cond: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone)) Heap Fetches: 0 Planning time: 67.023 ms Execution time: 0.128 ms
鍵集分頁查詢(慢)
SELECT id FROM mmsi_positions_archive WHERE (timestamp > '2019-03-10 00:00:00.000000+00:00' OR (timestamp = '2019-03-10 00:00:00.000000+00:00' AND id > 1032749689)) AND timestamp <= '2019-03-11 00:00:00+00:00' ORDER BY timestamp, id LIMIT 100
這給出了解釋,最終執行速度要慢得多:
Limit (cost=0.60..25.08 rows=100 width=16) (actual time=332918.152..332918.192 rows=100 loops=1) -> Merge Append (cost=0.60..41278140.09 rows=168591751 width=16) (actual time=332918.152..332918.189 rows=100 loops=1) " Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id" -> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.004..0.004 rows=0 loops=1) " Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id" Sort Method: quicksort Memory: 25kB -> Seq Scan on mmsi_positions_archive (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1) Filter: (("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone) AND (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) OR (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689)))) -> Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1 (cost=0.58..39170743.18 rows=168591750 width=16) (actual time=332918.147..332918.181 rows=100 loops=1) Index Cond: ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone) Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) OR (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689))) Rows Removed by Filter: 953622052 Heap Fetches: 0 Planning time: 0.778 ms Execution time: 332918.226 ms
據我了解,這最終會變慢,因為索引條件
Index Cond: ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone
最終會對大約 2000 萬*70 行索引數據進行 seq 掃描,然後將它們過濾掉。解決方法
我做了一些測試,發現問題出
OR
在聲明中;如果我不這樣做,他們都會給我一個快速的計劃OR
。因此,我將其切換並進行了UNION
查詢以獲取我想要的數據:SELECT id FROM ( SELECT * FROM ( SELECT id AS id, timestamp AS timestamp FROM mmsi_positions_archive WHERE timestamp = '2019-03-10 00:00:00.000000+00:00' AND id > 1032749689 ORDER BY timestamp, id LIMIT 100 ) keyset UNION SELECT * FROM ( SELECT id AS id, timestamp AS timestamp FROM mmsi_positions_archive WHERE timestamp > '2019-03-10 00:00:00.000000+00:00' AND timestamp <= '2019-03-11 00:00:00+00:00' ORDER BY timestamp, id LIMIT 100 ) all_after ) archive_ids ORDER BY timestamp, id LIMIT 100
產生快速查詢和以下查詢計劃:
Limit (cost=34.27..34.52 rows=100 width=16) (actual time=0.232..0.242 rows=100 loops=1) -> Sort (cost=34.27..34.77 rows=200 width=16) (actual time=0.231..0.238 rows=100 loops=1) " Sort Key: mmsi_positions_archive.""timestamp"", mmsi_positions_archive.id" Sort Method: quicksort Memory: 34kB -> HashAggregate (cost=22.63..24.63 rows=200 width=16) (actual time=0.151..0.167 rows=200 loops=1) " Group Key: mmsi_positions_archive.id, mmsi_positions_archive.""timestamp""" -> Append (cost=0.71..21.63 rows=200 width=16) (actual time=0.028..0.111 rows=200 loops=1) -> Limit (cost=0.71..12.24 rows=100 width=16) (actual time=0.028..0.049 rows=100 loops=1) -> Merge Append (cost=0.71..17.43 rows=145 width=16) (actual time=0.027..0.046 rows=100 loops=1) Sort Key: mmsi_positions_archive.id -> Index Scan using mmsi_positions_archive_pkey on mmsi_positions_archive (cost=0.12..8.14 rows=1 width=16) (actual time=0.010..0.010 rows=0 loops=1) Index Cond: (id > 1032749689) Filter: ("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) -> Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1 (cost=0.58..7.46 rows=144 width=16) (actual time=0.017..0.028 rows=100 loops=1) Index Cond: (("timestamp" = '2019-03-10 00:00:00+00'::timestamp with time zone) AND (id > 1032749689)) Heap Fetches: 0 -> Limit (cost=0.60..5.39 rows=100 width=16) (actual time=0.012..0.049 rows=100 loops=1) -> Merge Append (cost=0.60..773572.19 rows=16149157 width=16) (actual time=0.011..0.044 rows=100 loops=1) " Sort Key: mmsi_positions_archive_1.""timestamp"", mmsi_positions_archive_1.id" -> Sort (cost=0.01..0.02 rows=1 width=16) (actual time=0.005..0.005 rows=0 loops=1) " Sort Key: mmsi_positions_archive_1.""timestamp"", mmsi_positions_archive_1.id" Sort Method: quicksort Memory: 25kB -> Seq Scan on mmsi_positions_archive mmsi_positions_archive_1 (cost=0.00..0.00 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1) Filter: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone)) -> Index Only Scan using mmsi_positions_archive_p2019q1_timestamp_id_index on mmsi_positions_archive_p2019q1 mmsi_positions_archive_p2019q1_1 (cost=0.58..571707.70 rows=16149156 width=16) (actual time=0.006..0.031 rows=100 loops=1) Index Cond: (("timestamp" > '2019-03-10 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2019-03-11 00:00:00+00'::timestamp with time zone)) Heap Fetches: 0 Planning time: 1.059 ms Execution time: 0.312 ms
雖然我可以重寫我的查詢以使用該
UNION
方法,但我確實想知道是否有某種方法可以更好地幫助 Postgres 通過使用OR
?我也在 AWS Aurora Postgres 9.6 上執行它。我知道我們落後了幾個主要版本,我正計劃盡快升級,但目前我只需要讓這件事正常工作。:)
幸運的是,這在 PostgreSQL 中非常簡單,因為它支持可以使用索引的“行值”(或複合值)之間的比較。
所以你可以寫:
WHERE (timestamp, id) > ('2019-03-10 00:00:00+00:00', 1032749689) AND timestamp <= '2019-03-11 00:00:00+00:00' ORDER BY timestamp, id LIMIT 100
此類行值的比較是按字典順序進行的,這正是您想要的。
這是該功能的文件連結。