Postgresql

如何獲取 OR’ed 時間範圍謂詞的索引掃描?

  • August 27, 2021

我有events欄位表:

id
user_id
time_start
time_end
...

並在(time_start, time_end).

SELECT user_id
FROM events
WHERE ((time_start <= '2021-08-24T15:30:00+00:00' AND time_end >= '2021-08-24T15:30:00+00:00') OR
      (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id);
Group  (cost=243735.42..243998.32 rows=1103 width=4) (actual time=186.533..188.244 rows=166 loops=1)
 Group Key: user_id
 Buffers: shared hit=224848
 ->  Gather Merge  (cost=243735.42..243992.80 rows=2206 width=4) (actual time=186.532..188.199 rows=176 loops=1)
       Workers Planned: 2
       Workers Launched: 2
       Buffers: shared hit=224848
       ->  Sort  (cost=242735.39..242738.15 rows=1103 width=4) (actual time=184.121..184.126 rows=59 loops=3)
             Sort Key: user_id
             Sort Method: quicksort  Memory: 27kB
             Worker 0:  Sort Method: quicksort  Memory: 27kB
             Worker 1:  Sort Method: quicksort  Memory: 28kB
             Buffers: shared hit=224848
             ->  Partial HashAggregate  (cost=242668.62..242679.65 rows=1103 width=4) (actual time=184.065..184.085 rows=59 loops=3)
                   Group Key: user_id
                   Buffers: shared hit=224834
                   ->  Parallel Seq Scan on events  (cost=0.00..242553.74 rows=45952 width=4) (actual time=104.085..183.994 rows=64 loops=3)
                         Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))
                         Rows Removed by Filter: 708728
                         Buffers: shared hit=224834
Planning Time: 0.169 ms
Execution Time: 188.294 ms

Postgres 使用Seq Scan過濾器:

Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))

但是當我離開一個條件time_starttime_end開始使用索引掃描時。

如何更改條件以使 Postgres 使用 Index Scan over Seq Scan?

我不想像這樣使用UNION

SELECT user_id
FROM events
WHERE (
    (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id)
UNION (SELECT user_id
      FROM events
      WHERE (
          (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
  GROUP BY user_id

表達索引

包含時間戳範圍的GiST或(甚至更好的)SP-GiST表達式索引應該會產生奇蹟。

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));

使用“範圍包含”運算符重寫您的查詢並匹配索引表達式(完全等同於您的原始表達式):@>

SELECT user_id
FROM   events
WHERE  tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:30:00'
   OR tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:59:00'
GROUP  BY user_id;

你會得到一個這樣的查詢計劃:

HashAggregate  (cost=9.90..10.00 rows=10 width=4)
 Group Key: user_id
 ->  Bitmap Heap Scan on events  (cost=2.57..9.88 rows=10 width=4)
       Recheck Cond: ((tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:30:00''::timestamp without time zone) OR (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:59:00''::timestamp without time zone))
       ->  BitmapOr  (cost=2.57..2.57 rows=10 width=0)
             ->  Bitmap Index Scan on events_right_expr_idx  (cost=0.00..1.28 rows=5 width=0)
                   Index Cond: (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:30:00''::timestamp without time zone)
             ->  Bitmap Index Scan on events_right_expr_idx  (cost=0.00..1.28 rows=5 width=0)
                   Index Cond: (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:59:00''::timestamp without time zone)

應該快很多

預設情況下,範圍類型假定包含下限和互斥上限(tsrange(time_start, time_end)- 等效於tsrange(time_start, time_end), '[)')。

由於您使用. >=_<= tsrange(time_start, time_end, '[]')

有關的:

或者,將範圍列儲存在表中

應該快一點,但是,作為普通(不是表達式)索引。

您可以將時間戳範圍列添加到表中,例如:

ALTER TABLE event ADD COLUMN ts_range tsrange GENERATED ALWAYE AS (tsrange(time_start, time_end, '[]')) STORED;

看:

或者,更徹底地,將time_startand替換time_end為 range 列。那麼索引和查詢就簡單一些了:

CREATE INDEX events_right_idx ON events USING spgist (ts_range);

SELECT user_id
FROM   events
WHERE  ts_range @> timestamp '2021-08-24T15:30:00'
   OR ts_range @> timestamp '2021-08-24T15:59:00'
GROUP  BY user_id;

但是該tsrange列佔用的空間比兩timestamp列要多。權衡成本和收益。

旁白

Postgres 14(目前為測試版)甚至允許覆蓋 SP-GiST 索引。發行說明:

允許 SP-GiST 使用包含的列 (Pavel Borisov)

但我不認為您可以獲得針對特定查詢的僅索引掃描。

如果您出於某種原因不得不使用 B-tree 索引,那麼這個固定UNION查詢應該不會太糟糕:

SELECT user_id
FROM   events
WHERE  '2021-08-24T15:30:00' BETWEEN time_start AND time_end
UNION
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:59:00' BETWEEN time_start AND time_end

值得注意的是,沒有GROUP BYUNION已經完成了所有工作。

並簡化BETWEEN(對性能沒有影響)。

此外,你似乎有一個狂野的組合timestamp without time zonetimestamp with time zone。並將其命名為“時間”以增加混亂。通常timestamptz是更好的選擇。看:

最後但並非最不重要的一點是,這表明列統計資訊不准確,導致查詢計劃不理想:

-> Parallel Seq Scan on events (cost=0.00..242553.74 **rows=45952** width=4)
(實際時間=104.085..183.994**行=64**循環=3)

ANALYZE events;

並重試。您的原始查詢可以使用普通的 B 樹索引。它只是不如建議的 SP-GiST 索引那麼有效。

然後可能會調整您的autovacuum和統計設置,以避免將來出現錯誤的統計資訊。看:

引用自:https://dba.stackexchange.com/questions/298625