Postgresql

從大表中獲得最大價值的高效查詢

  • March 25, 2022

給定表格:

   Column    |            Type             
id           | integer                     
latitude     | numeric(9,6)                
longitude    | numeric(9,6)                
speed        | integer                     
equipment_id | integer                     
created_at   | timestamp without time zone
Indexes:
   "geoposition_records_pkey" PRIMARY KEY, btree (id)

該表有 2000 萬條記錄,相對而言,這並不是一個很大的數字。但這會使順序掃描變慢。

我怎樣才能得到max(created_at)每個的最後一條記錄()equipment_id

我已經嘗試了以下兩個查詢,其中有幾個變體,我已經閱讀了這個主題的許多答案:

select max(created_at),equipment_id from geoposition_records group by equipment_id;

select distinct on (equipment_id) equipment_id,created_at 
 from geoposition_records order by equipment_id, created_at desc;

我也嘗試過創建 btree 索引,equipment_id,created_at但 Postgres 發現使用 seqscan 更快。強制enable_seqscan = off也沒有用,因為讀取索引和 seq 掃描一樣慢,可能更糟。

查詢必須定期執行,始終返回最後一個。

使用 Postgres 9.3。

解釋/分析(有 170 萬條記錄):

set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate  (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
"  ->  Seq Scan on geoposition_records  (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"

set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate  (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
"  ->  Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records  (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)"
"Total runtime: 11305.161 ms"

指數

畢竟,一個普通的多列 B 樹索引應該可以工作:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

為什麼DESC NULLS LAST

假設你有一張**equipment桌子**是安全的嗎?那麼性能就不會有問題了:

相關子查詢

基於這個equipment表,執行一個低相關的子查詢,效果很好:

SELECT equipment_id
    , (SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

對於表中的少量行(根據您的輸出equipment判斷為 57行),這非常快EXPLAIN ANALYZE

LATERAL加入 Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
  SELECT created_at
  FROM   geoposition_records
  WHERE  equipment_id = eq.equipment_id
  ORDER  BY created_at DESC NULLS LAST
  LIMIT  1
  ) r(latest) ON true;

詳細解釋:

性能類似於相關子查詢。

功能

如果您無法與查詢計劃器對話(這不應該發生),那麼循環遍歷設備表的函式肯定可以解決問題。一次查找一個equipment_id使用索引。

CREATE OR REPLACE FUNCTION f_latest_equip()
 RETURNS TABLE (equipment_id int, latest timestamp)
 LANGUAGE plpgsql STABLE AS
$func$
BEGIN
  FOR equipment_id IN
     SELECT e.equipment_id FROM equipment e ORDER BY 1
  LOOP
     SELECT g.created_at
     FROM   geoposition_records g
     WHERE  g.equipment_id = f_latest_equip.equipment_id
                          -- prepend function name to disambiguate
     ORDER  BY g.created_at DESC NULLS LAST
     LIMIT  1
     INTO   latest;

     RETURN NEXT;
  END LOOP;
END  
$func$;

打個電話也不錯:

SELECT * FROM f_latest_equip();

性能對比:

db<>fiddle here

sqlfiddle

嘗試 1

如果

  1. 我有一張單獨的equipment桌子,並且
  2. 我有一個索引geoposition_records(equipment_id, created_at desc)

那麼以下對我有用:

select id as equipment_id, (select max(created_at)
                           from geoposition_records
                           where equipment_id = equipment.id
                          ) as max_created_at
from equipment;

我無法強制 PG 進行快速查詢以確定s 和相關列表。但是我明天要再試一次!equipment_id``max(created_at)

嘗試 2

我找到了這個連結:http: //zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values 將此技術與嘗試 1 中的查詢相結合,我得到:

WITH RECURSIVE equipment(id) AS (
   SELECT MIN(equipment_id) FROM geoposition_records
 UNION
   SELECT (
     SELECT equipment_id
     FROM geoposition_records
     WHERE equipment_id &gt; equipment.id
     ORDER BY equipment_id
     LIMIT 1
   )
   FROM equipment WHERE id IS NOT NULL
)
SELECT id AS equipment_id, (SELECT MAX(created_at)
                           FROM geoposition_records
                           WHERE equipment_id = equipment.id
                          ) AS max_created_at
FROM equipment;

這工作得很快!但是你需要

  1. 這種超扭曲的查詢表格,以及
  2. 上的索引geoposition_records(equipment_id, created_at desc)

引用自:https://dba.stackexchange.com/questions/51895