從大表中獲得最大價值的高效查詢
給定表格:
Column | Type id | integer latitude | numeric(9,6) longitude | numeric(9,6) speed | integer equipment_id | integer created_at | timestamp without time zone Indexes: "geoposition_records_pkey" PRIMARY KEY, btree (id)
該表有 2000 萬條記錄,相對而言,這並不是一個很大的數字。但這會使順序掃描變慢。
我怎樣才能得到
max(created_at)
每個的最後一條記錄()equipment_id
?我已經嘗試了以下兩個查詢,其中有幾個變體,我已經閱讀了這個主題的許多答案:
select max(created_at),equipment_id from geoposition_records group by equipment_id; select distinct on (equipment_id) equipment_id,created_at from geoposition_records order by equipment_id, created_at desc;
我也嘗試過創建 btree 索引,
equipment_id,created_at
但 Postgres 發現使用 seqscan 更快。強制enable_seqscan = off
也沒有用,因為讀取索引和 seq 掃描一樣慢,可能更糟。查詢必須定期執行,始終返回最後一個。
使用 Postgres 9.3。
解釋/分析(有 170 萬條記錄):
set enable_seqscan=true; explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id; "HashAggregate (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)" " -> Seq Scan on geoposition_records (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)" "Total runtime: 1935.632 ms" set enable_seqscan=false; explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id; "GroupAggregate (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)" " -> Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)" "Total runtime: 11305.161 ms"
指數
畢竟,一個普通的多列 B 樹索引應該可以工作:
CREATE INDEX foo_idx ON geoposition_records (equipment_id, created_at DESC NULLS LAST);
為什麼
DESC NULLS LAST
?假設你有一張**
equipment
桌子**是安全的嗎?那麼性能就不會有問題了:相關子查詢
基於這個
equipment
表,執行一個低相關的子查詢,效果很好:SELECT equipment_id , (SELECT created_at FROM geoposition_records WHERE equipment_id = eq.equipment_id ORDER BY created_at DESC NULLS LAST LIMIT 1) AS latest FROM equipment eq;
對於表中的少量行(根據您的輸出
equipment
判斷為 57行),這非常快。EXPLAIN ANALYZE
LATERAL
加入 Postgres 9.3+SELECT eq.equipment_id, r.latest FROM equipment eq LEFT JOIN LATERAL ( SELECT created_at FROM geoposition_records WHERE equipment_id = eq.equipment_id ORDER BY created_at DESC NULLS LAST LIMIT 1 ) r(latest) ON true;
詳細解釋:
性能類似於相關子查詢。
功能
如果您無法與查詢計劃器對話(這不應該發生),那麼循環遍歷設備表的函式肯定可以解決問題。一次查找一個
equipment_id
使用索引。CREATE OR REPLACE FUNCTION f_latest_equip() RETURNS TABLE (equipment_id int, latest timestamp) LANGUAGE plpgsql STABLE AS $func$ BEGIN FOR equipment_id IN SELECT e.equipment_id FROM equipment e ORDER BY 1 LOOP SELECT g.created_at FROM geoposition_records g WHERE g.equipment_id = f_latest_equip.equipment_id -- prepend function name to disambiguate ORDER BY g.created_at DESC NULLS LAST LIMIT 1 INTO latest; RETURN NEXT; END LOOP; END $func$;
打個電話也不錯:
SELECT * FROM f_latest_equip();
性能對比:
db<>fiddle here
嘗試 1
如果
- 我有一張單獨的
equipment
桌子,並且- 我有一個索引
geoposition_records(equipment_id, created_at desc)
那麼以下對我有用:
select id as equipment_id, (select max(created_at) from geoposition_records where equipment_id = equipment.id ) as max_created_at from equipment;
我無法強制 PG 進行快速查詢以確定s 和相關的列表。但是我明天要再試一次!
equipment_id``max(created_at)
嘗試 2
我找到了這個連結:http: //zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values 將此技術與嘗試 1 中的查詢相結合,我得到:
WITH RECURSIVE equipment(id) AS ( SELECT MIN(equipment_id) FROM geoposition_records UNION SELECT ( SELECT equipment_id FROM geoposition_records WHERE equipment_id > equipment.id ORDER BY equipment_id LIMIT 1 ) FROM equipment WHERE id IS NOT NULL ) SELECT id AS equipment_id, (SELECT MAX(created_at) FROM geoposition_records WHERE equipment_id = equipment.id ) AS max_created_at FROM equipment;
這工作得很快!但是你需要
- 這種超扭曲的查詢表格,以及
- 上的索引
geoposition_records(equipment_id, created_at desc)
。