在某個時間點查找 n 個加油站的目前價格

May 3, 2020

我有一個表，其中儲存了價格資訊，大約 1300 萬行儲存在 PostgreSQL 9.5 數據庫中。

CREATE TABLE public.de_tt_priceinfo (
 id integer NOT NULL DEFAULT nextval('priceinfo_id_seq'::regclass),
 station_id character varying(60),
 received timestamp with time zone NOT NULL DEFAULT now(),
 e5 numeric(4,3),
 e10 numeric(4,3),
 diesel numeric(4,3),
 CONSTRAINT de_tt_priceinfo_id_pkey PRIMARY KEY (id)
);

CREATE INDEX de_tt_priceinfo_received_station_id_idx
 ON public.de_tt_priceinfo (received, station_id COLLATE pg_catalog."default");

CREATE INDEX index_station_id
 ON public.de_tt_priceinfo (station_id COLLATE pg_catalog."default");

從這個表中，我需要以最高性能提取某個時間點的最新價格，因為我必須模擬 3200 萬查詢該表的通勤者（不是一次，而是仍然）。

我有一個有效的查詢！

SELECT station_id, e5, e10, diesel, received FROM de_tt_priceinfo a 
WHERE a.received = (SELECT MAX(received) FROM de_tt_priceinfo b 
 WHERE a.station_id = b.station_id
 AND  received &lt;= '2014-09-25 08:45:12'::TIMESTAMPTZ)
AND station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
               , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
               , 'F5137488-a-b-c-d')

此查詢的性能不可用。執行時間在 900 毫秒左右變化。結果看起來像這樣

0C91A93A-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 08:17:50.000000”
578C44BB-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 08:00:09.000000”
6F2F48A8-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 07:08:57.000000”
9982BE74-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 08:29:55.000000”
A24C612B-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 08:00:09.000000”
BEC3EF55-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 06:53:49.000000”
F5137488-abcd，1.xxx，1.xxx，1.xxx，“2014-09-25 07:44:55.000000”

所以我搜尋了一下，發現了一些流行詞，比如遞歸 CTE、鬆散的索引掃描和一些 DBA 上的答案似乎非常接近，但我無法根據我的需要修改它們。

如果我理解正確，遞歸 CTE 將是查詢我想要的數據的最快方法。

到目前為止，我得到的是：

WITH RECURSIVE cte AS (
  (
  SELECT station_id, e5, e10, diesel, received
  FROM   de_tt_priceinfo
  WHERE  received &lt;= '2014-09-25 08:45:00'::TIMESTAMPTZ
    AND station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
                    , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
                    , 'F5137488-a-b-c-d')
  ORDER  BY station_id, received DESC NULLS LAST
  LIMIT 1  
  )
  UNION ALL
  (
  SELECT u.station_id, u.e5, u.e10, u.diesel, u.received
  FROM   cte c
  JOIN de_tt_priceinfo u ON u.received &gt; c.received   
  WHERE  u.received &lt;= '2014-09-25 08:45:00'::TIMESTAMPTZ  -- repeat condition!
    AND u.station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
                      , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
                      , 'F5137488-a-b-c-d')
  ORDER BY u.station_id, u.received DESC NULLS LAST LIMIT 1
  )
  )
SELECT * FROM cte;

但這只是返回以下兩行：

0C91A93A-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:17:50.000000"
9982BE74-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:29:55.000000"

更新：

SELECT Version();x86_64-pc-linux-gnu 上的 PostgreSQL 9.5devel，由 x86_64-pc-linux-gnu-gcc (Gentoo 4.8.3 p1.1, pie-0.5.9) 4.8.3, 64-bit 編譯
解釋分析：http ://explain.depesz.com/s/clrZ
XEON 1231v3、16 GB 記憶體、三星 840 PRO SSD
對預設 postgresql.conf 的更改

＃ 聯繫
聽地址 = '*'
最大連接數 = 16

# 記錄
log_destination = 'csvlog'
log_directory = 'pg_log'
logging_collector = on
log_filename = 'postgres-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 1GB
log_min_duration_statement = 500 毫秒
#log_checkpoints = 開啟
#log_connections = 開啟
#log_disconnections = 開啟
log_lock_waits = on
#log_temp_files = 0

＃ 記憶
shared_buffers = 1GB
臨時緩衝區 = 32MB
工作記憶體 = 256MB
維護工作記憶體 = 1GB
有效記憶體大小 = 8GB

# 檢查點（何時寫入磁碟）
wal_buffers = 16MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30 分鐘
checkpoint_segments = 32

random_page_cost = 1.1

# 只導入！
#autovacuum = 關閉
fsync = 關閉
同步送出 = 關閉
full_page_writes = 關閉

指數
首先，對於您的查詢類型，這是更好的索引：
CREATE INDEX de_tt_priceinfo_received_station_id_idx
 ON public.de_tt_priceinfo (**station_id, received**);  -- note the reversed order
由於組合應該是唯一的（我假設），我建議改為UNIQUE限制：(station_id, receved)
ALTER TABLE de_tt_priceinfo ADD CONSTRAINT de_tt_priceinfo_station_id_received
UNIQUE (station_id, received);
該索引index_station_id大部分已被取代，現在可能會被刪除。
該索引de_tt_priceinfo_received_station_id_idx可能仍然有用。
一定要理解這一切背後的邏輯：
多列索引和性能
複合索引是否也適用於第一個欄位的查詢？
詢問
我還會考慮基本DISTINCT ON查詢：
SELECT DISTINCT ON (station_id)
      station_id, e5, e10, diesel, received
FROM   de_tt_priceinfo
WHERE  received &lt;= '2014-09-25 08:45:12'::TIMESTAMPTZ
AND    station_id = ANY ('{0C91A93A-a-b-c-d, 578C44BB-a-b-c-d, 6F2F48A8-a-b-c-d
                        , 9982BE74-a-b-c-d, A24C612B-a-b-c-d, BEC3EF55-a-b-c-d
                        , F5137488-a-b-c-d}'::varchar[])
ORDER BY station_id, received DESC;
但是由於您似乎每個站有很多行，所以這不會發光。反而：
SELECT *
FROM  (
  VALUES
    ('0C91A93A-a-b-c-d'::varchar)
   , ('578C44BB-a-b-c-d')
   , ('6F2F48A8-a-b-c-d')
   , ('9982BE74-a-b-c-d')
   , ('A24C612B-a-b-c-d')
   , ('BEC3EF55-a-b-c-d')
   , ('F5137488-a-b-c-d')
  ) s(station_id)
LEFT JOIN LATERAL (
   SELECT e5, e10, diesel, received
   FROM   de_tt_priceinfo
   WHERE  station_id = s.station_id
   AND    received &lt;= '2014-09-25 08:45:12'::TIMESTAMPTZ
   ORDER  BY received DESC
   LIMIT  1
  )  p ON TRUE
這應該是與上述UNIQUE約束（或等效索引）相結合的炸藥。
詳細解釋：
優化 GROUP BY 查詢以檢索每個使用者的最新記錄
在每個 GROUP BY 組中選擇第一行？
表定義
對於具有數百萬行的表，優化儲存是值得的。讓一切變得更小更快。
我就是這樣設計的：
CREATE TABLE station (
  station_id serial PRIMARY KEY
, station    text
, CHECK (length(station) &lt; 61) -- ?? optional, you decide 
);

CREATE TABLE priceinfo (
  priceinfo_id serial PRIMARY KEY
, station_id   integer NOT NULL REFERENCES station ON UPDATE CASCADE
, received     timestamptz NOT NULL DEFAULT now(),
, e5           integer  -- price in 0.1 Cent
, e10          integer  -- price in 0.1 Cent
, diesel       integer  -- price in 0.1 Cent
, CONSTRAINT priceinfo_station_id_received UNIQUE (station_id, received)
);

CREATE INDEX priceinfo_received_idx ON public.priceinfo (received);
與原始表中的94 字節(24 + 66 + 4)相比，其中的行大小priceinfo為60 字節（24 堆元組標頭 + 空點陣圖；32 字節數據；4 字節項目標識符）。****假設您的範例中有 16 個字元的字元串。一切都將小 36 %（或更多？）並且速度更快。
每個索引元組的關鍵索引(station_id, received)減少到8 個字節的數據，而不是32 個字節甚至更多（！） - 每個都加上成本。此外，處理integer數字 forstation_id通常比處理帶有 a 的文本要快COLLATION。
細節：
為讀取性能配置 PostgreSQL
查詢將首先station_id從station表中獲取，這很便宜。
價格儲存為integer表示 0.1 美分的數字。（原始的4 字節而不是10 字節numeric(4,3)乘以 0.1 得到 Cent 或 0.001 得到 € 用於顯示。非常簡單和快速。
UUID
錯誤消息中的字元串看起來相當長，實際上就像一個正常UUID數字：
871828b4-37e5-419c-b7a5-cdbe1e1c0148
如果是這樣，請使用uuid數據類型。無論你採用我的保持你的舊的設計。至少切換到uuid數據類型，以便在各個方面都有很大的整體收益：
當所有值都是 36 個字元時，使用 char 與 varchar 進行索引查找會明顯更快嗎

引用自：https://dba.stackexchange.com/questions/93166

在某個時間點查找 n 個加油站的目前價格

指數

詢問

表定義

`UUID`

相關問答

尋找遞歸查詢的更簡單替代方案

PostgreSQL 遞歸公用表表達式的 (Big O) 計算複雜度是多少？

基於函式值的遞歸 CTE 在 Postgres 12 上明顯比在 11 上慢

可以使用第一個 CTE 作為第二個 CTE 的過濾器嗎？

使用來自許多表的列提高 order by 的性能

PostgreSQL 規劃器對分佈不均勻的列的錯誤決策