Postgresql

儲存分組最近的相關記錄

  • May 17, 2020

我有兩張桌子,顧客和採購。每個客戶有很多(數千)次購買。我通常只需要每個客戶最近的購買,這就是為什麼我有“latest_purchase_id”列並在我添加購買時更新它。

我寧願不必維護“latest_purchase_id”,所以我一直在測試查詢。他們最終都變得慢了很多,我不知道為什麼。

顧客:

      Column        |  Type    |                       Modifiers                        | Storage  | Stats target | Description
---------------------+----------+--------------------------------------------------------+----------+--------------+-------------
id                  | integer  | not null default nextval('customers_id_seq'::regclass) | plain    |              |
latest_purchase_id  | integer  |                                                        | plain    |              |
Indexes:
   "customers_pkey" PRIMARY KEY, btree (id)
   "customers_latest_purchase_id" btree (latest_purchase_id)
Foreign-key constraints:
   "customers_latest_purchase_fk" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
   TABLE "purchases" CONSTRAINT "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

購買:

    Column   |  Type     |                        Modifiers                       | Storage  | Stats target | Description
--------------+-----------+--------------------------------------------------------+----------+--------------+-------------
id           | integer   | not null default nextval('purchases_id_seq'::regclass) | plain    |              |
customer_id  | integer   |                                                        | plain    |              |
Indexes:
   "purchases_pkey" PRIMARY KEY, btree (id)
   "purchases_id_customer_id" btree (id, customer_id)
   "purchases_customer_id" btree (customer_id)
Foreign-key constraints:
   "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
   TABLE "customers" CONSTRAINT "customers_latest_purchase_id" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
SELECT customers.id, purchases.id 
FROM customers 
  JOIN purchases ON customers.latest_purchase_id = purchases.id;

48ms

SELECT DISTINCT ON (customer_id) id, customer_id
FROM purchases
ORDER BY customer_id, id DESC;

1040毫秒

SELECT customers.id, p.id
FROM customers INNER JOIN (
   SELECT RANK()
   OVER (PARTITION BY customer_id ORDER BY id DESC) r, *
   FROM purchases
) p
ON customers.id = p.customer_id
WHERE p.r = 1;

836毫秒

SELECT customers.id, p1.id
FROM customers
JOIN purchases p1 ON customers.id = p1.customer_id
LEFT OUTER JOIN purchases p2 ON (customers.id = p2.customer_id and p1.id < p2.id)
WHERE p2.id IS NULL;

1833毫秒

SELECT customers.id, p.id
FROM customers CROSS JOIN LATERIAL (
   SELECT purchases.id, purchases.customer_id
   FROM purchases
   WHERE purchases.customer_id = customers.id
   ORDER BY purchases.id DESC
   LIMIT 1
) p;

23442毫秒

如您所見,“latest_purchase_id”比其他任何東西都快。性能優勢顯然是一種權衡,因為購買插入將花費大約兩倍的時間(我通過下面的觸發器顯著改進了這一點)。查詢也僅限於最近的購買。無需即時更改查詢以匹配特定交易價值的最新購買。

即使有我設置的索引,其他查詢是否有這麼慢的原因?我基本上只需要找到每個客戶 ID 的最大購買 ID,“purchases_id_customer_id”索引應該能夠輕鬆處理。

這是前兩個查詢的解釋分析輸出:

EXPLAIN ANALYZE SELECT customers.id, purchases.id FROM customers JOIN purchases ON customers.latest_purchase_id = purchases.id;
Nested Loop  (cost=0.42..11643.46 rows=3422 width=8) (actual time=0.961..72.014 rows=340 loops=1)
  ->  Seq Scan on customers  (cost=0.00..93.22 rows=3422 width=8) (actual time=0.010..1.239 rows=3420 loops=1)
  ->  Index Only Scan using purchases_pkey on purchases  (cost=0.42..3.38 rows=1 width=4) (actual time=0.020..0.020 rows=0 loops=3420)
        Index Cond: (id = d.latest_purchase_id)
        Heap Fetches: 137
Planning Time: 0.681 ms
Execution Time: 72.134 ms
EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id DESC;
Unique  (cost=78791.68..81715.56 rows=157 width=8) (actual time=1092.279..1434.771 rows=407 loops=1)
  ->  Sort  (cost=78791.68..80253.62 rows=584777 width=8) (actual time=1092.277..1291.642 rows=585790 loops=1)
        Sort Key: customer_id, id DESC
        Sort Method: external merge  Disk: 8304kB
        ->  Seq Scan on purchases  (cost=0.00..14779.77 rows=584777 width=8) (actual time=0.736..610.967 rows=585790 loops=1)
Planning Time: 0.098 ms
Execution Time: 1436.267 ms

編輯:我將索引更正為 (customer_id, id),但它仍然很慢。現在數據比較多,時間不完全可比,但還是不接近觸發方式。

EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id, id;
Result  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.513..1461.147 rows=823 loops=1)
  ->  Unique  (cost=0.43..162525.52 rows=381 width=8) (actual time=0.510..1460.719 rows=823 loops=1)
        ->  Index Only Scan using purchases_customer_id_id_idx on purchases  (cost=0.43..157859.86 rows=1866267 width=8) (actual time=0.508..981.186 rows=1866213 loops=1)
              Heap Fetches: 1363609
Planning Time: 0.096 ms
Execution Time: 1461.359 ms
(6 rows)

在看了更多之後,我發現了 sql 觸發器並想出瞭如何用一個來更新 latest_purchase_id。這消除了插入過程中的很多麻煩和性能損失,但我仍然不確定為什麼其他查詢表現如此糟糕。

CREATE OR REPLACE FUNCTION latest_purchase_func() RETURNS trigger AS
$BODY$
DECLARE
   CustomerID INT;
   PurchaseID INT;
BEGIN
   SELECT n.id, n.customer_id INTO PurchaseID, CustomerID
       FROM new_table n ORDER BY n.id DESC LIMIT 1;
   UPDATE customers SET "latest_purchase_id" = PurchaseID
       WHERE "customers"."id" = CustomerID;
   RETURN NULL;
END
$BODY$
LANGUAGE plpgsql;

CREATE trigger latest_purchase_ins
AFTER INSERT ON purchases
REFERENCING NEW TABLE AS new_table
FOR EACH STATEMENT
EXECUTE FUNCTION latest_purchase_func();

表達式索引不能使用子查詢或穩定/易失函式。不可能有一個包含依賴於其他行中的值的值的索引,因為這樣表中的任何單個更改都可能潛在地需要更改無限數量的索引條目。

因此,您必須將所需的屬性實際儲存在某處: in foo,或作為布爾值 in bar(對於部分索引仍然有效),或在單獨的表中。

幫助查找客戶最新購買的最佳索引要求購買 ID 位於客戶 ID 之後,即(customer_id, id).

引用自:https://dba.stackexchange.com/questions/243942