JOIN 子句中 groupwise MAX 的性能問題
問題
我的應用程序中有一些資產不時以非同步方式更新。
我將在這裡使用的範例是
Vehicles
. 有兩個表:
- **
Vehicles
:**保存有關車輛本身的資訊- **
VehicleUpdates
:**保存有關該車輛發生的所有更新的資訊。表結構的相關部分是:
CREATE TABLE `Vehicles` ( `id` varchar(50) NOT NULL, `organizationId` varchar(50) NOT NULL, `plate` char(7) NOT NULL, `vehicleInfo` json DEFAULT NULL, `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE, KEY `Vehicles_createdAt_idx` (`createdAt`), ); CREATE TABLE `VehicleUpdates` ( `id` varchar(50) NOT NULL, `organizationId` varchar(50) NOT NULL, `vehiclePlate` char(7) NOT NULL, `status` varchar(15) NOT NULL, `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE );
現在我有一個新要求,我必須在車輛資訊本身的同時返回最新的更新資訊。
Groupwise MAX 解決方案
經過一番探勘,我找到了這篇部落格文章。然後我嘗試使用建議的“不相關子查詢”方法,因為它被認為是最好的方法:
不相關子查詢
SELECT vu1.* FROM VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates GROUP BY organizationId, vehiclePlate ) AS vu2 USING (organizationId, vehiclePlate, createdAt);
275 ms
此查詢在我的生產數據庫中的平均執行時間為。我認為這太慢了,所以我決定試一試“LEFT JOIN”方法:
啞彈:左加入
SELECT vu1.* FROM VehicleUpdates AS vu1 LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate AND vu2.createdAt > vu1.createdAt WHERE vu2.id IS NULL;
這個表現更好,平均執行時間為
40 ms
. 對我來說已經足夠好了。然後我需要將此查詢作為對
Vehicles
錶的查詢的一部分執行。目前結果
以下查詢將滿足我的要求:
SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN VehicleUpdates AS vu1 ON v.plate = vu1.vehiclePlate AND v.organizationId = vu1.organizationId LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId AND vu1.vehiclePlate = vu2.vehiclePlate AND vu2.createdAt > vu1.createdAt WHERE vu2.id IS NULL;
問題是它需要
20 s
(!)才能執行。大問題!但我從不對生產進行全表掃描。查詢總是被限制為單個
organizationId
並且它是分頁的,所以我每頁最多返回 100 行,所以我執行了以下查詢:SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN VehicleUpdates AS vu1 ON v.plate = vu1.vehiclePlate AND v.organizationId = vu1.organizationId LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId AND vu1.vehiclePlate = vu2.vehiclePlate AND vu2.createdAt > vu1.createdAt WHERE vu2.id IS NULL and v.organizationId = '<some organization ID>' LIMIT 100;
現在它需要從
750 ms
到11 s
執行,具體取決於與多少車輛相關聯。還不夠好。執行
explain
上面的查詢讓我:"select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 | SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop) SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index
令我印象深刻的是該
vu1
表正在執行全表掃描,即使最左邊的表Vehicles
正在使用索引列進行過濾,該索引列organizationId
也在VehicleUpdates
.所以我決定再試一次“不相關的子查詢”並執行:
SELECT v.*, vu.* FROM Vehicles AS v LEFT JOIN ( SELECT vu1.* FROM VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates GROUP BY organizationId, vehiclePlate ) AS vu2 USING (organizationId, vehiclePlate, createdAt) ) AS vu ON vu.organizationId = v.organizationId AND vu.vehiclePlate = v.plate WHERE v.organizationId = '<SOME ORGANIZATION ID>' LIMIT 100;
這次執行時間從
1.4 s
到不等,具體取決於給定13 s
的表中有多少條目。我的申請不能接受。Vehicles``organizationId
跑步
explain
讓我:| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" | PRIMARY | v | ALL | | | | | 14456 | 100 | | PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where | PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where | DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by
目前結果 - 更新
我注意到添加特定
organizationId
子句可以提高性能。左連接
跑步:
SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN VehicleUpdates AS vu1 ON v.plate = vu1.vehiclePlate AND v.organizationId = vu1.organizationId AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <-------- LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId AND vu1.vehiclePlate = vu2.vehiclePlate AND vu2.createdAt > vu1.createdAt WHERE vu2.id IS NULL and v.organizationId = '<SOME ORGANIZATION ID>' -- <----------- LIMIT 100;
65 ms
我得到的執行時間從(可接受)到(不可接受)不等2.5 s
。不相關查詢
organizationId = '<SOME ORGANIZATION ID>'
在“主”查詢和連接外部子查詢中放置一個子句:SELECT v.*, vu.* FROM Vehicles AS v LEFT JOIN ( SELECT vu1.* FROM VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates GROUP BY organizationId, vehiclePlate ) AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate and vu1.createdAt = vu2.createdAt WHERE organizationId = '<SOME ORGANIZATION ID>' -- <-------- ) AS vu ON vu.organizationId = v.organizationId AND vu.vehiclePlate = v.plate where v.organizationId = '<SOME ORGANIZATION ID>' -- <--------- LIMIT 100;
450 ms
我得到的執行時間從(不可接受)到(不可接受)不等900 ms
。
organizationId = '<SOME ORGANIZATION ID>'
在“主”查詢和連接內部子查詢中放置一個子句:SELECT v.*, vu.* FROM Vehicles AS v LEFT JOIN ( SELECT vu1.* FROM VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates WHERE organizationId = '<SOME ORGANIZATION ID>' -- <-------- GROUP BY organizationId, vehiclePlate ) AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate and vu1.createdAt = vu2.createdAt ) AS vu ON vu.organizationId = v.organizationId AND vu.vehiclePlate = v.plate where v.organizationId = '<SOME ORGANIZATION ID>' -- <--------- LIMIT 100;
225 ms
我得到的執行時間從(可接受)到(不可接受)不等500 ms
。有沒有更好的方法來處理這樣的查詢?
數據庫資訊
- MySQL
- 版本:5.7.23-log (Amazon RDS)
- 引擎:InnoDB
我覺得自己好傻!才發現問題。
出於某種原因,
Vehicles
並且在生產中VehicleUpdates
具有不同的字元集(utf8mb4
和utf8
,分別)。這就是
EXPLAIN
“不相關子查詢”方法的結果在其步驟之一中進行全表掃描的原因:| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" | PRIMARY | v | ALL | | | | | 14456 | 100 | | PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where | PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where | DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by
轉換
VehicleUpdates
為後utf8mb4
,EXPLAIN
結果為:| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" | PRIMARY | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 | | PRIMARY | <derived2> | ref | <auto_key1> | <auto_key1> | "230" | v.plate,v.organizationId | 10 | 100 | | PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "234" | v.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where | DERIVED | VehicleUpdates | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "202" | const | 24090 | 100 | Using where; Using index
同樣,“LEFT JOIN”方法執行計劃從:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" | SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 | | SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop) | SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index
到:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra" | SIMPLE | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 | | SIMPLE | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | v.organizationId,v.plate | 9 | 100 | | SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | vu1.organizationId,vu1.vehiclePlate | 9 | 10 | Using where; Not exists; Using index
因此,現在不同查詢的性能是:
左連接
SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN VehicleUpdates AS vu1 ON v.plate = vu1.vehiclePlate AND v.organizationId = vu1.organizationId LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId AND vu1.vehiclePlate = vu2.vehiclePlate AND vu2.createdAt > vu1.createdAt where v.organizationId = '<SOME ORGANIZATION ID>' AND vu2.id IS NULL LIMIT 100;
總是在下面跑
50 ms
。內部查詢中沒有 WHERE 的不相關子查詢:
SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN ( VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates GROUP BY organizationId, vehiclePlate ) AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate and vu1.createdAt = vu2.createdAt ) ON vu1.organizationId = v.organizationId AND vu1.vehiclePlate = v.plate where v.organizationId = '<SOME ORGANIZATION ID>' LIMIT 100;
平均在
300 ms
.內部查詢中帶有 WHERE 的不相關子查詢:
SELECT v.*, vu1.* FROM Vehicles AS v LEFT JOIN ( VehicleUpdates AS vu1 JOIN ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt FROM VehicleUpdates WHERE organizationId = '<SOME ORGANIZATION ID>' -- <-- HERE GROUP BY organizationId, vehiclePlate ) AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate and vu1.createdAt = vu2.createdAt ) ON vu1.organizationId = v.organizationId AND vu1.vehiclePlate = v.plate where v.organizationId = '<SOME ORGANIZATION ID>' LIMIT 100;
也總是在下面跑
50 ms
。我決定堅持使用“LEFT JOIN”方法,因為它允許我創建一個視圖來表示內部查詢,這樣我就可以簡化返輸入輛的查詢。
我不能用“不相關子查詢”來做到這一點,因為它需要
WHERE organizationId = '<ORGANIZATION ID>'
內部查詢中的子句,所以視圖不會那麼有效。