Mysql
改進查詢(使用 ctes 在範圍上進行令人討厭的自連接)
-- The CTE1 is just to create columns that dictate the bound of what is considered the same entry -- Also I do a dense rank by ACT_TIME, and a PARITION BY on ID1, ID2 -- so all ID1/ID2 combos are ranked by when they ran WITH cte1 AS (SELECT *, (X-2) AS X_START, (X+2) AS X_END, (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX, DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS DENSE_RANK FROM data ORDER BY data.ACT_TIME) ,cte2 AS ( -- Create new set of column as comparisons SELECT ID AS ID_COMP, ID1 AS ID1_COMP, ID2 AS ID2_COMP, X_START AS X_START_COMP, X_END AS X_END_COMP, Y_START AS Y_START_COMP, Y_END AS Y_END_COMP, Z AS Z_MAX_COMP, DENSE_RANK AS DENSE_RANK_COMP FROM cte1) , cte3 AS ( -- join cte1 on cte2 only when X and Y value from cte1 is between the limits of cte2 AND -- Z max value from cte 2 is larger than Z value from cte1 AND ID1/ID2 match -- The result will have an ID of a row that should be removed since their x and y was in between the limits -- Then remove any rows where rank from cte2 is higher than cte1 -- Remove any rows that were joined onto it self SELECT cte1.* , cte2.* FROM cte1 JOIN cte2 ON (( cte2.X_END_COMP >= cte1.X AND cte1.X >= cte2.X_START_COMP) AND (cte2.Y_END_COMP >= cte1.Y AND cte1.Y>= cte2.Y_START_COMP) AND (cte1.Z < cte2.Z_MAX_COMP) AND (cte2.ID2_COMP = cte1.ID2) AND (cte2.ID1_COMP = cte1.ID1)) WHERE cte1.ID <> cte2.ID_COMP AND cte2.DENSE_RANK_COMP <= cte1.DENSE_RANK) -- Any IDs that shows up in cte3 remove from the final result SELECT data.* FROM data WHERE ID NOT IN (SELECT DISTINCT ID FROM cte3) ORDER BY data.ACT_TIME
這是我的創建表
CREATE TABLE `data` ( `ID` INT(11) NOT NULL AUTO_INCREMENT, `ID1` VARCHAR(10) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci', `ID2` INT(11) NULL DEFAULT NULL, `ACT_TIME` TIMESTAMP NULL DEFAULT NULL, `X` FLOAT(12) NULL DEFAULT NULL, `Y` FLOAT(12) NULL DEFAULT NULL, `Z` FLOAT(12) NULL DEFAULT NULL, PRIMARY KEY (`ID`) USING BTREE, INDEX `ID1` (`ID1`) USING BTREE, INDEX `ACT_TIME` (`ACT_TIME`) USING BTREE );
以下是查詢的工作方式。
我想刪除稍後出現在 X 和 Y 介於 +-2 和小於 1.2*z 的 Z 之間的任何行(具有相同的 ID1 和 ID2)
這是一個範例輸入
範例輸出
這個查詢大約需要 5 分鐘,有 250 萬行。
我在 MariDB 10.5.5
任何和所有的幫助表示讚賞!
這是一個優化的、修復了錯誤的查詢。小提琴展示。
WITH cte1 AS ( SELECT *, (X-2) AS X_START, (X+2) AS X_END, (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX, DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS `DENSE_RANK` FROM data ), cte3 AS ( SELECT c.ID FROM cte1 JOIN cte1 c ON (cte1.ID <> c.ID and cte1.X_END >= c.X and c.X >= cte1.X_START and cte1.Y_END >= c.Y and c.Y >= cte1.Y_START and c.Z < cte1.Z*1.2 and c.`DENSE_RANK` > cte1.`DENSE_RANK`) ) SELECT data.* FROM data WHERE NOT EXISTS (SELECT 1 FROM cte3 where cte3.ID=data.ID) ORDER BY data.ACT_TIME
我解決的問題:
- 更改
WHERE NOT IN
為WHERE NOT EXISTS
,IN
列表很大是一個常見的性能問題。- 優化器注意到這
cte2
是完全沒有必要的,因為它是簡單的列別名,所以我消除了它並進行cte3
了嚴格的自連接。c
是候選行。Z_MAX
界限缺少 1.2 係數。由於缺少此內容,您的查詢從範例輸出中返回了 2 額外行。- 刪除了
ORDER BY
incte1
因為它是不必要的。這裡的目標是擺脫盡可能多的種類並減少表掃描的數量。如果您的伺服器的 I/O 性能特別差,則可能值得探索記憶體調整選項以防止排序溢出到磁碟。
PS:Rick James 注意到它
ACT_TIME
可以用作決勝局DENSE_RANK()
,如果這適用於您的數據,那麼它將擺脫另一種類型。
我認為您不需要 CTE。看看這樣的事情是否可行:
SELECT d1.*, d2.* FROM data d1 JOIN data d2 ON d2.x BETWEEN d1.x - 2 AND d1.x + 2 AND d2.y BETWEEN d1.y - 2 AND d1.y + 2 AND d2.z ... (I don't understand the criteria here) AND d2.act_time < d1.act_time
供優化器選擇的索引:
INDEX(x,y,z,act_time) INDEX(y,z,x,act_time) INDEX(z,x,y,act_time)
同時,
(12)
從FLOAT(12)
.我看不到
DENSE_RANK
需要乳清的解釋。