Mysql

改進查詢(使用 ctes 在範圍上進行令人討厭的自連接)

  • October 7, 2021
-- The CTE1 is just to create columns that dictate the bound of what is considered the same entry
-- Also I do a dense rank by ACT_TIME, and a PARITION BY on ID1, ID2
-- so all ID1/ID2 combos are ranked by when they ran
WITH 
cte1 AS (SELECT *, (X-2) AS X_START, (X+2) AS X_END, (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX,
   DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS DENSE_RANK
   FROM data
   ORDER BY data.ACT_TIME)

,cte2 AS (
-- Create new set of column as comparisons
SELECT 
ID AS ID_COMP,
ID1 AS ID1_COMP,
ID2 AS ID2_COMP,
X_START AS X_START_COMP,
X_END AS X_END_COMP,
Y_START AS Y_START_COMP,
Y_END AS Y_END_COMP,
Z AS Z_MAX_COMP,
DENSE_RANK AS DENSE_RANK_COMP
FROM cte1)
, cte3 AS (
-- join cte1 on cte2 only when X and Y value from cte1 is between the limits of cte2 AND 
-- Z max value from cte 2 is larger than Z value from cte1 AND ID1/ID2 match
-- The result will have an ID of a row that should be removed since their x and y was in between the limits 
-- Then remove any rows where rank from cte2 is higher than cte1
--  Remove any rows that were joined onto it self 
SELECT 
cte1.* ,
cte2.*
FROM cte1
JOIN cte2 ON (( cte2.X_END_COMP >= cte1.X  AND  cte1.X >= cte2.X_START_COMP) AND (cte2.Y_END_COMP >= cte1.Y AND cte1.Y>= cte2.Y_START_COMP) 
                   AND (cte1.Z < cte2.Z_MAX_COMP) AND (cte2.ID2_COMP = cte1.ID2) AND (cte2.ID1_COMP = cte1.ID1))
WHERE cte1.ID <> cte2.ID_COMP AND cte2.DENSE_RANK_COMP <= cte1.DENSE_RANK)
-- Any IDs that shows up in cte3 remove from the final result
SELECT data.*
FROM data
WHERE ID NOT IN (SELECT DISTINCT ID FROM cte3)
ORDER BY data.ACT_TIME

這是我的創建表

CREATE TABLE `data` (
   `ID` INT(11) NOT NULL AUTO_INCREMENT,
   `ID1` VARCHAR(10) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
   `ID2` INT(11) NULL DEFAULT NULL,
   `ACT_TIME` TIMESTAMP NULL DEFAULT NULL,
   `X` FLOAT(12) NULL DEFAULT NULL,
   `Y` FLOAT(12) NULL DEFAULT NULL,
   `Z` FLOAT(12) NULL DEFAULT NULL,
   PRIMARY KEY (`ID`) USING BTREE,
   INDEX `ID1` (`ID1`) USING BTREE,
   INDEX `ACT_TIME` (`ACT_TIME`) USING BTREE
);

這是 EXPLAIN {query} 結果 在此處輸入圖像描述

以下是查詢的工作方式。

我想刪除稍後出現在 X 和 Y 介於 +-2 和小於 1.2*z 的 Z 之間的任何行(具有相同的 ID1 和 ID2)

這是一個範例輸入

在此處輸入圖像描述

範例輸出

在此處輸入圖像描述

這個查詢大約需要 5 分鐘,有 250 萬行。

我在 MariDB 10.5.5

任何和所有的幫助表示讚賞!

編輯 Rick James 這是你的解釋 {query} 這是解釋結果 在此處輸入圖像描述

這是一個優化的、修復了錯誤的查詢。小提琴展示

WITH cte1 AS (
SELECT *, (X-2) AS X_START, (X+2) AS X_END, 
   (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX,
   DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS `DENSE_RANK`
   FROM data
), 
cte3 AS (
SELECT 
c.ID
FROM cte1
JOIN cte1 c ON (cte1.ID <> c.ID and cte1.X_END >= c.X and c.X >= cte1.X_START
and cte1.Y_END >= c.Y and c.Y >= cte1.Y_START and c.Z < cte1.Z*1.2 
and c.`DENSE_RANK` > cte1.`DENSE_RANK`)
)
SELECT data.*
FROM data
WHERE NOT EXISTS (SELECT 1 FROM cte3 where cte3.ID=data.ID)
ORDER BY data.ACT_TIME

我解決的問題:

  • 更改WHERE NOT INWHERE NOT EXISTS,IN列表很大是一個常見的性能問題。
  • 優化器注意到這cte2是完全沒有必要的,因為它是簡單的列別名,所以我消除了它並進行cte3了嚴格的自連接。c是候選行。
  • Z_MAX界限缺少 1.2 係數。由於缺少此內容,您的查詢從範例輸出中返回了 2 額外行。
  • 刪除了ORDER BYincte1因為它是不必要的。

這裡的目標是擺脫盡可能多的種類並減少表掃描的數量。如果您的伺服器的 I/O 性能特別差,則可能值得探索記憶體調整選項以防止排序溢出到磁碟。

PS:Rick James 注意到它ACT_TIME可以用作決勝局DENSE_RANK(),如果這適用於您的數據,那麼它將擺脫另一種類型。

我認為您不需要 CTE。看看這樣的事情是否可行:

SELECT d1.*, d2.*
   FROM data d1
   JOIN data d2  ON d2.x BETWEEN d1.x - 2 AND d1.x + 2
                AND d2.y BETWEEN d1.y - 2 AND d1.y + 2
                AND d2.z ... (I don't understand the criteria here)
                AND d2.act_time < d1.act_time

供優化器選擇的索引:

INDEX(x,y,z,act_time)
INDEX(y,z,x,act_time)
INDEX(z,x,y,act_time)

同時,(12)FLOAT(12).

我看不到DENSE_RANK需要乳清的解釋。

引用自:https://dba.stackexchange.com/questions/300748