可以優化 Group By 和 Count on MySQL 嗎?
我在 MySQL 上有一個 1 億行的表。我需要計算某些範圍內的行數,並且我有適當的索引來過濾行。假設 SELECT 語句返回 20000 個結果,但我需要的只是計數。除了我可以使用的索引之外,還有其他技術嗎?是否有另一種選擇,例如 Cassandra,可以更快地處理分組和計數?
這是表結構:
mysql> desc activity; +------------------------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------------+---------+------+-----+---------+-------+ | Source | text | YES | MUL | NULL | | | Customer | text | YES | MUL | NULL | | | Month | int(11) | YES | | NULL | | | Day | int(11) | YES | | NULL | | | Year | int(11) | YES | MUL | NULL | | | Time | text | YES | | NULL | | | User | text | YES | | NULL | | | TimeStamp | date | YES | | NULL | | | EmailEventType | text | NO | | NULL | | +------------------------+---------+------+-----+---------+-------+
指數:
mysql> show index from activity; +----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | activity | 1 | idx_activity_customer | 1 | Customer | A | 180035 | 64 | NULL | YES | BTREE | | | | activity | 1 | customer_date | 1 | Customer | A | 202831 | 64 | NULL | YES | BTREE | | | | activity | 1 | customer_date | 2 | Year | A | 303263 | NULL | NULL | YES | BTREE | | | | activity | 1 | customer_date | 3 | Month | A | 307744 | NULL | NULL | YES | BTREE | | | | activity | 1 | customer_date | 4 | Day | A | 1388270 | NULL | NULL | YES | BTREE | | | | activity | 1 | dates | 1 | Year | A | 1286 | NULL | NULL | YES | BTREE | | | | activity | 1 | dates | 2 | Month | A | 20604 | NULL | NULL | YES | BTREE | | | | activity | 1 | dates | 3 | Day | A | 146993 | NULL | NULL | YES | BTREE | | | | activity | 1 | timestamp | 1 | Year | A | 1554 | NULL | NULL | YES | BTREE | | | | activity | 1 | timestamp | 2 | TimeStamp | A | 119908 | NULL | NULL | YES | BTREE | | | | activity | 1 | timestamp_customer | 1 | Customer | A | 188169 | 64 | NULL | YES | BTREE | | | | activity | 1 | timestamp_customer | 2 | Year | A | 261389 | NULL | NULL | YES | BTREE | | | | activity | 1 | timestamp_customer | 3 | TimeStamp | A | 743716 | NULL | NULL | YES | BTREE | | | +----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 14 rows in set (0.01 sec)
詢問:
mysql> SELECT Customer, User, Source, Count(CASE WHEN ( Year = 2018 AND Week(TimeStamp) = 1 ) THEN Source ELSE NULL END) AS '2018-W1', Count(CASE WHEN ( Year = 2018 AND Week(TimeStamp) = 2 ) THEN Source ELSE NULL END) AS '2018-W2', Count(CASE WHEN ( Year = 2018 AND Week(TimeStamp) = 3 ) THEN Source ELSE NULL END) AS '2018-W3', Count(CASE WHEN ( Year = 2018 AND Week(TimeStamp) = 4 ) THEN Source ELSE NULL END) AS '2018-W4' FROM activity WHERE customer LIKE 'jones%' AND ( ( Year = 2018 AND Week(TimeStamp) = 1 ) OR ( Year = 2018 AND Week(TimeStamp) = 2 ) OR ( Year = 2018 AND Week(TimeStamp) = 3 ) OR ( Year = 2018 AND Week(TimeStamp) = 4 ) ) AND Source IN ( 'online', 'other' ) GROUP BY Source, User ORDER BY Customer, Source; +-------------------------+-----------------------------------+----------+---------+---------+---------+---------+ | Customer | User | Source | 2018-W1 | 2018-W2 | 2018-W3 | 2018-W4 | +-------------------------+-----------------------------------+----------+---------+---------+---------+---------+ | Jones corporation | 00000000@sample.com | OTHER | 87 | 51 | 75 | 20 | | Jones corporation | 000000000000@sample.com | OTHER | 125 | 98 | 115 | 62 | | Jones corporation | 000000000000000@sample.com | OTHER | 30 | 0 | 0 | 0 | ... Truncated ... | Jones cpa | 111111111111@sample.com | ONLINE | 0 | 0 | 0 | 18 | | Jones cpa | 1111111111@sample.com | ONLINE | 0 | 0 | 0 | 225 | +-------------------------+-----------------------------------+----------+---------+---------+---------+---------+ 241 rows in set (9 min 10.93 sec)
所以基本上優化器是選擇和索引,但是只計算241行需要很長時間。每個使用者對每種類型都有少量的行數。GROUP BY 欄位應該如何與索引相關聯?
基於這些大行數,我假設估計是足夠的。
您可以查看 filter 上的列或 rows 列,
EXPLAIN SELECT count(*) FROM tbl WHERE ref=value
以根據查詢計劃器採樣提供對總數的估計。參考:解釋輸出
GROUP BY Customer, Source, User
要獲得更正確的 SQL 語法,您需要在不使用聚合 ( )的情況下選擇客戶時將其添加到您的分組中。MySQL 可以讓您擺脫這種情況,但大多數 SQL 引擎不會,因為其含義可能不明確。以正確的順序覆蓋所有列的索引,您以可搜尋的方式進行過濾並對其進行分組/排序,並且還覆蓋了您在不可搜尋的管理器中輸出或過濾的那些,將允許它僅使用該索引. 所以在這種情況下,一個索引在
Customer, Source, User, Year, Timestamp
.
AND Source IN ( 'online', 'other' )
通過在“線上”和“其他”之間進行部分掃描並降低索引在源、使用者訂單中的有效性增益,甚至可能導致效率降低(因此它只是充分利用客戶訂單並僅使用其餘的以避免在主表結構中查找)。如果是這種情況,那麼您可能會發現兩個查詢,每個值一個,然後組合起來更有效。在沒有看到查詢計劃器做什麼的情況下,我們無法確定地告訴您,包括EXPLAIN
輸出以提供更多詳細資訊,如 danback 的回答所建議的那樣。從表面上看,在我看來,這似乎不是對複雜性或數據大小的查詢,如果適當索引,使用普通關係數據庫結構將不夠有效。您可能會發現它在使用某種形式的列儲存索引(SQL Server 中的非聚集壓縮列儲存索引)時表現更好,但在這里切換技術可能會過度使用。