如何提高對超過 2000 萬行表的查詢速度?
我有一個查詢,用於獲取某些 IP 地址的網際網路流量統計資訊。
有單獨的 IP 地址欄位
hosts
和稱為 的 IP 塊assignments
。數據以 5 分鐘的間隔儲存。查詢結果在時間列上進行分組,並使用這 5 分鐘間隔內和外的總 SUM 繪製圖表。
該表被呼叫
traffic
並包含(在月底)大約 2100 萬條記錄。SHOW CREATE table traffic: CREATE TABLE `traffic` ( `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL, `type_id` int(11) unsigned NOT NULL, `time` int(32) unsigned NOT NULL, `bytesin` bigint(20) unsigned NOT NULL default '0', `bytesout` bigint(20) unsigned NOT NULL default '0', KEY `basic_select` (`type_id`,`time`,`type`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic WHERE ( ( traffic.type = 'v4_assignment' AND type_id IN (231, between 20 to 100 ids,265)) OR ( traffic.type = 'v4_host' AND type_id IN (131, ... a lot of ids... ,1506))) AND traffic.time >= 1343772000 AND traffic.time < 1346450399 GROUP BY traffic.time ORDER BY traffic.time;
以下是
explain
上述查詢的輸出:+----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+ | 1 | SIMPLE | traffic | range | basic_select | basic_select | 8 | NULL | 891319 | Using where; Using temporary; Using filesort | +----+-------------+---------+-------+---------------+--------------+---------+------+--------+----------------------------------------------+ show indexes from traffic; +---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | traffic | 1 | basic_select | 1 | type_id | A | 13835 | NULL | NULL | | BTREE | | | traffic | 1 | basic_select | 2 | time | A | 18470357 | NULL | NULL | | BTREE | | | traffic | 1 | basic_select | 3 | type | A | 18470357 | NULL | NULL | | BTREE | | +---------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
此查詢需要 30 秒到 30 分鐘才能完成。我希望我可以使用更好的索引來改進事情,或者使用不同的查詢,但我無法弄清楚。
更新:
根據樂於助人的評論者的建議,我創建了一個主鍵並添加了索引
traffic_pk (time, type, type_id, id)
。不幸的是,事實證明這個新索引的基數等於/低於我的原始索引(basic_select),並且 MySQL 仍然使用我的原始鍵。更新 2: 我刪除了原始索引
basic_select
,現在EXPLAIN
顯示更高的rows
值,但EXTRA
欄位中的步驟更少。查詢執行時間也降至一分鐘以下!(仍然有點太慢,但有重大改進!)。mysql> SHOW CREATE TABLE traffic_test \G; *************************** 1. row *************************** Table: traffic_test Create Table: CREATE TABLE `traffic_test` ( `traffic_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `type` enum('v4_assignment','v4_host','v6_subnet','v6_assignment','v6_host') NOT NULL, `type_id` int(11) unsigned NOT NULL, `time` int(32) unsigned NOT NULL, `bytesin` bigint(20) unsigned NOT NULL DEFAULT '0', `bytesout` bigint(20) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`time`,`type`,`type_id`,`traffic_id`), KEY `traffic_id_IDX` (`traffic_id`) ) ENGINE=InnoDB AUTO_INCREMENT=24545159 DEFAULT CHARSET=latin1
表上的索引:
mysql> SHOW INDEX FROM traffic; +--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | traffic_test | 0 | PRIMARY | 1 | time | A | 18 | NULL | NULL | | BTREE | | | traffic_test | 0 | PRIMARY | 2 | type | A | 38412 | NULL | NULL | | BTREE | | | traffic_test | 0 | PRIMARY | 3 | type_id | A | 24545609 | NULL | NULL | | BTREE | | | traffic_test | 0 | PRIMARY | 4 | traffic_id | A | 24545609 | NULL | NULL | | BTREE | | | traffic_test | 1 | traffic_id_IDX | 1 | traffic_id | A | 24545609 | NULL | NULL | | BTREE | | +--------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
我也通過不使用來簡化查詢
OR
:SELECT SQL_NO_CACHE traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic WHERE traffic.type LIKE 'v4_host' AND type_id IN (131,1974,1976,1514,1516,2767,2730,2731,2732,2733,2734,2769,2994,2709,1,4613,4614,4615,4616,326,1520,2652,1518,1521,1522,1523,1524,1525,2203,1515,1513,1467,1508,1973,1510,1975,1511,1475,1476,1468,1469,1470,1471,1472,1473,1500,1507,1478,1480,1481,1482,1483,1484,1485,1479,1486,1487,1488,1489,1490,1491,1495,1499,1494,2269,1474,1519,2204,2976,1922,1493,1492,1497,1496,1498,1501,1502,1503,1526,1509,1506) AND traffic.time >= 1342181721 AND traffic.time < 1343391321 GROUP BY traffic.time ASC;
此查詢的舊執行:
3980 rows in set (6 min 15.27 sec)
新的執行時間:
3980 rows in set (24.80 sec)
解釋輸出:
+----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+ | 1 | SIMPLE | traffic | range | PRIMARY | PRIMARY | 4 | NULL | 12272804 | Using where | +----+-------------+---------+-------+---------------+---------+---------+------+----------+-------------+
行值仍然很高。
type
我想我可以通過切換索引中的順序來改進這一點,type_id
因為只有 4 種可能的類型和更多的 type_id。這是一個正確的假設嗎?
1.表分區
由於
$$ AND traffic.time >= 1343772000 AND traffic.time < 1346450399 $$子句,我想您永遠不會從該表中刪除數據,或者該表目前正在儲存數月的數據。列中的值$$ time $$似乎是 unix 時間戳(1346450399 = Fri, 31 Aug 2012 21:59:59 GMT)根據時間列對錶進行分區。這將加快數據檢索,因為數據庫將掃描對應的分區(比掃描整個表要快得多)。
- 可以在這裡找到一個很棒的分區教程:http ://www.arachna.com/roller/spidaman/entry/scaling_rails_with_mysql_table
- 您需要為此計算時間戳範圍,但這應該不難做到。
- 例如:(1346450399 - 1343772000) / 60 / 60 / 24 =~ 31 天。因此,保存 9 月數據(也有 31 天)的分區的最大值將為:1346450399 + ( 31 * 24 * 60 * 60)
- 可以在這裡找到一個 unix 迄今為止的計算器:http ://www.onlineconversion.com/unix_time.htm
2.重寫查詢
因為 WHERE 塊中的“OR”,優化器將選擇不使用定義的索引。嘗試將查詢拆分為 2 個選擇,並進行聯合。
SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic WHERE traffic.type LIKE 'v4_assignment' AND type_id IN (1,2,3,4) AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 GROUP BY traffic.time UNION SELECT traffic.time, SUM(traffic.bytesin), SUM(traffic.bytesout) FROM traffic WHERE traffic.type LIKE 'v4_host' AND type_id IN (5,6,7,8) AND traffic.time >= 1343772000 AND traffic.time <= 1346450399 GROUP BY traffic.time ORDER BY traffic.time
- 基於數據基數的新索引
根據您的解釋輸出,我沒有看到使用的索引。可能是因為優化器決定進行全表掃描然後跟踪索引會更容易(更便宜)。此外,在您目前的索引中,第一列的基數低於接下來的 2 列。任何索引中的第一列都應該是具有最佳(最大)基數的列。
創建一個新索引為:
MYSQL> CREATE INDEX MTIhai_traffic_idx1 ON traffic(time, type, type_id)