Mysql

如何使用連接和分組優化非常慢的查詢?

  • January 30, 2015

我有這個 GTFS 數據庫:

數據庫架構

所有***_id列都被索引,並且表之間有外鍵(例如 FK on trips.route_idto routes.route_id)。

所以基本上停止和路線之間的連結是routestripsstop_timesstops

我創建並填寫了一張表格,以提及物理上彼此靠近的所有停靠點,填充stop_connections. 很簡單。

現在我想要的是從路線 A 獲得可以通過該路線的站點之一到達的所有其他路線。

所以基本上我想去:

A.route → A.trips (many) → A.trips.stop_times(many^2) → A.trips.stop_times.stops(many^3)

然後用於stop_connections獲取相關站點,然後向後獲取這些站點的路線:

connected_stops → connected_stops.stop_times → connected_stops.stop_times.trips → connected_stops.stop_times.trips.routes

我會得到一個類似這樣的表:

+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
|          0001 |        0002 |
|          0001 |        0004 |
|          0001 |        0005 |
|          0002 |        0001 |
|          0002 |        0005 |
+---------------+-------------+

這是我的查詢:

-- get from/to route IDs
select
   r.route_id as from_route_id,
   c_t.route_id as to_route_id

- start from route A
from routes r

-- going down on all trips on route A
left join trips t on t.route_id = r.route_id

-- going down on all stop_times for A's trips
left join stop_times st on st.trip_id = t.trip_id

-- going down on all stops use by A's trips/vehicles
left join stops s on s.stop_id = st.stop_id

-- get connected stops
inner join stop_connections c_s on c_s.from_stop_id = s.stop_id

-- going up to stop_times
inner join stop_times c_st on c_st.stop_id = c_s.to_stop_id

-- going up to trips
inner join trips c_t on c_t.trip_id = c_st.trip_id and c_t.route_id <> r.route_id

-- no need to actually go up to routes because we already have the route ID

-- I just want every from/to routes mentioned only once!
group by r.route_id, c_t.route_id -- this is my problem here

下面是解釋:

+----+-------------+-------+--------+------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| id | select_type | table | type   | possible_keys                            | key              | key_len | ref                              | rows | Extra                                        |
+----+-------------+-------+--------+------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | r     | const  | PRIMARY                                  | PRIMARY          | 302     | const                            |    1 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | t     | ref    | PRIMARY,trip_route_id                    | trip_route_id    | 302     | const                            | 1474 | Using where; Using index                     |
|  1 | SIMPLE      | st    | ref    | st_trip_id,st_stop_id                    | st_trip_id       | 302     | bicou_gtfs_rennes.t.trip_id      |   14 | Using where                                  |
|  1 | SIMPLE      | c_s   | ref    | from_to_stop_ids,from_stop_id,to_stop_id | from_to_stop_ids | 767     | bicou_gtfs_rennes.st.stop_id     |    1 | Using where; Using index                     |
|  1 | SIMPLE      | s     | eq_ref | PRIMARY                                  | PRIMARY          | 767     | bicou_gtfs_rennes.st.stop_id     |    1 | Using where; Using index                     |
|  1 | SIMPLE      | c_st  | ref    | st_trip_id,st_stop_id                    | st_stop_id       | 302     | bicou_gtfs_rennes.c_s.to_stop_id |  325 | Using where                                  |
|  1 | SIMPLE      | c_t   | eq_ref | PRIMARY,trip_route_id                    | PRIMARY          | 767     | bicou_gtfs_rennes.c_st.trip_id   |    1 | Using where                                  |
+----+-------------+-------+--------+------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+

我無法完成此查詢。限制為 500、5000、50k 甚至 100k,這是在 30 秒內完成的,但我不知道完整的行數,我預計數百萬(它是語句中所有的產品rowsEXPLAIN?我是相當新的帶有查詢優化)。

當我以限制執行時,SQL 引擎在它有行時停止,但這就是我得到的:

+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
| 0001          | 0150        |
+---------------+-------------+
10 rows in set (0.19 sec)

所以我想我會添加一個分組依據(它在我上面的查詢中),但現在 SQL 引擎必須遍歷所有行並對它們進行排序,這需要永遠。

我也嘗試了不同的join方法:先left join,然後inner join,還有STRAIGHT_JOIN. 這些都沒有幫助。

請注意,這樣做的目的是填充查找表,因此如果該查詢需要幾分鐘就可以了。但我想避免一天瘋狂地使用 CPU 和磁碟。我計劃每個月左右都需要這個查詢。

我正在使用 MySQL 5.1.66-0+squeeze1

好的,我最終添加了另一個查找表:

CREATE TABLE IF NOT EXISTS `stops_routes` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `stop_id` varchar(100) NOT NULL,
 `route_id` varchar(100) NOT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
 KEY `stop_id` (`stop_id`),
 KEY `route_id` (`route_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

填充它相當快:

mysql> insert into stops_routes (stop_id, route_id)
   ->
   -> select
   ->     s.stop_id,
   ->     r.route_id as from_route_id
   ->
   -> from routes r
   -> left join trips t on t.route_id = r.route_id
   -> left join stop_times st on st.trip_id = t.trip_id
   -> left join stops s on s.stop_id = st.stop_id
   -> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496  Duplicates: 0  Warnings: 0

使用它非常快:

mysql> select
   ->     r.route_id as from_route_id,
   ->     c_sr.route_id as to_route_id
   ->
   -> from routes r
   ->
   -> left join stops_routes sr on sr.route_id = r.route_id
   -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
   -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
   ->
   -> where r.route_id <> c_sr.route_id
   -> group by r.route_id, c_sr.route_id
   -> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0002        |
| 0001          | 0003        |
| 0001          | 0004        |
| 0001          | 0005        |
| 0001          | 0006        |
| 0001          | 0008        |
| 0001          | 0009        |
| 0001          | 0011        |
| 0001          | 0014        |
| 0001          | 0031        |
+---------------+-------------+
10 rows in set (0.63 sec)

現在我可以填寫我的最後一個查找表(GTFS 網路上每條路由之間的連接集):

mysql> insert into route_connections (from_route_id, to_route_id)
   -> select
   ->     r.route_id as from_route_id,
   ->     c_sr.route_id as to_route_id
   ->
   -> from routes r
   ->
   -> left join stops_routes sr on sr.route_id = r.route_id
   -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
   -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
   ->
   -> where r.route_id <> c_sr.route_id
   -> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848  Duplicates: 0  Warnings: 0

快得驚人。我猜引擎無法分解優化這一點的步驟。

我仍然很想知道是否可以僅使用一個亞秒或亞分鐘查詢獲得相同的結果(從路由到路由連接表)。

引用自:https://dba.stackexchange.com/questions/77597