Question

I'm working with some GTFS data and would like to be able to create a list of all stops associated served by a route. I don't really understand how to do with with GTFS data.

Trips.txt comes in a format like this:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id 1,A20120610WKD,A20120610WKD_000800_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_002700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_004700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_006700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_008700_1..S03R,SOUTH FERRY,1,,1..S03R

I tried reading in the matching shape using the shape_id and then looking for stops with matching latitudes and longitudes but that doesn't seem to work reliably. Does anybody know how to do this?

Was it helpful?

Solution

As you've noticed, there isn't a direct relationship between routes and stops in GTFS. Instead, stops are associated with trips, where each trip represents a single "run" of a vehicle along a particular route. This reflects the fact a route does not necessarily serve every one of its stops at all times—on weekends it might skip stops outside a high school, for instance.

So getting a list of every stop served by a route involves combining several models:

  • routes.txt gives you the route ID for the route you're interested in.
  • trips.txt gives you a set of trip IDs for that route.
  • stop_times.txt gives you a set of stop IDs for the stops served on each of these trips.
  • stops.txt gives you information about each of these stops.

Assuming you're using an SQL database to store your GTFS data, you might use a query like this (once you've obtained the route ID):

SELECT stop_id, stop_name FROM stops WHERE stop_id IN (
  SELECT DISTINCT stop_id FROM stop_times WHERE trip_id IN (
    SELECT trip_id FROM trips WHERE route_id = <route_id>));

Remember, though, this will output a record for every stop that is ever served by the route. If you're generating schedule information for a rider you'll probably want to limit the query to only trips running today and only stop times with departures in, say, the next thirty minutes.


Update: I wrote the above SQL query the way I did as I felt it most simply illustrated the relationship between the GTFS models, but btse is correct (in his answer below) that a query like this would never actually be used in production. It's too slow. You would instead use table joins and indices to keep query times reasonable.

Here is an equivalent query, written in a way more suited to being copied and pasted into a real application:

SELECT DISTINCT stops.stop_id, stops.stop_name
  FROM trips
  INNER JOIN stop_times ON stop_times.trip_id = trips.trip_id
  INNER JOIN stops ON stops.stop_id = stop_times.stop_id
  WHERE route_id = <route_id>;

Typically you would also create an index for each column used in a JOIN or WHERE clause, which in this case would mean:

CREATE INDEX stop_times_trip_id_index ON stop_times(trip_id);

CREATE INDEX trips_route_id_index ON trips(route_id);

(Note that RDBMSes normally index each table by its primary key automatically, so there is no need to explicitly create an index on stops.stop_id.)

Many further optimizations are possible, depending on the specific DBMS in use and your willingness to sacrifice disk space for performance. But these commands will yield good performance on virtually any RDBMS without needlessly sacrificing clarity.

OTHER TIPS

I came across this post in my Google searches and I figured I would update it with a better answer in case anyone else stumbles upon it. The answer that Simon gave is 100% correct, however, the query he provided is quite slow for large GTFS feeds. Here is a query that does the same thing, but performs significantly faster.

Just to give you some anecdotal evidence, for a GTFS feed of about 50mb, Simon's query took anywhere from 10-25 seconds to complete. The statement below takes consistently < 0.2 seconds.

SELECT T3.stop_id, T3.stop_name 
FROM trips AS T1
JOIN
stop_times AS T2
ON T1.trip_id=T2.trip_id AND route_id = <routeid>
JOIN stops AS T3
ON T2.stop_id=T3.stop_id
GROUP BY T3.stop_id, T3.stop_name

UPDATE:

I realized I didn't mention this before, but of course you will want to have indexes where each of the tables are being joined.

If you GROUP BY shape_id when selecting from trips you can make the query even faster.

Using @btse's query to get the unique stops for two routes takes 1.147s.

My equivalent query takes 0.4s.

SELECT unique_stops.route_id, unique_stops.stop_id, stop_name, stop_desc, stop_lat, stop_lon
FROM
  stops,
  (SELECT stop_id, route_id
   FROM
     stop_times,
     (SELECT trip_id, route_id
      FROM trips
      WHERE route_id IN (801, 803)
      GROUP BY shape_id
     ) AS unique_trips
   WHERE stop_times.trip_id = unique_trips.trip_id
   GROUP BY stop_id) AS unique_stops
WHERE stops.stop_id = unique_stops.stop_id

If you're working in R you could do this to find routes that stop at your target destination X:

require(dplyr)

routesX <- routes %>%
  left_join(trips %>% select(trip_id, route_id, shape_id)) %>%
  left_join(stop_times %>% select(trip_id, stop_id)) %>%
  semi_join(stops %>% filter(grepl('X', stop_name, ignore.case = T)), by = c('stop_id' = 'stop_code')) %>%
  select(names(routes), shape_id) %>%
  unique 

If the direction of the stop is needed, a change in Lukmaan's answer should be done:

SELECT unique_stops.route_id, unique_stops.stop_id, stop_name, stop_desc, stop_lat, stop_lon, unique_stops.direction_id
FROM
  stops,
  (SELECT stop_id, route_id, direction_id
   FROM
     stop_times,
     (SELECT trip_id, route_id, direction_id
      FROM trips
      WHERE route_id IN (801, 803)
      GROUP BY direction_id
     ) AS unique_trips
   WHERE stop_times.trip_id = unique_trips.id
   GROUP BY stop_id, direction_id) AS unique_stops
WHERE stops.stop_id = unique_stops.stop_id

If you add also stop_times.stop_sequence the same way, and order by direction and stop_sequence, the stops will be sorted as they are in the trip.

If you use "onebusaway", there is a quick way to do this without touching GTFS

Lets say you want to know the bus stops for bus route "M1" in Manhattan, NYC

http://bustime.mta.info/api/where/stops-for-route/MTA%20NYCT_M1.json?key=yourapikey&includePolylines=false&version=2

will give you a json feed then you can extract bus stops for both direction on route M1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top