Chapter 5 —
Transport Data

Logic will get you from A to B;
imagination will take you
everywhere.
― Anonymous.

Route planning applications can better target end-user needs when they have access to a higher quantity of datasets. The algorithms used by these apps need access to datasets that exceed one database and need query functionality that exceed standard query languages. We rethink the access to data by studying how route planning algorithms work. In this chapter, we describe the state of the art of route planning interfaces. Transport data today is published in or a datadump, or a web-service which answers the entire question on the server-side. If we want to advance the state of the art in sharing data for maximum reuse, we will have to establish a new trade-off between client and server effort.

The way travelers want route planning advice is diverse: from finding journeys that are accessible with a certain disability [1], to taking into account whether the traveler owns a (foldable) bike, a car or a public transit subscription, or even calculating journeys with the nicest pictures on social network sites [2]. However, when presented with a traditional route planning http api taking origin-destination queries, developers of, e.g., traveling tools are left with no flexibility to calculate journeys other than the functions provided by the server. As a consequence, developers that can afford a larger server infrasture, integrate data dumps of the timetables (and their real-time updates), into their own system. This way, they are in full control of the algorithm, allowing them to calculate journeys in their own manner, across data sources from multiple authorities.

For instance, trying to answer the question “how long do I have to walk from one point to another?” can take into account the geolocation of the streets, the weather conditions at that time of the day, the steepness of the road, whether or not there is a sidewalk, criminality reports to check whether it is safe to walk through these streets, the wheelchair accessibility or accessibility for the visual imparaired of the road, whether the street is blocked by works at that time, etc. We can imagine the complexities that arise if the user does not only want to walk, but that he also wants to get advice taking different transport modes into account. An open world approach is needed: a certain dataset should be queried with the assumption that it can only answer part of the question, and that a better answer can always be found by using more datasets. [3]

In this chapter, we first scratch the surface of publishing data for route planning on the road as well, by building a proof of concept in collaboration with the city of Ghent. We then discuss the state of the art in the field of public transit route planning, the focus of this book. Finally, we conclude with opportunities to evolve this state of the art with Linked Connections, described in the next chapter.

Data on the road

As we have discussed in the previous chapter, it is difficult to draw the line between transport data and non transport data. Even merely administrative datasets such as the alteration of a streetname, may at some point become useful for a route planning algorithm. We first discuss, within the field of Intelligent Transport Systems (its), what the constraints should be for an information system, in order to share transport data world-wide.

Constraints for a large-scale its data-sharing system

Within the domain of its, the its directive helped popularizing publicly sharing data. For example, with a delegated regulation elaborating on a European Access Point for Truck Parking Data, it regulates sharing the data through a national access point similar to the inspire directive, a directive for sharing data within the geo-spatial domain.

When creating a system to distribute data within our own domain – for the many years to come – an important requirement is that this data policy needs to scale up efficiently. When more datasets are added, when more difficult questions need to be answered, when more questions are asked, or when more user agents come into action, we want our system to work without architectural changes. Furthermore, the system should be able to evolve while maintaining backwards-compatibility, as our organization is always changing. Datasets that are published today, still have to work when accessed when new personnel is in place. Such a system should also have a low entry-barrier, as it needs to be adopted by both developers of user agents as well as data publishers.

A use case in Ghent

The mobility organization of Ghent was given the task to build a virtual traffic centre. The centre should inform Ghentians with the latest up to date information about mobility in the city. As the city is in a transition, banning cars from the city centre, this is a project with high expectations. Information to build this traffic centre comes from existing geospatial layers, containing the on-street parking zones with, among others, their tariffs, all off-street parking lots, all streets and addresses, and all traffic signs. Also real-time datasets are available, such as the sensor data from the induction loops, bicycle counters, thermal cameras, and the availability of the off-site parking lots. Third parties also contribute datasets, such as the public transit time schedules and their real-time updates, and traffic volumes in, to and leaving the city. In order to bring these datasets from various sources together, and analyse them, both the semantics as the queryability is lacking.

The city of Ghent allowed us to define a couple of uris for parking sites. In this city, a uri strategy is in place to negotiate identifiers since 2016. The uri strategy defines a base uri at “https://stad.gent/id/”. Using this strategy, the city introduced uris for each parking site, similar to the examples above. When we now would point our browser to https://stad.gent/id/parking/P1, we will be directed to a page about this parking space. Furthermore, this identifier is interoperable across different systems, as when we would get this uri from a computer program, I can negotiate its content type, and request an rdf representation of choice.

As there are only a couple of parking sites at the city of Ghent, describing this amount of parking sites easily fits into one information resource identified by one url, for instance, http://linked.open.gent/parking/. When the server does not expose the functionality to filter the parking sites on the basis of geolocation, all user agents that want to solve any question based on the location of parking sites, have to fetch the same resource. This puts the server at ease, as it can prepare the right document once each time the parking sites list is updated. Despite the fact that now all parking sites had to be transferred to the user agent, and thus consumed more bandwidth, also the user agent can benefit. When a similar question is executed from the same device, the dataset will already be present in the user agent’s cache, and now, no data at all will need to be transferred. This raises the user-perceived performance of a user interface. When now the number of end-users increases by a factor of thousand per second – not uncommon on the Web – it becomes easier for the server to keep delivering the same file for those user agents that do not have it in cache already. When it is not in the user agents own cache, it might already be in an intermediate cache on the Web, or in the server’s cache, resulting in less cpu time per user. Caching, another one of the rest constraints, thus has the potential to eliminate some network interactions and server load. When exploited, a better network efficiency, scalability, and user-perceived performance can be achieved.

Without the cors header, a resource is by default flagged as potentially containing private data, and cannot be requested by in-browser scripts at a different domain. More information at enable-cors.org In a test environment, we published a Linked Data document at http://linked.open.gent/parking/, by transforming the current real-time xml feeds using a php script. First, this script adds metadata to this document indicating this data has an open license, and thus becomes legally interoperable with other Web resources. Next, we added http headers, indicating that this document can be used for Cross Origin Resource Sharing (cors), as well as an http header that this document can be cached for 30 seconds. Finally, we also added a content negotiation for different rdf representations.

As we would like to solve Basic Graph Patterns (bgp) queries, we added the hypermedia controls requested by the Triple Pattern Fragments (tpf) specification, which details how to filter the triples in this document based on their subject and/or predicate and/or object [4]. As the number of data facts that need to be published is small enough, we directed all controls to the main document, and did not expose extra server functionality. The latest version of the Triple Pattern Fragments hypermedia specification can be found at https://www.hydra-cg.com/spec/latest/triple-pattern-fragments/.

Following these steps, we made the data queryable through clients that can exploit the Triple Pattern hypermedia controls, such as the Linked Data Fragments client available at http://client.linkeddatafragments.org. This client is able to query over multiple Triple Pattern Fragments interfaces at once, and thus answer federated queries by following hypermedia controls. The following query selects the name, the real-time availability, and the geolocation from Open Street Map (Linked Geo Data) of a parking lot in Ghent with more than 200 spaces available: http://bit.ly/2jUNnES. This demonstrates that complex questions can still be answered over simple interfaces. The overall publishing approach is cost-efficient and stays as close as possible to the http protocol as a uniform interface. The only part where an extra technical investment was needed, was in documenting the definitions of the new uris for the parking sites.

A demo showing that the demo also exposes historic data.

Public Transit Time Schedules

Computation of routes within road networks and public transport networks are remarkably different: while some speed up methods work well for the former, they do not for the latter [5]. In this chapter, we will focus on public transit route planning, further on referred to as route planning.

A public transit network is considered, in its most essential form, to consist of stops, connections, trips, and a transfer graph [6]:

A stop p is a physical location at which a vehicle can arrive, drop off passengers, pick up passengers and depart;
A connection c represents a vehicle driving from a departure stop at a departure time, without intermediate halt, to an arrival stop at an arrival time;
A trip is a collection of connections followed by a vehicle;
A transfer is when a passenger changes a vehicle at the same stop, or a nearby stop with a certain path in between.

A network can also optionally contain routes, which is the collection of trips that follow, to a certain extent, the same series of stops. As the number of routes is in many cases much smaller than the number of trips, the name of these routes are, for simplicity, commonly used to inform passengers. In some route planning applications such as Raptor [7] or Trip-based route planning [8], a clustering algorithm may re-cluster the trips into routes favorable for the algorithm.

For future reference, an arrival and a departure are two concepts defined by a location and a timestamp. A departure at a certain stop, linked to an arrival at a different stop can thus create a connection.

Route planning queries

We differentiate various types of queries over public transport networks [9]:

An Earliest Arrival Time (eat) q is a route planning question with a time of departure, a departure stop, and an arrival stop, expecting the earliest possible arrival time at the destination;
The Minimum Expected Arrival Time (meat) is similar to an eat, yet it takes into account the probability of delayed or canceled connections;
Given departure stop, a profile query computes for every other stop the set of all earliest arrival journeys, for every departure from the departure stop;
The multi-criteria profile query calculates a set of journeys that each are not outbettered in arrival time or number of transfers, also called a set of Pareto optimal journeys.

In route planning software, the eat problem is, among others, used for mobility studies, to calculate a latest arrival time for profile queries or for preprocessing to speed up other methods. For providing actual route planning advice, a profile query is what most end-user interfaces expect. The same problem solving algorithm can, with minimal adaptations, be used for the latest departure time problem, which requests the last possible time to leave in order to arrive on time at a certain stop.

Other queries

Within any of the given problems, a user may need extra personal features, such as the ability to request journeys that are wheelchair accessible, journeys providing the most “happy” route [2], journeys where the user will have a seat [1], or journeys with different transfer settings depending on, e.g., reachability by foldable bike or criminality rates in a station.

Unrelated to the previous problems is that end users also still want answers to other questions without a route planning algorithm to be in place, such as getting a list of the next departures and arrivals at a certain stop, all stops a certain trip contains, or a list of all stops in the area.

Route planning algorithms

Only in the last decade, algorithms have been built specifically for public transit route planning [10]. Two models were commonly used to represent such a network: a time-expanded model and a time-dependent model [11]. In a time-expanded model, a large graph is modeled with arrivals and departures as the nodes, and edges to connect a departure and an arrival together. The weights on these edges are constant. In a time-dependent model, a smaller graph is modeled in which vertices are physical stops and edges are transit connections between them. The weights on these edges change as a function of time. On both models, Dijkstra and Dijkstra-based algorithms can be used to calculate routes [11].

Raptor [7] was the first base algorithm to disregard Dijkstra-like graph algorithms and instead exploit the basic elements of a public transit network. It does this by studying the routes within a network. In each round, it computes the earliest arrival time for journeys with a per round increasing number of transfers. It looks at each route in the network at most once per round. Using simple pruning rules and parallelization with multiple cores, the algorithm can be made even faster. It is currently the algorithm behind software like Open Trip Planner, or Bliksemlabs’ RRRR software.

Trip-based route planning [8] uses the same ideas as Raptor and works with an array of trips. In a preprocessing phase, it links trips together when there is a possibility to transfer at a certain stop. During query execution, it can take into account the data generated during the preprocessing phase.

Transfer Patterns [10] preprocesses a timetable such that at execution time, it is known which transfers can be used at a certain departure stop and departure time. The preprocessing requires by design more cpu time and memory than the preprocessing of trip-based route planning. Scalable Transfer Patterns [12] enhances these results by reducing both the necessary time and space consumption by an order of magnitude. Within companies that have processing power of idle machines to their disposal, the latter is a desired approach, which makes it the current algorithm behind the Google Maps public transit route planner according to its authors.

Finally, the Connection Scan Algorithm (csa) is an approach for planning that models the timetable data as a directed acyclic graph [9]. By topologically sorting the graph by departure time, the shortest path algorithm only needs to scan through connections within a limited time window. csa can be extended to solving problems where it also keeps the number of transfers limited, as well as calculating routes with uncertainty. The ideas behind csa can scale up to large networks by using multi-overlay networks [6].

It is a challenge to compare route planning algorithms as they are often built for answering slightly different questions and are executed on top of different data models. When studying the state of the art in order to find a suitable data publishing model, we were looking for the smallest building block needed for both preprocessors as actual query execution algorithms (with and without preprocessed data).

Exchanging route planning data over the Web

The General Transit Feed Specification (gtfs) is a framework for exchanging data from a public transit agency to third parties. gtfs, at the time of writing, is the de-facto standard for describing and exchanging of transit schedules. It describes the headers of several csv files combined in a zip-file. Through a calendar.txt file, you are able to specify on which days of the week a certain entry from service.txt is going to take place during a part of the year. In a calendar_dates.txt file, you are able to specify exceptions on these calendar.txt rules for example indicating holidays or extra service days. Using these two files, periodic schedules can be described. When an aperiodic schedule needs to be described, mostly only the calendar_dates.txt file is used to indicate when a certain service is running. A gtfs:Service contains all rules on which a gtfs:Trip is taking place, and a gtfs:Trip is a periodic repetition of a trip as defined earlier. Trips in gtfs contain multiple gtfs:StopTimes and/or gtfs:Frequencies. The former – mind the difference with a connection – describes a periodic arrival time and a departure time at one certain gtfs:Stop. The latter describes the frequency at which a gtfs:Trip passes by a certain stop. gtfs also describes the geographic shape of trips, fare zones, accessiblity, information about an agency, and so forth.

Other specifications exist, such as the European cen specification Transmodel, the Dutch specification bison, the Belgian specification bltac and the specification for describing real-time services The Service Interface for Real Time Information (siri). They each specify a serialization specific format for exchanging data, and some specify the questions that should be exposed by a server.

Up to date, route planning solutions exist as services, such as Navitia.io or Plannerstack, in end-user applications such as CityMapper, Ally or Google Maps, or as open source, such as Open Trip Planner or Bliksemlabs RRRR. Other common practices include that an agency, such as a railway company exposing a route planner over http themself. Each of these route planners however have the disadvantage that they do not allow querying with an open world assumption: each response is final and is not supposed to be combined with other response documents.

We have given uris to the terms in the gtfs specification through the Linked gtfs (base uri: http://vocab.gtfs.org/terms#) vocabulary. The definitions of the above terms can be looked up by replacing gtfs: by the base uri.

Conclusion

If we may conclude anything from our first proof of concept within the city of Ghent, it is that the field of transport data is not much different than Open Data in general. If we want automated reuse of datasets within route planners, we still need to get some minimum requirements accepted in the entire transport domain. These requirements are situated on the same interoperability levels those we were already discussing. We have done this for both datex2 as gtfs. For the domain specific parts, we also need to get more specifications to become Web specifications and define uris for their terms.

When publishing data in fragments, a generic fragmentation strategy can be thought of for continuously updating data as well.

One challenge which we were able to overcome, is to publish real-time data over http. The approach was again not much different for static documents, only now we need to ensure the latency with the back-end system is minimized, while setting the right caching headers. Real-time data can be put in fragments as well, allowing multiple observations to be stored in a page. This can be annotated by using a named graph and the prov-o vocabulary. The dataset of parking lots, published as illustrated in Figure 1, is now available at https://linked.open.gent/parking.

The Linked Data Fragments (ldf) axis applied to public transit route planning: on both ends, centralization is key as the client or the server will centralize all data.

Finally, today one can see clear evidence of the two extremes on the ldf axis – applied to route planning in Figure 2 – within public transit timetables. In the next chapter, we will explore different trade-offs on this axis.

References

[1]: Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E. (2016). The impact of an extra feature on the scalability of Linked Connections. In proceedings of iswc2016.
[2]: Quercia, D., Schifanella, R., M. Aiello, L. (2014). The shortest path to happiness: Recommending beautiful, quiet, and happy routes in the city. In proceedings of The 25th ACM conference on Hypertext and social media (pp. 116–125). ACM.
[3]: Colpaert, P. (2014). Route planning using Linked Open Data. In proceedings of European Semantic Web Conference (pp. 827–833).
[4]: Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P. (2016, March). Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web. Journal of Web Semantics.
[5]: Bast, H. (2009). Efficient Algorithms: Essays Dedicated to Kurt Mehlhorn on the Occasion of His 60th Birthday. (pp. 355–367). Springer Berlin Heidelberg.
[6]: Strasser, B., Wagner, D. (2014). Connection Scan Accelerated. Proceedings of the Meeting on Algorithm Engineering & Expermiments (pp. 125–137).
[7]: Delling, D., Pajor, T., Werneck, R. (2012). Round-Based Public Transit Routing. Proceedings of the 14th Meeting on Algorithm Engineering and Experiments (ALENEX’12).
[8]: Witt, S. (2016). Trip-Based Public Transit Routing Using Condensed Search Trees. Computing Research Repository (corr).
[9]: Dibbelt, J., Pajor, T., Strasser, B., Wagner, D. (2013). Intriguingly Simple and Fast Transit Routing. Experimental Algorithms (pp. 43–54). Springer.
[10]: Bast, H., Carlsson, E., Eigenwillig, A., Geisberger, R., Harrelson, C., Raychev, V., Viger, F. (2010). Fast routing in very large public transportation networks using transfer patterns. Algorithms – esa 2010 (pp. 290–301). Springer.
[11]: Pyrga, E., Schulz, F., Wagner, D., Zaroliagis, C. (2008). Efficient models for timetable information in public transportation systems. Journal of Experimental Algorithmics (JEA) (pp. 2–4). ACM.
[12]: Bast, H., Hertel, M., Storandt, S. (2016). Scalable Transfer Patterns. In proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (alenex). Society for Industrial and Applied Mathematics.
[13]: Corsar, D., Markovic, M., Edwards, P., Nelson, J.D. (2015). The Transport Disruption Ontology. In proceedings of the International Semantic Web Conference 2015 (pp. 329–336).