Summary
The way travelers want route planning advice is diverse. To name a few: finding journeys that are accessible with a certain disability; combining different modes of transport; taking into account whether the traveler owns a (foldable) bike, car or public transit subscription; or even calculating journeys with the nicest pictures on social network sites. Rather than merely being a mathematical problem, route planning advice became a data accessibility problem. Better route planning advice can only be given when more datasets can be used within the query evaluation.
Public administrations maintain datasets that may contribute to such route planning advice. Today, there is evidence of such datasets being published on Open Data Portals, yet still the cost for adopting these datasets in end-user systems is too high, as there is no evidence yet of wide reuse of these simple datasets. In order to make these datasets more used and useful, how can we leverage Open Data publishing policies?
Publishing data for maximum reuse means persuing a lower cost for adoption of your dataset. This is an automation challenge: ideally, software written to work with one dataset, works as well with datasets published by a different authority. We can lower the cost for adoption of datasets and automating data reuse, when we raise the interoperability between all datasets published on the Web. Therefore, in this PhD we study the interoperability of data sources – each with their hetereneity problems – and introduce 5 data source interoperability levels. The (i) legal level puts forward the question whether we are legally allowed to bring two datasets together. On the (ii) technical level, we can study whether there are technical difficulties to physically bring the datasets together. The (iii) syntactic interoperability describes whether the serializations can be brought together. Moreover, the syntax should provide building blocks to document identifiers used in the dataset, as well as the domain model used. This creates the basis for reaching a higher (iv) semantic interoperability, as for the same real-world objects, identifiers can be aligned.
The goal of this PhD is to study how to raise the data source interoperability of public datasets, in order to lower the cost for adoption in route planning services.
As we study data sources published by multiple authorities, and as we still need to be able to evaluate queries over these datasets, we also added the (v) querying level. When the other four layers are fulfilled, we can otherwise still not garantuee a cost-efficient way to evaluate queries. Today, two extremes exist to publish datasets on the Web: or the query evaluation happens entirely on the data publisher’s interface, or only a data dump is provided, and the query evaluation happens entirely on the infrastructure of a reuser after replicating the entire dataset. The Linked Data Fragments (ldf) axis introduces a framework to study the effort done by clients vs. the effort done by servers, and tries to find new trade-offs by fragmenting datasets in a finite number of documents. By following hypermedia controls within these documents, user agents can discover fragments as they go along.
To each of these layers, we can map generic solutions for maximizing the potential reuse of a dataset. As we are working towards Open Data, the legal aspect is covered by the Open Definition. This definition requires the data to be accompanied by a public license that informs end-users about the restrictions that apply when reusing these datasets. The only restrictions that may apply are the legal obligations to always mention the source of the original document containing the data, and the legal restriction that when changing this document, the resulting document needs to be published with the same license conditions. The legal interoperability raises when these reuse conditions itself are also machine interpretable, and are in the same way to be published for maximum reuse as well.
We are using the Web as our worldwide information system. In order to ensure technical interoperability, the uniform interface – one of the rest architectural constraints – that we adopt is http. Yet, also for the identifiers, we choose to use http identifiers or uris. This way, the same identifiers can be used for accessing different serializations and representations for the same object. This also means the identifiers become a globally unique string of characters, and thus avoids identifier conflicts. Furthermore, using the Resource Description Framework (rdf), different serializations can annotate each data element with these http uris as well, which enables identifier reuse and linking across independent data sources.
In 2015, I had the opportunity to study the organizational challenges, together with communication scientists, at the Flemish Department of Transport and Public Works (dtpw) of the Flemish government. Three European directives (psi, inspire, and its) extended with own insights, created a clear willingness to publish data for maximum reuse. How to implement such an Open Data strategy in a large organization was still unclear. As we interviewed 27 data owners and directors, we came to a list of recommendations for next steps on all interoperability levels.
None of the common specifications – such as gtfs, transmodel, and datex2 – for describing time schedules, road networks, disruptions, and road events have an authoritative Linked Data approach. For the specific case of public transit time schedules, we used the gtfs specification and mapped the terms within the domain model to uris for these to become usable in rdf datasets.
For route planning over various sources, we studied the current existing public transit route planning algorithms. The to be selected base algorithm on which other route planning algorithms can be based, needs to work on top of a data model that allows for an efficient fragmentation strategy. Our hypothesis was that this way a new trade-off could be established, putting forward a cost-efficient way of publishing – as governmental organizations cannot afford to evaluate all queries over all datasets on their servers – as well as leave room for client-side flexibility. For this purpose, we found the Connection Scan Algorithm (csa) to be a good fit.
We introduced the Linked Connections (lc) framework. An lc server publishes an ordered list of connections – departure and arrival pairs – in fragments interlinked with next and previous page links. The csa algorithm can then be implemented on the side of the data consumer. Enabling the client to do the query execution comes with benefits: (i) off-loading server, (ii) better user-perceived performance, (iii) more datasets can be taken into account, and (iv) privacy by design, as the query itself is never sent to a server.
The drawback of this publishing method is a higher bandwidth consumption, and when the client did not cache any resources yet, the querying – certainly when using a slow network – is slow. However, the clients do not necessarily need to be the end-user devices, also intermediary servers can evaluate queries over the web, and give concise and timely answers to a smaller set of end-users. When studying whether an lc server should now also expose the functionality of wheelchair accessibility, we found that both client and server had more work to process the data.
With lc, we designed a framework with a high potential interoperability of all five levels on the Web. We researched a new trade-off for publishing public transport data by evaluating the cost-efficiency. The trade-off chosen allows for flexibility on the client-side, while offering a cost-efficient interface to data publishers.
In order to achieve a better Web ecosystem for sharing data, we propose a set of minimum extra requirements when using the http protocol. (i) Fragment your datasets and publish the documents over https. The way the fragments are chosen depends on the domain model. (ii) When you want to enable faster query answering, provide aggregated documents with appropriate links (useful for – for example – time series), or expose more fragments on the server-side. For scalability, (iii) add caching headers to each document. For discoverability, (iv) add hypermedia descriptions in the document. (v) A web address (uri) per object you describe, as well as http uris for the domain model. This way, each data element is documented and there will be a framework to raise the semantic interoperability. For the legal interoperability, (vi) add a link to a machine readable open license in the document. (vii) Add a Cross Origin Resource Sharing http header, enabling access from pages hosted on different origins. Finally, (viii) provide dcat-ap metadata for discoverability in Open Data Portals.
This approach does not limit itself to static data. The http protocol allows for caching resources for smaller amounts of time. Even when a document may change every couple of seconds, the resource can still be cached during that period of time, and a maximum load on a back-end system can be calculated.
Despite the old age of the Web – at least in terms of digital technology advances – there are still organizational challenges to overcome to build a global information system for the many. I hope this PhD can be the input for standardization activities within the (public) transport domain, and an inspiration to publishing on Web-scale for others.