Chapter 7

The future is so bright, we will have to wear shades.

― Erik Mannens.

Lowering the cost to reuse a dataset can be done by raising its interoperability to other datasets. A developer then has the possibility to automate reuse of more datasets when a user-agent can recognize common elements. In this dissertation, I studied how this data source interoperability of Open (Transport) Datasets can be raised. I introduced five layers of data source interoperability, that can create a framework to solve these questions: legal, technical, syntactic, semantic, and querying. On the one hand, these five layers were used for qualitative research studying public administrations in Flanders. On the other hand, they were used to design a public transport data publishing framework called Linked Connections (lc). With lc, I researched a new trade-off for publishing public transport data by evaluating the cost-efficiency. The trade-off chosen allows for flexibility on the client-side with a user-perceived performance that is comparable to the state of the art. When publishing data in any domain – or when creating standards for publishing data – a similar exercise should be made. I summarize the key take aways in 8 minimum requirements for designing your next http Open Data publishing interface.

The research question as discussed in Chapter 1 was “How can the data source interoperability of Open (Transport) Data be raised?”. In Chapter 2, I introduced the theoretical framework to study publishing data for maximum reuse in five data source interoperability layers. Chapter 3 then elaborated on how we can – and whether we should – measure the data source interoperability, based on these layers, and discussed three possible approaches. In Chapter 4, I applied this to three projects, which have been carried out through the course of this PhD, and reasoned that qualitatively studying interoperability would work best at that time for studying maximizing reuse at the Flemish government. In Chapter 5, I then introduced the specifics of data in the transport domain, which sketched the current state of the art. Finally, Chapter 6 introduced the Linked Connections framework, in which the conclusion contains strong arguments – supported by the evaluation – in favor of Linked Connections for publishing public transport data.

In the upcoming http/2.0 standard, all Web-communication need to be secure (https). Using the http protocol today thus implies using the secured https to identify Web-resources. A redirect from a http url to a https url still enables older identifiers to persist. Lowering the cost for adoption for public datasets is a complex cross-cutting concern. Different parties across different organizations need to – just to name a few – align their vision on Open Data, need to create and accept domain models, need to agree upon legal conditions need to be agreed upon, and need to pick a Linked Data interface to make their data queryable. To that extent, we need to identify the minimum requirements that would lower the cost for adoption across the entire information system on all interoperability levels. In order to achieve a better Web ecosystem for sharing data in general, I summarized a minimum set of extra requirements when using the http protocol to publish Open Data.

  1. Fragment your datasets and publish the documents over http. The way the fragments are chosen depends on the domain model.
  2. When you want to enable faster query answering, provide aggregated documents with appropriate links (useful for, e.g., time series), or expose more fragments on the server-side.
  3. For scalability, add caching headers to each document.
  4. For discoverability, add hypermedia descriptions in the document.
  5. A web address (uri) per object you describe, as well as http uris for the domain model. This way, each data element is documented and there will be a framework to raise the semantic interoperability.
  6. For the legal interoperability, add a link to a machine readable open license in the document.
  7. Add a Cross Origin Resource Sharing http header, enabling access from pages hosted on different origins.
  8. Finally, provide dcat-ap metadata for discoverability in Open Data Portals.

This approach does of course not limit itself to static data. The http protocol allows for caching resources for smaller amounts of time. Even when a document may change every couple of seconds, the resource can still be cached during that period of time, and a maximum load on a back-end system can be calculated.

While economists promise a positive economic impact from Open Data, I did not yet see proof of this impact – to the extent promised – today. Over the next years, we will need to focus on lowering the cost for adoption if we want to see true economic impact. For data publishers, this entails raising the data source interoperability. For data reusers, this entails automating their clients to reuse these public datasets: when a new dataset becomes available and discoverable, this user agent can automatically benefit from this dataset. Today, this is a manual process, where the user agent has to store and integrate all data locally first, or where the user agent has to rely merely on the expressiveness of a certain data publisher’s api. When fragmenting datasets in ways similar to Chapter 6, it becomes entirely up to the http client’s cache to decide what data to keep locally, and what data to download just in time.

There are still open research questions questions that we are going to tackle in the years to come. For one, I look forward researching fragmentation strategies within the domains of geo-spatial data. Our hypothesis is that intermodal route planning, combining both modes using road networks and public transport routing, can be achieved when a similar approach for routable tiles can be found. Moreover, solving full-text search by exposing fragments, inspired by how indexes and tree structures are built today, may afford a more optimized information architecture for federated full-text search. Finally, also organizational problems still need to be tackled: what interfaces need to be hosted by whom?

For the next couple of years, I look forward to building and expanding our IDLab team on Linked Data interfaces, and work further on projects with these organizations that I got to know best. A new generation of PhD researchers is already working on follow up projects, building a more queryable future for the many, not the few.