Chapter 1 —
Introduction

If you want to go fast, go alone. If you want to go far, go together.
― African proverb.

How far do you live from work? Keep the answer to this question in mind. Is the unit of measurement you used to answer this question minutes or kilometers? When asking a certain audience this question, each time, a significant amount of people answered with a distance in kilometers, while others would answer with a distance in time. Now imagine a software program has to calculate the time distance from one point to another for an end-user. Just imagine the amount of datasets that could be used to come up with a good response to that question… For 4 years I have been working in projects that had one goal in common: sharing data for an unknown number of use cases and an unknown number of users.

In this chapter we first discuss the research question. Then we discuss in more detail the projects that contributed to this research and the structure of the rest of this book.

Research question

I will define open in the next chapter. I studied lightweight interfaces for sharing open transport datasets. The term Open Transport Data, on the one hand, entails the goal of maximizing the reuse of your transport datasets. Transport data is used as a focus, yet there is no clear distinction between transport data and other kinds of data. As an illustration, even datasets like criminality rates could be at some point used to provide a better route planning experience. Only in Chapter 5 and 6 we will dive into the specifics of the transport domain. For example, in order to inform commuters better, a public transport agency wants to make sure the last updates about their transit schedules are available in each possible end-user interface. Another example of a clear incentive for governmental organizations in specific to publish data for maximum reuse would be policy decisions: datasets maintained by public administrations should be published “once-only” and become as used and useful as possible, as part of their core task. Publishing data for maximum reuse is an automation challenge: ideally, software written to work with one dataset, works as well with datasets published by a different authority. We can lower the cost for adoption of datasets and automating data reuse, when we raise the interoperability between all datasets published on the Web.

Lightweight interfaces on the other hand, entails that when publishing the data for maximum reuse – and thus when this data becomes widely adopted – there is not an ever growing publishing cost that comes with this server interface. We observe today that there are two common ways to share transport data. The first way is to provide an export of all facts in one dump, which can be used by reusers to ask any question slowly. The second way is to provide a data service, which can be used by reusers to answer a set of specific questions quickly. The goal of the Linked Connections framework introduced in Chapter 6, is to experiment with the trade-offs between the efforts needed to be done by reusers, questions that can be answered quickly by reusers and the cost-efficiency of the data publishing interfaces when reuse grows.

Hence, the research question of this PhD is: “how can the data source interoperability of public datasets be raised, in order to lower the cost for adoption in route planning services?”

In order to have an answer to this question, this book will go broader than merely discussing the technical aspects of datasources. I worked in close collaboration with communication scientists to study the management and publishing of data sources qualitatively as well. This might make this research question atypical for a PhD within software engineering.

Assumptions and limitations

This work assumes http is the uniform interface for Open Data publishing. Other protocols exist (e.g., ftp or e-mail), yet the scale of the adoption of http servers for Open Data is irreversible. If you would be reading this dissertation at a time in the future where the http protocol is not used any longer (which at the time of writing, I would describe as “unlikely”), I first of all would have to admit that this assumption in my PhD is terribly wrong. However, the experiments described in the next chapters would also work for other protocols: an identifier strategy would still be needed over this new protocol (Chapter 3 and Chapter 4), as well as a way to fragment these datasets (Chapter 6).

A second assumption is that the datasets that can be published will not have any privacy constraints. In practise, from the moment a dataset may contain something that may identify people, they need to go through a privacy check before they would be able to published publicly. As this is a different research area, we made the assumption that every dataset mentioned in this PhD, has either no person data, or has gone through the necessary checks in order to be disseminated.

A limitation of the current work – yet part of future work – is that we will be focusing on public transport routing and not calculate routes over road networks. Calculating routes over a road network can however be solved in a similar fashion, following the principles described in the conclusion.

Finally, the last limitation is that we describe the cost for adoption, yet do not describe an economical model to calculate such a cost. Instead, we assume that by raising the interoperability, the cost for adoption will lower.

The chapters and publications

When I started research at what was back then still called the MultiMediaLab, I was handed a booklet called “Is This Really Science? The Semantic Webber’s Guide to Evaluating Research Contributions” [1] written by Abraham Bernstein and Natasha Noy. In that booklet, a quote from Ernest Rutherford was used: “All science is either physics or stamp collecting”. This bold statement illustrates a useful distinction between two types of research: one that studies a phenomenon and creates hypotheses about it and the other that catalogues and categorizes observations. The next three chapters will show how we see and categorize the domain of data publishing, in order to see more clearly how we can contribute to this domain. In Chapter 6, we introduce and evaluate Linked Connections, in order to prove that it is indeed more cost-efficient to host.

I based my dissertation on a collection of papers that I have (co-)authored. Yet, it is also bringing together findings from deliverables from the projects I have been part of, invited talks I have been giving, blog posts I have been writing, and on the side projects I have been side-tracked by out of curiosity. A short overview of the chapters and what they are based on, is given below:

Chapter 2 – Open Data and interoperability: This chapter is based on my explanation on Linked Open Data, which I have been teaching over the course of my research position.
Chapter 3 – Measuring interoperability: This chapter is based on the first journal paper I authored for the Computer journal [2]. In the paper, I tried to quantify the interoperability of governmental datasets, in order to find out which datasets on an Open Data Portal would need more work.
Chapter 4 – Raising interoperability of governmental datasets: This chapter describes three projects which each were valorized in a publication. One project was on creating better Open Data Portals [3], another on an Open Data policy for the dtpw [4], and a third project was on creating a Linked Data strategy for local council decisions as a way to simplify administrative tasks.
Chapter 5 – Transport data: This chapter comes in two parts: a part about data on the road, which is based on a publication on more recent work about real-time parking availabilities [6], and data about public transit route planning, which is based on the related work of the paper introducing Linked Connections [7].
Chapter 6 – Public Transit route planning over lightweight Linked Data interfaces: Finally, the last chapter before the conclusion was based on four papers that also nicely illustrate the thought process over time. My PhD Symposium paper written back in 2013 [8] illustrated that I wanted to be able to answer any kind of route planning question (sic) over the Web of data. In 2015 I published a demo paper [9] that would explain in a proof of concept, that I meant to give the client more freedom to calculate routes the way they like. In 2016, I extended this proof of concept with wheelchair accessibility features [10] and looked into what would be more efficient for the information as a whole. Finally, in 2017 I published the paper evaluating the cost-efficiency of this new lightweight interface [7].

Innovation in route planning applications

In 2016, everyone wanted an app. Not having an app would mean not being able to call yourself a digitally advanced transit agency, even if it already has a website with exactly the same functionality. Tim Berners-Lee, in 1989, concluded his proposal for the Web with the advice that we should focus on creating a better information system, that works for anything and is portable, then again having to work on new fancy graphic techniques and complex extra facilities that do not tackle the root of the problem. Today, his conclusion could not be more on topic.

We do not need a separate route planning app for each transit agency. If you would ask smartphone users, they only need one application that returns a route from one place to another. We can see evidence of public transit authorities that understand this need. Public transit agencies share data among each other to include their data in each of their own app. However, instead of a solution, this now becomes a quadratic problem, in which each agency has to share their data with each other agency that they seem relevant. If we want an app that works world-wide, this approach will not scale. Moreover, we become dependent on the goodwill of the public transit agencies to implement features for specific use cases. Take for example the ability to take into account your specific set of subscriptions, the ability to take into account wheel chair accessiblity, or the ability to assist you in planning your next multi-day international trip.

At the time of writing, Google Maps is the most popular route planning application. Organically, we see that public transit companies understand this need, and share their data with Google Maps. While among digital citizens, this move is regarded as a long due step in the right direction, it is still questionable whether only Google Maps should receive this data. Giving one company the monopoly on creating a route planning experience for 100% of the population is not the solution either. Instead, we advocate for Open Data: everyone should be able to create and integrate a route planner in their own service offering.

My TEDxGhent 2014 talk sketches these problems in three minutes.

How exactly such an open dataset should be published is the subject of this book. Datasets need to be integrated in various “views”, which all work on top of a similar route planning Application Programming Interface (api). However, we should also be able to create different route planning apis ourselves for different use cases that were not kept in mind by transit agencies when publishing the data. This entails that the raw data should be published – not only the answers to advanced questions –.

The projects

This book has been written while working on European, Flemish, and bilaterally funded research projects. In November 2012, my first month at the MultiMedia Lab (now Internet and Data Lab), I was tasked with the further development of The DataTank. The DataTank is open source software to open up datasets over http, while also adding the right metadata to these datasets. The first version of The DataTank was further developed thanks to a project at Westtoer, which needed a single point of reference for their tourism datasets. After that, the project for ewi helped further shaping this project as simple data portal software. The DataTank was initially created at Open Knowledge Belgium and iRail (for a background, read the preface) and was installed for the open data portal of among others, Flanders, region of Kortrijk, Antwerp, and Ghent. Today, research on this project ceased and commercial support is available via third parties.

The first two years, I was funded on projects in the domain of e-government, where there was a need for better data dissemination. In all the aforementioned projects, the 5 stars of Linked Open Data was used as a framework. We would never however reach the 5 stars and would always be confronted with a glass ceiling: why would 3 stars not be enough? In these two years, we thus mainly described data in various formats, not often with well aligned data models, using the dcat specification, which at the time was still being built.

Apps for Europe was a different kind of project. Its goal was to support governmental organizations with their first steps towards Open Data and organize co-creation events. The project provided me with travel budget to travel to among others Berlin, Manchester, Amsterdam, Paris, and Switzerland. It provided me with the understanding of when developers would start to use governmental datasets, and gave me the first insight in how and why public administrations maintain certain datasets.

In the next projects, we created our own way of evaluating open data policies by introducing a framework to study the data source interoperability. This aligns with the goal to maximize reuse, and thus to provably raise the interoperability of data sources. Times are changing, and instead of e-government, Open Data would now become more eagerly funded under the umbrella of Smart Cities. The first project that was not linked to e-government was the its vis project with its.be, a public-private partnership that works on the European directive regarding Intelligent Transport Systems (its). It also set up a data portal at data.its.be, yet also put steps in the direction of interoperable semantics by transforming the datex2 specification to a Linked Data vocabulary.

In 2015, I had the opportunity to study the organizational challenges, together with communication scientists, at the Flemish Department of Transport and Public Works (dtpw) of the Flemish government. Three European directives (psi, inspire, and its) extended with own insights, created a clear willingness to publish data for maximum reuse. How to implement such an Open Data strategy in a large organization, was still unclear. As we interviewed 27 data owners and directors, we came to a list of recommendations for next steps on all interoperability levels. This project marked the start for many more projects that would need our help for publishing data for maximum reuse, such as the local council decisions as Linked Data project, the Smart Roeselare project, the reuse assessment for the Flemish Institute for the Sea, and finally for the Smart Flanders program.

An overview of the projects I was part of from November 2012 until June 2017 when at Internet and Data Lab.

Name	Period	Description and impact
Westtoer tourism	2012	Westtoer advanced The DataTank. Tourism Open Data portal created http://datahub.westtoer.be
An experimental data publishing platform for ewi	2012–2013	Advanced The DataTank with a connection to the popular ckan, created the dcat-ap extension. Shaped the current features of The DataTank
Apps for Europe	2012–2014	Advanced The DataTank and was the first project to bring together various … Apps for Ghent is still organized each year
Open Transport Net	2014–2016	A project on dcat, data portals and Open Transport Data.
Flemish Innovation Study for its.be	2014–2017	Setting up a data portal and advancing the datex2 specification towards a Linked Data ontology
An open data vision for the Department of Mobility and Public Works	2015	A first project to methodologically raise the impact and interoperability of datasets to be published at the Flemish government.
Local Decisions as Linked Data	2016	Publishing local council decisions as Linked Data for administrative simplification
The oasis team	2016–2018	Sharing experience between Ghent and Madrid on publishing Linked Data about public services and transport data.
Smart Roeselare	2017	Supporting the City of Roeselare with their Smart City vision to make more reuse of their data.
Raising the reuse of datasets maintained by the Flemish institute for the Sea	2016–2017	Towards an Open Sea Data innovation lab by making their data used and useful.
Smart Flanders	2017–2020	Supporting the 13 Flemish center cities and Brussels to publish real-time open data.

Each project contributed in their own respect to this dissertation. While some gave me access to research subjects, other projects gave me the freedom to further explore research directions I thought were worth further exploring.

References

[1]: Bernstein, A., Noy, N. (2014). Is this really science? the semantic webber’s guide to evaluating research contributions. Technical report.
[2]: Colpaert, P., Van Compernolle, M., De Vocht, L., Dimou, A., Vander Sande, M., Verborgh, R., Mechant, P., Mannens, E. (2014, October). Quantifying the interoperability of open government datasets. (pp. 50–56). Computer.
[3]: Colpaert, P., Joye, S., Mechant, P., Mannens, E., Van de Walle, R. (2013). The 5 Stars Of Open Data Portals. In proceedings of 7th international conference on methodologies, technologies and tools enabling e-Government (pp. 61–67).
[4]: Colpaert, P., Van Compernolle, M., Walravens, N., Mechant, P. (2017, April). Open Transport Data for maximizing reuse in multimodal route planners: a study in Flanders. IET Intelligent Transport Systems. Institution of Engineering and Technology.
[5]: Buyle, R., Colpaert, P., Van Compernolle, M., Mechant, P., Volders, V., Verborgh, R., Mannens, E. (2016, October). Local Council Decisions as Linked Data: a proof of concept. In proceedings of the 15th International Semantic Web Conference.
[7]: Colpaert, P., Verborgh, R., Mannens, E. (2017). Public Transit Route Planning through Lightweight Linked Data Interfaces. In proceedings of International Conference on Web Engineering.
[8]: Colpaert, P. (2014). Route planning using Linked Open Data. In proceedings of European Semantic Web Conference (pp. 827–833).
[9]: Colpaert, P., Llaves, A., Verborgh, R., Corcho, O., Mannens, E., Van de Walle, R. (2015). Intermodal Public Transit Routing using Linked Connections. In proceedings of International Semantic Web Conference (Posters & Demos).
[10]: Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E. (2016). The impact of an extra feature on the scalability of Linked Connections. In proceedings of iswc2016.