Open data in action
Warning: After reading this post you may find yourself wanting to stay for a week in Finland for no apparent reason.
“Open data is a means to an end, not an end in itself,” notices Rufus Pollock from the Open Knowledge Foundation (source). The question is, then, how do we put this means to action, to achieve the ends that it might help us achieving? How do we translate open data into action?
Only few open datasets contain “actionable” data that affords ground for an action. The data that is released into the open may be flooded with irrelevant data that makes relevant data hard to find. Governments take the safe road of disclosure of data that is not sensitive, does not shed any light on the intra-government processes, and leaves governments as opaque as they were previous to their adoption of open data.
Even the data that contain valuable knowledge may be hard to put to use. Raw and often cryptic data formats used in the public sector may prove to be particularly difficult to work with for citizens that do not possess the insider knowledge and skills the civil servants have. Not only the data requires access to information and communication technologies, its use depends on access to education with which users of the data may acquire technical skills necessary for its effective use. So far, open data initiatives were focused on technically inclined citizens, savvy enough to make use of raw data. If open data is to gain a sufficient traction to be transformative for the society, we need to fix this flaw and address this challenge, a challenge of inclusiveness.
What we have seen up to now is a fairly limited range of uses of open data. The impact of open data is, in most cases, caught in an echo chamber of new media. Open data serves to “raise awareness”, incite discussions about government decisions, and improve the level of public discourse; all of which feeds back into a dissemination loop. The more data is released, the more discussion it generates, leading to release of yet another data, while lacking real-world impact. We need data to be used to drive innovative businesses, inform evidence-based policy, or sentence criminals to jail based on findings yielded from the data.
“Open data in action” is the theme of the upcoming Open Knowledge Festival, which will take place during the week from 17th to 22nd September in Helsinki, Finland. If you think you know how to address the issues I have raised and you know how to put open data in action, consider letting others know. Open your knowledge to others, and let the #OKFest couple it with those vested with powers to implement it and ultimately put it into action!
Principled open data
There is a proliferation of principles of open data. Most of them share a similar core, they seem to be diverging from common predecessors. Principles of open data are usually used to get across the meaning of the concept of “open data”. However, there is no definition of “open data”; as is common with socially constructed, its meaning is embedded in the open data community. In this post I have decided to try to summarize what are the key characteristics of this concept.
On the way to this goal I took a series of steps:
- Take some of the existing open data principles (sampling)
- Think about their relationships (interlinking)
- Group them (clustering)
- Re-arrange them according to their relationships (classification)
- Infer new principles (extrapolation)
Content
What should be in the data?
Primary: The data should be collected at the source with highest possible level of granularity. Provide fine grained data with high resolution, with high sampling rate. Provide raw, uninterpreted data instead of aggregated or derivated forms. Public sector should not hold a monopoly to interpretation of public sector data by providing them reduced to reports; thumbnails of the original data.
Complete: All public data should be made available, except direct and indirect identifiers of persons constituting personally identifiable information and data that need to be kept secret due to the reasons of national security. Complete datasets should be available to bulk download.
Timely: Release data in timely fashion. All datasets are essentially snapshots of data streams capturing the current state of an observed phenomenon. Thus, the value of data can decrease over time (e.g., weather data). Also, the value of the methods used to capture the data decreases as the methods become obsolete. It is necessary to publish data as soon as possible to preserve their value. Preferably, the data should be released to the public at the time of their release for internal use. In this way, the data can serve to help in achieving real-time transparency and can be treated as a news source.
Conditions of use
What am I allowed to do with the data?
Licence: Unless there are legal instruments that enforce openness of data by default, there is a need for an explicit, open licence. The licence should state clearly what are the users allowed to do with the data. An open licence should be non-dicriminatory, enabling free reuse and redistribution of the licenced data. The licence should not discriminate against persons or groups, fields of endeavour, or any types of prospective use for the data. It should ignore the differences between users and their intentions. Therefore, the licence should permit any type of reuse, allowing modifications and create derivative data, and any type of redistribution, providing access to data to others. At maximum, an open licence may require attribution to the original author and redistribution with the same or analogous licence. In controlled settings, such as in government, establishing a single licence is encouraged to simplify conditions of use for combinations or multiple datasets.
Accessibility
How can I obtain the data?
Discoverability: In order to be able to access a dataset, you need to discover it. That is why data should be equipped with descriptive metadata, such as in a data catalogue. Another way is make data accessible to machines, such as search engines, that will enable the data to be found.
Accessibility: Data should be available online, retrievable via HTTP GET requests. There should be both access interface for human users, such as a web site or an application, and access interface for machine users, such as an API or downloadable data dump. There should not be any barriers obstructing access to data. There should be no financial cost associated with the use of data, although recovering reasonable marginal costs of data reproduction is OK in limited number of cases. Users should not be required to register, although requesting users to apply for an API key is OK. There should be no password protection, no strict limits on the number of API calls, and no encryption hindering in access to data.
Use
How can I use the data?
Non-proprietary data formats: Open data should use data formats over which no entity has exclusive control. Specifications of open data formats should be free for all to read and implement, subject to no fees or royalties. Using proprietary data formats excludes users of software that has not been allowed to implement support for the data format. Relying on proprietary data formats for storing data comes with the risk of them becoming obsolete, which may prevent archival access.
Standards: Open data should use open, community-owned standards, such as World Wide Web Consortium's standards. Adhering to a set of common standards makes reuse easier as the data can be processed by a wide array of standards-compliant tools. Standards create expected behaviour, enable comparisons, and ultimately lead to a greater interoperability. Standards, such as controlled vocabularies and common identifiers, provides better opportunities for combining disparate sources of data. Consistent use of standards leads to “informal” standards encoded in best practices.
Machine readability: Open data should be captured in a structured and formalized data format that enables automated processing by software. Users should be able not only to display the data, they should be able to perform other types of automated processing as well, such as full-text search, sorting or analysis. Machines are data users too, and thus providing data in machine-readable formats is not discriminating machines. People view data through machines and machines help them with efficient processing of the data. Some people, such as people with disabilities, consume data only via machines, such as screen-readers for users with visual impairment.
Note: The term “machine-readable” is a bit misleading when interpreted strictly. Machines can “read” all digital information. However, some data formats do not leave open many ways how the data may be used. For example, binary formats, such as images or executables, do not lend themselves to other types of use that display or execution and as such limit the possibilities of reuse. Therefore, open data should be stored in textual formats (e.g., CSV) with explicit and standard character encoding (e.g., UTF-8).
Safety: Use data formats that cannot contain executable content that may contain malicious code harmful to the data users. Textual formats, that are recommended for disclosure of open data, are safe to use.
Usability
How well the data can be used?
Presentation: A human-readable copy of data (e.g., a web page) should be available to address the issue of the data divide, alleviating the unequal levels of ability to work with data. Given the differing data literacy skills among users an effort needs to be taken to provide the largest number of people the greatest benefits from the data and help them make “effective use” as dubbed by Michael Gurstein.
Clarity: The data should communicate as clearly as possible, using plain and accurate language. The descriptions in data should be given in a neutral and unambiguous language that does not skew the interpretation of data. Data should employ meaningful scales that clearly convey the differences in data. To widen the reach of data use a universal language (e.g., English) and avoid using jargon or technical language unless the terminology is well-defined and adds to the clarity of data. Data should not contain extraneous information and superfluous padding that might distract users from important data or confuse them.
Permanence: Open data should be available in the long term. To ensure the future accessibility of data the URIs, from which the data can be retrieved, should be persistent. The sustainability and reliability of data access methods is important due to direct reuse of data, such as in applications built on top of data APIs, when the data cannot be copied or it is not efficient to do so.
Conclusion
When compiling principles of open data, it is difficult to separate data “openness“ and data “quality”. The question that we can ask is what are the non-essential features of openness that are actually features of a more general good design? I would expect the importance of different attributes of data openness depends on the use case. Thus, I have not subjected the principles presented above to a coarse narrowing down to those that seemed the most important to me.
The other reason why the principles are presented in a comprehensive way is that they are meant to serve as a tool. Principles of open data describe what should be achieved. This needs to be linked to how it should be achieved. Goals needs to be linked to implementations, so that it is straightforward to translate principles into action.
Open data principles should be distilled into policies and recommendations that provide direct guidance and specific steps for implementers. Recommendations should be accompanied with explanations and policies should be connected with the outcomes and benefits to offer motivating reasons to their users. The process of policy creation should be kept iterative, open, and transparent.
Finally, compliance with the policies based on open data principles should be reviewable. There should be tests and control mechanism in place to put their implementers under scrutiny, because a policy without a way of enforcing it is just a shadow of policy.
Sources
- Open definition by Open Knowledge Foundation
- 8 principles of open government
- Ten principles for opening up government information
- The three laws of open data by David Eaves
- ACM U.S. Public Policy Committee — recommendations on open government
- Characteristics of smart disclosure in Informing consumers through smart disclosure
- Disclosure and simplification as regulatory tools
- New public sector transparency board and public data transparency principles
- Open government executive directive
- An act relative to the use of open source software and open data formats by state agencies and relative to the adoption of a statewide information policy regarding open government data standards
- AusGOAL qualities of open data
- New Zealand data and information management principles
- Sixteen principles of open government in Open data is civic capital: best practices for open government data
- Key principles of government information by the American Library Association
- Plus my own post on technical openness of open data.
Opening contracted data in the public sector
Public sector sucks at making applications. Look at what applications it creates and look at what applications are created in the private sector, such as e-banking. The difference is huge. A common argument in favour of open government data follows this line of reasoning. Public sector is not able to create useful applications in a cost-efficient way, therefore it should openly publish its data and the applications will flow, produced by the members of public, for free.
See, the problem is that the public sector also sucks at making some data. Some types of data, such as geographical data or extensive surveys are quite difficult to gather by the means available to the public sector. The solution is to sign a contract with company that produces the requested data. By outsourcing acquisition of some types of data the public sector gets what it needs for its functioning. No problem so far.
The problem starts to appear in cases when the companies (often unlike the public sector) see the possibilites for reuse of the data. The companies producing the data are well aware of the ways in which their product can be reused by businesses to generate revenue. It would be stupid of the company to provide the public sector with an exclusive rights for the contracted data when it can be re-selled to other companies. For example, a company producing geospatial data for the cadaster may sell the same data to businesses producing maps. Of course, the public sector might want to get a licence permitting to re-distribute the data, but a contract containing such requirement would get a much higher pricing from the supplier, due to the fact that the supplier would be deprived of the additional income from re-selling the data. Opening highly reusable data might be pricey.
It leaves me with a lot of questions, wondering what is the best answer for opening data acquired by the public sector from a commercial supplier that is conscious of the real value of data and reflects it in the price.
At the beginning of #opendata film Tom Steinberg from MySociety says:
Open Government Data is any information the Government collects, by and large for their own purposes, that it then makes available for other people to use for their purposes.
The definition of open government data concerns the data that are collected by the government. However, it is not clear if it only covers data produced by the government itself or, if it includes data provisioned to the government by a third-party as well. Does the definition of open government data apply even for data that are a result of a public contract? If this is a correct interpretation, is it nevertheless the responsibility of the public sector to contract data in a way that allows to release the data as open data, even though it might significantly raise the price of the contract? Spending government's finance on this would certainly lower the barrier for starting a business based on such data. And, given its financial constraints, can the public sector afford to contract data in this way?
Acknowledgements: Thanks to Jáchym Čepický for bringing this point to our seminar Open Data and Public Sector: applying Austrian experience in Czech Republic.
Liking is linking
Hello, ladies. Look at you interface for creating linked data:

Now back to an interface used for creating linked data at Facebook:

Now back at your interface. Sadly, it not like the one from Facebook. Why is that?
The concept of linked data has its page on Facebook. It is identified by the URI http://graph.facebook.com/103761322995229, which is based on the identifier in the Facebook URL (103761322995229). The numeric identifier may be replaced by a “nice”, human-readable string, when the page reaches at least 20 likes. Given the concept URI, one can dereference it to retrieve its RDF representation:
curl -H "Accept:text/turtle" http://graph.facebook.com/103761322995229
Bearing in mind the linked data best practices, it would be even better if there was a redirect set up from the original Facebook URL to this URI. Nevertheless, the important thing is that one can reference Facebook resources from their data. Having Facebook resources equipped with dereferenceable URIs makes them linkable.
Facebook's Open Graph Protocol features a property likes (URI: http://graph.facebook.com/schema/user#likes) that can be used for relating a resource to an object that the resource likes. The auto-generated, yet human-readable reference for the vocabulary can be found here.
Taking this into account, the act of clicking the Like button for the concept of linked data while being logged in as me (URI: http://graph.facebook.com/jindrich.mynarz) can be treated as equivalent to writing the following triple (in Turtle notation):
@prefix graph: <http://graph.facebook.com/> . @prefix fbuser: <http://graph.facebook.com/schema/user#> . graph:jindrich.mynarz fbuser:likes graph:103761322995229 .
Note: the dot character (“.”) is not allowed in local names (such as jindrich.mynarz) in the original Turtle specification, however, in the newer version of the specification it is possible to use it.
It is likely that Facebook stores data differently than in this way, however, as can be seen in the case of Facebook pages and users, in some cases Facebook can surface the data in RDF. Such assumption can be supported by the practice of using RDF as an exchange format.
What I wanted to show by this example, is that by clicking a Like button, you are in fact creating links. Liking is an example of a speech act in which a subject expresses its relation to an object. The subject of the link is the agent (i.e., you, the person acting) and the object is the web page shown.
Using the Facebook Like button is an example of expressing how users feel about something. Facebook allows to express various feelings about things. Apart from the most known liking, users can recommend, or, thanks to the recently introduced Facebook actions there is an extensible mechanism for creating new types of relationships that users may describe. Facebook sees this functionality, and quite rightly so, as the “building blocks of Open Graph” (source).
What this reflects on, is the growing opportunity for crowdsourcing linking. Facebook Like button serves as an example of an easy-to-use interface for creating linked data that is available for masses. It shows the potential of adding more complex and machine-readable annotations via simple interfaces. It is a tool for growing the interconnected web of data, describing how do the users of the Web relate to its contents. Not to forget that the users of the Web might be machines too. Imagine bots crawling the Web and clicking Like buttons, leaving their traces on the visited places, and you will start to see the possibilities of crawlers connecting the web of data.
In search for the ontology
It is a common problem. When you want to create RDF data describing a domain that you have never described in RDF before, you need to find a proper RDF vocabulary or ontology that provides sufficient means of expression covering the domain in question. I you have enjoyed a triplification exercise, I am pretty sure you have encountered this obstacle. For instance, when you find a dataset about budgets of municipalities, then the first thing you need to do is to find an ontology that you can use to describe budgets, the topic of the dataset. This ontology retrieval problem is not straightforward as it may seem. It earned a reputation of being a rather esoteric practice and on the excellent W3C's web site it was aptly named ontology dowsing.
Clearly, finding an ontology should be easier than inventing a new one. There is a plenty of approaches to solving this question, some of which work fairly well when complemented with others. The difficulty in finding a proper ontology poses a hurdle to re-use oriented data modelling of linked data. Such a barrier leads to a situation where there is a lot of dataset-specific ontologies, the authors of which have taken the path of aligning their data modelling with the structure of legacy datasets instead of aligning it with the available RDF vocabularies and ontologies. In fact, to emphasize the importance of this question, if an ontology cannot be found, it is almost as if the ontology did not exist.
This problem is such common source of frustration that it has motivated a number of solutions and given rise to lots of questions. Among the approaches taken to solve it the one that gets often mentioned is the use of semantic search engines, such as FalconS, Watson, or (probably the best known) Sindice. These tools usually search across both instance and schema-level data (ABox and TBox), even though FalconS offers a functionality to search only within ontologies. Semantic search engines rely solely on the ontologies to be self-descriptive enough to be found. They take into account only the information that is in ontologies themselves, which may constitute a significant drawback when searching for ontologies containing only a brief description. How to ameliorate such state of affairs? It may help to introduce more data.
With the increasing volume of the Linked Open Data Cloud containing several billion of RDF triples the possibility of gaining relevant insights about ontologies from instance data became available. Essentially, by performing statistical analyses measuring how the ontologies are being used one can get more information that may help in search for the right ontology. As an example of such application based on a survey of ontology adoption by crawling and analysing existing data may be given the vocab.cc, which processed the data from the Billion Triples Challenge 2011 dataset. Examination of a large amount of data seems to be a research practice that is growing in popularity. This statement may be supported by the recently launched LOD2's LODStats project that also shows the list of the most used RDF vocabularies and ontologies. In fact, this kind of data can be used to implement PageRank-like metrics supporting relevancy ranking in ontology search, a feature that might be used to distinguish poor-quality or difficult to use ontologies from the established and widely deployed ones.
The method that I believe was not explored yet is to search for ontologies through datasets. Given the richer description of the data in LOD Cloud, it is possible that it could be easier to find datasets covering the domain you are interested in and then see what ontologies they use. Ultimately, if this approach proves to be fruitful, you could store the correspondences between datasets' topics and datasets' ontologies and skip the datasets during the search for ontologies.
Another source of information that powers the ontology search is metadata. Several projects, among which Schemapedia may be recognized as a good example, strive to provide for a better ontology search through additional data organizing the ontologies. These registries record information such as the topical classification or provenance metadata that helps users to find the ontologies they need. Linked Open Vocabularies, another example of a project of this type, employs a simple classification for the ontologies that can be browsed following the classification's hierarchical structure. Ontologies are organized into vocabulary spaces (the project's Vocabulary of a Friend introduces voaf:VocabularySpace class), and so, for example, the Public Contracts Ontology is sorted into the Contracts vocabulary space that belongs into the Market space. Linked Open Vocabularies also incorporates data of the previously mentioned type that are based on statistical analysis of instance data to compute vocabulary metrics such as popularity.
A common problem of the centralized registries is their patchy coverage. The registries require manual curation to update them with new ontologies, to mark ontologies that are no longer maintained, or to remove namespace URIs that do no longer resolve. One approach to this issue may be to introduce user generated content and let users to add annotations to ontologies. Although Schemapedia leverages such content by allowing to assign tags to the listed RDF vocabularies and ontologies, a more complete wiki-like approach used in the CKAN software for building data catalogues might be necessary. Nevertheless a simpler solution, such as extending namespace lookup service prefix.cc with user generated tags that may be looked-up as well, might turn out to be just as effective.
A potential bottleneck of all of the methods for finding relevant ontologies may be in their interfaces exposed to their users to support efficient query formulation. The user has an information need (to get an adecuate ontology for his or her dataset) and is required to express it using the means of the search system. The easiest way is to specify the need with natural language, however, I think having a dictionary-like lookup providing direct translations from user's phrases to ontology's terms is quite a hard problem to solve. What seems to be more feasible is browsing that harnesses the added structure built from tags or controlled vocabularies providing subject indexing for ontologies. However, this type of approach begs the question if we will not get lost in a rabbit hole of creating ontologies for describing ontologies as we will be adding more and more information to help us find other information. We will see about that.
Computing label specificity
This post has been long shelved in my head. Sometime around the summer 2011 I started to think again about the problems that arise when you use labels (strings) as identifiers for information retrieval tasks. The ambiguity of labels used as identifiers without the necessary context is a commons problem. Consider for example Wikipedia, which is trying to ameliorate this issue by providing disambiguation links to ambiguous label-based URIs. In this case the disambiguation is done by the user, that is provided with more contextual information describing the disambiguated resources.
Gradually I started to be interested not in label ambiguity, but an inverse property of textual labels: label specificity. What particularly interested me was the notion of computing label specificity based on external linked data . At first, I thought that having an indicator of the label's specificity may be useful when such label is to be treated as an identifier of a resource. Interlinking came to mind, in which the probability of a linkage between two resources is often computed based on labels' similarity. Another idea was to use label similarity in ontology design, to label parts of an ontology with unambiguous strings.
The more that I delved into the topic, the more it started to look like a useless, academic exercise. I wrote a few scripts, did some tests, and thought the topic is hardly worth continuing on. Then I stopped, leaving the work unfinished ( as it lead nowhere). I still cannot think of a real-world use for the approach I have chosen to investigate, however I be lieve I have learnt something in the process, something that might stimulate further, and a more fruitful research.
Label ambiguity and specificity
Let's begin a hundred years ago, when Ferdinand de Saussure was writing about language. Language works through relations of difference, then, which place signs in opposition to one another, he wrote. And, a linguistic system is a series of differences of sound combined with a series of differences of ideas. However, these differences are not context-free as their resolution is context-dependent. Natural language is not as precise as DNS and its correct resolution requires reasoning with context, of which humans are more capable than computers. It begs the question to what would happen if computers were provided with contextual, background knowledge in a sufficiently structured form.
Label is a proxy to a concept and its resolution depends on the label's context. Depending on the context in which a label is used it can redirect to different concepts. Thus, addressing RDF resources with literal labels is a crude method of information retrieval. In most cases, labels alone are not sufficient for unique identification of concepts. The ambiguity of labels makes them unfit to be used as identifiers for resources. However, in some cases labels serve the purpose of identification, and this comes with consequences, the consequences of treating labels as values of owl:inverseFunctionalProperties.
From the linguistic perspective, ambiguous labels can be homonyms or polysemes. Homonyms are unrelated meanings that share a label with the same orthography, such as bat as an animal and as a baseball racket. Polysemes, on the other hand are related meanings grouped by the same label, such as mouth as a bodily part and as a place where river enters the sea.
Computing label specificity then largely becomes a task of identification of ambiguous labels. Given the extensive language-related datasets available on the Web, such task seems feasible. For instance, by using the additional data from external sources, one can verify a link based on a rule specifying that the linked resources must match on an unambiguous label. And vice versa, every method of verification may be reversed and used for questioning the certainty of a given link.
The label specificity impacts the quality of interlinking based of matching labels of RDF resources. Harnessing string similarity for literal values of label properties is a common practice for linking data that is fast and easy to implement. Also, when the data about an RDF resource are very brief and there is virtually no other information describing the resource apart from its label, matching based on computing similarity of labels may be the only possible method for discovering links in heterogeneous datasets.
This approach to interlinking may work well if the matched labels are specific enough in the target domain to uniquely identify linked resources. In a well-defined and limited domain, such as medical terminology, it makes sense to treat label properties almost as instances of owl:InverseFunctionalProperty and use their values as strong identifiers of the labelled resources. However, in other cases the quality of results of this approach suffers from lexical ambiguity of resource labels.
In cases where the links between datasets are made in a procedure that is based on similarity matching of resource labels, checking specificity of the matched labels may be used to find links that were created by matching ambiguous labels and therefore needs further confirmation by other methods, such as matching on the label in different language in cases the RDF resources have multilingual description, or even manual examination to verify validity of the link.
So I though label specificity can be used as a method of verification for straight-forward and faulty interlinking based on resources' labels. I designed a couple of tests that were supposed to examine the tested label's specificity.
Label specificity tests
The simplest of the tests was a check of the label's character length. This test was based on a naïve intuition that the label's length correlates with its specificity, and so longer labels tend to be more specific as more letters add more context and in some cases labels that exceed a specified threshold length may be treated as unambiguous.
Just a little bit more complex was the test taking into account if the label is an abbreviation. If the label contains only capital letters it highly likely that it is an abbreviation. Abbreviations used as labels are short and less specific because they may refer to several meanings. For instance, the abbreviation RDF may be expanded to Resource Description Framework, Radio direction finder, or Rapid deployment force (define:RDF). Unless abbreviations are used in a limited context defining topical domain or language, such as the context of semantic web technologies for RDF, it is difficult to guess correctly the meaning they refer to.
Another test based on an intuition assumed that if a single dataset contains more than one resource with the examined label, the label should be treated as ambiguous. It is likely that in a single datasets there are no duplicate resources and thus resources sharing the same label must refer to distinct concepts.
I thought about a test based on a related supposition, that the number of translations of the tested label from one language to another also indicates its specificity. However, at that time I was thinking about this, I have not discovered any sufficient data sources. Google Translate had recently closed its API and other dictionary sources were not equipped with a machine-readable data output, or were not open enough. Not long ago I have learnt about the LOD in Translation project, that queries multiple linked data sources and retrieves translations of strings based on multilingual values of the same properties for the same resources. Looking at the number of translations of bat (from English to English) it supports the inkling that it is not the most specific label.
Then I tried a number of tests based on data from external sources, which explicitly provided data about the various meanings of the tested label. I started with the so-called Thesaurus, which offers a REST API, that provides a service to retrieve a set of categories, in which the given label belongs. Thesaurus supports four languages, with Italian, French and Spanish along with English. However, the best coverage is available for English.
The web service at accepts a label and the code of its language, together with the API key and specification of the requested output format (XML or JSON). For example, for label bat it returns 10 categories representing different senses in which the label may be used.
Then I turned to the resources that provide data in RDF. Lexvo provides RDF data from various linguistic resources combined together, building on datasets such as Princeton WordNet, Wiktionary, or Library of Congress Subject Headings.
Lexvo offers a linked data interface and it is possible to access its resources also in RDF/XML via content negotiation. The Lexvo URIs follow a simple pattern http://lexvo.org/id/term/{language code}/{label}. In order to get the data to examine label's specificity, retrieve the RDF representation of a given label, get the values of the property http://lexvo.org/ontology#means from Lexvo ontology, and group by results by data provider (cluster them according to their base URI) to eliminate the same meanings coming from different datasets included in Lexvo. For example, the URI http://lexvo.org/id/term/eng/bat returns 10 meanings from WordNet 3.0 and 4 meanings from OpenCyc.
Of course I tried DBpedia, one of the biggest sources of structured data, as well. Unlike Lexvo, DBpedia provides a SPARQL endpoint. In this way, instead of requesting several resources and running multiple queries, the information about the ambiguity of a label based on Wikipedia's disambiguation links can be retrieved by a single SPARQL query, such as this one:
PREFIX dbpont: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?resource ?label
WHERE {
?disambiguation dbpont:wikiPageDisambiguates ?resource, [
rdfs:label "Bat"@en
] .
?resource rdfs:label ?label .
FILTER(LANGMATCHES(LANG(?label), "en"))
}After implementing similar tests of label specificity, I was not able to find any practical use case for such functionality. I left this work, shifting my focus to things that looked more applicable in the real world. I still think there is something to it, and maybe I will return to it on some day.
Technical openness of open data
Apart from the legal requirements on open data, there are also aspects of technical openness. While the legal aspects are explicitly defined by the Open Definition, there is less understanding of the technical recommendations for making data open. Some principles of this side of openness are covered by the Three laws of open data by David Eaves, others are proposed in the Linked Open Data star scheme. An excellent resource that touches on both legal and technical requirements for open data is 8 open government data principles.
Data need to be formalized so that we can serialize them to representations that may be exchanged. However, there are different formalizations that may be used for communicating data, different formats that are more or less open. I think open technologies for representing data share a set of family resemblances. So, open data are:
- Non-exclusive
- Open data are not published exclusively for a particular application. No application has exclusive access to open data. Instead, they are available to be used by any application and thus support a wide range of uses.
- Non-proprietary
- No entity has exclusive control over non-proprietary data formats. Such formats have an open specification that may be implemented by anyone. Therefore, data in these formats are not tightly coupled with a specific software that is able to read them.
- Standards-based
- The data are based on open, community-owned standards. This means the standards are developed in an open process that may be joined by anyone from the public (i.e., not Schema.org). Such standards prescribe a set of rules the data have to adhere to. Standardized data have an expected format, which ensures interoperability, and as such can be used by a plethora of standards-compliant tools.
- Machine-readable
- Open data are formalized enough so that machines are able to use them. Well-formalized data have a structure that enables their automated machine processing. For instance, unlike a scanned document stored as an image, which is one opaque blob, open data have a higher granularity because they are segmented into well-defined data items (e.g., rows, columns, triples).
- Findable
- Open data should be publicly available on the Web. This means to have URLs that successfully return representation of data. Data should be directly accessible by resolving its URL. Any technical barriers, passwords or required registration, preventing from accessing data are unacceptable, as well as any attemps to hide the data and achieve security through obscurity via techniques of anti-SEO. As David Eaves puts it, if Google cannot find it, no one can.
- Linkable
- Elements of open data should be identified with URIs. In this way it is possible to link to it. This approach encourages re-use, data integration, and proper attribution of data used as a source.
- Linked
- If your open data are linked to other open data, users can follow these links to discover more. Being a part of the Web of data brings the benefits yielded by the network effects.
As you might have guessed from the previous points, I think that linked data is a very open technology. And, if you look at the 5 star of linked data, its author Tim Berners-Lee thinks the same. So if you want to make your data more open, it is a step in the right direction.
Open bibliographic data checklist
I have decided to write a few points that might be of interest to those thinking about publishing open bibliographic data. The following is a fragment of an open bibliographic data checklist, or, how to release your library's data into the public without a lawyer holding your hand.
I have been interested in open bibliographic data for a couple of years now, and I try to promote them at the National Technical Library, where we have, so far, released only authority dataset — the Polythematic Structured Subject Heading System. The following points are based on my experiences with this topic. What should you pay attention to when opening your bibliographic data then?
- Make sure you are the sole owner of the data or make arrangements with other owners. For instance, things may get complicated in the case data was created collaboratively via shared cataloguing. If you are not in complete control of the data, then start with consulting the other proprietors that have a stake in the datasets.
- Check if the data you are about to release are not bound by some contractual obligations. For example, you may publish a dataset under a Creative Commons licence, soon to realize that there are some unsolved contracts with parties that helped fund the creation of that data years ago. Then you need to discuss this issue with the involved parties to resolve if making the data open is a problem.
- Read your country's legislation to get to know what you are able to do with your data. For instance, in Czech Republic it is not possible to put data into the public domain intentionally. The only way how public domain content is created is by the natural order of things, i.e., author dies, leaves no heir, and after quite some time the work enters the public domain.
- See if the data are copyrightable. For instance, if the data do not fall into the scope of the copyright law of your country, it is not suitable to be licenced under Creative Commons, since this set of licences draws its legal binding from the copyright law; it is an extension of the copyright and it builds on it. Facts are not copyrightable and most bibliographic records are made of facts. However, some contain creative content, for example, subject indexing or an abstract, and as such are appropriate for licencing based on the copyright law. Your mileage may vary.
- Consult the database act. Check if your country has a specific law dealing with the use of databases that might add more requirements that need your attention. For example, in some legal regimes databases are protected on other level, as an aggregation of individual data elements.
- Different licencing options may be applicable for content and structure of dataset, for instance when there are additional terms required by database law. You can opt in dual-licensing and use two different licences, one for dataset's content that is protected by the copyright law (e.g., a Creative Commons licence), and one for dataset's structure for which the copyright protection may not apply (e.g., Public Domain Dedication and License).
- Choose a proper licence. A proper open licence is a licence that conforms with the Open Definition (and will not get you sued), so pick one of the OKD-Compliant licences. Good source of solid information about licences for open data is Open Data Commons.
- BONUS: Tell your friends. Create a record in the Data Hub (formerly CKAN) and add it to the bibliographic data group to let others know that your dataset exists.
Even if it may seem there are lots of things you need to check before releasing open bibliographic data, it is actually easy. It is an performative speech act: you only need to declare your data open to make it open.
<disclaimer>If you are unsure about some of the steps above, see a lawyer to consult it. Note that the usual disclaimers apply for this post, i.e., IANAL.</disclaimer>
Turning off feed reader
Today I have decided to stop using my feed reader. My use of it has diminished over a long period of time and I no longer think it's an optimal tool for the way I like to discover information.
In my view, feeds, whether they're from blogs, news sites or of any other origin, contain just too much noise. You need to go through all of items in your subscribed feeds yourself. It's information filtering on the client-side. Feed readers don't allow for fine grained filtering I would like to be able to do, and thus, they are blunt instruments for information discovery.
Reading feeds may also lack the serendipitous discovery. I'm rarely surprised when I read my feeds. On the other hand, on Twitter I get interesting pointers to various resources much more frequently due to the ways information spreads through the network of Twitter users before it finally reaches me (e.g., retweets).
Because of these shortcomings my primary platform for information acquisition is Twitter now. I don't read feeds, newspapers, magazines, watch TV news and the like. I have resigned from trying to achieve even near-complete coverage of the topics I'm interested and instead I sample and skim-read my Twitter stream.
Twitter provides me a manageable stream of highly relevant information resources that I'm usually able to process and digest. It offers me seredipitous discoveries I wouldn't have come across when using feed readers. Also, I like to sample from a wide range of resources on different topics and Twitter caters for that quite well.
I have changed my infomation consumption habits. In a sense, I have switched to a probabilistic information retrieval. I know that I can't get a complete coverage on the subject areas I'm interested in. I'm conscious that I miss something, but I'm fine with that. I believe that if the information is important enough, it will come back to me. If I don't catch something, I trust my network on Twitter to make me pay attention to it by mentioning it, re-tweeting it, and re-discovering it for me.
On Twitter my information filter is the network of the people I follow. The key difference is that while you're reading feeds you're using people as content creators, on Twitter you're using people as content curators. It's a filtering on a meta level: instead of filtering information yourself you filter the people that are filtering information for you. Your responsibility is to curate the list of Twitter users you follow. However, if you want to be an active member of the Twitter ecosystem you curate, share, and forward information for your followers.
On the Web there are many information channels and trying to follow all of them results in a fragmentation of one's attention. Reading lots of information resources is time consuming, content is often duplicated, and therefore demands strenuous filtering, and also, context switching between different media is expensive for one's cognitive abilities.
In an attention economy we decide how we spend our resources of attention. While marketing uses targetting to reach relevant audiences, we do reverse targetting when we expose ourselves as targets to media of our choice. Choosing a single, yet heterogeneous, information acquisition channel, such as Twitter, may lead to a defragmentation of our attention, and thus it may be a step towards more efficient allocation of one's attention.
The switch from feed readers may be a general trend. I think that information acquisition via feed readers was in part surpassed by the social media and the ubiquitous sharing of content on the Web (tweets, likes, plus ones, recommendations, etc.). One of the questions asked by the media theorist Marshall McLuhan in his tetrad of media effects was What does the medium make obsolete?. If we ask what Twitter makes obsolete, the answer may well be feed readers.
That said, I still think feeds are indispensable when it comes to information acquisition for machines, such as web applications and the like. Feeds are well suited for machines to exchange information. Unlike in humans, attention isn't a scarce resource for machines. Machines can read all items in feeds. But people needs more human ways for discovering new information as they have limited resources of attention. I think Twitter delivers on that.
Spoonfeeding Google with RDF graphs packaged as trees
During a small side project I've found out that Google Rich Snippets Testing Tool doesn't treat RDFa as RDF (i.e., a graph) but rather as a simple hierarchical structure (i.e., a tree). It doesn't take under account links in RDFa, but only the way HTML elements are nested inside one another. More about the difference between data models of graph and tree can be found in a blog post by Lin Clark.
I've created two documents that give the same RDF when you run RDFa distiller on them. Both contain GoodRelations product data, but the difference between them is that in the first document the HTML element describing price specification (gr:UnitPriceSpecification) is a not nested inside the HTML element descibing the offering (gr:Offering) and the offering is linked to via gr:hasPriceSpecification property. In the second document the HTML element with price specification is nested in the element about the offering.
Even though the documents contain same data, Google Rich Snippets Testing Tool parses them differently and refuses to show a preview of search result in the case of the first document, whereas the second document produces a preview. In the first case, the price information is not recognized because it's not nested inside the HTML element describing the offering and thus a warning is shown:
Warning: In order to generate a preview, either price or review or availability needs to be present.
This leads me to believe that Google Rich Snippets Testing Tool doesn't parse RDFa as RDF, but as a tree (much like a DOM tree), effectively the same way as HTML5 microdata, which is built on the tree model. Google doesn't use RDFa as RDF, but as microdata.
Eric Hellman wrote a blog post about spoonfeeding data to Google. Even though Google still accepts some RDF (e.g., GoodRelations) after the announcement of microdata-based Schema.org, it wants to be spoonfed RDF graphs packaged as microdata trees. Does it mean that if Google is your primary target consumer for your data, you shouldn't bother with packaging your RDF in trees, but rather directly provide your data as a tree in HTML5 microdata?


