Skip to content

RDF and Linked Data Learning Resources

Charles Vardeman edited this page Apr 21, 2016 · 5 revisions

RDF and Linked Data Learning Resources

The hackers4peace has a nice repository called "rdf-for-object-oriented-heads" as a scratch space for RDF and linked data information.

Online reading

A nice visualization of the linked data "cloud" is available from http://linkeddata.org/.

RDF

The Resource Description Framework (RDF) is a framework for representing and linking data within the World Wide Web infrastructure RDF1.1, and is a graph-based data model, where the core structure of the abstract syntax is a set of triples, each consisting of a subject, a predicate and an object. The idea is that we want relationship identifiers that form a "Knowledge Graph" and represent things not strings.

There are several serialization syntaxes for storing and exchanging RDF such as Turtle and JSON-LD. When creating new RDF resources, Turtle is regarded as a more human readable serialization although [JSON-LD] (https://www.w3.org/TR/json-ld/) has the advantage of being JSON compliant. Servers should implement content negotiation in order to handle different serialization formats. Additionally, the W3C has created an extension of Turtle for publication of RDF datasets called Trig such that there is support for representing a complete RDF Dataset that may contain multiple graphs. Because JSON-LD supports the @Graph construct, RDF serialized in Trig should be also able to be serialized in JSON-LD but may not be as "human readable".


Linked Data Principles

Tim Berners-Lee published a technical note outlining some basic principles for publishing data on the Web. These principles can be summarized as:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

Data that adopts some of these principles is referred to as "Five Star" linked data. The W3C continues to create recommendations to facilitate the linked data vision. The BBC has a nice introduction to linked data principles.

Query of Remote Linked Data

The W3C also provides a specification of a query language for RDF data called SPARQL that facilitates the searching for triples and the construction of new subgraphs based on particular patterns.

IBM Developer works has a tutorial, Data integration at scale: Query RDF data with SPARQL that provide concrete example using the sparql command line tool from the Apache Jena project.


Ontologies

To facilitate the capturing of concepts within RDF data, the W3C has created a specification called the Web Ontology Language (OWL). OWL is on it's second revision and a primer for OWL 2 is available as a W3C technical report. From this technical report:

"The W3C OWL 2 Web Ontology Language (OWL) is a Semantic Web language designed to represent rich and complex knowledge about things, groups of things, and relations between things. OWL is a computational logic-based language such that knowledge expressed in OWL can be reasoned with by computer programs either to verify the consistency of that knowledge or to make implicit knowledge explicit. OWL documents, known as ontologies, can be published in the World Wide Web and may refer to or be referred from other OWL ontologies. OWL is part of the W3C's Semantic Web technology stack, which includes RDF RDF Concepts and SPARQL."

The process of making implicit relationship specified in an ontology explicit is by creating materialized inferences. Since reasoners are not typically part of web development stacks, it is sometimes useful to materialize some references by static publishing of the relations in the data set.

Open World Assumption

The formal logic design underpinning OWL is based on the open world assumption. In a nutshell:

"...the open-world assumption is the assumption that the truth value of a statement may be true irrespective of whether or not it is known to be true."

"The absence of a particular statement within the web means, in principle, that the statement has not been made explicitly yet, irrespective of whether it would be true or not, and irrespective of whether we believe that it would be true or not. In essence, from the absence of a statement alone, a deductive reasoner cannot (and must not) infer that the statement is false."

Anyone can say Anything about Anything

The fact that Anyone can say Anything about Anything follows from the open-world design of the semantic web.

"In general, it is not assumed that all information about any topic is available. A consequence of this is that RDF cannot prevent anyone from making nonsensical or inconsistent assertions, and applications that build upon RDF must find ways to deal with conflicting sources of information."

This design choice for RDF is both a blessing and a curse since RDF makes it difficult to validate data correctness. There are ongoing W3C efforts to build some constraint checking into the RDF framework to facilitate data correctness. One group is the RDF Data Shapes Working Group.

Follow your nose

So one key consequence of the "Linked Data Principles" is that "When someone looks up a URI, provide useful information" is to provide the ability for machine agents to gain context about a particular resource. One useful mechanism for providing additional links is the rdfs:seeAlso property that can link to additional RDF documents. For a nice discussion, see the Linked Data Patterns book.

In general it is useful to use existing vocabulary resources such as Schema.org since JSON-LD embedded in HTML is "crawled" by google. It is also useful to link to dbpedia and wikidata resources.

Five Star Linked Vocabularies

Since ontologies and vocabularies are themselves linked data, we should apply the 5 star linked data principles. There are some additional concerns outlined in a Semantic Web journal blog post. The five stars of Vocabularies are:

    • Linked Data without any vocabulary.
    • There is dereferencable human-readable information about the used vocabulary.
  • ** The information is available as machine-readable explicit axiomatization of the vocabulary.
  • *** The vocabulary is linked to other vocabularies.
  • **** Metadata about the vocabulary is available (in a dereferencable and machine-readable form).
  • ***** The vocabulary is linked to by other vocabularies.

In general we aim for "4-star" vocabularies and assume that this will facilitate the "fifth-star".

How do I name URI's?


Opaque URIs

From Tim Berners-Lee's design issues document:

"The only thing you can use an identifier for is to refer to an object. When you are not dereferencing, you should not look at the contents of the URI string to gain other information." For the bulk of Web use URIs are passed around without anyone looking at their internal contents, the content of the string itself. This is known as the opacity. Software should be made to treat URIs as generally as possible, to allow the most reuse of existing or future schemes.

Trusty URIs


Cool URIs assume that the resource is dereferencable which may not be true of digital artifacts that are not web resources. To account for this issue the idea if Trusty URIs and a publication has been developed and applied with respect to Nanopublications which are an emerging methodology for scientific publishing that use Semantic Web methodology to link both provenance and metadata information together. An example of such a Data Set is the "Linked Drug-Drug Interactions"(LIDDI) dataset. There are frameworks implementing Trusty URIs with different languages in github. Trusty URIs identifiers contain a cryptographic hash value calculated with respect to the content of the represented digital artifact.

An example of a TrustyURI http://example.org/np1#RAHGB0WzgQijR88g_rIwtPCmzYgyO4wRMT7M91ouhojsQ. Since Docker already generates container and image checksums that are verifiable from within Docker, instance URIs should be generated using these checksums. Since the resolver of a docker container or image id resource is NOT a HTTP URL, URI's that generated from docker resources should have the following syntax.


Practical Programming Notes

from https://github.com/valueflows/agent/pull/49#issuecomment-150507316

Something to always keep in mind, each instance can have assertions of belonging to any number of sets "@type": [ "vf:Resource", "schema:Book", "foo:Novel", "gr:Product" ] etc. So one would never in code check thing["@type"] === helper.expand("schema:Book") but always assume that it can have any number of types (even 50 or 100, especially if it has materialized inferences) and instead always check thing["@type"].includes(helper.expand("schema:Book")). Most likely libraries specific to different programming languages, define nicer idiomatic interfaces and don't hack directly around how they represent raw JSON structures in their runtime.