RDF Processing in rdflib and Sparql¶
Remember that RDF uses a very simple data format: the triple, consisting of
- subject
- property
- object
Subject and Property are non-literal resources, while the object can be a literal, such as a string or number.
We will continue our examples in N3. Our examples here are also valid Turtle, the RDF-only subset of N3.
The first part of our RDF continues over several lines. In Python we can assign multi-line strings using the triple quote notation; it means that the string continues until the matching triple quote at the end.
data = """
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@base <http://my.org/> .
@prefix : <#> .
"""
We start with some prefixes. These help us to keep our vocabulary well-defined and avoid name clashes.
Out of each vocabulary we list what we are using here:
- rdf: The RDF built-in vocabulary, defining basic things
- rdf:type
- rdfs: The RDF Schema vocabulary, defining more basic things
- rdfs:Class
- rdfs:subClassOf
- rdfs:domain
- rdfs:range
- foaf: The Friend of a Friend vocabulary, devoted to linking people and information
- foaf:name
A few more definitions help keep this example simple:
- @base sets the base for all URIs, allowing us to keep them relative and thereby short
- @prefix : makes the code even shorter, no need for angle brackets < and >
Now we add our ontology part. The += notation in Python lets us add more text to our RDF data.
data += """
:Person a rdfs:Class .
:Man a rdfs:Class ; rdfs:subClassOf :Person .
:Woman a rdfs:Class ; rdfs:subClassOf :Person .
"""
The last step is to add instance data. This part is alone is sometimes referred to as the knowledge graph, although all triples together are part of a single RDF graph.
data += """
:tom a :Man ; foaf:name "Tom" .
:jane a :Woman ; foaf:name "Jane" .
:jack a :Man ; foaf:name "Jack" .
:tom foaf:knows :jane .
:jane foaf:knows :tom .
:jane foaf:knows :jack .
:jack foaf:knows :jane .
:jane foaf:age 28 .
"""
rdflib¶
Having defined our RDF data we can now use the Python package rdflib to process the statements.
- construct an empty graph
- add the data from our statements using the parse() function
- print the size of the graph; the len() function returns the number of triples.
# uncomment if needed
# %pip install rdflib
import rdflib
g = rdflib.Graph()
g.parse(data=data, format='turtle')
print(len(g))
16
Now we print the individual RDF triples, just to see what we have.
Each triple is made of three components, which we can state in the loop:
for s, p, o in g:
print(s, p, o)
http://my.org/#Woman http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#tom http://xmlns.com/foaf/0.1/knows http://my.org/#jane http://my.org/#Woman http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person http://my.org/#jack http://xmlns.com/foaf/0.1/name Jack http://my.org/#jack http://xmlns.com/foaf/0.1/knows http://my.org/#jane http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#jack http://my.org/#tom http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man http://my.org/#Person http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#jack http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man http://my.org/#jane http://xmlns.com/foaf/0.1/name Jane http://my.org/#Man http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#tom http://my.org/#Man http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#tom http://xmlns.com/foaf/0.1/name Tom http://my.org/#jane http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Woman http://my.org/#jane http://xmlns.com/foaf/0.1/age 28
What is happening here?
- the : @prefix added the concept of 'this document' to all our own entity IDs, resulting e.g. in #tom
- the @base is added to our own resource IDs, turning them into full URIs
- the corresponding prefix from the other vocabularies is added to the terms that are not our own
- the literals are unchanged, such as the names and ages of people
We could now go on and program Python code to manipulate RDF data, but that would be unwise: there is a query language specifically designed to query RDF data: Sparql.
Sparql¶
Coming from SQL the switch to Sparql is somewhat easier but also more confusing. Let us start by reproducing the result from using rdflib directly i.e. a list of all triples:
for s, p, o in g.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o }"):
print(s, p, o)
http://my.org/#Woman http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#tom http://xmlns.com/foaf/0.1/knows http://my.org/#jane http://my.org/#Woman http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person http://my.org/#jack http://xmlns.com/foaf/0.1/name Jack http://my.org/#jack http://xmlns.com/foaf/0.1/knows http://my.org/#jane http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#jack http://my.org/#tom http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man http://my.org/#Person http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#jack http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man http://my.org/#jane http://xmlns.com/foaf/0.1/name Jane http://my.org/#Man http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#tom http://my.org/#Man http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class http://my.org/#tom http://xmlns.com/foaf/0.1/name Tom http://my.org/#jane http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Woman http://my.org/#jane http://xmlns.com/foaf/0.1/age 28
The result should not be surprising. The Sparql syntax takes some getting used to.
- SELECT starts the query
- we continue with the values we want to see in the result
- we name each value as we wish, using the ?name format
- WHERE continues with specifications for the result
- in the example we just want that each s-p-o combination is a triple in the graph
Let's now ask the question: who knows Jack?
q = """
SELECT ?who
WHERE {
?who <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack>
}
"""
for row in g.query(q):
print(row)
(rdflib.term.URIRef('http://my.org/#jane'),)
The result is promissing, but we are getting some type info that we would rather not see at this point.
Improving the rdflib output¶
We can access the RDF text part by using the name of the column we are accessing:
q = """
SELECT ?who
WHERE {
?who <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack>
}
"""
for row in g.query(q):
print(row['who'])
http://my.org/#jane
We are getting somewhere, but we are not happy yet:
- the entitiy ID could be Q3741 or something similarly unreadable
- we want to see a property of the entity that has meaning for us: such as the name
q = """
SELECT ?name WHERE {
?s <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack> .
?s <http://xmlns.com/foaf/0.1/name> ?name
}
"""
for row in g.query(q):
print(row['name'])
Jane
This is a useful result.
At this point it should become apparent how Sparql queries work:
- we list the restrictions that describe our result, similar to the WHERE clause in SQL
- when we have more than one restriction we use a dot to separate, much like comma in SQL
- we need to use the triple notation, much unlike SQL
Let's try something similar:
- who does Jane know?
- identify Jane by her name, not her ID
q = """
SELECT ?who WHERE {
?sub <http://xmlns.com/foaf/0.1/name> "Jane" .
?sub <http://xmlns.com/foaf/0.1/knows> ?obj .
?obj <http://xmlns.com/foaf/0.1/name> ?who
}
"""
for row in g.query(q):
print(row['who'])
Tom Jack
Traversing a variable number of graph edges¶
Type hierarchies are a very useful concept in RDF. Here is one way of extracting this information from our graph:
- list the names of all people
- i.e. all members of :Person or one of its subclasses
People are not directly members of :Person; they are members of classes which are subclasses of Person.
The prop1/prop2* notation allows us to follow the links established by prop1 and prop2:
q = """
SELECT ?name WHERE {
?s a/rdfs:subClassOf* :Person .
?s foaf:name ?name
}
"""
for row in g.query(q):
print(row['name'])
Tom Jack Jane
The expression
?s a/rdfs:subClassOf* :Person
means two options:
- ?s is a :Person
- ?s is zero or more subclass steps from :Person
The last point is important: it means that we can traverse any number of steps for this relationship.
This notation works for any property, not just subClassOf.
Exercises¶
- create a few more small ontologies and knowledge graphs, modeling data for things such as
- a hobby that motivates you
- the sports club you joined or are a fan of
- or maybe even a business or economics subject:
- countries, their geographic and economic attributes
- companies, their attributes and product ranges
- a specific line of consumer electronics and its attributes
- once you have define your own small graph you can run Sparql queries on it
☆ Use this opportunity to delve into this interesting topic that is very high on the aggenda of all the big players, not only in IT!