# RDF Processing in rdflib and Sparql

Remember that RDF uses a very simple data format: the **triple**, consisting of

- subject
- property
- object

Subject and Property are non-literal resources, while the object can be a literal, 
such as a string or number.

We will continue our examples in N3. Our examples here are also valid **Turtle**, 
the RDF-only subset of N3.

The first part of our RDF continues over several lines.
In Python we can assign multi-line strings using the triple quote notation; it 
means that the string
continues until the matching triple quote at the end.


In [1]:
data = """
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@base <http://my.org/> .
@prefix : <#> .
"""


We start with some prefixes. These help us to keep our vocabulary well-defined and avoid name clashes.

Out of each vocabulary we list what we are using here:

- rdf: The RDF built-in vocabulary, defining basic things 
  - rdf:type
- rdfs: The RDF Schema vocabulary, defining more basic things  
  - rdfs:Class 
  - rdfs:subClassOf
  - rdfs:domain
  - rdfs:range
- foaf: The Friend of a Friend vocabulary, devoted to linking people and information
  - foaf:name
  
A few more definitions help keep this example simple:
  - @base sets the base for all URIs, allowing us to keep them relative and thereby short
  - @prefix : makes the code even shorter, no need for angle brackets < and >
  
Now we add our **ontology** part. The += notation in Python lets us add more text to our RDF data.

In [2]:
data += """
:Person a rdfs:Class .
:Man a rdfs:Class ; rdfs:subClassOf :Person .
:Woman a rdfs:Class ; rdfs:subClassOf :Person .
"""

The last step is to add instance data. This part is 
alone is sometimes referred to as the knowledge graph, although
all triples together are part of a single RDF graph. 

In [3]:
data += """
:tom a :Man ; foaf:name "Tom" .
:jane a :Woman ; foaf:name "Jane" .
:jack a :Man ; foaf:name "Jack" .

:tom foaf:knows :jane . 
:jane foaf:knows :tom .
:jane foaf:knows :jack .
:jack foaf:knows :jane .
:jane foaf:age 28 .
"""

## rdflib

Having defined our RDF data we can now use the Python package rdflib to process the statements. 

- construct an empty graph
- add the data from our statements using the parse() function
- print the size of the graph; the len() function returns the number of triples.

In [18]:
# uncomment if needed
# %pip install rdflib 

In [4]:
import rdflib

g = rdflib.Graph()

g.parse(data=data, format='turtle')
print(len(g))

16


Now we print the individual RDF triples, just to see what we have.

Each triple is made of three components, which we can state in the loop:

In [5]:
for s, p, o in g:
    print(s, p, o) 

http://my.org/#Woman http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class
http://my.org/#tom http://xmlns.com/foaf/0.1/knows http://my.org/#jane
http://my.org/#Woman http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person
http://my.org/#jack http://xmlns.com/foaf/0.1/name Jack
http://my.org/#jack http://xmlns.com/foaf/0.1/knows http://my.org/#jane
http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#jack
http://my.org/#tom http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man
http://my.org/#Person http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class
http://my.org/#jack http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man
http://my.org/#jane http://xmlns.com/foaf/0.1/name Jane
http://my.org/#Man http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person
http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#tom
http://my.org/#Man http:

What is happening here?

- the : @prefix added the concept of 'this document' to all our *own* entity IDs, resulting e.g. in #tom
- the @base is added to our *own* resource IDs, turning them into full URIs
- the corresponding prefix from the other vocabularies is added to the terms
  that are *not* our own
- the literals are unchanged, such as the names and ages of people

We could now go on and program Python code to manipulate RDF data, but that would be unwise: there is a query
language specifically designed to query RDF data: Sparql.

## Sparql

Coming from SQL the switch to Sparql is somewhat easier but also more confusing. Let us start by reproducing
the result from using rdflib directly i.e. a list of all triples:

In [6]:
for s, p, o in g.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o }"):
    print(s, p, o)

http://my.org/#Woman http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class
http://my.org/#tom http://xmlns.com/foaf/0.1/knows http://my.org/#jane
http://my.org/#Woman http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person
http://my.org/#jack http://xmlns.com/foaf/0.1/name Jack
http://my.org/#jack http://xmlns.com/foaf/0.1/knows http://my.org/#jane
http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#jack
http://my.org/#tom http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man
http://my.org/#Person http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#Class
http://my.org/#jack http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://my.org/#Man
http://my.org/#jane http://xmlns.com/foaf/0.1/name Jane
http://my.org/#Man http://www.w3.org/2000/01/rdf-schema#subClassOf http://my.org/#Person
http://my.org/#jane http://xmlns.com/foaf/0.1/knows http://my.org/#tom
http://my.org/#Man http:

The result should not be surprising. The Sparql syntax takes some getting used to.

- SELECT starts the query
- we continue with the values we want to see in the result
- we name each value as we wish, using the ?name format
- WHERE continues with specifications for the result
- in the example we just want that each s-p-o combination is a triple in the graph

Let's now ask the question: who knows Jack?

In [7]:
q = """
SELECT ?who 
WHERE { 
  ?who <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack> 
  }
"""

for row in g.query(q):
    print(row)

(rdflib.term.URIRef('http://my.org/#jane'),)


The result is promissing, but we are getting some type info 
that we would rather not see at this point.

### Improving the rdflib output

We can access the RDF text part by using the name of the column we are accessing:

In [8]:
q = """
SELECT ?who 
WHERE { 
  ?who <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack> 
  }
"""

for row in g.query(q):
    print(row['who'])

http://my.org/#jane


We are getting somewhere, but we are not happy yet: 

- the entitiy ID could be Q3741 or something similarly unreadable
- we want to see a property of the entity that has meaning for us: such as the name

In [9]:
q = """
SELECT ?name WHERE { 
  ?s <http://xmlns.com/foaf/0.1/knows> <http://my.org/#jack> .
  ?s <http://xmlns.com/foaf/0.1/name> ?name 
  }
"""

for row in g.query(q):
    print(row['name'])

Jane


This is a useful result. 

At this point it should become apparent how Sparql queries work: 

- we list the restrictions that describe our result, similar to the WHERE clause in SQL
- when we have more than one restriction we use a dot to separate, much like comma in SQL
- we need to use the triple notation, much unlike SQL
  
Let's try something similar: 

- who does Jane know?
- identify Jane by her name, not her ID

In [10]:
q = """
SELECT ?who WHERE { 
  ?sub <http://xmlns.com/foaf/0.1/name> "Jane" .
  ?sub <http://xmlns.com/foaf/0.1/knows> ?obj .
  ?obj <http://xmlns.com/foaf/0.1/name> ?who
  }
"""

for row in g.query(q):
    print(row['who'])

Tom
Jack


### Traversing a variable number of graph edges

Type hierarchies are a very useful concept in RDF. Here is one way of extracting this 
information from our graph:

- list the names of all people 
- i.e. all members of :Person or one of its subclasses

People are not directly members of :Person; they are members of classes which are subclasses of Person.

The prop1/prop2\* notation allows us to follow the links established by prop1 and prop2:

In [12]:
q = """
SELECT ?name WHERE { 
  ?s a/rdfs:subClassOf* :Person . 
  ?s foaf:name ?name 
  }
"""

for row in g.query(q):
    print(row['name'])  

Tom
Jack
Jane


The expression

?s a/rdfs:subClassOf\* :Person

means two options:

- ?s is a :Person
- ?s is zero or more subclass steps from :Person

The last point is important: it means that we can traverse any number of steps for this relationship.

This notation works for any property, not just subClassOf.

## Exercises
    
- create a few more small ontologies and knowledge graphs, modeling data for things such as
  - a hobby that motivates you
  - the sports club you joined or are a fan of
  - or maybe even a business or economics subject:
    - countries, their geographic and economic attributes
    - companies, their attributes and product ranges
    - a specific line of consumer electronics and its attributes
- once you have define your own small graph you can run Sparql queries on it

&star; Use this opportunity to delve into this interesting topic that is very high on 
the aggenda of all the big players, not only in IT!
