Knowledge Graphs, RDF, and N3¶
Since ancient times philosophers have wrestled with the question: What can we know?
We take a practical approach: knowledge is intimately tied to language; consider only the type of knowledge that can be put into words. Then, knowledge processing becomes word processing. We answer a query by putting words together, and if we do it well enough a human reader will get the impression that our system performs in an 'intelligent' fashion.
☆ Symbolic vs Connectionist AI¶
When we automate the process of question answering in a computer program we create a basic form of 'artificial intelligence' if the program responds to queries as we would expect from a human. There are basically two methods:
symbolic AI, such as methods based on knowledge graphs as described here. The knowledge and the processing are encoded in an explicit fashion, and it is easy to follow the reasoning, to see how it derives its answers to queries. Collecting and coding large amounts of explicit knowledge is a huge challenge.
connectionist AI using artificial neural networks and huge amounts of training data, such as very large collections of already existing natural language text. The knowledge is encoded in the connections of the network, and it is not easy to explain the reasoning. Connectionist AI based on large language models has seen spectacular successes recently, such as chatGPT and similar systems.
Since both symbolic and connectionist AI have their strengths and weaknesses there is growing interest in hybrid systems that seek to combine the advantages of both approaches.
An ontology is a formal description of knowledge. It lists the types of things that exist and the properties that are used to describe them. Ontologies do this in a machine-readable way and are concerned with
- classes
- attributes
- relationships
- restrictions
- rules
Ontologies can provide a sharable and reusable knowledge representation and allow for adding new knowledge about the domain. An ontology is more easily created for a clearly defined and very limited domain.
An ontology is concerned with knowledge about classes of things; a knowledge graph adds data about individual objects belonging to those classes. Therefore, knowledge graphs tend to be much bigger than ontologies; e.g., a book ontology is concerned with general concepts about books and not individual instances. The ontology describes classes such as book, author, publisher, and their relationships, while a book knowledge graph contains data on individual books, such as their titles, authors, year of publication.
ontology + instance data = knowledge graph
Graphs¶
Both ontologies and instance data can be represented as graphs. A graph consists of nodes and edges and is convenient to view in an image, provided the graph is not very big. With increasing graph size images become less useful.
RDF, the Resource Description Framework, is commonly used as a method of specifying knowledge graphs. A collection of statements describe the graph, each with
- subject
- property
- object
Since there are three elements such statements are also called triples.
An example in the image above is the triple (France, capital, Paris).
- The subject is some sort of resource about which a statement is made
- The property is used to make that statement
- The object is another resource
In RDF everything is considered a resource, including literals, such as numbers and strings.
A resource can play different roles in different statements, e.g. in the triple (France, capital, Paris) the resource 'capital' acts as a property, but it can be subject or object in other triples.
RDF data is usually stored in XML format. However, there are other storage formats for RDF, and XML as a storage format is used for many types of data, not just RDF.
XML¶
XML is a well-established format. Many software tools exists for processing XML documents.
The essential requirements for XML documents are:
- all elements are closed, e.g. <a> ... </a>
- there is a single root element, everything else is nested in the root or other elements
- elements must not overlap, e.g. the following ist not allowed: <a> ... <b> ... </a> ... </b>
XML is used for various types of content. Well-formed HTML conforms to XML.
An XML version of the customer table in a relational DB can look like this:
<xml>
<table name="cust">
<row>
<id>BK</id>
<name>Buster Keaton</name>
</row>
<row>
<id>DF</id>
<name>Douglas Fairbanks</name>
</row>
</table>
</xml>
Here is an RDF document in XML format about a person identified by '#tom'; this person has the name 'Tom'.
<foaf:Person rdf:about="#tom" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:name>Tom</foaf:name>
</foaf:Person>
N3¶
XML is fine for automated processing but tedious for human readers.
The examples in the Semantic Web Primer are still very informative (written by Tim Berners-Lee, the inventor of the World Wide Web):
https://www.w3.org/2000/10/swap/Primer
It describes the more convenient N3 notation. In N3 we can write something like
<#tom> <#name> "Tom" .
We will use this notation here since it is much easier to read.
URI¶
RDF uses URIs (Uniform Resource Identifiers) to indentify subjects, properties, and non-literal objects.
☆ There is also the IRI (Internationalized Resource Identifier) which permits a wider range of Unicode characters than the URI specification (subset of ASCII); however, like all 'international' characters (meaning non-English) it can cause many problems in practical applications. Use plain ASCII characters whenever you can; they are most likely to work everywhere.
URIs look very much like URLs. However, here that format is just used to identify something, such as
http://example.com/people#tom
The idea is that someone else writing N3 documents about a different 'tom' will use a different URI, such as
http://widgets4us.com/team#tom
In some cases an URI can also be used as an URL i.e. there is some web content at the address; however, for the purposes of RDF this is not necessary. When used as an URL the example.com URI returns some HTML content, but the widgets URI results in an error (at least at the time of writing).
Sometimes a (hopefully) global identification is not necessary: leaving out everything before the hashtag results in #tom which identifies some 'tom' in the current document only.
Populating the Knowledge Graph¶
In this example we start by adding data into the knowledge graph.
In N3 we can write:
<#tom> <#knows> <#jane> .
<#jane> <#age> 28 .
While subject and property are stated in URIs, the third part of the statement can also be a literal, such as a string or number. Note that #age acts as a property, while #knows can also be understood as a relationship.
The meaning for the human reader is obvious. However, in terms of RDF processing the string identifying (within the current document) our 'tom' does not need to be human-readable. Some knowledge graphs use numbers for subjects and properties, e.g. WikiData uses 'Q91' for Abraham Lincoln (the 16th president of the United States); DBPedia uses 'Abraham_Lincoln'.
http://www.wikidata.org/entity/Q91
http://dbpedia.org/resource/Abraham_Lincoln
Since N3 was designed as a help for human readers it also contains some options for abbreviations:
<#jane> <#child> <#albert>, <#martha> ;
<#age> 28 ;
<#eyecolor> "blue" .
This avoids repetition and says that Jane has two children, Albert and Martha, her (Jane's) eye color is blue, and her age is 28. It is equivalent to the more tedious:
<#jane> <#child> <#albert> .
<#jane> <#child> <#martha> .
<#jane> <#age> 28 .
<#jane> <#eyecolor> "blue" .
Similarly we can provide a tabular-like version using:
<#albert> <#age> 2; <#eyecolor> "green" .
<#martha> <#age> 4; <#eyecolor> "blue" .
which says that Albert is 2 years old and has green eye color, while Martha is 4 years old and has blue eye color.
We can also make a statement about objects without identifiers - we only want to state that they exist and have certain properties. This can done by using square brackets with the properties inside:
<#jane> <#child> [ <#age> 2 ] , [ <#age> 4 ] .
This says that Jane has two children aged 2 and 4.
At this point let us again stress some concepts:
An identifier like <#jane> works very much like an employee ID - the letters do not specify someone whose name is "Jane".
We can add that information by saying something like
<#jane> <#name> "Jane" .
The same applies to the properties. They were chosen to provide a nice example; as far as the automated processing is concerned, we could have used <#P40> instead of <#child> (e.g. see Wikidata https://www.wikidata.org/wiki/Property:P40).
The identifiers we used work just fine in our own document, but when we process data from different sources there may well be a name clash: the same name is used in another source in a different way. On the other hand, we do not want to always write the full URI in our statements.
Introducing Namespaces¶
Namespaces solve the problem of name clashes. Suppose we save our little knowledge graph which we are building step by step in a file, and we want to give that document a title, very much like the title of a web page:
<> <#title> "Some N3 Examples".
The expression <> in N3 refers to this document i.e. the document it is written in.
In the following statement the meaning of the word 'title' is clear for fantasy fans since lotr commonly means Lord of the Rings in a fantasy context.
<#lotr> <#title> "Lord of the Rings" .
However, the next statement also uses the word 'title'
<#tom> <#title> "Managing Director" .
For the human reader the term 'title' in this context is again clear to us. However, out of context even the innocent little word 'title' can refer to a number of things:
- the title of a document, book, film, piece of music
- an academic title or job title
- a sports prize, such as heavyweight boxing title
- a legal right, such as title to a property
To clarify our meaning of terms we can use a namespace, such as the Dublin Core, a small vocabulary to describe certain types of resources:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<> dc:title "Some N3 Examples" .
Let's add another prefix for the FOAF vocabulary (Friend of a Friend):
@prefix foaf: xmlns:foaf="http://xmlns.com/foaf/0.1/" .
<#tom> foaf:name "Tom" .
<#jane> foaf:name "Jane" .
Now we are using pre-defined vocabularies identified by prefixes such as foaf. There are a number of well-known vocabularies; prefix statements typically include RDF, RDFS (RDF Schema), FOAF, and OWL (Web Ontology Language):
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix foaf: xmlns:foaf="http://xmlns.com/foaf/0.1/" .
The empty prefix refers to this document, which can be specified as
@prefix : <#> .
This makes the notation even shorter and easier to read.
Defining an Ontology¶
In addition to providing a vocabulary an ontology usually defines a type hierarchy for classes and some rules or restrictions for properties. OWL provides means to define more rigorous data models, but we will use the simpler RDF and RDFS here.
First we want to specify types i.e. classes for things.
:Person rdf:type rdfs:Class .
Since this is so often done there is a special keyword in N3 acting as a shorthand for rdf:type, simply 'a':
:Person a rdfs:Class .
Now that we have defined a class for people we can add instances to that class:
:tom a :Person .
:jane a :Person .
RDFS provides a number of vocabulary elements to specify details for classes and properties, such as hierarchy:
:Man a rdfs:Class; rdfs:subClassOf :Person .
:Woman a rdfs:Class; rdfs:subClassOf :Person .
Now, when we say that
:martha a :Woman .
it follows logically that
:martha a :Person .
When that logic is implemented we can automatically make such inferences. Similar logic can be defined for properties:
:brother a rdf:Property .
:sister a rdf:Property .
:brother rdfs:domain :Person; rdfs:range :Man .
:sister rdfs:domain :Person; rdfs:range :Woman .
These statements provide information on properties:
- rdfs:domain states the types allowed on the left side
- rdfs:range states the types allowed on the right side
Given these definitions we now make the following statement:
:martha :brother :albert .
If the property 'brother' is used as defined i.e. with range 'Man' then this implies:
:albert a :Man .
This is clear to us human readers, not the machine. Just like the type hierarchy the logic behind domain and range must also be implemented in software for these inferences to be made. This will be the subject of the section on reasoning.
Note that since the domain of rdfs:range and rdfs:domain is rdf:Property, it follows that :brother and :sister are both rdf:Property. However, stating it explicitly is not a problem since it does not introduce a contradiction.
EXERCISES:
- Continue to populate both the knowledge graph and the ontology part above
- Determine what can be logically inferred from the statements
- Find other applications for this graph notation; e.g. our database tables
☆ Define an ontology for some topic you are interested in, such as sports or other hobby. Populated the graph with some facts and see how far you can go before encountering the limits of the approach.