Sparql Endpoints¶

While it is useful to query local data in RDF format with Sparql queries there is now another way of accessing data: Sparql endpoints, services that are identified by URLs and are capable of processing queries and returning results.

In this manner an organization can provide a service for accessing its data both in an interactive, and, more importantly, also an automated manner.

An increasing number of endpoints is becoming available for a large array of applications. There are several sources that list available Sparql endpoints, such as

  • https://www.wikidata.org/wiki/Wikidata:Lists/SPARQL_endpoints
  • https://www.w3.org/wiki/SparqlEndpoints

DBPedia¶

The DBPedia project extracts structured data from Wikipedia and makes this data available in various formats:

  • on their website http://dbpedia.org
  • as downloads in formats such as Turtle
  • via their Sparql endpoint

Taking a look at a sample page such as

http://dbpedia.org/page/Italy

we find the various properties that are associated with countries.

In the following example we will query the endpoint to get a list of countries with their populations and capitals.

We use the SPARQLWrapper package to access the data.

In [24]:
# uncomment if needed
# %pip install SPARQLWrapper 
In [1]:
from SPARQLWrapper import SPARQLWrapper, JSON, CSV

The endpoint has an URL address where the service is waiting for requests, such as ours.

In [2]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

Countries and Capitals¶

The query is a little more elaborate then our previous examples.

  • we use the dbpedia prefix for their property names
  • the places we are looking for are of type dbo:Country

Note that DBPedia triples are extracted from structured content in Wikipedia which is a community effort; entries are not always reliable or objective, particularly when it comes to political issues; Wikipedia editors are people, and people often favour their own agenda over unbiased information.

We are querying a public endpoint that receives a lot of requests. The LIMIT clause is meant to avoid huge data transfers in case our query does not work as expected.

Languages¶

Wikipedia and therefore DBpedia come in a number of languages.

Labels are assigned language tags e.g. the capital of Italy has the name 'Rome' in English but 'Roma' in Italian.

The filter() option allows us to restrict to English by using the lang() function.

In [13]:
sparql.setQuery("""
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    
    SELECT ?country ?population ?capital
    WHERE { 
    ?place rdf:type dbo:Country .
    ?place rdfs:label ?country .
    filter(lang(?country) = 'en')
    ?place dbo:populationTotal ?population .
    ?place dbo:capital ?cap .
    ?cap rdfs:label ?capital
    filter(lang(?capital) = 'en')
    }
    LIMIT 20
""")

JSON is a commonly used data-interchange format that works with name/value pairs and ordered list.

In Python we can access the various parts of the results as dictionaries which give us the values for the keys within the square brackets [].

In [14]:
sparql.setReturnFormat(JSON)

Now we are ready to send the query to the endpoint and process the result.

It usually takes a while to process, maybe 5-10 seconds or so.

We should try not to execute this step too often since this a public endpoint with many users.

In [15]:
countries = sparql.query().convert()

Accessing the JSON Output¶

Now we can print the of records in the result:

  • the JSON output contains the actual values in results/bindings
  • for each entry there is a corresponding name/value pair
In [16]:
for res in countries["results"]["bindings"]:
    print(res["country"]["value"], res["population"]["value"], res["capital"]["value"])
Bosnia and Herzegovina 3475000 Sarajevo
Republic of Sassari 15 Sassari
Venezuela 29789730 Caracas
Costa Rica 5204411 San José, Costa Rica
Saint Lucia 184961 Castries
Egypt 107770524 Cairo
El Salvador 6568745 San Salvador
Malaysia 33871431 Putrajaya
Czech Republic 10516707 Prague
North Korea 25955138 Pyongyang
Italy 61095551 Rome
Chile 18430408 Santiago
China 1410539758 Beijing
Lebanon 5296814 Beirut
Dominican Republic 10694700 Santo Domingo
South Africa 60604992 Cape Town
South Africa 60604992 Pretoria
Sovereign Military Order of Malta 3 Rome
Korea 77000000 Pyongyang
Kosovo 1806279 Pristina

Note that

  • not everything in the list corresponds to our idea of a country
  • the population numbers are obviously not exact
  • although we filtered the capital names for English we still get non-English characters
  • in other words, language != encoding; hopefully, utf8 is used everywhere in the process

What constitutes a country is often a disputed issue; here is another query for the member states of the UN:

In [17]:
sparql.setQuery("""
SELECT ?name WHERE {
  ?x a <http://dbpedia.org/class/yago/WikicatMemberStatesOfTheUnitedNations> .
  ?x rdfs:label ?name . FILTER (lang(?name) = 'en') . 
  } LIMIT 20
""")
sparql.setReturnFormat(JSON)
countries = sparql.query().convert()
for res in countries["results"]["bindings"]:
    print(res["name"]["value"])
Cambodia
Cameroon
Canada
Cape Verde
Qatar
Romania
Samoa
San Marino
Saudi Arabia
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Botswana
Brazil
Democratic Republic of the Congo
Denmark
Honduras

Another Query: Soccer Players¶

As another example we look for soccer players, their teams, and the countries of their teams.

Obviously this list is far from complete, as it depends on community efforts in adding the data to Wikipedia.

  • We start with the usual prefixes
  • Naturally we want the labels, and we want them in English
  • We limit the country population to very large values

Again we use a very low limit since we just want to see how this type of query works; we do not need the full data.

In [18]:
sparql.setQuery("""
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
SELECT distinct ?playerl ?teaml ?countryl
WHERE { 
    ?player a dbo:SoccerPlayer .
    ?player dbo:team ?team .
    ?player rdfs:label ?playerl .
    ?team rdfs:label ?teaml .
    ?team dbo:ground ?countryOfTeam . 
    ?countryOfTeam a dbo:Country .
    ?countryOfTeam rdfs:label ?countryl .
    ?countryOfTeam dbo:populationTotal ?population .
    FILTER (?population > 100000000)
    FILTER (lang(?playerl) = 'en')
    FILTER (lang(?teaml) = 'en')
    FILTER (lang(?countryl) = 'en')
} ORDER BY ?playerl LIMIT 20
""")

players = sparql.query().convert()

For the output we use the display package for the Jupyter notebook to produce a nice-looking HTML table:

In [19]:
from IPython.display import HTML, display

tab = '<table>'
for res in players["results"]["bindings"]:
    tab += '<tr><td>%s<td>%s<td>%s' % (res["playerl"]["value"], 
                                       res["teaml"]["value"], 
                                       res["countryl"]["value"])
display(HTML(tab+'</table>'))
A HoàngHoang Anh Gia Lai FCVietnam
Aarón PadillaC.D. VeracruzMexico
Aarón PadillaAtlante F.C.Mexico
Aarón Padilla GutiérrezAtlante F.C.Mexico
Abayomi Owonikoko SeunFC GagraGeorgia (country)
Abayomi Owonikoko SeunFC ZestafoniGeorgia (country)
Abdel Rahman MagdyTersana SCEgypt
Abdeljalil MedioubFC Dinamo TbilisiGeorgia (country)
Abdoul NjankouIndios de Ciudad JuárezMexico
Abdoulaye KoffiGrand Hotel FCEgypt
Abdul RachmanBontang F.C.Indonesia
Abdullahi Ibrahim AlhassanAkwa United F.C.Nigeria
Abdullahi OyedeleABS F.C.Nigeria
Abdulrahman BashirABS F.C.Nigeria
Abdulrasaq WuraolaSharks F.C.Nigeria
Abdulrasaq WuraolaNiger Tornadoes F.C.Nigeria
Abdulwaheed AfolabiNiger Tornadoes F.C.Nigeria
Abdulwaheed AfolabiPlateau United F.C.Nigeria
Abdulwasiu ShowemimoZamfara United F.C.Nigeria
Abduwahap AniwarKunshan F.C.China

☆ Wikidata¶

Another free open knowledge graph, somewhat bigger than DBPedia; it uses number schemes for URIs rather than human-readable strings. Here is a query for humans (Q5) with the label "Abraham Lincoln". Only one of the results is the 16th US president (Q91). Note that we have to designate the language version of strings, otherwise the result will be empty.

In [20]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""
SELECT distinct ?x  WHERE {  
  ?x rdfs:label "Abraham Lincoln"@en . 
  ?x wdt:P31 wd:Q5 . 
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
df = pd.json_normalize(results['results']['bindings'])
df['x.value']
Out[20]:
0     http://www.wikidata.org/entity/Q2821841
1    http://www.wikidata.org/entity/Q27807440
2          http://www.wikidata.org/entity/Q91
Name: x.value, dtype: object

Here is the query for DBPedia:

In [21]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""  
    SELECT ?x WHERE { 
    ?x rdf:type dbo:Person .
    ?x rdfs:label "Abraham Lincoln"@en .
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
df = pd.json_normalize(results['results']['bindings'])
df['x.value']
Out[21]:
0    http://dbpedia.org/resource/Abraham_Lincoln
Name: x.value, dtype: object

Exercises¶

  • find more examples for useful data in the dbpedia graph
  • construct some Sparql queries and run them on the endpoint
  • format the results in a nice fashion

Remember that DBpedia is based on Wikipedia which is a community effort, and not everything is perfect and complete.

Many interesting things can be found, but a little patience is required for this type of source.