Sparql Endpoints¶
While it is useful to query local data in RDF format with Sparql queries there is now another way of accessing data: Sparql endpoints, services that are identified by URLs and are capable of processing queries and returning results.
In this manner an organization can provide a service for accessing its data both in an interactive, and, more importantly, also an automated manner.
An increasing number of endpoints is becoming available for a large array of applications. There are several sources that list available Sparql endpoints, such as
DBPedia¶
The DBPedia project extracts structured data from Wikipedia and makes this data available in various formats:
- on their website http://dbpedia.org
- as downloads in formats such as Turtle
- via their Sparql endpoint
Taking a look at a sample page such as
we find the various properties that are associated with countries.
In the following example we will query the endpoint to get a list of countries with their populations and capitals.
We use the SPARQLWrapper package to access the data.
# uncomment if needed
# %pip install SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
The endpoint has an URL address where the service is waiting for requests, such as ours.
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
Countries and Capitals¶
The query is a little more elaborate then our previous examples.
- we use the dbpedia prefix for their property names
- the places we are looking for are of type dbo:Country
Note that DBPedia triples are extracted from structured content in Wikipedia which is a community effort; entries are not always reliable or objective, particularly when it comes to political issues; Wikipedia editors are people, and people often favour their own agenda over unbiased information.
We are querying a public endpoint that receives a lot of requests. The LIMIT clause is meant to avoid huge data transfers in case our query does not work as expected.
Languages¶
Wikipedia and therefore DBpedia come in a number of languages.
Labels are assigned language tags e.g. the capital of Italy has the name 'Rome' in English but 'Roma' in Italian.
The filter() option allows us to restrict to English by using the lang() function.
sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?population ?capital
WHERE {
?place rdf:type dbo:Country .
?place rdfs:label ?country .
filter(lang(?country) = 'en')
?place dbo:populationTotal ?population .
?place dbo:capital ?cap .
?cap rdfs:label ?capital
filter(lang(?capital) = 'en')
}
LIMIT 20
""")
JSON is a commonly used data-interchange format that works with name/value pairs and ordered list.
In Python we can access the various parts of the results as dictionaries which give us the values for the keys within the square brackets [].
sparql.setReturnFormat(JSON)
Now we are ready to send the query to the endpoint and process the result.
It usually takes a while to process, maybe 5-10 seconds or so.
We should try not to execute this step too often since this a public endpoint with many users.
countries = sparql.query().convert()
Accessing the JSON Output¶
Now we can print the of records in the result:
- the JSON output contains the actual values in results/bindings
- for each entry there is a corresponding name/value pair
for res in countries["results"]["bindings"]:
print(res["country"]["value"], res["population"]["value"], res["capital"]["value"])
Bosnia and Herzegovina 3475000 Sarajevo Republic of Sassari 15 Sassari Venezuela 29789730 Caracas Costa Rica 5204411 San José, Costa Rica Saint Lucia 184961 Castries Egypt 107770524 Cairo El Salvador 6568745 San Salvador Malaysia 33871431 Putrajaya Czech Republic 10516707 Prague North Korea 25955138 Pyongyang Italy 61095551 Rome Chile 18430408 Santiago China 1410539758 Beijing Lebanon 5296814 Beirut Dominican Republic 10694700 Santo Domingo South Africa 60604992 Cape Town South Africa 60604992 Pretoria Sovereign Military Order of Malta 3 Rome Korea 77000000 Pyongyang Kosovo 1806279 Pristina
Note that
- not everything in the list corresponds to our idea of a country
- the population numbers are obviously not exact
- although we filtered the capital names for English we still get non-English characters
- in other words, language != encoding; hopefully, utf8 is used everywhere in the process
What constitutes a country is often a disputed issue; here is another query for the member states of the UN:
sparql.setQuery("""
SELECT ?name WHERE {
?x a <http://dbpedia.org/class/yago/WikicatMemberStatesOfTheUnitedNations> .
?x rdfs:label ?name . FILTER (lang(?name) = 'en') .
} LIMIT 20
""")
sparql.setReturnFormat(JSON)
countries = sparql.query().convert()
for res in countries["results"]["bindings"]:
print(res["name"]["value"])
Cambodia Cameroon Canada Cape Verde Qatar Romania Samoa San Marino Saudi Arabia Belarus Belgium Belize Benin Bhutan Bolivia Botswana Brazil Democratic Republic of the Congo Denmark Honduras
Another Query: Soccer Players¶
As another example we look for soccer players, their teams, and the countries of their teams.
Obviously this list is far from complete, as it depends on community efforts in adding the data to Wikipedia.
- We start with the usual prefixes
- Naturally we want the labels, and we want them in English
- We limit the country population to very large values
Again we use a very low limit since we just want to see how this type of query works; we do not need the full data.
sparql.setQuery("""
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT distinct ?playerl ?teaml ?countryl
WHERE {
?player a dbo:SoccerPlayer .
?player dbo:team ?team .
?player rdfs:label ?playerl .
?team rdfs:label ?teaml .
?team dbo:ground ?countryOfTeam .
?countryOfTeam a dbo:Country .
?countryOfTeam rdfs:label ?countryl .
?countryOfTeam dbo:populationTotal ?population .
FILTER (?population > 100000000)
FILTER (lang(?playerl) = 'en')
FILTER (lang(?teaml) = 'en')
FILTER (lang(?countryl) = 'en')
} ORDER BY ?playerl LIMIT 20
""")
players = sparql.query().convert()
For the output we use the display package for the Jupyter notebook to produce a nice-looking HTML table:
from IPython.display import HTML, display
tab = '<table>'
for res in players["results"]["bindings"]:
tab += '<tr><td>%s<td>%s<td>%s' % (res["playerl"]["value"],
res["teaml"]["value"],
res["countryl"]["value"])
display(HTML(tab+'</table>'))
A Hoàng | Hoang Anh Gia Lai FC | Vietnam |
Aarón Padilla | C.D. Veracruz | Mexico |
Aarón Padilla | Atlante F.C. | Mexico |
Aarón Padilla Gutiérrez | Atlante F.C. | Mexico |
Abayomi Owonikoko Seun | FC Gagra | Georgia (country) |
Abayomi Owonikoko Seun | FC Zestafoni | Georgia (country) |
Abdel Rahman Magdy | Tersana SC | Egypt |
Abdeljalil Medioub | FC Dinamo Tbilisi | Georgia (country) |
Abdoul Njankou | Indios de Ciudad Juárez | Mexico |
Abdoulaye Koffi | Grand Hotel FC | Egypt |
Abdul Rachman | Bontang F.C. | Indonesia |
Abdullahi Ibrahim Alhassan | Akwa United F.C. | Nigeria |
Abdullahi Oyedele | ABS F.C. | Nigeria |
Abdulrahman Bashir | ABS F.C. | Nigeria |
Abdulrasaq Wuraola | Sharks F.C. | Nigeria |
Abdulrasaq Wuraola | Niger Tornadoes F.C. | Nigeria |
Abdulwaheed Afolabi | Niger Tornadoes F.C. | Nigeria |
Abdulwaheed Afolabi | Plateau United F.C. | Nigeria |
Abdulwasiu Showemimo | Zamfara United F.C. | Nigeria |
Abduwahap Aniwar | Kunshan F.C. | China |
☆ Wikidata¶
Another free open knowledge graph, somewhat bigger than DBPedia; it uses number schemes for URIs rather than human-readable strings. Here is a query for humans (Q5) with the label "Abraham Lincoln". Only one of the results is the 16th US president (Q91). Note that we have to designate the language version of strings, otherwise the result will be empty.
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""
SELECT distinct ?x WHERE {
?x rdfs:label "Abraham Lincoln"@en .
?x wdt:P31 wd:Q5 .
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
df = pd.json_normalize(results['results']['bindings'])
df['x.value']
0 http://www.wikidata.org/entity/Q2821841 1 http://www.wikidata.org/entity/Q27807440 2 http://www.wikidata.org/entity/Q91 Name: x.value, dtype: object
Here is the query for DBPedia:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
SELECT ?x WHERE {
?x rdf:type dbo:Person .
?x rdfs:label "Abraham Lincoln"@en .
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
df = pd.json_normalize(results['results']['bindings'])
df['x.value']
0 http://dbpedia.org/resource/Abraham_Lincoln Name: x.value, dtype: object
Exercises¶
- find more examples for useful data in the dbpedia graph
- construct some Sparql queries and run them on the endpoint
- format the results in a nice fashion
Remember that DBpedia is based on Wikipedia which is a community effort, and not everything is perfect and complete.
Many interesting things can be found, but a little patience is required for this type of source.