Combining CSV and RDF Sources
We have seen both relational DB Tables generated from CSV files, and the results from Sparql queries. Now we want to combine the two
sources.
There are several options; we will
- put both the CSV data and the results from Sparql queries into relational DB tables. Then, we can
- combine these sources simply by using a Join operation on DB tables.
We again use a Jupyter notebook to implement our method, and we start with the usual
%load_ext sql
With the extension loaded we connect to the Postgres DB:
%sql postgresql://username:password@postgresql.student-db.svc.cluster.local/username
Sunspots
First we get data on sunspots. There are many sources, here is a convenient one from the R archive:
Download CSV data
import requests
r = requests.get('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv')
f = open('sunspots.csv', 'wb')
f.write(r.content)
f.close()
The CSV file with the sunspot data is now in the current directory. We can check the content with out favorite
editor if we want to.
Import into DB
To put these data into a DB table in Postgres we use the Python package psycopg2 to connect to the DB:
import psycopg2
conn = psycopg2.connect(
"dbname='username' user='username' host='postgresql.student-db.svc.cluster.local' password='password'")
Before creating the table for the sunspot data we make sure it does not already exist:
%%sql
drop table if exists sunspots;
Now we use the Python package pandas to read the CSV file and convert it to a DB table:
import pandas
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://username:password@postgresql.student-db.svc.cluster.local/username')
pandas.read_csv('sunspots.csv').to_sql('sunspots', engine, index = False)
The conversion recognises valid integer and floating point values for the columns time and value, and performs
the data import. Of course we check the result:
%%sql
select * from sunspots limit 5;
There is a columns Unnamed which we ignore.
Unnamed: 0 time value
1 1700 5.0
2 1701 11.0
3 1702 16.0
4 1703 23.0
5 1704 36.0
Pirates
For the Sparql query we need information on the properties used for entries on pirates. Simply look up your favorite
pirate in DBPedia and search for useful properties. We need to identify pirates, and we also need
the time when they were active.
Before putting the query into Python code it is recommended to experiment with it in the interactive DBPedia Sparql endpoint.
This also gives us some idea of the type of results we will get.
Query RDF data
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
SELECT distinct ?name ?year WHERE {
?x a yago:Pirate110435367 .
?x dbp:serviceyears ?year .
?x rdfs:label ?name .
FILTER (lang(?name) = 'en')
} LIMIT 100""")
sparql.setReturnFormat(JSON)
Make sure the following statement is not execute more often than necessary, in order not to overload the Sparql endpoint.
pirates = sparql.query().convert()
Of course we again check the results. The expression [:10] here displays only the first 10 rows.
for res in pirates["results"]["bindings"][:10]:
print(res["name"]["value"],
res["year"]["value"])
As expected, some values for the year are not simply four-digit numbers. This will create problems later.
John Coxon (pirate) 1677
Sulayman Reis (pirate) c. 1607–c. 1620
Thomas Tew 1692
Cornelis Jol -1640.0
Henry Every 1694
Samuel Burgess 1690
Howell Davis --07-18
Pedro Gilbert 1832
Michiel Andrieszoon 1680.0
Mansel Alcantra 1820.0
Download to CSV
We write the result to a CSV file in the same manner as before. The columns name and year are sufficient.
f = open("pirates.csv", "w")
f.write("name,year\n")
for res in pirates["results"]["bindings"]:
f.write(res["name"]["value"] + "," + res["year"]["value"] + "\n")
f.close()
Import into DB
Now we can import it just as above with the sunspot data.
%%sql
drop table if exists pirates;
The import function to_sql() works in the same manner as above, but this time it has to use a text column for the year,
as some entries are not four-digit year values.
pandas.read_csv('pirates.csv').to_sql('pirates', engine, index = False)
Check the DB table:
%%sql
select * from pirates limit 5;
We find a variety of formats for the year of service:
name year
John Coxon (pirate) 1677
Sulayman Reis (pirate) c. 1607–c. 1620
Thomas Tew 1692
Cornelis Jol -1640.0
Henry Every 1694
Join
Almost done; we now have two DB tables:
In the join we need to take care of the fact that the values for year in the pirate table are text.
The simple (and far from perfect) solution is to change the data type of the sunspots into text, since
the conversion from integer to text always works. The function cast() achieves this.
%%sql
select time, value, name
from sunspots join pirates on pirates.year = cast(sunspots.time as text);
The results look promising (mostly). Obviously our solution leaves a lot to be desired; however, we have a working
prototype.
time value name
1700 5.0 John Bowen (pirate)
1701 11.0 Miguel Enriquez (privateer)
1703 23.0 John Quelch (pirate)
1707 20.0 Nathaniel North
1713 2.0 Benjamin Hornigold
1716 47.0 Blackbeard
1716 47.0 Charles Vane
1717 63.0 Stede Bonnet
1718 60.0 Calico Jack
1718 60.0 Anne Bonny
1719 39.0 Bartholomew Roberts
1721 26.0 Edward Low
1724 21.0 John Gow
1726 78.0 William Fly
1781 68.1 Rachel Wall
1795 21.3 Kazimierz Lux
1795 21.3 Rais Hamidou
1805 42.2 Henri Caesar
1806 28.1 Charlotte Badger
1832 27.5 Pedro Gilbert
1845 40.1 Shap Ng-tsai
Exercises:
Find more examples of combining data from CSV and RDF sources. The possibilities are practically endless.