Combining CSV and RDF Sources

We have seen both relational DB Tables generated from CSV files, and the results from Sparql queries. Now we want to combine the two sources.

There are several options; we will

put both the CSV data and the results from Sparql queries into relational DB tables. Then, we can
combine these sources simply by using a Join operation on DB tables.

We again use a Jupyter notebook to implement our method, and we start with the usual

%load_ext sql

With the extension loaded we connect to the Postgres DB:

%sql postgresql://username:password@postgresql.student-db.svc.cluster.local/username

Sunspots

First we get data on sunspots. There are many sources, here is a convenient one from the R archive:

Download CSV data

import requests r = requests.get('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv') f = open('sunspots.csv', 'wb') f.write(r.content) f.close()

The CSV file with the sunspot data is now in the current directory. We can check the content with out favorite editor if we want to.

Import into DB

To put these data into a DB table in Postgres we use the Python package psycopg2 to connect to the DB:

import psycopg2 conn = psycopg2.connect( "dbname='username' user='username' host='postgresql.student-db.svc.cluster.local' password='password'")

Before creating the table for the sunspot data we make sure it does not already exist:

%%sql drop table if exists sunspots;

Now we use the Python package pandas to read the CSV file and convert it to a DB table:

import pandas from sqlalchemy import create_engine engine = create_engine('postgresql+psycopg2://username:password@postgresql.student-db.svc.cluster.local/username') pandas.read_csv('sunspots.csv').to_sql('sunspots', engine, index = False)

The conversion recognises valid integer and floating point values for the columns time and value, and performs the data import. Of course we check the result:

%%sql select * from sunspots limit 5;

There is a columns Unnamed which we ignore.

Unnamed: 0 time value 1 1700 5.0 2 1701 11.0 3 1702 16.0 4 1703 23.0 5 1704 36.0

Pirates

For the Sparql query we need information on the properties used for entries on pirates. Simply look up your favorite pirate in DBPedia and search for useful properties. We need to identify pirates, and we also need the time when they were active.

Before putting the query into Python code it is recommended to experiment with it in the interactive DBPedia Sparql endpoint. This also gives us some idea of the type of results we will get.

Query RDF data

from SPARQLWrapper import SPARQLWrapper, JSON, CSV sparql = SPARQLWrapper("http://dbpedia.org/sparql") sparql.setQuery(""" SELECT distinct ?name ?year WHERE { ?x a yago:Pirate110435367 . ?x dbp:serviceyears ?year . ?x rdfs:label ?name . FILTER (lang(?name) = 'en') } LIMIT 100""") sparql.setReturnFormat(JSON)

Make sure the following statement is not execute more often than necessary, in order not to overload the Sparql endpoint.

pirates = sparql.query().convert()

Of course we again check the results. The expression [:10] here displays only the first 10 rows.

for res in pirates["results"]["bindings"][:10]: print(res["name"]["value"], res["year"]["value"])

As expected, some values for the year are not simply four-digit numbers. This will create problems later.

John Coxon (pirate) 1677 Sulayman Reis (pirate) c. 1607–c. 1620 Thomas Tew 1692 Cornelis Jol -1640.0 Henry Every 1694 Samuel Burgess 1690 Howell Davis --07-18 Pedro Gilbert 1832 Michiel Andrieszoon 1680.0 Mansel Alcantra 1820.0

Download to CSV

We write the result to a CSV file in the same manner as before. The columns name and year are sufficient.

f = open("pirates.csv", "w") f.write("name,year\n") for res in pirates["results"]["bindings"]: f.write(res["name"]["value"] + "," + res["year"]["value"] + "\n") f.close()

Import into DB

Now we can import it just as above with the sunspot data.

%%sql drop table if exists pirates;

The import function to_sql() works in the same manner as above, but this time it has to use a text column for the year, as some entries are not four-digit year values.

pandas.read_csv('pirates.csv').to_sql('pirates', engine, index = False)

Check the DB table:

%%sql select * from pirates limit 5;

We find a variety of formats for the year of service:

name year John Coxon (pirate) 1677 Sulayman Reis (pirate) c. 1607&ndash;c. 1620 Thomas Tew 1692 Cornelis Jol -1640.0 Henry Every 1694

Join

Almost done; we now have two DB tables:

sunspots
pirates

In the join we need to take care of the fact that the values for year in the pirate table are text. The simple (and far from perfect) solution is to change the data type of the sunspots into text, since the conversion from integer to text always works. The function cast() achieves this.

%%sql select time, value, name from sunspots join pirates on pirates.year = cast(sunspots.time as text);

The results look promising (mostly). Obviously our solution leaves a lot to be desired; however, we have a working prototype.

time value name 1700 5.0 John Bowen (pirate) 1701 11.0 Miguel Enriquez (privateer) 1703 23.0 John Quelch (pirate) 1707 20.0 Nathaniel North 1713 2.0 Benjamin Hornigold 1716 47.0 Blackbeard 1716 47.0 Charles Vane 1717 63.0 Stede Bonnet 1718 60.0 Calico Jack 1718 60.0 Anne Bonny 1719 39.0 Bartholomew Roberts 1721 26.0 Edward Low 1724 21.0 John Gow 1726 78.0 William Fly 1781 68.1 Rachel Wall 1795 21.3 Kazimierz Lux 1795 21.3 Rais Hamidou 1805 42.2 Henri Caesar 1806 28.1 Charlotte Badger 1832 27.5 Pedro Gilbert 1845 40.1 Shap Ng-tsai

Exercises:

Find more examples of combining data from CSV and RDF sources. The possibilities are practically endless.