Combining data from relational DB and RDF¶
Now that we have seen both the relational world, and the knowledge graph approach, we will look at some very simple options to combine those two in a single application.
- We will combine data about beer in the US..
- ..with data about peope born in the cities of the respective breweries.
- We are not aiming for a perfect solution, just a proof of concept.
- Most of the components for this little project are already in our repertoire, we just need to put them together.
Relational Part: CSV Import¶
Let's return to our CSV file for the top cities in the US:
https://github.com/plotly/datasets/raw/master/beers.csv
Download the file and put it into the current directory of the notebook or script.
Or, for a dynamic version we can use the requests package:
# uncomment if needed
# %pip install requests
import requests
import os.path
fn = "beers.csv"
if os.path.isfile(fn):
print("file exists.")
else:
url = 'https://github.com/plotly/datasets/raw/master/beers.csv'
r = requests.get(url)
f = open(fn, 'w')
f.write(r.text)
f.close()
print("downloaded.")
file exists.
We use our shop DB again because it already contains the table with the state names.
%load_ext sql
The sql extension is already loaded. To reload it, use: %reload_ext sql
%sql sqlite:///shop.db
import sqlite3
import pandas
conn = sqlite3.connect('shop.db')
Now we are ready to import the file, using the read_csv() function from the pandas package.
There are some problems with the column names, but pandas and sqlite should deal with them in a sensible manner, even if we get a UserWarning.
pandas.read_csv('beers.csv', encoding='utf8').to_sql('beer',
conn,
if_exists='replace',
index = False)
df = pandas.read_csv('beers.csv', encoding='utf8')
df[df.city == 'Black Mountain']
Unnamed: 0 | count.x | abv | ibu | id | beer | style | brewery_id | ounces | style2 | count.y | brewery | city | state | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1659 | 1660 | 1660 | 0.069 | 51.0 | 1760 | GreyBeardâ„¢ IPA | American IPA | 324 | 12.0 | American IPA | 325 | Pisgah Brewing Company | Black Mountain | NC | GreyBeardâ„¢ IPA (Pisgah Brewing Company) |
1660 | 1661 | 1661 | 0.057 | 31.0 | 1759 | Pisgah Pale Ale | American Pale Ale (APA) | 324 | 12.0 | American Pale Ale (APA) | 325 | Pisgah Brewing Company | Black Mountain | NC | Pisgah Pale Ale (Pisgah Brewing Company) |
The states are in code here.
- Sadly, in the RDF source they will be in full name.
- Luckily, we already have the state code and names!
We imported this earlier from
https://github.com/jasonong/List-of-US-States/blob/master/states.csv
%sql select * from state limit 5;
* sqlite:///shop.db Done.
State | Abbreviation |
---|---|
Alabama | AL |
Alaska | AK |
Arizona | AZ |
Arkansas | AR |
California | CA |
This very useful building block helps us here again.
- We combine tables beer and state to get the state names for the state codes
- For convenience we define a view
- The concatenation operator || joins strings together in sqlite.
%%sql
drop view if exists beers;
create view beers as select beer, brewery, city || ", " || state.state as city
from beer join state on beer.state = state.abbreviation;
select * from beers limit 5;
* sqlite:///shop.db Done. Done. Done.
beer | brewery | city |
---|---|---|
Pub Beer | 10 Barrel Brewing Company | Bend, Oregon |
Devil's Cup | 18th Street Brewery | Gary, Indiana |
Rise of the Phoenix | 18th Street Brewery | Gary, Indiana |
Sinister | 18th Street Brewery | Gary, Indiana |
Sex and Candy | 18th Street Brewery | Gary, Indiana |
We are happy with this. There are many things to improve, but for now we turn to the Sparql part.
RDF Part: Sparql Query¶
We get some data about tennis players from the DBpedia sparql endpoint.
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
In the query set the LIMIT for the number of results, just to make sure we do not want to overload the endpoint. However, in this example this turns out not to be a problem.
sparql.setQuery("""
SELECT ?who ?city WHERE {
?x dbo:birthPlace ?p .
?x foaf:name ?who .
?p dbo:country dbr:United_States_of_America .
?p rdfs:label ?city .
FILTER (lang(?city) = 'en') .
} LIMIT 500""")
sparql.setReturnFormat(JSON)
people = sparql.query().convert()["results"]["bindings"]
for res in people[:30]:
print(res["who"]["value"],
res["city"]["value"])
print(len(people))
Carl Williams Belle Glade, Florida Carrot Top Rockledge, Florida Casey Printers DeSoto, Texas Roberta Flack Black Mountain, North Carolina Beau Taylor Rockledge, Florida Bill Butler Hyattsville, Maryland Bill Calhoun Rockmart, Georgia Brad Daugherty Black Mountain, North Carolina Brandon Morales Pharr, Texas Brianna Walle Delray Beach, Florida David Barrera San Juan, Texas DeLane Matthews Rockledge, Florida Jose Antonio Menendez San Juan, Texas Joy Cheek Hyattsville, Maryland Rhi Jeffrey Delray Beach, Florida Rhiannon Jeffrey Delray Beach, Florida Robert Johnson Delray Beach, Florida Jamon Dumas-Johnson Hyattsville, Maryland Melissa Witek Rockledge, Florida S. S. Cooke Hyattsville, Maryland Chris McCray Capitol Heights, Maryland Garrett Wittels Bay Harbor Islands, Florida Gary Antonio Russell Capitol Heights, Maryland Gary Antuanne Russell Capitol Heights, Maryland Goldie Cephus Preston, Maryland Mandy Freeman Royal Palm Beach, Florida Miranda Alexis Freeman Royal Palm Beach, Florida Stephen Clancy Hill Riverdale Park, Maryland Steve Crisafulli Rockledge, Florida Austen Rowland Hyattsville, Maryland 357
This is useable for a proof of concept.
Combining the Results¶
Now we hope to get at least some hits for beers in the respective birth cities of people.
- We do a little coding, effectively joining DB table and Sparql results.
- This version lets us do some formatting directly.
# we need this for the c.execute()
c = conn.cursor()
# go through the Sparql result
for res in people:
# city format: city name, state name
city = res["city"]["value"]
# force query result into a list, otherwise it will not be evaluated at this point
rows = list(c.execute("select distinct beer, brewery from beers where city = ?", (city,)))
# only output anything if there are matches; additional formatting can go here
if len(rows) > 0:
print(res["who"]["value"], "born in", city)
print("Beers:") # brewed in", city, ":")
# for layout, at least indent the beer labels
for row in rows:
print(" ", row[0], "by", row[1])
# leave an empty line before the next entry
print("")
Roberta Flack born in Black Mountain, North Carolina Beers: GreyBeardâ„¢ IPA by Pisgah Brewing Company Pisgah Pale Ale by Pisgah Brewing Company Brad Daugherty born in Black Mountain, North Carolina Beers: GreyBeardâ„¢ IPA by Pisgah Brewing Company Pisgah Pale Ale by Pisgah Brewing Company Barbara Gardner Proctor born in Black Mountain, North Carolina Beers: GreyBeardâ„¢ IPA by Pisgah Brewing Company Pisgah Pale Ale by Pisgah Brewing Company
Combine via CSV¶
If you are unhappy with the amount of Python code here is another version:
- dump Sparql results into a CSV file
- read CSV file into SQlite DB
- use SQL Join
Not very elegant but simple, and gives us full control over columns and their names.
Note that we cannot use the comma as a delimiter since it is used in the city plus state names.
f = open("people.csv", "w")
f.write("person;city\n")
for res in people:
f.write(res["who"]["value"] + ";" + res["city"]["value"] + "\n")
f.close()
pandas.read_csv('people.csv', delimiter=";").to_sql('people',
conn,
if_exists='replace',
index = False)
%%sql
select person, people.city, beer from beers join people
on beers.city = people.city limit 20;
* sqlite:///shop.db Done.
person | city | beer |
---|---|---|
Barbara Gardner Proctor | Black Mountain, North Carolina | GreyBeardâ„¢ IPA |
Brad Daugherty | Black Mountain, North Carolina | GreyBeardâ„¢ IPA |
Roberta Flack | Black Mountain, North Carolina | GreyBeardâ„¢ IPA |
Barbara Gardner Proctor | Black Mountain, North Carolina | Pisgah Pale Ale |
Brad Daugherty | Black Mountain, North Carolina | Pisgah Pale Ale |
Roberta Flack | Black Mountain, North Carolina | Pisgah Pale Ale |
pd.json_normalize¶
Another option for putting Sparql results into an SQL table. This one adds type and language data for each column.
import pandas as pd
pd.json_normalize(people).to_sql('people',
conn,
if_exists='replace',
index=False)
%sql select * from people limit 5;
* sqlite:///shop.db Done.
who.type | who.xml:lang | who.value | city.type | city.xml:lang | city.value |
---|---|---|---|---|---|
literal | en | Carl Williams | literal | en | Belle Glade, Florida |
literal | en | Carrot Top | literal | en | Rockledge, Florida |
literal | en | Casey Printers | literal | en | DeSoto, Texas |
literal | en | Roberta Flack | literal | en | Black Mountain, North Carolina |
literal | en | Beau Taylor | literal | en | Rockledge, Florida |
Create a view and rename the who column; use quotes for problematic column names:
%%sql
drop view if exists brew;
create view brew as
select beer, brewery, city, "who.value" as who from beers join people
on beers.city = people."city.value";
select * from brew limit 20;
* sqlite:///shop.db Done. Done. Done.
beer | brewery | city | who |
---|---|---|---|
GreyBeardâ„¢ IPA | Pisgah Brewing Company | Black Mountain, North Carolina | Barbara Gardner Proctor |
GreyBeardâ„¢ IPA | Pisgah Brewing Company | Black Mountain, North Carolina | Brad Daugherty |
GreyBeardâ„¢ IPA | Pisgah Brewing Company | Black Mountain, North Carolina | Roberta Flack |
Pisgah Pale Ale | Pisgah Brewing Company | Black Mountain, North Carolina | Barbara Gardner Proctor |
Pisgah Pale Ale | Pisgah Brewing Company | Black Mountain, North Carolina | Brad Daugherty |
Pisgah Pale Ale | Pisgah Brewing Company | Black Mountain, North Carolina | Roberta Flack |
Exercises¶
- elaborate on the example above; there are lots of things to improve
- find more CSV files and RDF data and combine them