Combining data from relational DB and RDF¶

Now that we have seen both the relational world, and the knowledge graph approach, we will look at some very simple options to combine those two in a single application.

  • We will combine data about beer in the US..
  • ..with data about peope born in the cities of the respective breweries.
  • We are not aiming for a perfect solution, just a proof of concept.
  • Most of the components for this little project are already in our repertoire, we just need to put them together.

Relational Part: CSV Import¶

Let's return to our CSV file for the top cities in the US:

https://github.com/plotly/datasets/raw/master/beers.csv

Download the file and put it into the current directory of the notebook or script.

Or, for a dynamic version we can use the requests package:

In [206]:
# uncomment if needed
# %pip install requests 
In [184]:
import requests
import os.path

fn = "beers.csv"
if os.path.isfile(fn):
    print("file exists.")
else:
    url = 'https://github.com/plotly/datasets/raw/master/beers.csv'
    r = requests.get(url)
    f = open(fn, 'w')
    f.write(r.text)
    f.close()
    print("downloaded.")
file exists.

We use our shop DB again because it already contains the table with the state names.

In [185]:
%load_ext sql
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
In [186]:
%sql sqlite:///shop.db
In [187]:
import sqlite3 
import pandas  
conn = sqlite3.connect('shop.db') 

Now we are ready to import the file, using the read_csv() function from the pandas package.

There are some problems with the column names, but pandas and sqlite should deal with them in a sensible manner, even if we get a UserWarning.

In [203]:
pandas.read_csv('beers.csv', encoding='utf8').to_sql('beer', 
                                                     conn, 
                                                     if_exists='replace',
                                                     index = False)
In [204]:
df = pandas.read_csv('beers.csv', encoding='utf8')
df[df.city == 'Black Mountain']
Out[204]:
Unnamed: 0 count.x abv ibu id beer style brewery_id ounces style2 count.y brewery city state label
1659 1660 1660 0.069 51.0 1760 GreyBeardâ„¢ IPA American IPA 324 12.0 American IPA 325 Pisgah Brewing Company Black Mountain NC GreyBeardâ„¢ IPA (Pisgah Brewing Company)
1660 1661 1661 0.057 31.0 1759 Pisgah Pale Ale American Pale Ale (APA) 324 12.0 American Pale Ale (APA) 325 Pisgah Brewing Company Black Mountain NC Pisgah Pale Ale (Pisgah Brewing Company)

The states are in code here.

  • Sadly, in the RDF source they will be in full name.
  • Luckily, we already have the state code and names!

We imported this earlier from

https://github.com/jasonong/List-of-US-States/blob/master/states.csv

In [191]:
%sql select * from state limit 5;
 * sqlite:///shop.db
Done.
Out[191]:
State Abbreviation
Alabama AL
Alaska AK
Arizona AZ
Arkansas AR
California CA

This very useful building block helps us here again.

  • We combine tables beer and state to get the state names for the state codes
  • For convenience we define a view
  • The concatenation operator || joins strings together in sqlite.
In [192]:
%%sql 

drop view if exists beers;

create view beers as select beer, brewery, city || ", " || state.state as city
from beer join state on beer.state = state.abbreviation;

select * from beers limit 5;
 * sqlite:///shop.db
Done.
Done.
Done.
Out[192]:
beer brewery city
Pub Beer 10 Barrel Brewing Company Bend, Oregon
Devil's Cup 18th Street Brewery Gary, Indiana
Rise of the Phoenix 18th Street Brewery Gary, Indiana
Sinister 18th Street Brewery Gary, Indiana
Sex and Candy 18th Street Brewery Gary, Indiana

We are happy with this. There are many things to improve, but for now we turn to the Sparql part.

RDF Part: Sparql Query¶

We get some data about tennis players from the DBpedia sparql endpoint.

In [193]:
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
In [194]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

In the query set the LIMIT for the number of results, just to make sure we do not want to overload the endpoint. However, in this example this turns out not to be a problem.

In [195]:
sparql.setQuery("""
SELECT ?who ?city WHERE {
      ?x dbo:birthPlace ?p .
      ?x foaf:name ?who .
      ?p dbo:country dbr:United_States_of_America .
      ?p rdfs:label ?city .
 FILTER (lang(?city) = 'en') .
} LIMIT 500""") 

sparql.setReturnFormat(JSON)

people = sparql.query().convert()["results"]["bindings"]

for res in people[:30]:
    print(res["who"]["value"], 
          res["city"]["value"])
print(len(people))
Carl Williams Belle Glade, Florida
Carrot Top Rockledge, Florida
Casey Printers DeSoto, Texas
Roberta Flack Black Mountain, North Carolina
Beau Taylor Rockledge, Florida
Bill Butler Hyattsville, Maryland
Bill Calhoun Rockmart, Georgia
Brad Daugherty Black Mountain, North Carolina
Brandon Morales Pharr, Texas
Brianna Walle Delray Beach, Florida
David Barrera San Juan, Texas
DeLane Matthews Rockledge, Florida
Jose Antonio Menendez San Juan, Texas
Joy Cheek Hyattsville, Maryland
Rhi Jeffrey Delray Beach, Florida
Rhiannon Jeffrey Delray Beach, Florida
Robert Johnson Delray Beach, Florida
Jamon Dumas-Johnson Hyattsville, Maryland
Melissa Witek Rockledge, Florida
S. S. Cooke Hyattsville, Maryland
Chris McCray Capitol Heights, Maryland
Garrett Wittels Bay Harbor Islands, Florida
Gary Antonio Russell Capitol Heights, Maryland
Gary Antuanne Russell Capitol Heights, Maryland
Goldie Cephus Preston, Maryland
Mandy Freeman Royal Palm Beach, Florida
Miranda Alexis Freeman Royal Palm Beach, Florida
Stephen Clancy Hill Riverdale Park, Maryland
Steve Crisafulli Rockledge, Florida
Austen Rowland Hyattsville, Maryland
357

This is useable for a proof of concept.

Combining the Results¶

Now we hope to get at least some hits for beers in the respective birth cities of people.

  • We do a little coding, effectively joining DB table and Sparql results.
  • This version lets us do some formatting directly.
In [196]:
# we need this for the c.execute()
c = conn.cursor()
    
# go through the Sparql result
for res in people:
    # city format: city name, state name
    city = res["city"]["value"]
    # force query result into a list, otherwise it will not be evaluated at this point
    rows = list(c.execute("select distinct beer, brewery from beers where city = ?", (city,)))
    # only output anything if there are matches; additional formatting can go here
    if len(rows) > 0:
        print(res["who"]["value"], "born in", city)
        print("Beers:") #  brewed in", city, ":")
        # for layout, at least indent the beer labels
        for row in rows:
            print("    ", row[0], "by", row[1])
        # leave an empty line before the next entry
        print("")
        
Roberta Flack born in Black Mountain, North Carolina
Beers:
     GreyBeardâ„¢ IPA by Pisgah Brewing Company
     Pisgah Pale Ale by Pisgah Brewing Company

Brad Daugherty born in Black Mountain, North Carolina
Beers:
     GreyBeardâ„¢ IPA by Pisgah Brewing Company
     Pisgah Pale Ale by Pisgah Brewing Company

Barbara Gardner Proctor born in Black Mountain, North Carolina
Beers:
     GreyBeardâ„¢ IPA by Pisgah Brewing Company
     Pisgah Pale Ale by Pisgah Brewing Company

Combine via CSV¶

If you are unhappy with the amount of Python code here is another version:

  • dump Sparql results into a CSV file
  • read CSV file into SQlite DB
  • use SQL Join

Not very elegant but simple, and gives us full control over columns and their names.

Note that we cannot use the comma as a delimiter since it is used in the city plus state names.

In [197]:
f = open("people.csv", "w")
f.write("person;city\n")
for res in people:
    f.write(res["who"]["value"] + ";" + res["city"]["value"] + "\n")
f.close()

pandas.read_csv('people.csv', delimiter=";").to_sql('people', 
                                                    conn, 
                                                    if_exists='replace',
                                                    index = False)
In [198]:
%%sql

select person, people.city, beer from beers join people 
on beers.city = people.city limit 20;
 * sqlite:///shop.db
Done.
Out[198]:
person city beer
Barbara Gardner Proctor Black Mountain, North Carolina GreyBeardâ„¢ IPA
Brad Daugherty Black Mountain, North Carolina GreyBeardâ„¢ IPA
Roberta Flack Black Mountain, North Carolina GreyBeardâ„¢ IPA
Barbara Gardner Proctor Black Mountain, North Carolina Pisgah Pale Ale
Brad Daugherty Black Mountain, North Carolina Pisgah Pale Ale
Roberta Flack Black Mountain, North Carolina Pisgah Pale Ale

pd.json_normalize¶

Another option for putting Sparql results into an SQL table. This one adds type and language data for each column.

In [199]:
import pandas as pd
pd.json_normalize(people).to_sql('people', 
                                 conn, 
                                 if_exists='replace',
                                 index=False)
In [200]:
%sql select * from people limit 5;
 * sqlite:///shop.db
Done.
Out[200]:
who.type who.xml:lang who.value city.type city.xml:lang city.value
literal en Carl Williams literal en Belle Glade, Florida
literal en Carrot Top literal en Rockledge, Florida
literal en Casey Printers literal en DeSoto, Texas
literal en Roberta Flack literal en Black Mountain, North Carolina
literal en Beau Taylor literal en Rockledge, Florida

Create a view and rename the who column; use quotes for problematic column names:

In [201]:
%%sql

drop view if exists brew;

create view brew as
select beer, brewery, city, "who.value" as who from beers join people 
on beers.city = people."city.value";

select * from brew limit 20;
 * sqlite:///shop.db
Done.
Done.
Done.
Out[201]:
beer brewery city who
GreyBeardâ„¢ IPA Pisgah Brewing Company Black Mountain, North Carolina Barbara Gardner Proctor
GreyBeardâ„¢ IPA Pisgah Brewing Company Black Mountain, North Carolina Brad Daugherty
GreyBeardâ„¢ IPA Pisgah Brewing Company Black Mountain, North Carolina Roberta Flack
Pisgah Pale Ale Pisgah Brewing Company Black Mountain, North Carolina Barbara Gardner Proctor
Pisgah Pale Ale Pisgah Brewing Company Black Mountain, North Carolina Brad Daugherty
Pisgah Pale Ale Pisgah Brewing Company Black Mountain, North Carolina Roberta Flack

Exercises¶

  • elaborate on the example above; there are lots of things to improve
  • find more CSV files and RDF data and combine them