Database Applications¶

So far we have interacted with our databases by issuing individual SQL statements. However, this method of entering and retrieving data from a database quickly becomes much to cumbersome with increased amounts of data to process and repeated processes on a regular schedule; hence the need for software programs to automate the process of managing data in a database: database applications.

There are many areas where DB applications can be used; here we will look at automating the tasks of

importing data from external files to database tables
exporting data from database tables to files
generating reports in a specific format

Expanding foreign keys¶

Let us take a look at a different version of our supplier table; this one contains the city and state names:

In [2]:

%load_ext sql

In [3]:

%sql sqlite:///shop.db

In [4]:

%%sql
drop table if exists supp;
create table supp (
  id varchar primary key,
  name varchar(30),
  city varchar(30),
  state varchar(5),
  status int);
insert into supp values ('S1', 'Smith', 'Detroit', 'MI', 20);
insert into supp values ('S2', 'Jones', 'Dallas', 'TX', 30);
insert into supp values ('S3', 'Brown', 'San Antonio', 'TX', 30);
select * from supp;

 * sqlite:///shop.db
Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
Done.

Out[4]:

id	name	city	state	status
S1	Smith	Detroit	MI	20
S2	Jones	Dallas	TX	30
S3	Brown	San Antonio	TX	30

Let's assume we want a listing showing the number of suppliers per state:

In [5]:

%sql select state, count(*) as suppliers from supp group by state;

 * sqlite:///shop.db
Done.

Out[5]:

state	suppliers
MI	1
TX	2

This listing provides the state codes but not the state names. We could of course create a table and enter the names and codes ourselves; however, there is a better solution: we use our favorite Internet search engine to find a data file that already contains what we want, and import that file into our database.

Finding files with useful data is sometimes easy, as in the case of US state codes, and sometimes less easy. In this case we can use a number of sources, such as

https://github.com/jasonong/List-of-US-States/blob/master/states.csv

From the link above the file can be downloaded to your computer (use the Raw button). To use the file in the Juypter notebook we put it into the same directory as the notebook.

If you work with you own local Jupyter notebook server you can use the file manager.
For a remote notebook server we have to upload the file: use the Upload button on the home tab of the notebook, titled Home Page in your browser.

Importing data from CSV files¶

A common file format for both text and numerical data is CSV, comma separated value. There is no standard for CSV; a file can have the suffix .csv and still not use comma as a separator, but tabs or semicolons. Some CSV files have headers. In some cases quotes are used for text.

It is therefore beneficial to use a ready-made solution that takes care of problems. One of these is the Python package Pandas. This package is already part of some distributions of Python. The name derives from panel data analysis (multi-dimensional data over time), not the bears.

The following command can check for the pandas package and install it if necessary. Do not uncomment unless you run into an error when you import pandas. If you are using a conda Python distribution, use %conda instead. If this for some reason does not work for you, there is always the command line: pip install pandas

☆ pip, conda, and software dependencies¶

You may wonder why we do not leave the pip statements uncommented by default.

One reason is that calling pip to check the existing installation takes quite some time, maybe longer then everything else in this notebook. There is no need to wait that long every time we 'Run All Cells'.
Python comes in various distributions, and pip is not always the package manager, it could be e.g. conda. Your installation can get corrupted in subtle ways if you do not use the proper package manager.
The problem is not so much the new package itself, in this case pandas, but all the other packages that it depends on: the dependencies
the dependencies need to be installed or upgraded in order for the new package to work
In an ideal world installing or upgrading software should never be a problem since the package manager takes care of all dependencies and their proper versions, and new versions of software packages are always fully backward compatible
In the real world a new version of a dependency is not necessarily fully backward compatible with the installed older version; you can disable your existing and properly working software by upgrading dependencies!
These types of problems may not be immediately obvious and therefore even harder to discover and fix, because it is not necessarily the application you are currently working on that is affected; they can be addressed e.g. with virtual environments.

→ If things already work, we do not touch them. Never change a running system!

In [6]:

# uncomment next line if needed
# %pip install pandas

Documentation for Pandas read_csv():

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

We create the table from the CSV source using the headers in the first line of the file as column names.

The following Python code

uses the sqlite package to connect to the database
with the DB connection the pandas package can read the CSV file and create the database table

Note that the following cell is not SQL but Python code.

In [7]:

import sqlite3 
import pandas  
conn = sqlite3.connect('shop.db') # connection to database in current directory

# read_csv: create Pandas dataframe from file in current directory
# to_sql: create DB table from dataframe
pandas.read_csv('states.csv').to_sql('state', 
                                     conn, 
                                     if_exists='replace',
                                     index = False)

index=False: avoid creating an addition index column
option if_exists='replace' to avoid previous SQL drop table statement
This is not a problem here since this table will not be referenced by any foreign key statements.

Now the table exists in the database, and we can access it using SQL commands:

In [8]:

%sql select * from state limit 3;

 * sqlite:///shop.db
Done.

Out[8]:

State	Abbreviation
Alabama	AL
Alaska	AK
Arizona	AZ

The table provides the translation from state code to state name. We can use it to add the name of the states to our supplier report:

In [9]:

%%sql 
select state.state as State, count(*) as Suppliers 
from supp join state on supp.state = state.abbreviation group by state.state;

 * sqlite:///shop.db
Done.

Out[9]:

State	Suppliers
Michigan	1
Texas	2

Combining Multiple Sources¶

Suppose we want to combine data on various cities across the US, such as economic and population data. There is always a chance to find a source that already has everything we want. However, more likely we have to combine data from various sources.

In general this can be tricky since data found on different sites on the web will not comply with common conventions and will also often be incomplete or even erroneous. Here we have carefully chosen our data files to avoid such problems.

In this example we wish to combine data on store openings across cities with population data. After some time spent searching we finally find CSV files that contain the data.

Sources:

On the github page use the Raw button to download the plain CSV data file.

Using our favorite text editor we find that

Our store opening source contains city names and state codes
Our population source contains city names and state names

In order to join the store opening and population data we need to translate the state codes into state names. Luckily we alread have our state table which does exactly that.

In [10]:

%%sql

drop table if exists city;
drop table if exists store;

 * sqlite:///shop.db
Done.
Done.

Out[10]:

[]

In [11]:

pandas.read_csv('us-cities-top-1k.csv').to_sql('city', conn, index = False)
pandas.read_csv('1962_2006_walmart_store_openings.csv').to_sql('store', conn, index = False)

In [12]:

%sql select * from city limit 3;

 * sqlite:///shop.db
Done.

Out[12]:

City	State	Population	lat	lon
Marysville	Washington	63269	48.0517637	-122.1770818
Perris	California	72326	33.7825194	-117.2286478
Cleveland	Ohio	390113	41.49932	-81.6943605

In [13]:

%sql select * from store limit 3;

 * sqlite:///shop.db
Done.

Out[13]:

storenum	OPENDATE	date_super	conversion	st	county	STREETADDR	STRCITY	STRSTATE	ZIPCODE	type_store	LAT	LON	MONTH	DAY	YEAR
1	7/1/62	3/1/97	1.0	5	7	2110 WEST WALNUT	Rogers	AR	72756	Supercenter	36.342235	-94.07141	7	1	1962
2	8/1/64	3/1/96	1.0	5	9	1417 HWY 62/65 N	Harrison	AR	72601	Supercenter	36.236984	-93.09345	8	1	1964
4	8/1/65	3/1/02	1.0	5	7	2901 HWY 412 EAST	Siloam Springs	AR	72761	Supercenter	36.179905	-94.50208	8	1	1965

SQL Join, Create a view¶

Our SQL statement to join the three tables will be somewhat complex; a view can be used to make that statement available later with less complexity. We can select from a view in just the same way as from a table.

In [14]:

%%sql 

drop view if exists storepop;

create view storepop as select city.city, city.state, 
max(city.population) as population, count(year) as stores
from city join store join state 
where city.city = store.STRCITY and city.state = state.State 
and state.abbreviation = store.STRSTATE
group by city.city, city.state;

select * from storepop limit 10;

 * sqlite:///shop.db
Done.
Done.
Done.

Out[14]:

City	State	population	stores
Abilene	Texas	120099	2
Addison	Illinois	37385	1
Akron	Ohio	198100	1
Albany	Georgia	76185	1
Albany	New York	98424	1
Albuquerque	New Mexico	556495	5
Alexandria	Louisiana	48426	1
Alexandria	Virginia	148892	1
Allentown	Pennsylvania	118577	1
Alpharetta	Georgia	62298	1

Exporting CSV¶

Since we have already seen how to import CSV we now take a brief look at CSV as the output format. Executing the code below results in a file named report.csv being created in the current directory.

In [15]:

f = open('storepop.csv', 'w')
f.write('city,state,population,stores\n')
c = conn.cursor()
for row in c.execute('select city, state, population, stores from storepop'):
    f.write(row[0] + ',' + row[1] + ',' + str(row[2]) + ',' + str(row[3]) + '\n')
f.close()

This piece of code provides an example for the steps in creating a report:

open a file for writing
write the header line
get a cursor from the database connection
execute the SQL query and loop over the resulting rows
write each row to the output file
- row[0] : name of city
- row[1] : name of state
- row[2] : population; this is an integer, we must convert it to a string
- row[3] : stores; another integer
- finally we add a newline character, since we want each city on a separate line
close the file i.e. finish processing it

Note carefully the quotes starting and ending text, and the plus signs glueing the parts of the output together. It is very easy to make mistakes here. Colors from syntax highlighting help finding problems.

Generating Reports¶

A plain text file is useful in many situations, but often we want a little nicer formatting options. PDF andRTF are somewhat elaborate; the idea can be illustrated in HTML, a markup language that is familiar to many people already. We simply use the code from above and add some markup to achieve basic formatting for the title and the table.

In [16]:

f = open('report.html', 'w')
f.write('<h1>Cities, Population, and Store Openings</h1><table>\n')
c = conn.cursor()
for row in c.execute('select city, state, population, stores from storepop'):
    f.write('<tr><td>' + row[0] + '<td>' + row[1] + '<td>' 
            + str(row[2]) + '<td>' + str(row[3]) + '\n')
f.write('</table>\n')
f.close()

Again, this code is somewhat tricky, and there are many possibilities for mistakes. Take care with the quotes and the plus operator.

In the code above we do not generate correct HTML code; web browsers have no trouble displaying the result. Nevertheless, as an exercise add the bits that are missing to produce completely valid HTML code, i.e.

there should be a top html tag
as well as at least a body tag
all tags must be closed

☆ DB Content on the Web¶

Many websites interact with databases, to construct HTML output from DB data, or to insert data from forms into DB tables.

For practical applications we would need a web server that can be reached from anywhere on the Internet, but for testing we can run everything on our own computer, completely offline.

Flask is one of the more easy-to-use frameworks for web servers in Python. You probably need to pip install Flask.

Here is a basic application to access a table in the shop DB and return a listing in the browser:

In [17]:

%%file myapp.py

from flask import Flask, request 
import sqlite3

app = Flask(__name__)

@app.route("/")
def myapp():
    return """<h1>Sample App</h1>
    <ul>
    <li><a href=states>States</a></li>
    </ul>"""
    
@app.route('/states')
def clients():
    conn = sqlite3.connect("shop.db")
    cur = conn.cursor()
    rows = cur.execute("select State, Abbreviation from state")
    html = "<h3>States</h3><table>\n" 
    for row in rows:
        html +=  "<tr><td> %s <td> %s\n" % row
    return html + "</table>\n"
    conn.close()
    
app.run(host='localhost', port=8080, debug=True, use_reloader=True)

Overwriting myapp.py

To run the web server go to the command line, cd to the directory with the file myapp.py, and enter

python3 myapp.py

On Windows use python or py instead of python3. Now you can open the web page in your browser:

localhost:8080

Misc¶

Here are some useful little things for programming projects.

Problematic Headers¶

Sometimes CSV files use headers that are troublesome as column names in the database, e.g. two words with a blank in between, such as model year for cars.

If the import succeeds we can use quotes to access a column name:

select mpg, "model year" from autompg;

This begs the question of how we can access these columns in SQL statements within our Python scripts, where the statement is itself already a string. Fortunately, we can use the other quote to solve that problem:

'select mpg, "model year" from autompg'

will work as a string in Python because the outer single quotes define the begin and end of the string; the inner double quotes are just characters within the string.

Programmatic Access to Files on the Web¶

Instead of downloading files interactively we can also access them dynamically within our Python code; e.g. to to access the file mpg.csv on auto mileage from

https://gist.github.com/omarish/5687264

we can use the URL which we get from clicking on the Raw link with the right mouse button and selecting Copy Link Location.

In [18]:

url = 'https://gist.githubusercontent.com/omarish/5687264/raw/7e5c814ce6ef33e25d5259c1fe79463c190800d9/mpg.csv'

pandas.read_csv(url).to_sql('autompg', 
                            conn, 
                            if_exists='replace',
                            index = False)

An URL like the one above does not look like it will stay like this forever. Maybe you already have to find the auto mpg data elsewhere.

Downloading files dynamically from the web is not always a good idea; it means that our code depends on something out of our control, like a web page maintained by someone else. It may be better to download them once manually and then leave them in the directory with our other files, so they are always there for us, even when the original source changes, or vanishes.

In [19]:

%sql select * from autompg limit 5;

 * sqlite:///shop.db
Done.

Out[19]:

mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
17.0	8	302.0	140	3449	10.5	70	1	ford torino

Dealing with missing data¶

Sqlite is very permissive where data types are concerned. This means that we have to take special measures when dealing with data that does not correspond to a given type, such as a numeric columns with missing values.

Ideally, those missing entries would be set to NULL meaning that there is no value here; sadly, there is no standard for missing values in data files, so the import does not always work correctly.

In the autompg table there is a column horsepower that contains a number of such missing values which are indicated in the CSV files with the question mark. Sqlite converts to numbers where possible and imports the rest as strings.

In [20]:

%sql select * from autompg where horsepower == '?';

 * sqlite:///shop.db
Done.

Out[20]:

mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
25.0	4	98.0	?	2046	19.0	71	1	ford pinto
21.0	6	200.0	?	2875	17.0	74	1	ford maverick
40.9	4	85.0	?	1835	17.3	80	2	renault lecar deluxe
23.6	4	140.0	?	2905	14.3	80	1	ford mustang cobra
34.5	4	100.0	?	2320	15.8	81	2	renault 18i
23.0	4	151.0	?	3035	20.5	82	1	amc concord dl

This is not satisfactory. There are two options:

use read_csv(url, na_values='?') to let pandas do the proper conversion to NULL
set the values to NULL ourselves

Obviously the first option is preferable. However, when we are stuck with a table already we can still improve things:

In [21]:

%sql update autompg set horsepower = null where horsepower == '?';

 * sqlite:///shop.db
6 rows affected.

Out[21]:

[]

It says 6 rows affected, that sounds good. Let's check anyway:

In [22]:

%sql select * from autompg where horsepower is null;

 * sqlite:///shop.db
Done.

Out[22]:

mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
25.0	4	98.0	None	2046	19.0	71	1	ford pinto
21.0	6	200.0	None	2875	17.0	74	1	ford maverick
40.9	4	85.0	None	1835	17.3	80	2	renault lecar deluxe
23.6	4	140.0	None	2905	14.3	80	1	ford mustang cobra
34.5	4	100.0	None	2320	15.8	81	2	renault 18i
23.0	4	151.0	None	3035	20.5	82	1	amc concord dl

Note that NULL is not a value, so we cannot use the = sign here; we have to use the expression is null.

☆ Leftovers¶

df.merge(): similar to SQL Join, but for Pandas dataframes
df.query(): select rows of dataframe by boolean expression
cursor.execute("SELECT .. where X = ?", (var)): construct query; beware of SQL insertions..

Exercises

Find more CSV files to practise on. There are hundreds of thousands of CSV data files to be found; a simple search using your favorite search engine for csv data files produces many hits. There are also portals that try to aggregate and facilitate access to that vast resource of data.

Import CSV files into your database and generate your own reports and analyses, and augment the data already in the tables in your databases, such as

the birding club
the chess data

As you find and import CSV files you will inevitably encounter problems. Read the error messages carefully, study the files themselves with your text editor to spot potential problems. All these little technical obstacles can be overcome, it is just a matter of patience and discipline.