Database Applications¶
So far we have interacted with our databases by issuing individual SQL statements. However, this method of entering and retrieving data from a database quickly becomes much to cumbersome with increased amounts of data to process and repeated processes on a regular schedule; hence the need for software programs to automate the process of managing data in a database: database applications.
There are many areas where DB applications can be used; here we will look at automating the tasks of
- importing data from external files to database tables
- exporting data from database tables to files
- generating reports in a specific format
Expanding foreign keys¶
Let us take a look at a different version of our supplier table; this one contains the city and state names:
%load_ext sql
%sql sqlite:///shop.db
%%sql
drop table if exists supp;
create table supp (
id varchar primary key,
name varchar(30),
city varchar(30),
state varchar(5),
status int);
insert into supp values ('S1', 'Smith', 'Detroit', 'MI', 20);
insert into supp values ('S2', 'Jones', 'Dallas', 'TX', 30);
insert into supp values ('S3', 'Brown', 'San Antonio', 'TX', 30);
select * from supp;
* sqlite:///shop.db Done. Done. 1 rows affected. 1 rows affected. 1 rows affected. Done.
id | name | city | state | status |
---|---|---|---|---|
S1 | Smith | Detroit | MI | 20 |
S2 | Jones | Dallas | TX | 30 |
S3 | Brown | San Antonio | TX | 30 |
Let's assume we want a listing showing the number of suppliers per state:
%sql select state, count(*) as suppliers from supp group by state;
* sqlite:///shop.db Done.
state | suppliers |
---|---|
MI | 1 |
TX | 2 |
This listing provides the state codes but not the state names. We could of course create a table and enter the names and codes ourselves; however, there is a better solution: we use our favorite Internet search engine to find a data file that already contains what we want, and import that file into our database.
Finding files with useful data is sometimes easy, as in the case of US state codes, and sometimes less easy. In this case we can use a number of sources, such as
https://github.com/jasonong/List-of-US-States/blob/master/states.csv
From the link above the file can be downloaded to your computer (use the Raw button). To use the file in the Juypter notebook we put it into the same directory as the notebook.
- If you work with you own local Jupyter notebook server you can use the file manager.
- For a remote notebook server we have to upload the file: use the Upload button on the home tab of the notebook, titled Home Page in your browser.
Importing data from CSV files¶
A common file format for both text and numerical data is CSV, comma separated value. There is no standard for CSV; a file can have the suffix .csv and still not use comma as a separator, but tabs or semicolons. Some CSV files have headers. In some cases quotes are used for text.
It is therefore beneficial to use a ready-made solution that takes care of problems. One of these is the Python package Pandas. This package is already part of some distributions of Python. The name derives from panel data analysis (multi-dimensional data over time), not the bears.
The following command can check for the pandas package and install it if necessary. Do not uncomment unless you run into an error when you import pandas. If you are using a conda Python distribution, use %conda instead. If this for some reason does not work for you, there is always the command line: pip install pandas
☆ pip, conda, and software dependencies¶
You may wonder why we do not leave the pip statements uncommented by default.
- One reason is that calling pip to check the existing installation takes quite some time, maybe longer then everything else in this notebook. There is no need to wait that long every time we 'Run All Cells'.
- Python comes in various distributions, and pip is not always the package manager, it could be e.g. conda. Your installation can get corrupted in subtle ways if you do not use the proper package manager.
- The problem is not so much the new package itself, in this case pandas, but all the other packages that it depends on: the dependencies
- the dependencies need to be installed or upgraded in order for the new package to work
- In an ideal world installing or upgrading software should never be a problem since the package manager takes care of all dependencies and their proper versions, and new versions of software packages are always fully backward compatible
- In the real world a new version of a dependency is not necessarily fully backward compatible with the installed older version; you can disable your existing and properly working software by upgrading dependencies!
- These types of problems may not be immediately obvious and therefore even harder to discover and fix, because it is not necessarily the application you are currently working on that is affected; they can be addressed e.g. with virtual environments.
→ If things already work, we do not touch them. Never change a running system!
# uncomment next line if needed
# %pip install pandas
Documentation for Pandas read_csv():
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
We create the table from the CSV source using the headers in the first line of the file as column names.
The following Python code
- uses the sqlite package to connect to the database
- with the DB connection the pandas package can read the CSV file and create the database table
Note that the following cell is not SQL but Python code.
import sqlite3
import pandas
conn = sqlite3.connect('shop.db') # connection to database in current directory
# read_csv: create Pandas dataframe from file in current directory
# to_sql: create DB table from dataframe
pandas.read_csv('states.csv').to_sql('state',
conn,
if_exists='replace',
index = False)
- index=False: avoid creating an addition index column
- option if_exists='replace' to avoid previous SQL drop table statement
- This is not a problem here since this table will not be referenced by any foreign key statements.
Now the table exists in the database, and we can access it using SQL commands:
%sql select * from state limit 3;
* sqlite:///shop.db Done.
State | Abbreviation |
---|---|
Alabama | AL |
Alaska | AK |
Arizona | AZ |
The table provides the translation from state code to state name. We can use it to add the name of the states to our supplier report:
%%sql
select state.state as State, count(*) as Suppliers
from supp join state on supp.state = state.abbreviation group by state.state;
* sqlite:///shop.db Done.
State | Suppliers |
---|---|
Michigan | 1 |
Texas | 2 |
Combining Multiple Sources¶
Suppose we want to combine data on various cities across the US, such as economic and population data. There is always a chance to find a source that already has everything we want. However, more likely we have to combine data from various sources.
In general this can be tricky since data found on different sites on the web will not comply with common conventions and will also often be incomplete or even erroneous. Here we have carefully chosen our data files to avoid such problems.
In this example we wish to combine data on store openings across cities with population data. After some time spent searching we finally find CSV files that contain the data.
Sources:
- https://github.com/plotly/datasets/blob/master/us-cities-top-1k.csv
- https://github.com/plotly/datasets/blob/master/1962_2006_walmart_store_openings.csv
- https://github.com/jasonong/List-of-US-States/blob/master/states.csv
On the github page use the Raw button to download the plain CSV data file.
Using our favorite text editor we find that
- Our store opening source contains city names and state codes
- Our population source contains city names and state names
In order to join the store opening and population data we need to translate the state codes into state names. Luckily we alread have our state table which does exactly that.
%%sql
drop table if exists city;
drop table if exists store;
* sqlite:///shop.db Done. Done.
[]
pandas.read_csv('us-cities-top-1k.csv').to_sql('city', conn, index = False)
pandas.read_csv('1962_2006_walmart_store_openings.csv').to_sql('store', conn, index = False)
%sql select * from city limit 3;
* sqlite:///shop.db Done.
City | State | Population | lat | lon |
---|---|---|---|---|
Marysville | Washington | 63269 | 48.0517637 | -122.1770818 |
Perris | California | 72326 | 33.7825194 | -117.2286478 |
Cleveland | Ohio | 390113 | 41.49932 | -81.6943605 |
%sql select * from store limit 3;
* sqlite:///shop.db Done.
storenum | OPENDATE | date_super | conversion | st | county | STREETADDR | STRCITY | STRSTATE | ZIPCODE | type_store | LAT | LON | MONTH | DAY | YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7/1/62 | 3/1/97 | 1.0 | 5 | 7 | 2110 WEST WALNUT | Rogers | AR | 72756 | Supercenter | 36.342235 | -94.07141 | 7 | 1 | 1962 |
2 | 8/1/64 | 3/1/96 | 1.0 | 5 | 9 | 1417 HWY 62/65 N | Harrison | AR | 72601 | Supercenter | 36.236984 | -93.09345 | 8 | 1 | 1964 |
4 | 8/1/65 | 3/1/02 | 1.0 | 5 | 7 | 2901 HWY 412 EAST | Siloam Springs | AR | 72761 | Supercenter | 36.179905 | -94.50208 | 8 | 1 | 1965 |
SQL Join, Create a view¶
Our SQL statement to join the three tables will be somewhat complex; a view can be used to make that statement available later with less complexity. We can select from a view in just the same way as from a table.
%%sql
drop view if exists storepop;
create view storepop as select city.city, city.state,
max(city.population) as population, count(year) as stores
from city join store join state
where city.city = store.STRCITY and city.state = state.State
and state.abbreviation = store.STRSTATE
group by city.city, city.state;
select * from storepop limit 10;
* sqlite:///shop.db Done. Done. Done.
City | State | population | stores |
---|---|---|---|
Abilene | Texas | 120099 | 2 |
Addison | Illinois | 37385 | 1 |
Akron | Ohio | 198100 | 1 |
Albany | Georgia | 76185 | 1 |
Albany | New York | 98424 | 1 |
Albuquerque | New Mexico | 556495 | 5 |
Alexandria | Louisiana | 48426 | 1 |
Alexandria | Virginia | 148892 | 1 |
Allentown | Pennsylvania | 118577 | 1 |
Alpharetta | Georgia | 62298 | 1 |
Exporting CSV¶
Since we have already seen how to import CSV we now take a brief look at CSV as the output format. Executing the code below results in a file named report.csv being created in the current directory.
f = open('storepop.csv', 'w')
f.write('city,state,population,stores\n')
c = conn.cursor()
for row in c.execute('select city, state, population, stores from storepop'):
f.write(row[0] + ',' + row[1] + ',' + str(row[2]) + ',' + str(row[3]) + '\n')
f.close()
This piece of code provides an example for the steps in creating a report:
- open a file for writing
- write the header line
- get a cursor from the database connection
- execute the SQL query and loop over the resulting rows
- write each row to the output file
- row[0] : name of city
- row[1] : name of state
- row[2] : population; this is an integer, we must convert it to a string
- row[3] : stores; another integer
- finally we add a newline character, since we want each city on a separate line
- close the file i.e. finish processing it
Note carefully the quotes starting and ending text, and the plus signs glueing the parts of the output together. It is very easy to make mistakes here. Colors from syntax highlighting help finding problems.
Generating Reports¶
A plain text file is useful in many situations, but often we want a little nicer formatting options. PDF andRTF are somewhat elaborate; the idea can be illustrated in HTML, a markup language that is familiar to many people already. We simply use the code from above and add some markup to achieve basic formatting for the title and the table.
f = open('report.html', 'w')
f.write('<h1>Cities, Population, and Store Openings</h1><table>\n')
c = conn.cursor()
for row in c.execute('select city, state, population, stores from storepop'):
f.write('<tr><td>' + row[0] + '<td>' + row[1] + '<td>'
+ str(row[2]) + '<td>' + str(row[3]) + '\n')
f.write('</table>\n')
f.close()
Again, this code is somewhat tricky, and there are many possibilities for mistakes. Take care with the quotes and the plus operator.
In the code above we do not generate correct HTML code; web browsers have no trouble displaying the result. Nevertheless, as an exercise add the bits that are missing to produce completely valid HTML code, i.e.
- there should be a top html tag
- as well as at least a body tag
- all tags must be closed
☆ DB Content on the Web¶
Many websites interact with databases, to construct HTML output from DB data, or to insert data from forms into DB tables.
For practical applications we would need a web server that can be reached from anywhere on the Internet, but for testing we can run everything on our own computer, completely offline.
Flask is one of the more easy-to-use frameworks for web servers in Python. You probably need to pip install Flask.
Here is a basic application to access a table in the shop DB and return a listing in the browser:
%%file myapp.py
from flask import Flask, request
import sqlite3
app = Flask(__name__)
@app.route("/")
def myapp():
return """<h1>Sample App</h1>
<ul>
<li><a href=states>States</a></li>
</ul>"""
@app.route('/states')
def clients():
conn = sqlite3.connect("shop.db")
cur = conn.cursor()
rows = cur.execute("select State, Abbreviation from state")
html = "<h3>States</h3><table>\n"
for row in rows:
html += "<tr><td> %s <td> %s\n" % row
return html + "</table>\n"
conn.close()
app.run(host='localhost', port=8080, debug=True, use_reloader=True)
Overwriting myapp.py
To run the web server go to the command line, cd to the directory with the file myapp.py, and enter
python3 myapp.py
On Windows use python or py instead of python3. Now you can open the web page in your browser:
Misc¶
Here are some useful little things for programming projects.
Problematic Headers¶
Sometimes CSV files use headers that are troublesome as column names in the database, e.g. two words with a blank in between, such as model year for cars.
If the import succeeds we can use quotes to access a column name:
select mpg, "model year" from autompg;
This begs the question of how we can access these columns in SQL statements within our Python scripts, where the statement is itself already a string. Fortunately, we can use the other quote to solve that problem:
'select mpg, "model year" from autompg'
will work as a string in Python because the outer single quotes define the begin and end of the string; the inner double quotes are just characters within the string.
Programmatic Access to Files on the Web¶
Instead of downloading files interactively we can also access them dynamically within our Python code; e.g. to to access the file mpg.csv on auto mileage from
https://gist.github.com/omarish/5687264
we can use the URL which we get from clicking on the Raw link with the right mouse button and selecting Copy Link Location.
url = 'https://gist.githubusercontent.com/omarish/5687264/raw/7e5c814ce6ef33e25d5259c1fe79463c190800d9/mpg.csv'
pandas.read_csv(url).to_sql('autompg',
conn,
if_exists='replace',
index = False)
An URL like the one above does not look like it will stay like this forever. Maybe you already have to find the auto mpg data elsewhere.
Downloading files dynamically from the web is not always a good idea; it means that our code depends on something out of our control, like a web page maintained by someone else. It may be better to download them once manually and then leave them in the directory with our other files, so they are always there for us, even when the original source changes, or vanishes.
%sql select * from autompg limit 5;
* sqlite:///shop.db Done.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name |
---|---|---|---|---|---|---|---|---|
18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
Dealing with missing data¶
Sqlite is very permissive where data types are concerned. This means that we have to take special measures when dealing with data that does not correspond to a given type, such as a numeric columns with missing values.
Ideally, those missing entries would be set to NULL meaning that there is no value here; sadly, there is no standard for missing values in data files, so the import does not always work correctly.
In the autompg table there is a column horsepower that contains a number of such missing values which are indicated in the CSV files with the question mark. Sqlite converts to numbers where possible and imports the rest as strings.
%sql select * from autompg where horsepower == '?';
* sqlite:///shop.db Done.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name |
---|---|---|---|---|---|---|---|---|
25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 | 1 | ford pinto |
21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 | 1 | ford maverick |
40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 | 2 | renault lecar deluxe |
23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 | 1 | ford mustang cobra |
34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 | 2 | renault 18i |
23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 | 1 | amc concord dl |
This is not satisfactory. There are two options:
- use read_csv(url, na_values='?') to let pandas do the proper conversion to NULL
- set the values to NULL ourselves
Obviously the first option is preferable. However, when we are stuck with a table already we can still improve things:
%sql update autompg set horsepower = null where horsepower == '?';
* sqlite:///shop.db 6 rows affected.
[]
It says 6 rows affected, that sounds good. Let's check anyway:
%sql select * from autompg where horsepower is null;
* sqlite:///shop.db Done.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name |
---|---|---|---|---|---|---|---|---|
25.0 | 4 | 98.0 | None | 2046 | 19.0 | 71 | 1 | ford pinto |
21.0 | 6 | 200.0 | None | 2875 | 17.0 | 74 | 1 | ford maverick |
40.9 | 4 | 85.0 | None | 1835 | 17.3 | 80 | 2 | renault lecar deluxe |
23.6 | 4 | 140.0 | None | 2905 | 14.3 | 80 | 1 | ford mustang cobra |
34.5 | 4 | 100.0 | None | 2320 | 15.8 | 81 | 2 | renault 18i |
23.0 | 4 | 151.0 | None | 3035 | 20.5 | 82 | 1 | amc concord dl |
Note that NULL is not a value, so we cannot use the = sign here; we have to use the expression is null.
☆ Leftovers¶
- df.merge(): similar to SQL Join, but for Pandas dataframes
- df.query(): select rows of dataframe by boolean expression
- cursor.execute("SELECT .. where X = ?", (var)): construct query; beware of SQL insertions..
Exercises
Find more CSV files to practise on. There are hundreds of thousands of CSV data files to be found; a simple search using your favorite search engine for csv data files produces many hits. There are also portals that try to aggregate and facilitate access to that vast resource of data.
Import CSV files into your database and generate your own reports and analyses, and augment the data already in the tables in your databases, such as
- the birding club
- the chess data
As you find and import CSV files you will inevitably encounter problems. Read the error messages carefully, study the files themselves with your text editor to spot potential problems. All these little technical obstacles can be overcome, it is just a matter of patience and discipline.