SQL by Example: Products, Customers, Orders¶
SQL - Structured Query Language
SQL (often pronounced sequel) is the query language supported by all major relational database management systems (DBMS). It includes a DDL (data definition language).
There is a lot of effort to standardize SQL; however, both commercial and free DBMS offer many convenient extensions. For any non-trivial application the choice of DBMS is still a long-term commitment.
In the following we cover some of the essential SQL commands and their main options; however, there is much more to the language, we can only scratch the surface.
- Customers place orders
- Each order contains one or more items
- Each item refers to exactly one product
After some more discussion with the key users we learn the essential attributes of the enitities:
- CUST (id, name, address) for data on customers
- PROD (id, name, price, stock) for data on products
- ORDR (id, cust, odate) for data on orders
- ITEM (prod, qty, ordr) for order items
We start with loading the SQL extensions and establishing the DB connection. We will also need foreign key support.
%load_ext sql
%sql sqlite:///shop.db
The sql extension is already loaded. To reload it, use: %reload_ext sql
%%sql
PRAGMA foreign_keys = ON;
* sqlite:///shop.db Done.
[]
CREATE AND DROP TABLES¶
The table is the basic object in a relational database. It consists of columns with names and datatypes.
As always, we ensure that all statements in this notebook can be executed from start to end repeatedly, therefore we drop existing tables before we issue the CREATE command.
This application is a little more tricky since we will have foreign key constraints, which means that the order in which we drop tables is important. We need to drop the referencing tables first, then the parent tables.
%%sql
drop table if exists item;
drop table if exists ordr;
drop table if exists cust;
drop table if exists prod;
* sqlite:///shop.db Done. Done. Done. Done.
[]
Now we can create our customer table. We choose to stick with a simple convention for table names: 4 letters, all lower case.
%%sql
create table cust (
id varchar primary key,
name varchar(30),
address varchar(60)
);
* sqlite:///shop.db Done.
[]
The PRIMARY KEY
- ensures unique values for the column ID
- creates an index for fast access
The datatype VARCHAR
- defines a character field with vaying size
- there is no significant performance gain by using VARCHAR(n) or fixed-length CHAR()
Table and column names:
Recommended characters in table names and column names are a-z, A-Z, and the digits 0-9 (not as the first character). Underscore _ should work as well.
Any other characters should not be used, even if they seem to work within the DB; otherwise, integration with components such as PHP and R can yield some nasty surprises.
Case sensitivity: do not rely on it, even if it seems to work in your DB. Choose a convention and stick to it, e.g., everything upper case OR everything lower case.
In some DBMS all identifiers that are not double-quoted are folded to lower case; you can enforce case with double quotes, but double-quoted names are one sure way to maintenance hell.
Do not use non-English characters, such as German umlauts in table names or column names! Even if your DB system seems to support them, sooner or later you will run into problems.
INSERT¶
Obviously this command insert data into tables.
Note that text must be enclosed in quotes.
If all columns are filled with values then the list of columns name before the VALUES keyword can be ommitted; however, it is good practise to use it anyway, to provide additional documentation on what is happening in this statement.
%%sql
insert into cust (id, name) values ('DF', 'Doulas Fairbanks');
insert into cust (id, name) values ('BK', 'Buster Keaton');
insert into cust (id, name) values ('RV', 'Rudolph Valentino');
insert into cust (id, name) values ('ML', 'Myrna Loy');
* sqlite:///shop.db 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected.
[]
If all values are supplied in the proper order then the INSERT statement works without the list of column names before the VALUES keyword, but this depends on the current state of the table: if the table is altered (columns added or dropped) the statement without the column names will no longer work. Interactively this is not much of a problem, since the error is immediately apparent and easily fixed, but in a programmatic environment (e.g. in Python or PHP scripts) always state the column names before the VALUES clause.
Missing values are set to NULL. In the example, all addresses are set to NULL, which is displayed by 'None' in SQLite.
Note carefully that NULL is not a value: it means that there is no value. This has consequences for how the data is handled, e.g. when it comes to counting: NULL is not a value, so it does not count.
%sql select * from cust;
* sqlite:///shop.db Done.
id | name | address |
---|---|---|
DF | Doulas Fairbanks | None |
BK | Buster Keaton | None |
RV | Rudolph Valentino | None |
ML | Myrna Loy | None |
Storage space - how much do we need?¶
The US Postal Service 'optimum' recommendation is 64 bytes for a delivery address; see https://about.usps.com/publications/pub28/28c2_001.htm
If all the people on the planet were our customers, could we store their names and addresses on a single hard disk?
- A typical desktop computer has at least 1 TB hard disk.
- Ignore space for index (fraction of main table).
- Assume that we use 100 bytes per record on average
- That means there are about 10 records in one KB
- 1 TB = 1024 GB, 1 GB = 1024 MB, and 1 MB = 1024 KB
- We can store about 10,000,000,000 records.
The answer is Yes!
Encodings¶
Are 100 characters on screen always 100 bytes on disk?
This depends on the encoding:
In ASCII each character is encoded in one byte. ASCII contains a total of 128 characters: those used in the English language, some commonly used punctuation marks and signs, and control characters.
ISO 8859-1 (Latin-1) also encodes each character in one byte, but in addition to the ASCII characters it also contains characters used in western European languages, such as French accents and German umlauts.
UTF-8 is a variable length encoding: the ASCII characters are encoded in one byte, the other characters in the Unicode set are encoded in two, three, or four bytes. The Unicode set contains over 100,000 characters.
What does this mean?
If a text contains only ASCII characters, the number of bytes is equal to the number of characters.
If the text contains non-ASCII characters and UTF-8 is used, the number of bytes is larger than the number of characters.
There is a general move towards UTF-8 but various applications still use other encodings which continues to cause problems.
Since the encoding of the ASCII characters is the same in all encodings mentioned above, a text only containing ASCII characters is usually safe from encoding problems.
Meaning: When we design a multi-part system, such as a website with DB connection that is accessed by browsers running on user desktops or mobile devices, we need to ensure that all parts use a compatible encoding; ideally, UTF8 everywhere.
SELECT¶
The select statement allows us to generate lists from out tables. Here are some examples:
Select all columns and all rows:
- select * from cust;
Select specific columns from all rows:
- select id, name from cust;
Select specific columns and rows:
- select name from cust where id = 'DF';
UPDATE¶
To change values in a table row we use the UPDATE command.
- update cust set name = 'newname' where id = 'someid';
Without WHERE clause all records are updated!
DELETE and DROP¶
The DELETE command removes rows from a table; the DROP command removes the whole table.
- delete from cust where id = 'someid';
Some records are deleted, but the table remains.
Note: without the WHERE clause all records are deleted!
- drop table sometable;
Table and content are deleted.
ALTER TABLE¶
With the ALTER TABLE command we can change the definition of a table. This is usually not a good idea. It is much better to invest sufficient time in a proper table design before issuing the first CREATE TABLE command.
- alter table x add column b char(20);
For our small sample applications it will be easier to drop a table and create it again as desired.
The ADD COLUMN feature is one of the few that are supported in SQLite; the other one is rename table.
The TechOnTheNet website https://www.techonthenet.com/sql/tables/alter_table.php is one the few that not only explains SQL commands for a particular DBMS but also shows which options work on other DBMS.
More Tables: Product, Order, Item¶
When creating tables with foreign keys we create the parent tables first, and then the child tables that depend on them (the parent/child concept is also known as master/detail).
Product Table¶
In our example both product and customer tables are parents (or masters); we start by creating them (in either order); then, we create the tables that depend on the parents.
%%sql
create table prod (
id varchar(80) primary key,
name varchar(80),
price numeric(8,2),
stock integer
);
* sqlite:///shop.db Done.
[]
Price is numeric with 2 decimal points and a total of 8 digits, since it is a currency amount.
Note that in SQLite we cannot enforce this; we will return to this topic later.
%%sql
insert into prod (id, name, price, stock) values ('HUB1', 'USB Hub 4x', 15.90, 5);
insert into prod (id, name, price, stock) values ('SD08', 'SDHC Card 8 GB', 5.80, 10);
insert into prod (id, name, price, stock) values ('HD05', 'HD IDE 500 GB', 55.90, 2);
insert into prod (id, name, price, stock) values ('MON1', 'LCD Monitor 19 XL', 65.90, 2);
* sqlite:///shop.db 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected.
[]
Order Table¶
The order table needs a different name, since ORDER is already an SQL keyword.
%%sql
create table ordr (
id integer primary key,
cust varchar(80) references cust(id),
odate date
);
* sqlite:///shop.db Done.
[]
- ORDER is a reserved word in SQL, as part of the ORDER BY clause.
- DATE is another reserved word for the datatype date.
- The Foreign key CUST refers to customer IDs
Foreign keys ensure data integrity, but they also restrict operations on the database; they can be a blessing or a curse. Decide carefully whether to use foreign keys for a particular table. In examples from books and lectures you will usually see them, as they incorporate the principles of clean design; unfortunately, the real world is often not clean at all, and in some situations foreign key clauses are deliberately avoided.
%%sql
insert into ordr (id, cust, odate) values (1, 'DF', '2014-01-13');
insert into ordr (id, cust, odate) values (2, 'RV', '2014-02-17');
insert into ordr (id, cust, odate) values (3, 'BK', '2014-03-12');
insert into ordr (id, cust, odate) values (4, 'DF', '2014-03-23');
* sqlite:///shop.db 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected.
[]
Item Table¶
The item table will store the individual order items.
%%sql
create table item (
ordr integer references ordr(id),
prod varchar(80) references prod(id),
qty integer
);
* sqlite:///shop.db Done.
[]
Let us fill this table with some order items:
%%sql
insert into item (prod, qty, ordr) values ('HUB1', 1, 1);
insert into item (prod, qty, ordr) values ('HD05', 2, 1);
insert into item (prod, qty, ordr) values ('MON1', 1, 2);
insert into item (prod, qty, ordr) values ('HD05', 1, 2);
insert into item (prod, qty, ordr) values ('HD05', 1, 3);
insert into item (prod, qty, ordr) values ('MON1', 1, 4);
* sqlite:///shop.db 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected. 1 rows affected.
[]
JOIN¶
With the JOIN command we can combine records from several tables with a condition, usually checking foreign key values:
%%sql
select name, odate
from cust join ordr on cust.id = ordr.cust;
* sqlite:///shop.db Done.
name | odate |
---|---|
Doulas Fairbanks | 2014-01-13 |
Rudolph Valentino | 2014-02-17 |
Buster Keaton | 2014-03-12 |
Doulas Fairbanks | 2014-03-23 |
Only customers with orders appear in the result.
Left Join¶
What can we do if we want all customers to appear in the results, even those without orders?
Use LEFT join for all records from the left table:
%%sql
select name, odate
from cust LEFT join ordr on cust.id = ordr.cust ;
* sqlite:///shop.db Done.
name | odate |
---|---|
Doulas Fairbanks | 2014-01-13 |
Doulas Fairbanks | 2014-03-23 |
Buster Keaton | 2014-03-12 |
Rudolph Valentino | 2014-02-17 |
Myrna Loy | None |
- All customers appear in the result.
- Missing ODATEs are set to NULL in the result.
- NULL is not a value; it means no value.
- Our Python interface to SQLite uses the term None for Null, other DBMS interfaces display nothing.
LEFT refers to the table on the left of the keyword JOIN in the SQL statement.
Aggregate Functions¶
The functions count(), sum(), min(), max(), avg() work on sets of records defined by the GROUP BY clause:
%%sql
select cust, count(*) from ordr group by cust;
* sqlite:///shop.db Done.
cust | count(*) |
---|---|
BK | 1 |
DF | 2 |
RV | 1 |
The expression count(*) counts the number of rows for each customer that appears in the ORDR table.
The second column is named after the function; we can rename it using the keyword AS. This is particularly useful when interfacing to the DB from programming languages like Python or R where we want column names that also serve as variable names.
%%sql
select cust, count(*) as orders from ordr group by cust;
* sqlite:///shop.db Done.
cust | orders |
---|---|
BK | 1 |
DF | 2 |
RV | 1 |
Assume we want to get a listing showing the number of orders for each customer and also the names of the customers. Here is a simple approach:
%%sql
select cust.name, cust.id, count(*) as orders
from cust join ordr on ordr.cust = cust.id
group by cust.name, cust.id ;
* sqlite:///shop.db Done.
name | id | orders |
---|---|---|
Buster Keaton | BK | 1 |
Doulas Fairbanks | DF | 2 |
Rudolph Valentino | RV | 1 |
- Count the number of records in the join for each customer.
- However, customers with no orders do not appear in the result.
- Therefore, the list is incomplete.
We can use LEFT JOIN to include customers without any orders:
%%sql
select cust.name, cust.id, count(*) as orders
from cust LEFT join ordr on ordr.cust = cust.id
group by cust.name, cust.id ;
* sqlite:///shop.db Done.
name | id | orders |
---|---|---|
Buster Keaton | BK | 1 |
Doulas Fairbanks | DF | 2 |
Myrna Loy | ML | 1 |
Rudolph Valentino | RV | 1 |
Is this the correct result?
- Count the number of records for each customer.
- Because of the LEFT join every customer appears in the result.
How does this affect the count() function?
- The LEFT JOIN returns all rows from the left table.
- Therefore, customers with no orders still get one record in the join.
- The count() function counts the number of records.
- This is still not the number of orders for each customer!
The solution is the parameter of the count() function -- what we count:
%%sql
select cust.name, cust.id, count(ordr.id) as orders
from cust left join ordr on ordr.cust = cust.id
group by cust.name, cust.id ;
* sqlite:///shop.db Done.
name | id | orders |
---|---|---|
Buster Keaton | BK | 1 |
Doulas Fairbanks | DF | 2 |
Myrna Loy | ML | 0 |
Rudolph Valentino | RV | 1 |
When we count the records using count(*) we get the wrong result. We need to count a value from the ordr table.
- Count the number of order IDs for each customer in the CUST table.
- NULLs do not count since NULL means no value.
- This is the number of orders for each customer.
We can always build an SQL statement incrementally; e.g. if we are not sure about how the statement above works, we can take a step back and leave out the aggregate functions:
%%sql
select *
from cust left join ordr on ordr.cust = cust.id
;
* sqlite:///shop.db Done.
id | name | address | id_1 | cust | odate |
---|---|---|---|---|---|
DF | Doulas Fairbanks | None | 1 | DF | 2014-01-13 |
DF | Doulas Fairbanks | None | 4 | DF | 2014-03-23 |
BK | Buster Keaton | None | 3 | BK | 2014-03-12 |
RV | Rudolph Valentino | None | 2 | RV | 2014-02-17 |
ML | Myrna Loy | None | None | None | None |
Having¶
When putting restrictions on the result of an aggregate statement we cannot use WHERE. Instead, the keyword HAVING serves the same purpose, but on the result of a GROUP BY operation.
If we want a specific order in the result we use ORDER BY. The option DESC specifies descending order.
To restrict the number of results we can use LIMIT.
To put it all together: who is our best customer?
%%sql
select cust.name, cust.id, count(ordr.id) as orders
from cust left join ordr on ordr.cust = cust.id
group by cust.name, cust.id
having orders > 0
order by orders desc
limit 1;
* sqlite:///shop.db Done.
name | id | orders |
---|---|---|
Doulas Fairbanks | DF | 2 |
Note the order of the keywords: GROUP BY ... HAVING ... ORDER ... LIMIT
Printing Currency Values¶
In business applications we usually want monetary values stored and printed with exactly two decimal places. However, in SQLite we cannot enforce exact datatypes for floating point numbers, such as decimal(10,2). Floating point values are approximate. Therefore, e.g. a Sales per Customer listing would look like this:
%%sql
select cust.id, cust.name , sum(price * qty) as sales
from cust left join ordr on cust.id = ordr.cust
left join item on item.ordr = ordr.id
left join prod on prod.id = item.prod
group by cust.id, cust.name;
* sqlite:///shop.db Done.
id | name | sales |
---|---|---|
BK | Buster Keaton | 55.9 |
DF | Doulas Fairbanks | 193.60000000000002 |
ML | Myrna Loy | None |
RV | Rudolph Valentino | 121.80000000000001 |
This is not acceptable:
- the approximate floating point values result in tiny errors
- 'None' is not wrong, but here we want 0.00
In SQLite we can produce the required format with the printf() function:
%%sql
select cust.id, cust.name , printf("%.2f", sum(price * qty)) as sales
from cust left join ordr on cust.id = ordr.cust
left join item on item.ordr = ordr.id
left join prod on prod.id = item.prod
group by cust.id, cust.name;
* sqlite:///shop.db Done.
id | name | sales |
---|---|---|
BK | Buster Keaton | 55.90 |
DF | Doulas Fairbanks | 193.60 |
ML | Myrna Loy | 0.00 |
RV | Rudolph Valentino | 121.80 |
☆ Performing math on floating-point values¶
"Performing math on floating-point values introduces error."
https://sqlite.org/floatingpoint.html
In a digital computer decimal numbers are converted to binary for processing, and converted back to decimal for output. Some values cannot be represented exactly in binary. The same is true for the decimal system, e.g. 1/3 has no exact decimal representation.
Some experiments:
%sql select 0.1, 0.1+0.1, 0.1+0.2, 0.3, 0.3+0.0, 0.3+0.1, 0.3+0.2, 0.3+0.6;
* sqlite:///shop.db Done.
0.1 | 0.1+0.1 | 0.1+0.2 | 0.3 | 0.3+0.0 | 0.3+0.1 | 0.3+0.2 | 0.3+0.6 |
---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.30000000000000004 | 0.3 | 0.3 | 0.4 | 0.5 | 0.8999999999999999 |
The tiny errors result from the use of a binary format for floating point numbers. They are not specific to SQLite, but other DBMS isolate us more from these problems, e.g. in Postgresql all return values in the statement above are exact.
SQLite will store and return exact values where possible, but it does not automatically perform any rounding, so when we have a value like 0.3 as a result of an operation we see the error. Note how adding zero to 0.3 does not produce it!
In many applications the errors resulting from approximate datatypes are so small that they can be ignored. However, tiny errors can accumulate to a problematic level. Even small inaccuracies are unacceptable in some areas, such as accounting.
For more on floating-point arithmetic, see e.g.
Exercises¶
For the shopping tables, create the following listings:
- Number of units sold per product
- Number of orders per product
- Sales per product
Create tables for student - enroll - course, insert some data, and produce the following listings:
- Number of courses enrolled per student, for all students
- Number of students per course, for all courses
- Students not enrolling in any course
- Courses with fewer than two students