Relational Database Systems: Basic Concepts¶

Data: from Latin given; given facts

Given fact: true proposition in logic
Database: collection of true propositions

Relational approach:

Data is represented by rows in tables, which can be interpreted as true propositions
Operators on rows in tables derive additional true propositions in response to queries

Example:

table containing soccer players and their clubs
additional facts in response to user queries:
- number of players per club
- number of clubs
- club with most players

Other types of databases e.g.

graph databases: store data as nodes and edges, understood as entities with properties and relationships
document databases: store data in formats such as JSON (fields and values) and XML (markup) which are also commonly used in data exchange between applications

Advantages of databases¶

over application-specific files such as spreadsheets:

Centralized control over data
Data can be shared
Redundancy can be reduced
Inconsistencies can be avoided
Transaction support
Integrity constraints
Security constraints
Data independence, immunity of applications to change in physical representation and access technique
Standards e.g. ISO SQL

Standards are important efforts, but

even those database management systems (DBMS) that adhere closely to the standard also offer convenient non-standard extensions
this tends to lead to lock-in for the customer
the practical reality is that the choice of a particular DBMS is a long-term one
switching to a different DBMS is costly and risky, and will be avoided if at all possible

Information Principle¶

The entire information content of the relational database is represented in only one way, explicit values in column positions in table rows.

Not all relational DBMS implement this principle.

The Relational Model¶

Relational Databases are based on the relational model.

relation = math. for table, NOT from ER-model

A relation has two parts:

heading: set of attributes A:T with name A and type T

denotes a predicate (truth-valued function)

body: set of tuples A:V with attribute A and value V i.e. a set of tuples that conforms to the heading

each tuple denotes a true proposition, i.e. instantiates the predicate

SQL¶

The Structured Query Language is the standard for data query and manipulation in relational database systems. It implements many (but not all) concepts of the relational model.

Table vs. Relation¶

The basic concept of the SQL implementation is the table. When going from the relation model to the SQL implementation relations become tables.

An SQL table is similar to a relation, with some important differences:

Doubles:
- A relation contains a set of tuples, which means that by definition there are no doubles.
- An SQL table and the result of an SQL query can contain several rows with identical values.
Order:
- Since there is no order defined for the elements of a set, any order is valid when listing the tuples of a relation.
- Results of SQL SELECT statements can be ordered at will with the ORDER BY clause.

In the following we will assume that

tables do not contain multiple identical rows
order of tuples and attributes are not crucial for our examples

which means that we can use the terms relation and table interchangeably, unless otherwise noted.

Data Types¶

The relational model is not concerned with technical implementation.
SQL DBMS must know in great detail the type of data stored in tables.

Numbers¶

precision: number of significant digits
scale: position of the least significant digit to the right of the decimal
e.g. 1234.56 has precision 6, scale 2, defined as NUMERIC(6,2)

SQL92 standard numeric data types:

INT (precision)
- exact numeric type
- scale 0
NUMERIC (precision, scale)
- exact numeric type i.e. rounding and truncating dictated by predefined formula
- decimal portion exactly scale
DECIMAL (precision, scale)
- exact numeric type
- decimal portion at least scale
REAL
- approximate numeric type
- binary precision based on manufacturer limit
- default precision must be less than DOUBLE PRECISION
DOUBLE PRECISION
- approximate numeric type
- binary precision based on manufacturer limit
- default precision must be greater than REAL
FLOAT (precision)
- approximate numeric type
- DOUBLE with given precision

Other Types¶

CHAR (size)
- fixed length
- unused space empty
VARCHAR (size)
- variable length up to size
- no unused space
- without size: not in SQL92 standard, but widely supported, e.g. in PostgreSQL; but not in MySQL
DATE
- Various input formats, some manufacturer dependent; e.g. '2012-04-21'
- Calendar functions such as length of time, weekday
BOOLEAN
- two values, true or false

These are standard definitions. Implementations vary considerably, e.g.

Postgres: DECIMAL(p, s) and NUMERIC(p, s) are equivalent
SQLite:
- CHAR and VARCHAR are both converted to TEXT with default max length of 1,000,000,000
- datatypes are not enforced, e.g. even in a column defined as INT we can enter a text like 'fluffy bunnies'

Time for studying the documentation of your DBMS is well spent. Otherwise, subtle errors will haunt you in years to come.

SQL in Jupyter Notebook¶

If you are not familiar with Jupyter notebooks: take a look at jupyter.org

This document itself was generated from a Jupyter notebook.

To run the examples on your own computer:

If necessary install Python first, from python.org; steps for Windows:
- download 64-bit installer for stable release (3.11.8 Feb 2024)
- start installer
- make sure 'Add Python to Path' is checked
- click 'Install Now'
To install packages, open a terminal window (command prompt in Windows, cmd) and enter:
- pip install jupyterlab ipython-sql
To start Jupyter Lab, enter:
- jupyter lab

We load the SQL extension into the current notebook:

In [1]:

%load_ext sql

This allows us to use SQL commands in code cells. We need to start the cell with

%sql for a single line of code
%%sql for multiple lines

If you want to comment inside a code cell: note the different comment characters in Python and SQL:

Python: # (hashtag)
SQL: -- (two hyphens)

SQLite¶

"SQLite does not compete with client/server databases. SQLite competes with fopen(). "

sqlite.org/whentouse.html

We choose SQLite as our relational DB software:

SQLite is limited in many respects, but powerful enough for our purposes
It is free and Open Source, and part of the Python standard library

A file with the suffix .db in the current directory stores our tables. For Jupyter notebooks, the current directory is the one where your notebook file is stored (the file with the .ipynb suffix).

SQLite is widely used as an embedded database engine in desktop software, smartphone apps, and many other places. There are much more sophisticated options for free DBMS, such as PostgreSQL; however, Postgres requires separate installation, a running DB server, and user management. For our purposes the cost/benefit ratio favours SQLite.

For us, the downsides in using SQLite are

datatypes are not enforced at all
SQL syntax is not enforced fully

However: If we are using correct SQL and proper datatypes our examples will work just like in PostgreSQL; with a few tiny differences that will be noted as we come across them.

SQLite is by far the most widely deployed database engine. See sqlite.org for the amazing statistics.

☆ Other commonly used relational DBMS¶

☆ sections offer additional information on a topic; not essential, but (hopefully) interesting

It is tricky to compare DBMS overall: how would you aggregate the many aspects into a single ranking? Here are some of the big players (see https://db-engines.com/en/ranking), with year of initial release:

Oracle: often regarded as the top commercial solution, 1980
DB2: another top commercial system, widely used on IBM hosts, 1983
SQL Server: a Microsoft product, 1989
PostgreSQL: very strong free and open source alternative to commercial products, 1989
MySQL: common choice for web databases, now owned by Oracle, but still free and open source, 1995

If you are interested in comparing e.g. SQLite and Postgres yourself:

install PostgreSQL from postgresql.org
configure (create database, create user, alter user, grant all)
in Jupyter notebook use connector postgresql://myuser:mypw@localhost/mydb
switch SQLite and Postgres by connector

Of course, the same goes for MySQL, and other free DBMS: install and compare them, if you have the time and motivation.

Sample DB: The Birding Club¶

Here is a verbal description of a situation where a DB solution is desired:

You local birding club decides to finally put some structure into all those reports from members and enter them into a database, to allow for automated analysis. The following things are reported by the birders:

common name of the bird

date of observation

location

type of terrain

nickname of birder

The birders are fairly learned in the avian ways, and the common names they use identify the species. No two birders use the same nickname. They always report the same type of terrain for a given location.

We decide on a prototype development method and create a very simple first solution from the description: a single table that holds all the data. Whether this solution is adequate depends on the application: it may be exactly what the birders need, or there may be room for improvement.

Creating a Database¶

We connect to a SQLite database in the current directory with the following statement; if the database does not exist it will be created.

In [2]:

%sql sqlite:///birds.db

Creating Tables¶

By creating an SQL table we define a structure to hold data:

a table named BIRDING is created
for each column we specify a name and a data type
the list of columns is enclosed in parenthesis
a semicolon terminates the statement

In [3]:

%%sql

drop table if exists BIRDING;
create table BIRDING (
  BIRDNAME varchar(30),
  OBSDATE date,
  LOCATION varchar(30),
  TERRAIN varchar(20),
  NICKNAME varchar(20)
);

 * sqlite:///birds.db
Done.
Done.

Out[3]:

[]

The DB software responds with the currently used connector and a message for each command.

Why do we first drop the table, if it exists?

We cannot create a table when another one with the same name already exists.

In Jupyter notebooks it is much more convenient to ensure that all code in a notebook can run from start to end repeatedly. It is possible to hand-pick the order of execution by running cells individually, but this gets out of hand very quickly.

Therefore, we make sure that an existing table is dropped, so our CREATE statement will not produce an error.

The statement could be written on a singe line, but it is more readable this way. Layout matters not to SQLite, but to the humans who may need to read and understand our statements at some point in the future.

Inserting Data¶

Now that we have the table structure, we can fill it with some data.

In this case we do not need to list the column names before the VALUES keyword, because we are supplying all values in the order they were defined
However, it is still good practise since it documents the statement
We could write several INSERT statements, but the INSERT command accepts several tuples after the VALUES keyword

In [4]:

%%sql

insert into BIRDING 
(BIRDNAME, OBSDATE, LOCATION, TERRAIN, NICKNAME)
values 
('House sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Tree sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Blackcap', '2018-06-14', 'Sherwood Forest', 'Woodland', 'mjones'),
('European robin', '2018-07-04', 'Little Winging', 'Garden', 'gmorgan'),
('Common gull', '2017-06-01', 'Cliffs of Moher', 'Beach', 'pmurphy'),
('Green woodpecker', '2018-06-15', 'Sherwood Forest', 'Woodland', 'mjones');

 * sqlite:///birds.db
6 rows affected.

Out[4]:

[]

Listing Data¶

To check if everything worked the way we expected, we generate a listing:

In [5]:

%%sql

select * from BIRDING;

 * sqlite:///birds.db
Done.

Out[5]:

BIRDNAME	OBSDATE	LOCATION	TERRAIN	NICKNAME
House sparrow	2018-06-12	Cold Striffen	Garden	jsmith
Tree sparrow	2018-06-12	Cold Striffen	Garden	jsmith
Blackcap	2018-06-14	Sherwood Forest	Woodland	mjones
European robin	2018-07-04	Little Winging	Garden	gmorgan
Common gull	2017-06-01	Cliffs of Moher	Beach	pmurphy
Green woodpecker	2018-06-15	Sherwood Forest	Woodland	mjones

Select statements can use conditions on records to be displayed, and state the columns, e.g.

In [6]:

%%sql

select location, nickname from birding where obsdate < '2018-01-01';

 * sqlite:///birds.db
Done.

Out[6]:

LOCATION	NICKNAME
Cliffs of Moher	pmurphy

☆ SQLite and Datatypes¶

"Flexible typing is a feature of SQLite, not a bug."

From https://sqlite.org/datatype3.html, a source worth studying if you intend to do any serious work with SQLite.

In SQLite datatypes are suggestions, not requirements. Column datatypes are in fact optional in an SQLite CREATE TABLE statement; if included, they are implemented as 'type affinity'; Values will be converted to something close to the suggested format, if possible. Internally (for values stored or manipulated) there are only the following 'storage classes':

NULL
INTEGER
REAL (8-byte IEEE floating point)
TEXT (using database encoding: UTF-8, UTF-16BE or UTF-16LE)
BLOB (stored exactly as it was input)

Some examples:

In [7]:

%%sql

drop table if exists bunnies;
create table bunnies (
    fluffy int,
    bunny varchar );

insert into bunnies values
(42, 0),
(43.1, 'Yes'),
('somewhat fluffy', TRUE),
('very fluffy', 10),
('slightly fluffy', '0.1');

select fluffy, typeof(fluffy), cast(fluffy as int), bunny, typeof(bunny) from bunnies;

 * sqlite:///birds.db
Done.
Done.
5 rows affected.
Done.

Out[7]:

fluffy	typeof(fluffy)	cast(fluffy as int)	bunny	typeof(bunny)
42	integer	42	0	text
43.1	real	43	Yes	text
somewhat fluffy	text	0	1	text
very fluffy	text	0	10	text
slightly fluffy	text	0	0.1	text

Values are converted as needed if possible, which can lead to surprising results:

In [8]:

%%sql

select * from bunnies where bunny;

 * sqlite:///birds.db
Done.

Out[8]:

fluffy	bunny
somewhat fluffy	1
very fluffy	10
slightly fluffy	0.1

In [9]:

%%sql

select fluffy, fluffy + 1, bunny, bunny + 1 from bunnies;

 * sqlite:///birds.db
Done.

Out[9]:

fluffy	fluffy + 1	bunny	bunny + 1
42	43	0	1
43.1	44.1	Yes	1
somewhat fluffy	1	1	2
very fluffy	1	10	11
slightly fluffy	1	0.1	1.1

Options for restricting datatypes:

There is a STRICT option for CREATE TABLE, but it only restricts to INT, REAL, TEXT, BLOB, ANY.
CHECK constraints go some way towards data integrity, but again they do not know about standard SQL datatypes.

There is no feasible way (or no way at all?) to enforce standard SQL datatypes in SQLite. We can use types like DECIMAL in the table creation; they are not checked, but as long as we populate the tables with suitable values, everything will be fine (in our examples). We will return to a similar subject in another ☆ section, on floating-point math.

We use SQLite in our examples because it comes with Python, and it serves our purpose. However, SQLite is limited in a number of ways. Other relational DBMS such as Postgresql offer full support for all standard SQL datatypes, and much more; but they also require more effort to install and run. The right tool for the right job; SQLite is not always a good choice. If your application demands strict checking of a variety of SQL types then SQLite as a DBMS is definitely not a good choice.

☆ Postgres Birds¶

Here is the birding example in Postgres (after server is installed, DB and user created, privileges granted, additional Python packages installed; see https://www.postgresql.org/). The obvious difference to SQLite is the connector:

In [10]:

%sql postgresql://myuser:mypw@localhost/mydb

The Jupyter SQL extension now prints two connectors and marks the currently active one:

In [11]:

%%sql

drop table if exists BIRDING;
create table BIRDING (
  BIRDNAME varchar,
  OBSDATE date,
  LOCATION varchar,
  TERRAIN varchar,
  NICKNAME varchar);

 * postgresql://myuser:***@localhost/mydb
   sqlite:///birds.db
Done.
Done.

Out[11]:

[]

In [12]:

%%sql

insert into BIRDING 
(BIRDNAME, OBSDATE, LOCATION, TERRAIN, NICKNAME)
values 
('House sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Tree sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Blackcap', '2018-06-14', 'Sherwood Forest', 'Woodland', 'mjones'),
('European robin', '2018-07-04', 'Little Winging', 'Garden', 'gmorgan'),
('Common gull', '2017-06-01', 'Cliffs of Moher', 'Beach', 'pmurphy'),
('Postgres woodpecker', '2018-06-15', 'Sherwood Forest', 'Woodland', 'mjones');

 * postgresql://myuser:***@localhost/mydb
   sqlite:///birds.db
6 rows affected.

Out[12]:

[]

In [13]:

%sql select * from BIRDING;

 * postgresql://myuser:***@localhost/mydb
   sqlite:///birds.db
6 rows affected.

Out[13]:

birdname	obsdate	location	terrain	nickname
House sparrow	2018-06-12	Cold Striffen	Garden	jsmith
Tree sparrow	2018-06-12	Cold Striffen	Garden	jsmith
Blackcap	2018-06-14	Sherwood Forest	Woodland	mjones
European robin	2018-07-04	Little Winging	Garden	gmorgan
Common gull	2017-06-01	Cliffs of Moher	Beach	pmurphy
Postgres woodpecker	2018-06-15	Sherwood Forest	Woodland	mjones

☆ Non-Standard Features¶

For MySQL (once server and client are installed, DB and user created, privileges granted, additional Python packages installed; see https://dev.mysql.com/downloads/) we can use a connector like this:

mysql+mysqldb://myuser:mypw@localhost/mydb

However, the CREATE statement above will not work.

Even here we have already used a feature that works in some DBMS but not in others: VARCHAR without length is fine in SQLite and PostgresQL, but in MySQL we need to include the maximum length, such as VARCHAR(30) for the bird names.

Once we use non-standard features we lock ourselves in: we can no longer easily switch e.g. to MySQL. On the other hand, sticking to standard features is not a sensible option in non-trivial applications, and switching to a different DBMS is never easy, not even for a tiny toy application like ours; in a real-world system with hundreds or thousands of tables and many thousands of application code lines depending on certain features it is no longer feasible to change the DBMS.

☆ DBMS on Linux¶

Linux is one of the top choices for DB server operating systems; the other being Windows Server in the Microsoft world. Just like with DBMS, comparisons are tricky, since there are so many aspects; Linux is usually praised for its stability and efficiency, requiring lower resources.

MySQL and PostgresQL can be installed with the Linux package manager which keeps track of dependencies and installs additional packages as needed. Database server installation and user management can be done on the command line; this allows for easy automation, and remote administration.

Exercises¶

Chess Club¶

Your local chess community is organized into several clubs; each club has a name and a home location where practise games are offered, usually once a week.

One function of a database solution would be to simply announce the various training evenings for the clubs, and their location.
Another very useful feature would be a list of participants for each event, so that tables and boards can be prepared in suitable number, and results can be recorded.
For each evening one of the clubs is hosting; but members from other clubs are also welcome.

Fill in information as you see fit to make this more interesting. Create the tables, insert some data, and write some select statements for useful listings.