Relational Database Systems: Basic Concepts¶
Data: from Latin given; given facts
- Given fact: true proposition in logic
- Database: collection of true propositions
Relational approach:
- Data is represented by rows in tables, which can be interpreted as true propositions
- Operators on rows in tables derive additional true propositions in response to queries
Example:
- table containing soccer players and their clubs
- additional facts in response to user queries:
- number of players per club
- number of clubs
- club with most players
Other types of databases e.g.
- graph databases: store data as nodes and edges, understood as entities with properties and relationships
- document databases: store data in formats such as JSON (fields and values) and XML (markup) which are also commonly used in data exchange between applications
Advantages of databases¶
over application-specific files such as spreadsheets:
- Centralized control over data
- Data can be shared
- Redundancy can be reduced
- Inconsistencies can be avoided
- Transaction support
- Integrity constraints
- Security constraints
- Data independence, immunity of applications to change in physical representation and access technique
- Standards e.g. ISO SQL
Standards are important efforts, but
- even those database management systems (DBMS) that adhere closely to the standard also offer convenient non-standard extensions
- this tends to lead to lock-in for the customer
- the practical reality is that the choice of a particular DBMS is a long-term one
- switching to a different DBMS is costly and risky, and will be avoided if at all possible
Information Principle¶
The entire information content of the relational database is represented in only one way, explicit values in column positions in table rows.
Not all relational DBMS implement this principle.
The Relational Model¶
Relational Databases are based on the relational model.
relation = math. for table, NOT from ER-model
A relation has two parts:
- heading: set of attributes A:T with name A and type T
denotes a predicate (truth-valued function)
- body: set of tuples A:V with attribute A and value V i.e. a set of tuples that conforms to the heading
each tuple denotes a true proposition, i.e. instantiates the predicate
SQL¶
The Structured Query Language is the standard for data query and manipulation in relational database systems. It implements many (but not all) concepts of the relational model.
Table vs. Relation¶
The basic concept of the SQL implementation is the table. When going from the relation model to the SQL implementation relations become tables.
An SQL table is similar to a relation, with some important differences:
Doubles:
- A relation contains a set of tuples, which means that by definition there are no doubles.
- An SQL table and the result of an SQL query can contain several rows with identical values.
Order:
- Since there is no order defined for the elements of a set, any order is valid when listing the tuples of a relation.
- Results of SQL SELECT statements can be ordered at will with the ORDER BY clause.
In the following we will assume that
- tables do not contain multiple identical rows
- order of tuples and attributes are not crucial for our examples
which means that we can use the terms relation and table interchangeably, unless otherwise noted.
Data Types¶
- The relational model is not concerned with technical implementation.
- SQL DBMS must know in great detail the type of data stored in tables.
Numbers¶
- precision: number of significant digits
- scale: position of the least significant digit to the right of the decimal
- e.g. 1234.56 has precision 6, scale 2, defined as NUMERIC(6,2)
SQL92 standard numeric data types:
- INT (precision)
- exact numeric type
- scale 0
- NUMERIC (precision, scale)
- exact numeric type i.e. rounding and truncating dictated by predefined formula
- decimal portion exactly scale
- DECIMAL (precision, scale)
- exact numeric type
- decimal portion at least scale
- REAL
- approximate numeric type
- binary precision based on manufacturer limit
- default precision must be less than DOUBLE PRECISION
- DOUBLE PRECISION
- approximate numeric type
- binary precision based on manufacturer limit
- default precision must be greater than REAL
- FLOAT (precision)
- approximate numeric type
- DOUBLE with given precision
Other Types¶
- CHAR (size)
- fixed length
- unused space empty
- VARCHAR (size)
- variable length up to size
- no unused space
- without size: not in SQL92 standard, but widely supported, e.g. in PostgreSQL; but not in MySQL
- DATE
- Various input formats, some manufacturer dependent; e.g. '2012-04-21'
- Calendar functions such as length of time, weekday
- BOOLEAN
- two values, true or false
These are standard definitions. Implementations vary considerably, e.g.
- Postgres: DECIMAL(p, s) and NUMERIC(p, s) are equivalent
- SQLite:
- CHAR and VARCHAR are both converted to TEXT with default max length of 1,000,000,000
- datatypes are not enforced, e.g. even in a column defined as INT we can enter a text like 'fluffy bunnies'
Time for studying the documentation of your DBMS is well spent. Otherwise, subtle errors will haunt you in years to come.
SQL in Jupyter Notebook¶
If you are not familiar with Jupyter notebooks: take a look at jupyter.org
This document itself was generated from a Jupyter notebook.
To run the examples on your own computer:
- If necessary install Python first, from
python.org; steps for Windows:
- download 64-bit installer for stable release (3.11.8 Feb 2024)
- start installer
- make sure 'Add Python to Path' is checked
- click 'Install Now'
- To install packages, open a terminal window (command prompt in Windows, cmd) and enter:
- pip install jupyterlab ipython-sql
- To start Jupyter Lab, enter:
- jupyter lab
We load the SQL extension into the current notebook:
%load_ext sql
This allows us to use SQL commands in code cells. We need to start the cell with
- %sql for a single line of code
- %%sql for multiple lines
If you want to comment inside a code cell: note the different comment characters in Python and SQL:
- Python: # (hashtag)
- SQL: -- (two hyphens)
SQLite¶
"SQLite does not compete with client/server databases. SQLite competes with fopen(). "
We choose SQLite as our relational DB software:
- SQLite is limited in many respects, but powerful enough for our purposes
- It is free and Open Source, and part of the Python standard library
A file with the suffix .db in the current directory stores our tables. For Jupyter notebooks, the current directory is the one where your notebook file is stored (the file with the .ipynb suffix).
SQLite is widely used as an embedded database engine in desktop software, smartphone apps, and many other places. There are much more sophisticated options for free DBMS, such as PostgreSQL; however, Postgres requires separate installation, a running DB server, and user management. For our purposes the cost/benefit ratio favours SQLite.
For us, the downsides in using SQLite are
- datatypes are not enforced at all
- SQL syntax is not enforced fully
However: If we are using correct SQL and proper datatypes our examples will work just like in PostgreSQL; with a few tiny differences that will be noted as we come across them.
SQLite is by far the most widely deployed database engine. See sqlite.org for the amazing statistics.
☆ Other commonly used relational DBMS¶
☆ sections offer additional information on a topic; not essential, but (hopefully) interesting
It is tricky to compare DBMS overall: how would you aggregate the many aspects into a single ranking? Here are some of the big players (see https://db-engines.com/en/ranking), with year of initial release:
- Oracle: often regarded as the top commercial solution, 1980
- DB2: another top commercial system, widely used on IBM hosts, 1983
- SQL Server: a Microsoft product, 1989
- PostgreSQL: very strong free and open source alternative to commercial products, 1989
- MySQL: common choice for web databases, now owned by Oracle, but still free and open source, 1995
If you are interested in comparing e.g. SQLite and Postgres yourself:
- install PostgreSQL from postgresql.org
- configure (create database, create user, alter user, grant all)
- in Jupyter notebook use connector postgresql://myuser:mypw@localhost/mydb
- switch SQLite and Postgres by connector
Of course, the same goes for MySQL, and other free DBMS: install and compare them, if you have the time and motivation.
Sample DB: The Birding Club¶
Here is a verbal description of a situation where a DB solution is desired:
You local birding club decides to finally put some structure into all those reports from members and enter them into a database, to allow for automated analysis. The following things are reported by the birders:- common name of the bird
- date of observation
- location
- type of terrain
- nickname of birder
The birders are fairly learned in the avian ways, and the common names they use identify the species. No two birders use the same nickname. They always report the same type of terrain for a given location.
We decide on a prototype development method and create a very simple first solution from the description: a single table that holds all the data. Whether this solution is adequate depends on the application: it may be exactly what the birders need, or there may be room for improvement.
Creating a Database¶
We connect to a SQLite database in the current directory with the following statement; if the database does not exist it will be created.
%sql sqlite:///birds.db
Creating Tables¶
By creating an SQL table we define a structure to hold data:
- a table named BIRDING is created
- for each column we specify a name and a data type
- the list of columns is enclosed in parenthesis
- a semicolon terminates the statement
%%sql
drop table if exists BIRDING;
create table BIRDING (
BIRDNAME varchar(30),
OBSDATE date,
LOCATION varchar(30),
TERRAIN varchar(20),
NICKNAME varchar(20)
);
* sqlite:///birds.db Done. Done.
[]
The DB software responds with the currently used connector and a message for each command.
Why do we first drop the table, if it exists?
We cannot create a table when another one with the same name already exists.
In Jupyter notebooks it is much more convenient to ensure that all code in a notebook can run from start to end repeatedly. It is possible to hand-pick the order of execution by running cells individually, but this gets out of hand very quickly.
Therefore, we make sure that an existing table is dropped, so our CREATE statement will not produce an error.
The statement could be written on a singe line, but it is more readable this way. Layout matters not to SQLite, but to the humans who may need to read and understand our statements at some point in the future.
Inserting Data¶
Now that we have the table structure, we can fill it with some data.
- In this case we do not need to list the column names before the VALUES keyword, because we are supplying all values in the order they were defined
- However, it is still good practise since it documents the statement
- We could write several INSERT statements, but the INSERT command accepts several tuples after the VALUES keyword
%%sql
insert into BIRDING
(BIRDNAME, OBSDATE, LOCATION, TERRAIN, NICKNAME)
values
('House sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Tree sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Blackcap', '2018-06-14', 'Sherwood Forest', 'Woodland', 'mjones'),
('European robin', '2018-07-04', 'Little Winging', 'Garden', 'gmorgan'),
('Common gull', '2017-06-01', 'Cliffs of Moher', 'Beach', 'pmurphy'),
('Green woodpecker', '2018-06-15', 'Sherwood Forest', 'Woodland', 'mjones');
* sqlite:///birds.db 6 rows affected.
[]
Listing Data¶
To check if everything worked the way we expected, we generate a listing:
%%sql
select * from BIRDING;
* sqlite:///birds.db Done.
BIRDNAME | OBSDATE | LOCATION | TERRAIN | NICKNAME |
---|---|---|---|---|
House sparrow | 2018-06-12 | Cold Striffen | Garden | jsmith |
Tree sparrow | 2018-06-12 | Cold Striffen | Garden | jsmith |
Blackcap | 2018-06-14 | Sherwood Forest | Woodland | mjones |
European robin | 2018-07-04 | Little Winging | Garden | gmorgan |
Common gull | 2017-06-01 | Cliffs of Moher | Beach | pmurphy |
Green woodpecker | 2018-06-15 | Sherwood Forest | Woodland | mjones |
Select statements can use conditions on records to be displayed, and state the columns, e.g.
%%sql
select location, nickname from birding where obsdate < '2018-01-01';
* sqlite:///birds.db Done.
LOCATION | NICKNAME |
---|---|
Cliffs of Moher | pmurphy |
☆ SQLite and Datatypes¶
"Flexible typing is a feature of SQLite, not a bug."
From https://sqlite.org/datatype3.html, a source worth studying if you intend to do any serious work with SQLite.
In SQLite datatypes are suggestions, not requirements. Column datatypes are in fact optional in an SQLite CREATE TABLE statement; if included, they are implemented as 'type affinity'; Values will be converted to something close to the suggested format, if possible. Internally (for values stored or manipulated) there are only the following 'storage classes':
- NULL
- INTEGER
- REAL (8-byte IEEE floating point)
- TEXT (using database encoding: UTF-8, UTF-16BE or UTF-16LE)
- BLOB (stored exactly as it was input)
Some examples:
%%sql
drop table if exists bunnies;
create table bunnies (
fluffy int,
bunny varchar );
insert into bunnies values
(42, 0),
(43.1, 'Yes'),
('somewhat fluffy', TRUE),
('very fluffy', 10),
('slightly fluffy', '0.1');
select fluffy, typeof(fluffy), cast(fluffy as int), bunny, typeof(bunny) from bunnies;
* sqlite:///birds.db Done. Done. 5 rows affected. Done.
fluffy | typeof(fluffy) | cast(fluffy as int) | bunny | typeof(bunny) |
---|---|---|---|---|
42 | integer | 42 | 0 | text |
43.1 | real | 43 | Yes | text |
somewhat fluffy | text | 0 | 1 | text |
very fluffy | text | 0 | 10 | text |
slightly fluffy | text | 0 | 0.1 | text |
Values are converted as needed if possible, which can lead to surprising results:
%%sql
select * from bunnies where bunny;
* sqlite:///birds.db Done.
fluffy | bunny |
---|---|
somewhat fluffy | 1 |
very fluffy | 10 |
slightly fluffy | 0.1 |
%%sql
select fluffy, fluffy + 1, bunny, bunny + 1 from bunnies;
* sqlite:///birds.db Done.
fluffy | fluffy + 1 | bunny | bunny + 1 |
---|---|---|---|
42 | 43 | 0 | 1 |
43.1 | 44.1 | Yes | 1 |
somewhat fluffy | 1 | 1 | 2 |
very fluffy | 1 | 10 | 11 |
slightly fluffy | 1 | 0.1 | 1.1 |
Options for restricting datatypes:
- There is a STRICT option for CREATE TABLE, but it only restricts to INT, REAL, TEXT, BLOB, ANY.
- CHECK constraints go some way towards data integrity, but again they do not know about standard SQL datatypes.
There is no feasible way (or no way at all?) to enforce standard SQL datatypes in SQLite. We can use types like DECIMAL in the table creation; they are not checked, but as long as we populate the tables with suitable values, everything will be fine (in our examples). We will return to a similar subject in another ☆ section, on floating-point math.
We use SQLite in our examples because it comes with Python, and it serves our purpose. However, SQLite is limited in a number of ways. Other relational DBMS such as Postgresql offer full support for all standard SQL datatypes, and much more; but they also require more effort to install and run. The right tool for the right job; SQLite is not always a good choice. If your application demands strict checking of a variety of SQL types then SQLite as a DBMS is definitely not a good choice.
☆ Postgres Birds¶
Here is the birding example in Postgres (after server is installed, DB and user created, privileges granted, additional Python packages installed; see https://www.postgresql.org/). The obvious difference to SQLite is the connector:
%sql postgresql://myuser:mypw@localhost/mydb
The Jupyter SQL extension now prints two connectors and marks the currently active one:
%%sql
drop table if exists BIRDING;
create table BIRDING (
BIRDNAME varchar,
OBSDATE date,
LOCATION varchar,
TERRAIN varchar,
NICKNAME varchar);
* postgresql://myuser:***@localhost/mydb sqlite:///birds.db Done. Done.
[]
%%sql
insert into BIRDING
(BIRDNAME, OBSDATE, LOCATION, TERRAIN, NICKNAME)
values
('House sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Tree sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith'),
('Blackcap', '2018-06-14', 'Sherwood Forest', 'Woodland', 'mjones'),
('European robin', '2018-07-04', 'Little Winging', 'Garden', 'gmorgan'),
('Common gull', '2017-06-01', 'Cliffs of Moher', 'Beach', 'pmurphy'),
('Postgres woodpecker', '2018-06-15', 'Sherwood Forest', 'Woodland', 'mjones');
* postgresql://myuser:***@localhost/mydb sqlite:///birds.db 6 rows affected.
[]
%sql select * from BIRDING;
* postgresql://myuser:***@localhost/mydb sqlite:///birds.db 6 rows affected.
birdname | obsdate | location | terrain | nickname |
---|---|---|---|---|
House sparrow | 2018-06-12 | Cold Striffen | Garden | jsmith |
Tree sparrow | 2018-06-12 | Cold Striffen | Garden | jsmith |
Blackcap | 2018-06-14 | Sherwood Forest | Woodland | mjones |
European robin | 2018-07-04 | Little Winging | Garden | gmorgan |
Common gull | 2017-06-01 | Cliffs of Moher | Beach | pmurphy |
Postgres woodpecker | 2018-06-15 | Sherwood Forest | Woodland | mjones |
☆ Non-Standard Features¶
For MySQL (once server and client are installed, DB and user created, privileges granted, additional Python packages installed; see https://dev.mysql.com/downloads/) we can use a connector like this:
mysql+mysqldb://myuser:mypw@localhost/mydb
However, the CREATE statement above will not work.
Even here we have already used a feature that works in some DBMS but not in others: VARCHAR without length is fine in SQLite and PostgresQL, but in MySQL we need to include the maximum length, such as VARCHAR(30) for the bird names.
Once we use non-standard features we lock ourselves in: we can no longer easily switch e.g. to MySQL. On the other hand, sticking to standard features is not a sensible option in non-trivial applications, and switching to a different DBMS is never easy, not even for a tiny toy application like ours; in a real-world system with hundreds or thousands of tables and many thousands of application code lines depending on certain features it is no longer feasible to change the DBMS.
☆ DBMS on Linux¶
Linux is one of the top choices for DB server operating systems; the other being Windows Server in the Microsoft world. Just like with DBMS, comparisons are tricky, since there are so many aspects; Linux is usually praised for its stability and efficiency, requiring lower resources.
MySQL and PostgresQL can be installed with the Linux package manager which keeps track of dependencies and installs additional packages as needed. Database server installation and user management can be done on the command line; this allows for easy automation, and remote administration.
Exercises¶
Chess Club¶
Your local chess community is organized into several clubs; each club has a name and a home location where practise games are offered, usually once a week.
- One function of a database solution would be to simply announce the various training evenings for the clubs, and their location.
- Another very useful feature would be a list of participants for each event, so that tables and boards can be prepared in suitable number, and results can be recorded.
- For each evening one of the clubs is hosting; but members from other clubs are also welcome.
Fill in information as you see fit to make this more interesting. Create the tables, insert some data, and write some select statements for useful listings.