Data: from Latin given; given facts, from which additional facts can be inferred (= response to a user query)
Given fact: true proposition in logic
Database: collection of true propositions
Relational approach:
Data is represented by rows in tables, which can be interpreted as true propositions
Operators are provided for operating on rows in tables, providing additional true propositions from the existing ones
Entity: any object about which we wish to record information
Relationship: linking entities (can themselves be interpreted as a type of entity)
Properties: used to record information about entities: numbers, dates, strings
Data - one of the enterprise's most valuable assets; should be:
Hardware; should have
Software
People
over application-specific files such as spreadsheets:
ISO SQL 2016 and later
Standards are important efforts, but
The entire information content of the database is represented in only one way, explicit values in column positions in table rows.
Relational Databases are based on the relational model. Objects and operations defined in this theoretical approach are found in a similar form in practical implementations i.e. working DB management systems.
relation = math. for table, NOT from ER-model
A relation has two parts:
denotes a predicate (truth-valued function)
each tuple denotes a true proposition, i.e. instantiates the predicate
When the context is clear the attribute names can be ommitted, e.g. the (unnamed) tuples of relation Age are { (John, 30), (Jane, 25) }
Relations are sets of tuples:
Operations on one or more relation always result in a relation (not some other kind of object).
Restrict A: tuples in A satisfying a condition
Project A: specified attributes of tuples in A
remove duplicates!
Product A,B: all combinations of tuples in A and B
Union A,B: all tuples appearing in either or both
remove duplicates!
Difference A,B: all tuples in A but not in B
Intersect A,B: all tuples appearing in both
Join: A,B: all combinations with common value(s) for common attribute(s)
Example:
Emp = { (Smith, Accounting), (Jones, Accounting), (Brown, Sales) }
Bonus = { (Accounting, 300), (Sales, 400), (Marketing, 0) }
Join Emp, Bonus = { (Smith, Accounting, 300), (Jones, Accounting, 300), (Brown, Sales, 400) }
Divide A,B,C: tuples in A that match via B with all tuples in C
Example:
Doctor = { (Semmelweiss), (Billroth), (Paracelsus) }
Treatment = { (Billroth, Smith), (Billroth, Jones), (Billroth, Brown), (Semmelweiss, Smith), (Paracelsus, Luther) }
Sick = { (Smith), (Brown) }
Divide Doctor, Treatment, Sick = { (Billroth) }
Interpretation: doctors who treat all patients
Supplier-parts example: suppliers (relation S) that supply (SP) all parts (P)
The Structured Query Language is the standard for data query and manipulation in relational database systems. It implements many (but not all) concepts of the relational model.
The basic concept of the SQL implementation is the table. When going from the relation model to the SQL implementation relations become tables.
An SQL table is similar to a relation, with some important differences:
Doubles:
Order:
SQL92 standard numeric data types:
INT (precision)
NUMERIC (precision, scale)
DECIMAL (precision, scale)
REAL
DOUBLE PRECISION
FLOAT (precision)
Other Types:
CHAR (size)
VARCHAR (size)
DATE
BOOLEAN
These are standard definitions. Implementations vary considerably, e.g.
Time for studying the documentation of your DBMS is well spent. Otherwise, subtle errors will haunt you in years to come.
define restrictions for valid operations, e.g. certain operations with part weights and quantities make sense, while others do not:
SQL92: named types in terms of primitive types, no operation restrictions
Postgres:
When the packages are installed we need to load the SQL extensions into the current notebook:
%load_ext sql
This allows us to use SQL commands in code cells. We need to start the cell with
First we connect to a database. We choose SQLite for convenience.
%sql sqlite:///sp.db
We ensure that foreign keys are enabled. This is specific to SQLite.
%%sql
PRAGMA foreign_keys = ON;
Now we are ready to create tables. However, since we probably want to run this code multiple times, and a table cannot be created with the same name twice, we must first make sure that existing tables are removed:
You local birding club decides to finally put some structure into all those reports from members and enter them into a database. The following things are reported by the birders:
The birders are fairly learned in the avian ways, and the common names they use identify the species. Location and nickname also identify their entities perfectly (e.g., no two birders use the same nickname). Birders always report the same type of terrain for a given location.
Basic SQL is quite intuitive. By looking at the following statement we can easily see what it does:
The statement could be written on a singe line, but it is more readable this way. Layout matters not to the computer, but to the human who may be forced to read and understand our statements at some point in the future.
Why do we first drop the table, if it exists? Due to the nature of Jupyter notebooks it is much more convenient to ensure that all code in a notebook can run from start to end repeatedly. It is possible to hand-pick the order of execution by running cells individually, but this gets out of hand very quickly.
%%sql
drop table if exists BIRDING;
create table BIRDING (
BIRDNAME varchar,
OBSDATE date,
LOCATION varchar,
TERRAIN varchar,
NICKNAME varchar);
Now that we have the table structure, we can fill it with some data:
%%sql
insert into BIRDING values ('House sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith');
insert into BIRDING values ('Tree sparrow', '2018-06-12', 'Cold Striffen', 'Garden', 'jsmith');
insert into BIRDING values ('Blackcap', '2018-06-14', 'Sherwood Forest', 'Woodland', 'mjones');
insert into BIRDING values ('European robin', '2018-07-04', 'Little Winging', 'Garden', 'gmorgan');
insert into BIRDING values ('Common gull', '2017-06-01', 'Cliffs of Moher', 'Beach', 'pmurphy');
insert into BIRDING values ('Green woodpecker', '2018-06-15', 'Sherwood Forest', 'Woodland', 'mjones');
To check if everything worked the way we expected, we generate a listing:
%%sql
select * from BIRDING;
Select statements can use conditions on records to be displayed, and state the columns, e.g.
%%sql
select location, nickname from birding where obsdate < '2018-01-01';
Get the SQL code for the BIRDING table to run in your own Jupyter notebook in the lab.
If you want a little more challenge, install the software on your own computer. You will probably need the following packages:
If you are using Linux (and why would you not??) then enter the following on the command line:
sudo pip3 install ipython-sql
or, if you are using a Conda distribution of Python then instead of pip3
sudo conda install ipython-sql
EXERCISE:
Your local chess community is organized into several clubs; each club has a name and a home location where practise games are offered, usually once a week, but subject to change. One function of a database solution would be to simply announce the various training evenings for the clubs, and their location. Another very useful feature would be a list of participants for each event, so that furniture and equipment can be prepared in suitable number. For each evenings one of the clubs is hosting; but members from other clubs are also welcome.
Fill in information as you see fit to make this more interesting. Create the tables and use some select statements for useful listings.
If for any given value of x there is only one value for y, or none, then y is functionally dependent on x.
In other words, a given value for x points to a certain value for y. This is written as
x → y
Note that different values of x can point to the same value y; but a given value of x cannot point to more than one value for y.
In the following table some data about suppliers is stored:
SID | Name | City | Status |
---|---|---|---|
S1 | Smith | New York | 30 |
S2 | Jones | New York | 30 |
S3 | Black | Los Angeles | 30 |
S4 | Brown | Denver | 20 |
S5 | White | Denver | 20 |
S6 | Green | Salem | 10 |
For the table above we have SID → NAME, STATUS, CITY and CITY → STATUS
When we look at functional dependencies, we distinguish between:
We are usually interested in (b).
Functional dependencies are important for practical applications. A good table design helps to automatically enforce functional dependencies.
Give a set of functional dependencies we can derive further FDs that must also hold, e.g. given
we can derive (among others):
Trivial Dependencies: e.g. SID → SID
A FD is trivial if the right-hand side is a subset of the left-hand side.
These axioms give us a set of rules to manipulate functional dependencies.
Let A, B, C arbitrary subsets of attributes of relation R, and denote AB the union of A and B.
First rule = definition of trivial FD
Rules are complete: all implied FDs can be derived
Rules are sound: no additional FDs (not implied by S) can be so derived
Several more rules can be derived to simplify the practical task of manipulating FDs:
Decomposition: if A → BC, then A → B and A → C
Union: if A → B and A → C, then A → BC
Composition: if A → B and C → D, then AC → BD
Self-determination: A → A
Example: A → BC, B → E, CD → EF (e.g. A EMPID, B DEPID, C MGR, D PROJID, E DEPNAME, F PERCTIME)
Show: AD → F
In other words, show that in this application the percentage of time worked on a project is determined by employee and project.
The functional dependency AD → F can be derived with a suitable choice of operations.
EXERCISE: bc → a, de → c, a → bcde
Show that bc → d
Ensure that the data is correct.
These rules restrict the data that can be stored. If an attempt is made to violate such a rule, the DB systems responds with an error message.
These type of constraints are widely supported by relational DB systems.
Type and values of individual Attributes, e.g.
Constraints on a Relation
Constraints on the whole Database, e.g.
Relation Predicate: logical AND of all relation constraints
Golden Rule: no update operation may leave any relation in a state violating its predicate
These type of constraints are typically not supported by the DB system out of the box, and have to be implemented by application logic i.e. we must program them.
Example: relationship status
Valid transitions
Not valid
When functional dependencies have been identified the logical next step is to define keys for tables.
A candidate key K for a given relation is a set of attributes with
Every relation has at least one candidate key: if we cannot find any subset of attributes that form a candidate key we can use the complete set of attributes.
Note that this does not hold for tables, since a table can contain multiple identical tuples.
In case of more than one candidate key: choose one as primary key, the others are alternate keys
A superkey is unique, but not necessarily irreducible
A foreign key is a set of attributes F in relation B, where
The value of F is a reference to the tuple containing C.
Example: value of ORDR.CUST must be equal to an existing value of CUST.ID
Ensure valid foreign key values. Default: record cannot be deleted while there are still references.
UNIQUE can be NULL i.e. without value, not practical for identifiers
REFERENCES A(C)
Example: Supplier and Parts
We want to keep track of our suppliers and record which supplier deliver which parts. We put the data about the suppliers in table S and the parts data in table P. Since we get certain parts from certain suppliers, we record that data in another table SP.
Again, to make sure that all the code in this notebook can be run repeatedly from top to bottom we drop the table if they already exist:
%%sql
drop table if exists SP;
drop table if exists S;
drop table if exists P;
Now we create out supplier table:
%%sql
create table S (
SID varchar primary key,
NAME varchar,
STATUS numeric(5),
CITY varchar);
The table exists and is empty. We fill it with some data.
%%sql
insert into S values ('S1', 'Smith', 30, 'New York');
insert into S values ('S2', 'Jones', 30, 'New York');
insert into S values ('S3', 'Black', 30, 'Los Angeles');
insert into S values ('S4', 'Brown', 20, 'Denver');
insert into S values ('S5', 'White', 20, 'Denver');
insert into S values ('S6', 'Green', 10, 'Salem');
To make sure that this actually worked we generate a list of the records in the table.
%%sql
select * from S;
In the same fashion we create a table for some parts. Again the ID identifies the part, therefore we define it as the primary key.
%%sql
create table P (
PID varchar primary key,
NAME varchar,
COLOR varchar,
WEIGHT numeric(5,1));
Again, insert some data and check the contents of the table:
%%sql
insert into P values ('P1', 'Nail', 'Gray', 12);
insert into P values ('P2', 'Bolt', 'Black', 17);
insert into P values ('P3', 'Screw', 'Gray', 17);
insert into P values ('P4', 'Screw', 'Silver', 14);
insert into P values ('P5', 'Hub', 'Silver', 12);
insert into P values ('P6', 'Cog', 'Gray', 19);
select * from P;
In this situation, we get parts from various suppliers. Not all suppliers deliver all parts. The following table allows us to define every possible situation:
Similar for the parts; if a part is delivered by no supplier, it has no entry in the SP table.
We want to ensure referential integrity:
Because of the REFERENCES definitions the DB will enforce these constraints.
%%sql
create table SP (
SID int references S(SID),
PID int references P(PID),
QTY numeric(6));
%%sql
insert into SP values ('S1', 'P1', 300);
insert into SP values ('S1', 'P2', 200);
insert into SP values ('S1', 'P3', 400);
insert into SP values ('S1', 'P4', 200);
insert into SP values ('S1', 'P5', 100);
insert into SP values ('S1', 'P6', 100);
insert into SP values ('S2', 'P1', 300);
insert into SP values ('S2', 'P2', 400);
insert into SP values ('S3', 'P2', 200);
insert into SP values ('S4', 'P2', 200);
insert into SP values ('S4', 'P4', 300);
insert into SP values ('S4', 'P5', 500);
select * from SP;
Answer the following questions by looking at the data.
Which of these questions could be answered easily by using an operation from relational algebra?
The beauty of Jupyter notebooks is that they support experimentation in a straight-forward fashion. Feel free to create more tables for subjects of your interest, and see what kind of information they can store.
A view is a named relational expression, from a practical point very similar to a defined function without parameters in programming languages.
%%sql
drop view if exists GS;
create view GS as select sid, status, city from S where status > 1;
The view can be used in select statements just like a table:
%%sql
select * from GS;
Multi-user DB systems typically let us define permissions for individual tables and views, which allows us to implement various schemes for security and user roles; views come in very handy in such an environment.
Logical data independence:
Immunity of users and programs to changes in the logical structure of the database (conceptual level)
Since everything in a relational DB system is stored in tables, the data on which users and tables exist is also stored in tables: the catalog tables.
With the help of these meta-tables the DB system allows for changing the structure of user tables, even when they already contain data, e.g. to add a column:
ALTER TABLE table_name ADD COLUMN column_definition ;
SQLite takes a fairly unusual approach to this topic by storing the SQL statement that created the object, rather than the structural information on columns and their data types. Because of this, the ALTER TABLE command is severly limited in SQLite.
%%sql
select * from sqlite_master;
For comparison, Postgres uses a separate catalog table to store information on the columns of user tables. This allows for the full range of ALTER TABLE commands, and makes it easy to e.g. query the names of all numeric columns:
Postgres:
SELECT table_name,column_name FROM information_schema.columns WHERE data_type = 'numeric';
table_name | column_name |
---|---|
p | weight |
s | status |
sp | qty |
EXERCISE: