Normal Forms¶

The process of converting a set of relations into a form that avoids certain problems is called normalisation. It involves decomposing relations into parts. Obviously, we do not want to introduce any errors in this process.

We will use curly braces for relations and underline for keys, e.g. relation R = {A, B, C } has attributes A, B, C and key A. If not otherwise noted, key means non-reducible i.e. candidate key.

We will also use the terms relation and table interchangeably to simplify the discussion when it comes to implementation in SQL. We will assume that tables do not contain multiple identical rows.

Let's look at our supplier table S = {SID, CITY, STATUS }:

SID	CITY	STATUS
S4	Denver	20
S5	Denver	20
S6	Salem	10

SID is the primary key, therefore all other attributes must be functionally dependent on SID. However, in this example let us also assume that the status is dependent on the city. The functional dependencies are

SID → CITY, STATUS
CITY → STATUS

The last FD means that this table design introduces a problem: the fact that Denver has status 20 is recorded multiple times.

recording a fact more than once is not necessary and a waste of resources
it can easily introduce inconsistencies, e.g. when the record for supplier S5 is updated with a status other than 20

Redundancy¶

For arbitrary attribute sets A and B of relation R:

If A → B and A is not a candidate key,

and the FD is not trivial
and A is not a superkey
and R contains at least 2 tuples

then R contains redundancies.

To remove the redundancy in the supplier table, we can decompose it into two separate tables. There are several possible decompositions, some of them more desirable than others. For practical reasons we want a decomposition that allows us to join the decomposed tables and get back exactly the original table (headers and content).

Nonloss Decomposition¶

Decomposition = projection
Recomposition = join
Reversibility: original relation is equal to the join of its projections

E.g. compare the following decompositions of S = {SID, CITY, STATUS}

lossless: {SID, CITY} and {SID, STATUS}
lossy: {SID, STATUS} and {STATUS, CITY}

For obvious practical reasons we are interested in lossless decompositions:

SID	CITY
S4	Denver
S5	Denver
S6	Salem

SID	STATUS
S4	20
S5	20
S6	10

Heath's Theorem¶

Let R = {A, B, C} a relation with A, B, C as sets of attributes
If A → B then R is equal to the join of its projections on {A, B} and {A, C}

Proof:

No tuple lost:
- (a,b,c) ∈ R.
- Then (a,b) ∈ R1 and (a,c) ∈ R2.
- Therefore (a,b,c) ∈ R1 JOIN R2.
No spurious tuples:
- (a,b,c) ∈ R1 JOIN R2.
- Therefore (a,b) ∈ R1 and (a,c) ∈ R2.
- There must be (a,x,c) ∈ R for some x to generate (a,c) ∈ R2.
- Therefore (a,x) ∈ R1.
- From (a,b) ∈ R1 and (a,x) ∈ R1 follows b = x because of A → B.
- So, (a,b,c) ∈ R.

Note that

The theorem says 'If', not 'If and only if'
A similar version of this idea is conveyed in the theorem by Delobel and Casey

First normal form¶

Every tuple contains exactly one value for each attribute.

The definition refers to one value of the appropriate type for the attribute; this is not necessarily a single word or number. What constitutes an appropriate type for an attribute depends on the application.
The often encountered definition by 'atomic values' (i.e. values must not be divisible) is problematic. Many types of values could be divided into smaller parts, but without benefit for the application.
The only meaningful and practical interpretation of the atomic definition is divisibility into several values of the attribute type i.e. if we can split the entry into several values for the type of the attribute (such as several dates for a date attribute, or several addresses for an address attribute) then the 1NF is violated.
Relations are by definition always in 1NF.
In bad implementations the 1NF of a table can be violated by e.g. writing several values into a field, separated by commas. People will have no problem reading and understanding this, but SQL statements will run into problems.
All tables we have seen in examples so far are in 1NF.

Update Anomalies¶

If a relation is only in 1NF we can have Update anomalies.

Example:

lump supplier table S and supplier-parts table SP together into FIRST = {SID, STATUS, CITY, PID, QTY}

with

(SID, PID) → QTY
SID → CITY, STATUS
CITY → STATUS

Suppose the table contains the following data:

sid	status	city	pid	qty
S1	20	London	P1	300
S1	20	London	P2	200
S1	20	London	P3	400
S4	30	Rome	P2	200
S4	30	Rome	P4	300
S3	10	Paris	P2	200

Anomalies:

DELETE: when we delete S3 we also loose the status and city
UPDATE: changing the city of S1 requires changes in multiple tuples

Second normal form¶

1NF and all non-key attributes irreducibly dependent on every candidate key

Left-irreducible Functional Dependency:¶

left-hand side is 'not too big', e.g. with SID → CITY

SID, PID → CITY left-hand side is reducible to SID
SID → CITY is left-irreducible

Procedure for 2NF:¶

take (nonloss) projections to eliminate reducible FDs

For relation FIRST: replace with SECOND = { SID, STATUS, CITY } and SP = { SID, PID, QTY }

Process is reversible by joining SECOND and SP, no information is lost (check with Heath!)
However, 2NF can contain information that 1NF could not, i.e. location of supplier without shipment

But relation SECOND still contains CITY → STATUS and therefore a transitive FD SID → CITY → STATUS

Suppose that SECOND contains the following data:

SID	STATUS	CITY
S1	20	London
S2	10	Paris
S3	10	Paris
S4	30	Rome

Anomalies:

DELETE S4: not only supplier information but also city-status information
UPDATE: city-status information appears many times, have to update every tuple

Third normal form¶

2NF and every non-key attribute is non-transitively dependent on every candidate key

Procedure: take (nonloss) projections to eliminate transitive FDs

Decomposition A:

project SECOND into SC = { SID, CITY } and CS = { CITY, STATUS }

Again, process is reversible by joining, no information lost (check with Heath!)
however, 3NF can contain more information about the real world, i.e. status of a city without supplier
SC and CS are independent:
- updates can be made to either one without regard for the other:
- SID → CITY and CITY → STATUS are automatically enforced,
- and therefore also SID → STATUS

From table SECOND above:

SID	CITY
S1	London
S2	Paris
S3	Paris
S4	Rome

CITY	STATUS
London	20
Paris	10
Rome	30

Dependency preservation¶

Decomposition A above is dependency-preserving: all FDs are preserved.

Not all decompositions are automatically dependency-preserving, even if they are lossless.

Decomposition B:

project SECOND into SC = { SID, CITY } and SS = { SID, STATUS }

lossless
3NF
however:
- FD CITY → STATUS is not preserved
- SC and SS are not independent

Example: if we update SS and set the status of S3 to 20 then we would violate CITY → Status, but we need to check in SC to find out about that.

We could use a database constraint to ensure CITY → STATUS, but there is a better way. From a practical point of view, we prefer decompositions that are dependency-preserving.

Theorem of Rissanen¶

Projections R1 and R2 of relation R are independent if and only if:

every FD in R is a logical consequence of those in R1 and R2
common attributes of R1 and R2 form a candidate key for at least one of the pair

Check decomposition A:

FDs in SECOND follow from SC and CS
SID → CITY, CITY → STATUS, and therefore SID → STATUS
Common attribute CITY is key in CS

Check decomposition B:

CITY → STATUS does not follow from FDs in SC and SS: SID → CITY, SID → STATUS

Atomic Relation¶

cannot be decomposed into independent relations

Examples:

SP = { SID, PID, QTY } with key (SID, PID) is atomic
SECOND from above is not: use Decomposition A to get independent relations SC and CS

Boyce/Codd Normal Form¶

A relation is in BCNF if and only if every non-trivial, left-irreducible FD has a candidate key as its determinant (left-hand side).

In a diagram: only FD arrows out of candidate keys.

Codd's original definition of 3NF did not adequately deal with a relation that

has two or more candidate keys such that
candidate keys are composite, and
they overlap

BCNF is a stronger definition, and therefore not equal to 3NF:

3NF and BCNF are equivalent when the three conditions above are not met.

Example: Student-Subject-Teacher

Students take several subjects
For each subject, each student is taught by only one teacher
Each teacher teaches only one subject, but each subject is taught by several teachers

Relation SJT with FDs S,J → T and T → J

S	J	T
Smith	Math	White
Smith	Physics	Green
Jones	Math	White
Jones	Physics	Brown

No description has been provided for this image

two overlapping candidate keys S,J and S,T
relation SJT is in 3NF (see Zaniolo's definition below)
T is determinant but not candidate key, therefore
relation SJT is not in BCNF
update anomalies, e.g. deleting 'Jones studies Physics' loses info that Brown teaches Physics

We can go to BCNF by projecting to ST = { S, T } and TJ = { T, J }

avoids anomaly above, but
ST and TJ are not independent, since
only T → J is represented, and
(S,J) → T cannot be deduced from T → J

E.g. insert Smith-Brown into ST

should be rejected since
Brown-Physics (Brown teaches Physics)
and we already have Smith-Physics-Green (Smith studies Physics with Green)
but need to check TJ to find out!

Note that the only candidate key for ST is (S,T).

S	T
Jones	Brown
Jones	White
Smith	Green
Smith	White

T	J
Brown	Physics
Green	Physics
White	Math

Check reversibility:

select S, J, T from ST natural join TJ;

S	J	T
Jones	Physics	Brown
Smith	Physics	Green
Jones	Math	White
Smith	Math	White

We observe that

SJT is atomic.
We cannot decompose every relation into components that are BCNF and independent.

☆ Zaniolo's definition of 3NF¶

X any subset of attributes
A any single attribute
3NF iff for every FD X → A at least one of the following is true:
- X contains A (trivial FD)
- X is a superkey
- A is contained in a candidate key

For a definition of BCNF drop condition 3.

Apply Zaniolo's definition to SJT:

Functional dependencies (S,J) → T and T → J
Candidate keys (S,J) and (S,T)
Functional dependency (S,J) → T
- X contains A i.e. S,J contains T: false
- X is superkey i.e. S,J is superkey: true
- A is contained in a candidate key i.e. T in (S,T) : true
OK for 3NF, OK for BCNF
Functional dependency T → J
- X contains A i.e. T contains J : false
- X is superkey i.e. T is superkey: false
- A is contained in a candidate key i.e. J in (S,J) : true
OK for 3NF, not OK for BCNF

Therefore, SJT is in 3NF but not in BCNF.

Algorithms for Normalization¶

Synthesis to 3NF¶

Go from relation R to set S of relations in 3NF:

start with S = empty
Find a minimal cover C of FDs in relation R
For every FD X → Y in C:
- Add relation {X, Y} to S
If no attribute set of any relation in S contains a key K for R:
- Add relation {K} to S

The resulting decomposition is always lossless and dependency-preserving.

The minimal cover of a set of functional dependencies is the smallest set of FDs implying all FDs in the original set; not necessarily unique.
Two relations with common keys can be combined, e.g., { A, B } and { A, C } can be combined to { A, B, C }.
In the result of this algorithm a relation can be dropped if all its attributes are contained in those of another relation.

Example¶

Start with R = { A, B, C, D } and FDs A → B and C → D.

If you prefer meaningful interpretations: let this be a table for recording participants in an event, such as an archery competition: archer A with birthdate B takes part in competion C with duration D. The essential function of the table is to record which archers participate in which event. Clearly the birthdate of the archer and the duration of the competition belong in separate tables -- which is exactly what the algorithm will achieve.

The minimal cover is A → B and C → D.
Create relation R1 = {A, B }
Create relation R2 = {C, D }
The key of the original relation is (A, C); B and D are redundant.
(A,C) is not in R1 or R2
Therefore, create another relation R3 = { A, C } with compound key (A,C)

Done. We now have R1 (the archers), R2 (the competitions), and R3 (the participations).

A relation with only key attributes may seem a little strange, but it is not uncommon. Take e.g. our SP table: if we remove the QTY attribute the table still tells us which supplier can deliver which part.

☆ Decomposition to BCNF¶

Go from relation R to set S of relations in BCNF:

S = { R }
While S has a relation R' that is not in BCNF do:
- Choose FD X → Y in R' that violates BCNF
- Add relation {X, Y} to S
- R' = R' - Y

Note that dependency preservation is not guaranteed.

Exercises¶

For the tables in our examples so far, and (hopefully) tables that you have created on your own, check if and how they can be decomposed in lossless and lossy fashions
Check if they are in 1NF (hopefully) and 2NF, 3NF, BCNF
If they are not in 3NF, apply the Synthesis algorithm
Find an example similar to SJT with patients and specialist doctors
- create the tables
- execute the SQL statements for decomposition
- check the reversibility