Normal Forms¶
The process of converting a set of relations into a form that avoids certain problems is called normalisation. It involves decomposing relations into parts. Obviously, we do not want to introduce any errors in this process.
We will use curly braces for relations and underline for keys, e.g. relation R = {A, B, C } has attributes A, B, C and key A. If not otherwise noted, key means non-reducible i.e. candidate key.
We will also use the terms relation and table interchangeably to simplify the discussion when it comes to implementation in SQL. We will assume that tables do not contain multiple identical rows.
Let's look at our supplier table S = {SID, CITY, STATUS }:
SID | CITY | STATUS |
---|---|---|
S4 | Denver | 20 |
S5 | Denver | 20 |
S6 | Salem | 10 |
SID is the primary key, therefore all other attributes must be functionally dependent on SID. However, in this example let us also assume that the status is dependent on the city. The functional dependencies are
- SID → CITY, STATUS
- CITY → STATUS
The last FD means that this table design introduces a problem: the fact that Denver has status 20 is recorded multiple times.
- recording a fact more than once is not necessary and a waste of resources
- it can easily introduce inconsistencies, e.g. when the record for supplier S5 is updated with a status other than 20
Redundancy¶
For arbitrary attribute sets A and B of relation R:
If A → B and A is not a candidate key,
- and the FD is not trivial
- and A is not a superkey
- and R contains at least 2 tuples
then R contains redundancies.
To remove the redundancy in the supplier table, we can decompose it into two separate tables. There are several possible decompositions, some of them more desirable than others. For practical reasons we want a decomposition that allows us to join the decomposed tables and get back exactly the original table (headers and content).
Nonloss Decomposition¶
- Decomposition = projection
- Recomposition = join
- Reversibility: original relation is equal to the join of its projections
E.g. compare the following decompositions of S = {SID, CITY, STATUS}
- lossless: {SID, CITY} and {SID, STATUS}
- lossy: {SID, STATUS} and {STATUS, CITY}
For obvious practical reasons we are interested in lossless decompositions:
SID | CITY |
---|---|
S4 | Denver |
S5 | Denver |
S6 | Salem |
SID | STATUS |
---|---|
S4 | 20 |
S5 | 20 |
S6 | 10 |
Heath's Theorem¶
- Let R = {A, B, C} a relation with A, B, C as sets of attributes
- If A → B then R is equal to the join of its projections on {A, B} and {A, C}
Proof:
- No tuple lost:
- (a,b,c) ∈ R.
- Then (a,b) ∈ R1 and (a,c) ∈ R2.
- Therefore (a,b,c) ∈ R1 JOIN R2.
- No spurious tuples:
- (a,b,c) ∈ R1 JOIN R2.
- Therefore (a,b) ∈ R1 and (a,c) ∈ R2.
- There must be (a,x,c) ∈ R for some x to generate (a,c) ∈ R2.
- Therefore (a,x) ∈ R1.
- From (a,b) ∈ R1 and (a,x) ∈ R1 follows b = x because of A → B.
- So, (a,b,c) ∈ R.
Note that
- The theorem says 'If', not 'If and only if'
- A similar version of this idea is conveyed in the theorem by Delobel and Casey
First normal form¶
Every tuple contains exactly one value for each attribute.
The definition refers to one value of the appropriate type for the attribute; this is not necessarily a single word or number. What constitutes an appropriate type for an attribute depends on the application.
The often encountered definition by 'atomic values' (i.e. values must not be divisible) is problematic. Many types of values could be divided into smaller parts, but without benefit for the application.
The only meaningful and practical interpretation of the atomic definition is divisibility into several values of the attribute type i.e. if we can split the entry into several values for the type of the attribute (such as several dates for a date attribute, or several addresses for an address attribute) then the 1NF is violated.
Relations are by definition always in 1NF.
In bad implementations the 1NF of a table can be violated by e.g. writing several values into a field, separated by commas. People will have no problem reading and understanding this, but SQL statements will run into problems.
All tables we have seen in examples so far are in 1NF.
Update Anomalies¶
If a relation is only in 1NF we can have Update anomalies.
Example:
lump supplier table S and supplier-parts table SP together into FIRST = {SID, STATUS, CITY, PID, QTY}
with
- (SID, PID) → QTY
- SID → CITY, STATUS
- CITY → STATUS
Suppose the table contains the following data:
sid | status | city | pid | qty |
---|---|---|---|---|
S1 | 20 | London | P1 | 300 |
S1 | 20 | London | P2 | 200 |
S1 | 20 | London | P3 | 400 |
S4 | 30 | Rome | P2 | 200 |
S4 | 30 | Rome | P4 | 300 |
S3 | 10 | Paris | P2 | 200 |
Anomalies:
- DELETE: when we delete S3 we also loose the status and city
- UPDATE: changing the city of S1 requires changes in multiple tuples
Procedure for 2NF:¶
take (nonloss) projections to eliminate reducible FDs
For relation FIRST: replace with SECOND = { SID, STATUS, CITY } and SP = { SID, PID, QTY }
- Process is reversible by joining SECOND and SP, no information is lost (check with Heath!)
- However, 2NF can contain information that 1NF could not, i.e. location of supplier without shipment
But relation SECOND still contains CITY → STATUS and therefore a transitive FD SID → CITY → STATUS
Suppose that SECOND contains the following data:
SID | STATUS | CITY |
---|---|---|
S1 | 20 | London |
S2 | 10 | Paris |
S3 | 10 | Paris |
S4 | 30 | Rome |
Anomalies:
- DELETE S4: not only supplier information but also city-status information
- UPDATE: city-status information appears many times, have to update every tuple
Third normal form¶
2NF and every non-key attribute is non-transitively dependent on every candidate key
Procedure: take (nonloss) projections to eliminate transitive FDs
Decomposition A:
project SECOND into SC = { SID, CITY } and CS = { CITY, STATUS }
- Again, process is reversible by joining, no information lost (check with Heath!)
- however, 3NF can contain more information about the real world, i.e. status of a city without supplier
- SC and CS are independent:
- updates can be made to either one without regard for the other:
- SID → CITY and CITY → STATUS are automatically enforced,
- and therefore also SID → STATUS
From table SECOND above:
SID | CITY |
---|---|
S1 | London |
S2 | Paris |
S3 | Paris |
S4 | Rome |
CITY | STATUS |
---|---|
London | 20 |
Paris | 10 |
Rome | 30 |
Dependency preservation¶
Decomposition A above is dependency-preserving: all FDs are preserved.
Not all decompositions are automatically dependency-preserving, even if they are lossless.
Decomposition B:
project SECOND into SC = { SID, CITY } and SS = { SID, STATUS }
- lossless
- 3NF
- however:
- FD CITY → STATUS is not preserved
- SC and SS are not independent
Example: if we update SS and set the status of S3 to 20 then we would violate CITY → Status, but we need to check in SC to find out about that.
We could use a database constraint to ensure CITY → STATUS, but there is a better way. From a practical point of view, we prefer decompositions that are dependency-preserving.
Theorem of Rissanen¶
Projections R1 and R2 of relation R are independent if and only if:
- every FD in R is a logical consequence of those in R1 and R2
- common attributes of R1 and R2 form a candidate key for at least one of the pair
Check decomposition A:
FDs in SECOND follow from SC and CS
SID → CITY, CITY → STATUS, and therefore SID → STATUS
Common attribute CITY is key in CS
Check decomposition B:
- CITY → STATUS does not follow from FDs in SC and SS: SID → CITY, SID → STATUS
Atomic Relation¶
cannot be decomposed into independent relations
Examples:
- SP = { SID, PID, QTY } with key (SID, PID) is atomic
- SECOND from above is not: use Decomposition A to get independent relations SC and CS
Boyce/Codd Normal Form¶
A relation is in BCNF if and only if every non-trivial, left-irreducible FD has a candidate key as its determinant (left-hand side).
In a diagram: only FD arrows out of candidate keys.
Codd's original definition of 3NF did not adequately deal with a relation that
- has two or more candidate keys such that
- candidate keys are composite, and
- they overlap
BCNF is a stronger definition, and therefore not equal to 3NF:
3NF and BCNF are equivalent when the three conditions above are not met.
Example: Student-Subject-Teacher
- Students take several subjects
- For each subject, each student is taught by only one teacher
- Each teacher teaches only one subject, but each subject is taught by several teachers
Relation SJT with FDs S,J → T and T → J
S | J | T |
---|---|---|
Smith | Math | White |
Smith | Physics | Green |
Jones | Math | White |
Jones | Physics | Brown |
- two overlapping candidate keys S,J and S,T
- relation SJT is in 3NF (see Zaniolo's definition below)
- T is determinant but not candidate key, therefore
- relation SJT is not in BCNF
- update anomalies, e.g. deleting 'Jones studies Physics' loses info that Brown teaches Physics
We can go to BCNF by projecting to ST = { S, T } and TJ = { T, J }
- avoids anomaly above, but
- ST and TJ are not independent, since
- only T → J is represented, and
- (S,J) → T cannot be deduced from T → J
E.g. insert Smith-Brown into ST
- should be rejected since
- Brown-Physics (Brown teaches Physics)
- and we already have Smith-Physics-Green (Smith studies Physics with Green)
- but need to check TJ to find out!
Note that the only candidate key for ST is (S,T).
S | T |
---|---|
Jones | Brown |
Jones | White |
Smith | Green |
Smith | White |
T | J |
---|---|
Brown | Physics |
Green | Physics |
White | Math |
Check reversibility:
select S, J, T from ST natural join TJ;
S | J | T |
---|---|---|
Jones | Physics | Brown |
Smith | Physics | Green |
Jones | Math | White |
Smith | Math | White |
We observe that
- SJT is atomic.
- We cannot decompose every relation into components that are BCNF and independent.
☆ Zaniolo's definition of 3NF¶
- X any subset of attributes
- A any single attribute
- 3NF iff for every FD X → A at least one of the following is true:
- X contains A (trivial FD)
- X is a superkey
- A is contained in a candidate key
For a definition of BCNF drop condition 3.
Apply Zaniolo's definition to SJT:
Functional dependencies (S,J) → T and T → J
Candidate keys (S,J) and (S,T)
Functional dependency (S,J) → T
- X contains A i.e. S,J contains T: false
- X is superkey i.e. S,J is superkey: true
- A is contained in a candidate key i.e. T in (S,T) : true
OK for 3NF, OK for BCNF
Functional dependency T → J
- X contains A i.e. T contains J : false
- X is superkey i.e. T is superkey: false
- A is contained in a candidate key i.e. J in (S,J) : true
OK for 3NF, not OK for BCNF
Therefore, SJT is in 3NF but not in BCNF.
Algorithms for Normalization¶
Synthesis to 3NF¶
Go from relation R to set S of relations in 3NF:
- start with S = empty
- Find a minimal cover C of FDs in relation R
- For every FD X → Y in C:
- Add relation {X, Y} to S
- If no attribute set of any relation in S contains a key K for R:
- Add relation {K} to S
The resulting decomposition is always lossless and dependency-preserving.
- The minimal cover of a set of functional dependencies is the smallest set of FDs implying all FDs in the original set; not necessarily unique.
- Two relations with common keys can be combined, e.g., { A, B } and { A, C } can be combined to { A, B, C }.
- In the result of this algorithm a relation can be dropped if all its attributes are contained in those of another relation.
Example¶
Start with R = { A, B, C, D } and FDs A → B and C → D.
If you prefer meaningful interpretations: let this be a table for recording participants in an event, such as an archery competition: archer A with birthdate B takes part in competion C with duration D. The essential function of the table is to record which archers participate in which event. Clearly the birthdate of the archer and the duration of the competition belong in separate tables -- which is exactly what the algorithm will achieve.
- The minimal cover is A → B and C → D.
- Create relation R1 = {A, B }
- Create relation R2 = {C, D }
- The key of the original relation is (A, C); B and D are redundant.
- (A,C) is not in R1 or R2
- Therefore, create another relation R3 = { A, C } with compound key (A,C)
Done. We now have R1 (the archers), R2 (the competitions), and R3 (the participations).
A relation with only key attributes may seem a little strange, but it is not uncommon. Take e.g. our SP table: if we remove the QTY attribute the table still tells us which supplier can deliver which part.
☆ Decomposition to BCNF¶
Go from relation R to set S of relations in BCNF:
- S = { R }
- While S has a relation R' that is not in BCNF do:
- Choose FD X → Y in R' that violates BCNF
- Add relation {X, Y} to S
- R' = R' - Y
Note that dependency preservation is not guaranteed.
Exercises¶
- For the tables in our examples so far, and (hopefully) tables that you have created on your own, check if and how they can be decomposed in lossless and lossy fashions
- Check if they are in 1NF (hopefully) and 2NF, 3NF, BCNF
- If they are not in 3NF, apply the Synthesis algorithm
- Find an example similar to SJT with patients and specialist doctors
- create the tables
- execute the SQL statements for decomposition
- check the reversibility