Database normalization

Normalization of a relational data schema ( table structure ) is the division of attributes ( columns ) in a number of relations ( tables) in accordance with the rules of normalization (see below), so that a shape is formed that contains no more redundancy avoidable.

A conceptual schema that contains data redundancies can lead to changes in the database are changed so that realized the data multiple times included not consistent, but only partially and incompletely, so they can be contradictory or obsolete. It is also said that anomalies may occur. In addition, multiple storage of the same data occupies unnecessary space. For these reasons, it is attempted to avoid such duplication by normalization.

There are several dimensions in which a database schema against anomalies may be immune. Depending speaks to the fact that it vorliege in the first, second, third, etc. normal form. These normal forms are defined by certain formal requirements of the scheme.

It takes a relational data schema into a normal form by progressively for them existing functional dependencies his relations decomposed into simpler basis until no further decomposition is possible. However, it may be lost on no account data. With the set of Delobel one can formally prove that he does not bring any data loss with it for a disassembly step.

Is normalized, especially in the phase of designing a relational database. For the normalization, there are algorithms ( algorithm synthesis ( 3NF ) decomposition algorithm ( BCNF ), etc. ) that can be automated.

The decomposition methodology follows the relational design theory.

  • 2.1 First Normal Form ( 1NF ) 2.1.1 Negative example; 1NF injured
  • 2.1.2 solution
  • 2.2.1 Negative example; 2NF injured
  • 2.2.2 solution
  • 2.3.1 Negative example; 3NF violated
  • 2.3.2 solution
  • 2.4.1 Negative example; BCNF injured
  • 2.4.2 solution
  • 2.4.3 decomposition algorithm
  • 2.4.4 Unlike the 3.NF
  • 2.5.1 Negative example; 4NF injured
  • 2.5.2 solution
  • 2.5.3 Note
  • 2.6.1 Negative example; 5NF injured
  • 2.6.2 solution
  • 2.6.3 Note

Procedure

Objective: To increase consistency by avoiding redundancy

First column ( synonymous terms: fields, attributes) Normalization in these areas are tables within these ranges ( data schemes ) divided into new columns, such as addresses in code, town and street. Secondly tables are divided, for example, a table tbl_AdressenAlles with the fields of company, street, postal code and place in these tables:

  • Tbl_PLZOrt with the fields Postal code and city

See picture to split the table tbl_AdressenAlles - the tbl_Adressen table nor the unique primary key AdressID receives. In this example it is assumed that there is only one available to each zip code a place name, but this is not always the case in Germany.

The normalization is to reduce the purpose of redundancy (multiple holding the same facts ) and thereby caused abnormalities (eg not due to a change in all places ) prevent so as to facilitate the updating of a database ( due to a change in only one place ) as well as to ensure data consistency.

Example

This is illustrated by a simple example: A database contains customers and their addresses and jobs that are assigned to the customer. Since there can be multiple orders from the same customer, a collection of customer data would (possibly with address data ) into the order table to the fact that they occur several times there, even though the customer is always only one set of valid data has (redundancy). For example, it may happen that in one order incorrect address data is entered to the customer, the next job, the correct data is captured. So it can - in this table or to other tables - with contradictory data. The data would then not consistent, you did not know which data is correct. Maybe even both addresses are not correct, because the customer has moved (solution below).

In a normalized database, there is only a single entry in the Customer table is linked with each order that customer (usually via the customer number ) for the customer data. In the case of the relocation of a customer ( another example is the change in value added tax) if there were indeed several entries in the corresponding table, but they are also distinguishable by specifying a validity period and are clearly addressed in the above customer example of the combination order Date / ID can.

Another advantage of redundancy freedom, still plays an important role in millions of records in a database, is the reduced memory requirements when the record in a table, for example tbl_Auftrag eg tbl_Kunde points to a record in another table, rather than the data itself to contain.

These are the recommendations that are given on the basis of the theory of normalization in database development to ensure above all consistency of the data and a unique selection of data. The purpose desired redundancy freedom is, however, in special applications in competition with the performance and / or to other destinations. It may therefore be advisable to refrain from normalizing this or to reverse it through denormalization to

  • To increase the performance ( processing speed ) or
  • To simplify queries and thus reduce the risk of errors or
  • Special processes are represented ( for example, business processes ).

In these cases, you should regularly be implemented automatic calibration routines to avoid inconsistencies. Alternatively, you can lock the data relating to amendments.

Normal Forms

Currently, commonly used normal forms are:

Firstly, they are used to assess the quality of an observed database schema, on the other hand, they help to avoid errors when creating new schemas.

In addition, can be obtained from non-relational sources using the normalization data structures that are formally correct in the sense of the normalization concept and the data ( for example, form data or spreadsheets ), record from their respective non-relational sources from which they originated.

Following the criteria of the respective normal forms are explained. It should be noted that any standard form including the criteria of the previous normal forms.

First Normal Form ( 1NF )

Each attribute of the relation must have an atomic value range, and the relation must be free of repeating groups. (Note: instead of " atomic" is also the term " atomic " is used. )

That is, composite, or set-valued nested ranges of values ​​( relationenwertige attribute value ranges ) are not allowed. This also repeating groups are not admitted. In short: No attribute value range of a relation in 1NF can be split into more ( meaningful ) subregions ( Example: The address must not be used as an attribute, but must - insofar as it requires the underlying process - are divided into postcode, city, street and house number ).

The fact that the relation must be free of repeating groups, means that attributes that contain the same or similar information that must be stored in another relation.

An example of a repeating group would be a column { phone } that contains multiple phone numbers or even a set of columns { Phone1, Phone2, Telefon3 }, where noted in the latter case is that it must be a case not necessarily be a repeating group (see alternative formulations ).

Practical benefits Query the database are facilitated by the 1NF and in the first place, if the attribute value ranges are atomic. For example, it is in a field that contains a full name string of title, first name and last name, difficult to impossible to sort by last name.

Alternative formulations All attributes contain atomic content and the relation has a fixed width. This formulation refers to the fact that it may never be necessary to need to raise additional attributes in the relation, because the number of repetitions of the repeating group is too small ( for example: it is with three attributes Telefon1 - 3 is a fourth telephone number for a person known). It is interesting insofar as it can help to decide if there is a repeating group: For example, although { .., Phone1, Phone2, Telefon3, ..} strongly implies the presence of a repeating group, it could at only other attribute names be clear that - of course, under the light of the application - which need not be so: { .., Telephone, Fax, Mobile, ..}

Another variation is caused by the following words: .. and the relation has a primary key. Although this formulation may not be found in Codd, is an extension that leads to pronounced practicable data structures.

Negative example; 1NF injured

  • The field contains the attribute value ranges album artist and album title.
  • The title list box contains a lot of titles.

This one has without splitting the following problems with queries:

  • To sort by the album title, the music box in artist and album title must be split.
  • The tracks can ( in simple terms ) only all be displayed simultaneously as the title list or not.

Solution

The attribute value ranges are split into atomic attribute value ranges:

  • The field album is split into the fields album title and artist.
  • The title list box is split into the fields and track titles and divided into multiple records.

Since now each attribute value range is atomic and the table has a unique primary key ( composite key from the columns CD_ID and track), the relation is in 1NF.

Second Normal Form ( 2NF )

A relation is in second normal form if the first normal form and no nonkey attribute is functionally dependent on a proper subset of a candidate key.

In other words, any non- primary attribute (not part of the key ) is dependent in each case from all around the key, not only for a part of a key. It is important that the non-key attributes really completely depend on all keys.

Thus is that relations in 1NF, the candidate key (s) are not assembled, but only of each of ( a ) single attribute (s ) which automatically satisfy the 2NF.

In a relation R (A, B), the attribute B of the attribute A is functionally dependent if exactly for each value of the attribute A has a value of the attribute B. In a ratio R (S1, S2, B) the attribute B in the key attributes of S1 and S2 is fully functionally dependent when B of the composite attributes ( S1, S2) is functionally dependent on, but not a single attribute S1 or S2.

This informal definition can be specified as follows:

A relation is exactly then in second normal form if it

  • Is part of a candidate key or
  • Is not dependent on a proper subset of a candidate key.

Is fully functionally dependent on each candidate key (the key candidates KC can also be formed by the combination of several attributes). The 2NF eliminates all partial functional dependencies, that is, functional dependencies of parts of the key candidates.

If a candidate key has two attributes, a maximum of three relations may arise from the decomposition into the 2NF. If a candidate has three key attributes that may arise from the decomposition into the 2NF maximum of seven relations. These are respectively the number of subsets of a given set minus 1 ( empty set) and corresponds to the number of elements of the power set () as an upper limit.

Practical benefits The 2NF enforces essential " monothematic " relations in the schema: each relation only modeled a situation.

Characterized redundancy and the associated risk of inconsistencies can be reduced. Only logically / factually related information can be found in a relation. This understanding of the data structures easier.

Negative example; 2NF injured

  • The primary key of the relation is composed of the CD_ID and track fields. ( Basically, a primary key may consist of multiple attributes, but it creates a conflict in the example. )
  • Fields album title, artist and year of the foundation are from CD_ID field -dependent, but not from the track field. This (point 2) violates the second normal form, since the three non-primary attributes not only from one part of the key ( here CD_ID ) may depend. If the key is not assembled ( see point 1), so this could not happen.

Problems that arise from this: The information from these two fields are, as can be seen on the example of CD Not That Kind, multiple times, that is redundant. Characterized there is a risk that the integrity of the data is infringed. So could the album title for the song Not That Kind change in I Do not Mind, but without changing the corresponding entries for the title and I'm Outta Love Cowboys & Kisses ( update anomaly ).

In which case, a state is reached, which is referred to as a data inconsistency. Viewed over the entire table, no longer "fit" together the data.

Solution

The data in the table is divided into two tables: CD and song. Table CD only contains fields that depend functionally full of CD_ID, so it has CD_ID as the primary key. Even the album title alone is unique, that is a candidate key. Since no other ( composite ) key candidates exist, the table automatically, putting it ahead in the second normal form. The table " song" finally contains only fields that depend functionally full of CD_ID and track, that is also in the second normal form. With the help of this lossless decomposition also mentioned redundancies of data are eliminated.

The CD_ID attribute from the table song is called foreign key that references the primary key of the table CD. At the same time the attributes CD_ID track and make the composite primary key of the table song dar.

Third Normal Form ( 3NF )

The third normal form is exactly achieved if the relation schema is in 2NF and no non-key attribute (light gray cells in the table) of a candidate key transitively depends.

An attribute depending on the key is transitive candidate when there is a set of attributes, and so.

This is a dependency in which an attribute of a set of attributes of a candidate key of the relation is a function ( without at the same time is also directly dependent on, that is a candidate key ). That is, if the set of attributes depends on the attribute set and attribute, then is transitive depending on. Formal words.

Simply put: A non-key attribute must not be dependent on a set of nonkey attributes. A non-key attribute may therefore only directly from a primary key (or a key candidates ) to be dependent.

See also: Transitive relation, synthesis algorithm

Practical benefits Transitive dependencies are immediately apparent without the need to understand the interrelationships of the data. They are represented by the structure of the relations.

In addition, remaining thematic By mixtures are resolved in the relation: after the 3NF relations of the scheme are reliable monothematic.

Alternative formulation The third normal form is reached when the relation schema is in 2NF and no non-key attribute (light gray cells in the table) is determinant.

Or: The third normal form is reached when the relation schema is in 2NF and no non-key attribute (light gray cells in the table ) from another non-key attribute is functionally dependent.

Negative example; 3NF violated

Obviously, can the artist on a CD from the CD_ID determine the year of foundation of the band / artist depends in turn the performer and thus by the transitive CD_ID from.

The problem is again data redundancy. For example, if a new CD with an existing performer introduced, the founding year is stored redundantly.

Solution

The relation is split, the two interdependent data be stored in a separate table. The key of the new table must be maintained as a foreign key in the old table.

At the Table " song", no changes in the transmission in 3rd normal form were made. It is only listed here for completeness.

Boyce - Codd Normal Form ( BCNF )

A relation schema is in BCNF if it is in 3NF and every determinant ( attribute set of the other attributes depend functionally ) a super key (or the dependence is trivial ).

The BCNF prevented ( by Raymond F. Boyce and Edgar F. Codd ) that parts of two composite of several fields key candidates are interdependent.

Although the conversion into BCNF is always possible without loss, but not always dependency preserving. The Boyce - Codd Normal Form was originally intended as a simplification of the 3NF, but led to a new normal form, which exacerbates this: A relation is automatically free of transitive dependencies if all determinants are candidate keys.

Negative example; BCNF injured

In this example, there is a simple database in which the club membership of athletes is stored. There are the following conditions apply:

  • Each club offers only a sport.
  • An athlete may play in different clubs, but only if these clubs operate different sports. This ensures that the athletes will never play against a club where he is a member.

From the above conditions it follows that the attribute is functionally dependent on the attribute sport club, ie club is a determinant. However, association is not a candidate key. Possible candidate keys are { name } and {name club, sport }. As you can see, each attribute occurs in a candidate key, so that the relation in 3NF, but not in BCNF is. However, a conversion into BCNF is possible by (name, club ) is used as the primary key and the relation is split:

Solution

Decomposition algorithm

There exists an algorithm that relational schemas ( decomposition german) by decomposition into Boyce - Codd Normal Form transferred. All schemes are doing as long as split, until no longer breaks the BCNF. Each division will be based on one that BCNF violating functional dependency. The attributes of the violating dependence form the first new scheme, and the remaining attributes plus the determinant another scheme. The two new schemes include the original functional dependencies only those which use only attributes of each of the schemes, the rest is lost.

The following pseudo code describes the decomposition algorithm:

Pass of the algorithm in the above example, (without showing all trivial dependencies):

Contrast to 3.NF

The BCNF normal form is stricter in terms of allowed functional dependencies: Relations in 3NF schemas in but some information may be duplicated in the BCNF not.

Fourth normal form ( 4NF )

A relation schema is then in the fourth normal form if it is in BCNF and only trivial multi-valued dependencies ( MWA) includes.

Simply put: It is not more 1 within a relationship: give n- relations with a key value, which thematically / content unrelated: n or m. Heard about a key value i times attribute a, but also independent of j times attribute b, the 4NF is violated.

To put it clearly: The 4NF examined n-ary relationships ( more than two tables are the same in relationship) and whether they have been correctly modeled.

Negative example; 4NF injured

There are several pets and vehicles to a person number. Pets and vehicles of a person but have basically nothing to do with each other; they say, they are " independent of each other ." As a primary key just a combination of all three attributes comes into question, thus, the table in 3NF. Person number → Pet is a multivalued dependency, person number → vehicle also. Since these two mwas are independent, the 4NF is violated.

The example graph shows the incorrect modeling of the multivalued dependencies and the correct solution. Exists between pet and vehicle no relationship, thus the relationship was "owns" incorrectly modeled.

Solution

Reference

The following relation schema satisfies the 4NF, although even here there are several mwas:

Person → person → Partners and child are indeed two mwas, but these two are interdependent: Partners → child. Such interdependent mwas be solved only in 5NF.

Fifth normal form ( 5NF )

A relation is in 5NF if it is in 4NF and not from simpler relations ( those that contain fewer attributes ) can be reconstructed by composite operations.

So the 5NF demands as far as possible simplified relations, from which, however, by projection and composite operations all information of the original relation must be recoverable. It is therefore kept very general and thus (for now) the last normal form. Such relationships can be divided into individual queries and are joined by subsequent operations again composite, wherein a portion of the so-called Cartesian product is created.

Negative example; 5NF injured

The following relation shows which can provide components to which the project which suppliers:

The relation can be further divided, is lost without that information.

Solution

To convert this relation into the fifth normal form, three relations must be created (supplier - part, part - project and supplier project).

  • What parts can supply which supplier?
  • Which parts of which project needed?
  • What projects can be supplied from which supplier?

Reference

Unlike the forming between the previous normal forms is expressed by this change by something other than before the new relations in the fourth normal form.

This can be seen easily if one combines the three relations from the example above again:

Project 1 - Müller - Brand: New is the tuple

Because Müller could theoretically deliver the project 1 with nails, because

  • He also supplies Project 2 with nails and
  • Project 1 also requires nails (which, however, have come from Maier ).

The conversion into 5NF is therefore only possible if one wants to express the possibilities of compounds from three relationships and do not want to have a physical link between the three.

Comments

Weaknesses in the data model due to lack of normalization can - besides the typical anomalies - mean higher costs at a later development. On the other hand, are deliberately avoided in database design from Performance considerations on normalization steps ( denormalization ). A typical example is the star schema in data warehouse.

The creation of a normalized schema is supported by automatic deduction from a conceptual data model; this is in practice an extended entity-relationship model ( ERM) or a class diagram of Unified Modeling Language (UML ) as a starting point. The derived from the conceptual design relational schema can then be checked with the help of normalizations; However, there are formalisms and algorithms that can already ensure this property.

Instead of the original by Peter Chen in 1976 developed ER model today extended ER models are used: the structured ERM (SERM ) that E3R model, the EER model and the SAP SERM used by SAP AG.

A relation schema is not in the 1NF, it is called this form and Non -First -Normal - form ( NF ²) or Unnormalized form ( UNF).

The process of normalization and decomposition of a relation into 1NF, 2NF and 3NF must receive the recoverability of the original relation, that is, the decomposition must be composite faithful and loyal dependence.

Mnemonic

As a reminder of the degree of dependence on the key in the first three normal forms is in a joking reference to the U.S. Court of silk ( "The truth, the whole truth and nothing but the truth so help me God. " ) Like to call the following saying:

"The key, the whole key, and nothing but the key. So help me Codd! "

"The key, the whole key and nothing but the key. So help me Codd! "

This means:

  • All ( implied: atomic ) values ​​refer to the key - 1 NF
  • With composite keys they relate to the whole key - 2 NF
  • The values ​​depend only on the key, and not by other values ​​- 3 NF

(Note: this mnemonic with respect to " the key " is slightly shortened, in effect, you have the normal forms 1-3 apply to all candidate keys, otherwise the Boyce - Codd normal form gives no sense and leads to completely senseless Table Plans In! however practice have very many tables except the primary key no other candidate keys on, and the formulation brings the essence very memorable to the point. )

Rules to remember

11976
de