Denormalization

Under denormalization is defined as the conscious withdrawal of a normalization for the purpose of improvement of the runtime behavior of a database application. From the perspective of the ANSI - SPARC architecture, the conceptual level of a data model is designed completely normalized. Regardless of the internal level can be specifically designed denormalized. So denormalization takes place exclusively on the internal level and does not obviate the requirement previously to normalize the conceptual level.

A logical ideal ( " normalized " ) data model is completely free of redundancy - apart from the technically necessary multiple storage of foreign keys for primary key-foreign key relationships.

With Denormalisierungen can often achieve much greater performance improvements than with a tuning of the database installation.

In addition to the improvement of the run-time behavior denormalization is also applied to reduce the complexity of a system, or to facilitate the manageability of the stored data.

Areas of denormalization

Withdrawal of the first normal form

Violations of First Normal Form are usually made to avoid unnecessary complication of the budget data and related system components.

The first normal form calls an atomic storage of data, which means that in an attribute ( = a database field ) only atomic ( = without decomposition divisible ) information may be stored. Example: The definition of a 100 -character data field to accommodate one or more phone numbers violates the requirement of first normal form. To meet the first normal form, a separate table may have to be made for the storage of the telephone numbers. So as many phone numbers could be stored to a person. However, the placement of one or more phone numbers in a single data field is often perfectly adequate and the complexity of the system is reduced.

Another example of a useful practical reasons violation of first normal form is the retention of title, first name and last name into a single data field. As long as the system does not need to access the individual components of the name, a storage of the individual name components in a single text field would also be a great way to simplify the system.

In most database systems, the street and the house number is ( possibly even with supplementary letters) stored in a single data field, although this procedure, strictly speaking, violates the first normal form.

Withdrawal of the second or third normal form

Demand, the second and third normal form that all dependent attributes may be dependent on the candidate keys alone. All relations that do not meet these requirements must be split. Thus many new smaller tables are created. To access this data, the data of these individual tables need to be brought together again by using SQL statements with joins. The execution of a join is usually time-consuming for the DBMS, as the access to a single table.

The second or third normal form is often withdrawn with the aim of avoiding a join. It typically affects two tables that are in a N: 1 relationship. Example: Employee and Department. If many performance-critical read accesses need the data of employees and in addition the department name, then the additional storage of the department name can be useful in each set of the employees table. These requests can be served from the data in the employee table then alone. The additional access to the department table is no longer required.

This type of denormalization is perfected in the modeling of dimension tables for a data warehouse (data warehouse). If the dimension tables are designed fully normalized, then there are a number of individual tables that are connected to each other through foreign key relationships. The data model looks like a snowflake, hence the term snowflake schema. For accessing the data is often necessary to join the many dissolved out by normalizing individual tables. In contrast, the star schema in which the dimension tables are denormalized designed. The fact table is dependent only directly from the individual dimension tables. There are no dependencies that need to be completed over several foreign-key relationships. The number of fact tables is less, and for access to the tables less joins are required. However, there are data in the fact tables redundancy. The performance of data access in the star schema is usually better, therefore, is usually chosen this scheme in practice.

Anticipated aggregation

To execute queries often require extensive aggregations must be performed. This is especially the case with OLAP systems. When the response time of the queries is no longer in acceptable ranges, then the aggregates can also be calculated and stored in advance. This is ideal for systems that are updated only at night. Then after the actual updating of the data and all possible aggregations are computed and stored. Then, when a user requests a key figure during the day (KPI ), then all the necessary aggregations are already available and the code can be output in seconds.

Fragmentation

A distinction is made, horizontal and vertical fragmentation.

In the horizontal fragmentation (English sharding ) is split the set of all records in a relation to multiple tables. If these tables are on the same server, there are mostly partitioning. However, the individual tables can also be located on different servers. Thus, for example, the data for the transactions are stored in the United States on a server in the U.S. and the data for the business with Europe on a server in Germany. This division is also referred to as regionalization.

Horizontal fragmentation provides no redundancy of the stored data, but the structures. When a relation has to be changed, not only a table to be changed, but all the tables need to be modified, the data is distributed from the respective relation. Here there is the risk of abnormalities in the data structures.

In the vertical fragmentation dependent attributes ( non-key attributes ) of a table into two or more groups are divided. From each group will have its own table, which are complemented by all the key attributes of the original table. This can be useful if the attributes of a relation resulting datasets with a very large record length. Additionally, if the accesses usually affect only a small number of attributes, one can summarize the few frequently accessed attributes in a group and summarize the rest in a second group. The accesses commonly performed with speed, because a smaller amount of data from the hard disk needs to be read. The accesses rarely performed on the remaining attributes are thus not faster, not slower.

From which sentence length splitting is useful in several smaller tables, also depends on the database system. Many database systems store data in the form of blocks with a size of 4 KiB, 8 KiB or 16 KiB from. When the average block length is slightly greater than 50% of a data block, then a lot of space unused. If the average record size is greater than the block size used, the data accesses are costly. If BLOBs occur in a relation together with other attributes, vertical fragmentation is almost always beneficial.

Partitioning

Partitioning is a special case of the horizontal fragmentation.

Large data sets can be administered easily when the data of a relation divided into several small parts (= partitions), and these are stored separately. If a partition of a table is being updated, then other partitions of the table can be reorganized at the same time. If an error is discovered in a partition, then this single partition can be restored from a backup, while programs can access files on the other partitions on. Most established database vendors offer partitioning, see, eg, Partitioning in DB2 and partitioning in MySQL.

Most database systems offer the option to either address individual partitions or to address all partitions in a single table name.

By partitioning the data accesses can be accelerated. However, the essential advantage is the ease of manageability of the entire table.

Index

The creation of an index is a redundant data storage and thus - strictly speaking - a denormalization. Most of the data base management systems are able to automatically update an index, if the data are changed in the base table. Indices are especially well suited for performance improvement when the data changes infrequently, but very often read.

Disadvantages

The disadvantage is often the extra effort that must be driven in order to keep the redundant data consistent. There is a risk of data anomalies on the basis of redundant storage.

This threat can be removed if it is possible to delegate the updating of redundant data stored in the database management system.

The database vendors provide different functions in order to adjust automatically redundantly stored data.

That updating of an index is done automatically, is so self-evident that you would not expect otherwise.
The anticipation of aggregations is supported by materialized views that update these aggregations automatically as required.
Furthermore, there are triggers, which redundantly stored data can be automatically updated.

If the database management system handles such tasks, the update of a single record may be slowed only slightly. Mass data processing can be significantly slower through the use of such functions, however.

Mostly denormalization has an additional memory required to order. Often, however, one is willing to bear the cost of additional storage space for an improvement in performance. In the individual case must be considered whether the benefits are worth it to take the associated disadvantages in purchasing.

ANSI-SPARC Architecture Database tuning Data-Warehouse Binary Large Object Database index Database trigger

228430