Star schema

The star schema is a special form of a data model, whose goal is not the normalization, but optimization for efficient read operations. Main field of application data warehouse and OLAP applications.

The name star schema is because the tables are arranged in a star shape: In the center is a fact table, around which group multiple dimension tables.

A star schema is in the rule denormalized. Possible abnormalities and a higher memory requirement is generally accepted for performance reasons in purchasing. An improvement is possible by the star schema related snowflake schema. There, however, multi-level dimension tables must be joined via join queries.

Definition

As a logical database schema for data warehouse applications, the so-called star schema has prevailed. This scheme is composed of a fact table and several dimension tables, which are query nicely arranged radially around a fact table and relate this scheme to exactly one fact table. Hence the name of this scheme derives from the star-shaped arrangement of the dimension tables from the lying in the center fact table. The mentioned fact table has information-bearing attributes, such as Sales, periods, costs, etc., and as a primary key a composite key from the primary keys of the dimension tables involved.

Each dimension table is in a 1: n relationship to a fact table. The 1 -to-many relationship is mediated by a key of the dimension table and a foreign key in the fact table. The fact table integrated m: n relationships implicit in a single table and includes, for this reason a lot of redundancy. The key of the fact table consists of the primary key of each dimension table as a foreign key.

The star schema allows for the selection, navigation and summary of measured values ​​or facts. The dimension tables are usually not normalized and therefore are denormalized ago: There are functional dependencies between non-key attributes, so that the third normal form ( 3NF ) is violated. This injury is, however, taken into account in this scheme, because the data structure enables better processing speed at the expense of data integrity and storage space.

Fact and dimension tables

The data to be managed are called facts; they are typically continuously stored in the fact table. Other names for the facts are metrics, measurements or metrics. Fact tables can be very large, which is a data warehouse forces to compress the data by and by ( aggregate ) and finally delete after a holding period or outsource ( archiving). The tables contain identification or earnings figures, which can be derived from ongoing operations and reflect economic performance, such as Profitability, cost, performance / revenue, expenses, revenues, expenses, income, etc. However, only if these figures are placed in a context, they also result in a sense. An example is that revenues are compared in a particular area with specified products in a defined period of time, which reflects dimensions in which the economic benefits are evaluated and analyzed.

In contrast, the dimensional table contains the " descriptive " data. The fact table contains foreign keys to the dimension entries defining their meaning. Typically, the total amount of foreign keys to the dimension tables at the same time the primary key in the fact table Represents the implies that there can be any entry to a combination of dimensions only once. Dimension tables are relatively static and are usually considerably smaller than fact tables. The term " dimension " is because each dimension table represents one dimension of a multidimensional OLAP cube.

Due to the existence of functional dependencies between non-key attributes of the third normal form is deliberately injured in the dimension tables. In order to satisfy the 3NF, the dimension table in question would have to be broken down into individual hierarchical tables, but for performance reasons you can see from in the star schema of a normalization of the dimension tables and hereby accepts the redundancy occurring.

Advantage of the separation of facts and dimensions is that the facts can be analyzed generic and independently by each dimension. An OLAP application requires no "knowledge" about the importance of a dimension. The interpretation is left to the user alone.

The size of dimension tables should, however, be observed. To fact tables in a star schema often more than 10 million records include. While dimension tables are smaller, but can take a considerable size in each dimension. In order to reduce such large data sets and the associated shortened access time single, very large dimension tables can be converted by normalization in a snowflake schema.

Slowly Changing Dimensions

A problem of the star schema is that data over a long period of time have a reference to data in the fact tables in the dimension tables. But may also be necessary to change the dimension of data over time. However, these changes may not affect data prior to the change in the rule. If changes such as the seller for a product group, then the respective entry in the dimension table can not be simply overwritten. Instead, a new entry must be generated, otherwise the sales figures for the previous vendor would no longer detectable. An approach to avoid such conflicts are slowly changing dimensions. This concept summarizes the data warehousing together methods to detect changes in dimension tables and if necessary to document historically.

Properties of star schema

  • Dimension tables Primary key for the identification of the dimension values
  • Figure the dimension hierarchy by attributes
  • Denormalized, ie non-normalized dimension tables
  • Fact table Foreign key to the dimension tables, that is, the lowest level of each dimension is taken as the key in the fact table
  • Foreign key to the dimensions form composite primary key for the facts

Pros & Cons star schema

Benefits

  • Fast query processing: Analytical queries are typically located at higher levels of aggregation and by the lack of normalization of the dimension tables are joins saved. Furthermore, a special join ( star join) are well optimized.
  • Data volume: Dimension tables are very small compared to fact tables. The additional data volume by a denormalization of the dimension table are not case sensitive.
  • Modification anomalies can be easily controlled because no changes take place in classifications.
  • Simple, intuitive data model: The star schema has much less relations as a converging snowflake schema and the JOIN depth is not greater than 1
  • Intelligibility and comprehensibility: a modernization of the reporting system is possible by the star schema. For data collections can be created for trend recognition and data mining.

Disadvantages

  • Deteriorating response time in case of frequent queries of very large dimension tables ( browsing functionality )
  • Redundancy within a dimension table by multiple store identical values ​​or facts.
  • Aggregation formation is difficult

Star schema vs.. Snowflake schema ( normalized)

  • User-friendly query (aggregate access; simple, intuitive data model)
  • Minimize redundancy by normalizing
  • Efficient transaction processing
  • Simple, local and standardized data model
  • A fact table and a few dimension tables
  • Complex and specific schema
  • Many entities and relationships in large data models

Requirement diagram the star schema

First collect Operational requirements: Building a star schema begins with the question

  • What facts are interested by what criteria?
  • Available data required evaluations and table contents

Create second requirement diagram: Defined specifications can be subsumed in a request graph

  • Required indicators: Attributes, evaluate the result of a business unit
  • Question: How good?
  • Dimensions: Attributes along which the indicators are measured
  • Questions: What? When? Where?
  • Categories: Ranges of values ​​of a dimension
  • Questions: How exactly?

Star Join

The star schema leads for typical queries called star joins, which look like this:

SELECT fact or dimension attribute FROM fact or dimension tables WHERE condition GROUP BY fact or dimension attribute ORDER BY fact or dimension attribute example

For example, the query selects several relevant measures from the fact table, which fact rows linked by means of the spare key with one or more dimensions, demonstrates the business columns of the dimension tables with filter predicates, grouped according to one or more business columns and finally aggregates the data retrieved from the fact table measure over a certain period of time. In the following example, the sum of sales of a product over a defined period are presented.

SELECT ProductAlternateKey, Calendar Year, SUM ( SalesAmount ) FROM FactInternetSales Factsheet   JOIN DimProduct ON DimProduct.ProductKey = Fact.ProductKey   JOIN DimTime ON Fact.OrderDateKey = TimeKey   WHERE ProductAlternateKey LIKE ' XYZ %'        AND Calendar Year BETWEEN 2008 AND 2009   GROUP BY ProductAlternateKey, Calendar Year The number of joins that used (there are 2 in this example joins) are the star schema in contrast to the snowflake schema, regardless of the length of the aggregation paths.

240842
de