Information integration

Under Information integration refers to the merging of information from different databases (data sources) with generally different data structures in a common unified data structure.

Above all, heterogeneous sources to be merged as completely and efficiently to a structured unit that can be used more effectively than would be possible with direct access to the individual sources. Information integration is especially necessary where several grown systems to be connected together, so for example in the merging companies, workflows and applications, or searching for information on the Internet.

The integration of complex systems has moved only in the 1990s, the focus of computer science research and thus evolving.

Methods

The integration of heterogeneous information from various sources concerns the integration of specific data as well as the structures (schemas ) in which they are present. First, the local schemas must usually be integrated ( schema integration), including (partial) automatic methods can be used ( schema matching ). For the subsequent integration of the data, methods of data fusion and duplicate detection is necessary.

Opportunities and goals

For redundancy between data from different sources ( extensional redundancy) to Collaboration can partially automatically determined and used for the completion of records ( data fusion). For example, the entries of a phone list and a staff directory can be combined in accordance of personal names. Thus, since there are more information about individual objects are available, one also speaks of compaction.

Goal of integration is to enable a consistent global view of all data sources. Redundant data sources can use this for verification. The combination of intensional redundant sources leads to a higher coverage ( coverage) and the completion of records at extensional redundancy of sources to a higher density ( Density).

Materialized vs.. Virtual Integration

In general, two types of integration can be distinguished:

  • Materialized or physical integration: data from different data sources - with generally different data structures - are transformed into the target structure and copied to a central database, where they are available for evaluation. This principle is found for example in data warehouses or in the project for data exchange of the Open Archives Initiative.
  • Virtual or logical integration: The data remains in the different sources and the integration takes place only when a request instead ( Federated Information System).

In comparison, the following advantages and disadvantages

  • Actuality: In materialized integration, the timeliness of the data obtained from the time interval of the data updates from the sources; a virtually integrated system, however, is always up to date, since the data are integrated to the request date.
  • Response time: in a materialized System Since all data is stored centrally, it can be stored optimized for fast response times. In virtual integration, the response time greatly depends on the availability of the data management system and the speed of access to the source data, the transmission paths as well as the addition taking place tasks such as data transformation (mapping) and data cleansing from.
  • Flexibility: As large data storage systems are mostly materialized difficult to maintain as a virtual integrated systems, in which the maintenance of the data is the responsibility of the sources. Furthermore, the addition of a source can affect the entire integration (global -as -view ), while virtual integration adding, removing or changing a source only on their mapping to a global schema impact (local -as- view).
  • Autonomy of data sources: For materialized and virtual data integration is not taken directly impact on the data sources, eg, the structure remains unchanged. Due to the required access but demands placed upon it, such as accessibility and performance are subject to change, virtual data integration appears to be related to have a stronger influence, as in physical integration of access, for example, could be targeted at times of generally weaker utilization.
  • Hardware Requirements: Materialized integration usually requires the purchase of dedicated hardware.
  • Data quality: In materialized integration is generally more time to transform the data, thus, in comparison to the virtual data integration more complex analyzes possible - the achievable data quality is therefore higher.

Integration architectures

Materialized integration architectures

In materialized systems, data is imported from the sources, adjusted and stored centrally. The existing data in the source systems are not changed in the rule.

  • Data Warehouse ( DWH): Are the most important representatives of materialized database systems. The data required for the information needs of a company are directly stored persistently in a central data warehouse to enable a global, unified view of the relevant data. In order to integrate the source data into the DWH - based database that must be implemented for this purpose, an integration layer ( ETL ).
  • Operational Data Stores ( ODS): While data warehouse systems primarily adapted to the requirements of a corporate management and thus serve the information available to the strategic decision-making processes, the integrated data operational business processes are at " with operational data stores " are available. This already implies that the data stored in a central data warehouse, data should be " surgically " used, ie after the completion of the integration (import, cleanup, storage), these data are subject to change. Therefore, the focus of attention with ODS systems are not historical, but primarily current data. In this respect, there is another key differentiator for DWH, since the synchronization has to be made to the source data in either requests or at least frequent, regular intervals. ODS are generally used by companies in those business areas where the timeliness of the data plays an important role, such as in customer and supplier communication areas and warehouse management processes. With the trend toward real-time data warehouse and powerful database management systems the operational data store is likely to rise in the data warehouse.

Virtual integration architectures

Unlike materialized systems, data is not stored in virtual database systems in the integrated system itself, but remain physically in the data sources and are only for inquiries in the integration system loaded (virtual data storage ).

  • Federated database systems ( FDBS ): In the center of a Federated Database System is a "global conceptual " ( = canonical ) schema. This scheme one hand represents the interface to the local, distributed databases and their local schemes and on the other hand provides the requesting applications by means of appropriate services an integrated global view of the federated source data. FDBS caused mostly by the union of several database systems ( multi- database systems ) with the aim of a "central" ( federated ) Coordination of joint tasks.
  • Mediator -based integration systems & Wrapper ( MBS): Mediators serve as " intermediaries " between data sources and applications. The mediator in this case accepts requests from the application and answered this by communicating with the relevant data sources. This already implies a great knowledge of the structure of all the federated data sources schemas and possible inconsistencies in terms of the connected entities. However, unlike federated database systems mediator -based information systems provide only read access to the integrated systems. Mediator -based systems in conjunction with wrappers already put a specific software expression of middleware dar. principle mediators can also be used as part of a materialized information system, such as an intermediary between the integration layer (or the central data warehouse ) in order to overcome the heterogeneity of the connected source systems. However, since the essential characteristic of materialized systems, a standing in the center of the data warehouse, missing in mediator- based systems, they are assigned to virtual information architectures.
  • Peer data management systems (PDMS ): The last relevant in practice integration system peer data management systems will be presented here. The internal structure of a peer- component is defined as follows:

Related topics

The information integration has among other overlaps and affinities with the following topics:

  • Data Mining
  • Knowledge Management
219646
de