Fault-tolerant system

In the technique, particularly in the data processing means fault tolerance (from the Latin tolerare, suffer ',' endure '), the property of a technical system, its operation and maintain, if unforeseen entries or errors in the hardware or software.

Fault tolerance increases the reliability of a system, as is required for example in medical technology or in the air and space technology. Fault tolerance is also a prerequisite for high availability, in telecommunications technology plays an important role in particular.

  • 3.1 Forward Error Correction
  • 3.2 Backward error correction
  • 3.3 External correction

Approaches at different levels

Fault tolerance may be achieved at different levels. Depending on the application (PC, medical technology, space technology, etc.), various approaches are useful combinations lend themselves to often.

Fault tolerance in hardware

Hardware, i.e., an electronic circuit can be made fault-tolerant, for example, by adding redundancy.

Running for example, two implementations of a circuit in parallel (dual modular redundancy, DMR ), then a decision unit detect an error by comparing the outputs of the two components, but not correct.

If you add another instance of the components added (triple modular redundancy, TMR ), you can correct a mistake a decision unit. If the faulty unit marked as defective, an error is more recognizable (as in DMR ).

Fault tolerance in software

At the software level fault tolerance can be achieved by the following measures:

  • Design diversity: different implementations of an algorithm run in parallel
  • Data diversity: the input data can be easily edited several times modified (eg good for rounding errors)
  • Temporal diversity: an algorithm is called with the same data multiple times (eg good against short-term hardware failure)

Fault tolerance in user interfaces

Frequently cause erroneous user input, so human error, abnormal operating conditions. Fault tolerance is therefore one of the design principles for dialogue according to EN ISO 9241, Section 110 ( Dialogue principles ). A dialogue is error tolerant if the intended work result can be either with no or minimal correction effort achieved despite evident errors in input from the user:

  • Support in the discovery and avoid entry errors ( plausibility check)
  • No system crashes or undefined system states
  • Error explanations for correction purposes
  • Additional representation expenses for error localization
  • Automatic error correction information
  • Slide- error treatments
  • Additional explanations on request
  • Consideration and approval prior to execution
  • Troubleshooting Without change of state of the dialogue

The potential errors that cause visitors or can meet them can be classified as follows:

This kind of errors occur because of lack of employment on the user behavior and could be avoided with careful analysis of the target group and their typical usage patterns. Typical preventable user errors on websites are navigational error, or erroneous entries on forms. Due to extensive testing prior to the launch of a website or application, many of these errors could be avoided.

Not all known errors can be avoided. A jogging with the keyboard, accidentally submitting a form, which was not yet completed, are only two examples of error with which to be reckoned with, because it can not be excluded. For all foreseeable errors there must be simple and clearly identifiable correction options.

In this class of errors fall all those who pass through or be caused by difficult to identify programming errors due to unexpected user behavior. Most of these errors lead to opaque application behavior that are not understandable to the user.

Levels of fault tolerance

In general, the following levels of fault tolerance can be distinguished:

Reaction and correction of errors

In the reaction or correction of errors, a distinction the two principles of forward and backward error correction.

Forward error correction

In the forward error correction, the system tries to carry on as if no error had occurred by compensating about incorrect input values ​​through experience from the past or input values ​​of properly functioning input interface or continue to operate immediately with properly functioning replacement systems at the moment of the occurrence of an error. Errors usually remain invisible for users in the forward error correction.

Backward error correction

In the backward error correction the system when an error occurs attempting to revert to a state prior to this occurrence, such as in the state just before an erroneous calculation to perform this calculation again. Similarly, but also a change of state in an emergency or for example, a system restart is possible. Can a faulty calculation repeated successfully, the error for the user remains invisible even when the backward error correction. Often, however, only further operation with poor performance and limited functionality possible and the error is therefore visible.

External correction

In space technology, error correction through the use of satellite telemetry data to the ground station is performed by system experts and switching functions by remote commands. Since the enormous progress in the on-board data processing ( fast processors, large data storage, intelligent software concepts ) data evaluation and switching to more and more error correction is performed autonomously by the satellite system itself.

Due to the necessary extensive verification measures of a complex wiring system and the associated time and cost, the increased autonomy is realized only in small steps in terms of error correction, because unlike systems on earth can an incorrect error correction to the complete loss of a satellite lead.

329433
de