Split-brain (computing)

Split Brain is in the computer science an undesired state of a computer cluster, all of the intermediate compounds are simultaneously interrupted between the cluster members.

Molding

A basic distinction between

  • The (waste) separation of a single node, this is an extreme example of the subdivision of a two -node cluster,
  • The separation of a multi - node cluster (> 2) into unequal parts
  • The separation of a multi - node cluster (> 2) in the same parts
  • The separation of a multi - node cluster (> 2) into a plurality of parts. This last situation is, however, considered to be more single split-brain scenarios.

Conclusion

To coordinate transactions in the cluster, a cluster interconnect or a quorum is usually used - depending on the technology used. If the connection between one or more parts of the cluster interrupted in this way, none can yet distinguish whether it is a partial failure or separation. All of these ( now isolated ) cluster fragments continue to work for themselves in order to maintain the provision of the service (or " Service"). Since normally the network connection to the public network ( ie in the direction the user) still works, problems arise:

Effects

The basic problem of split brain is the fact that at least two parts still work, but no coordination between them is possible. While this is not immediately seem problematic in pure read access, write access leads to massive conflicts: the write operations are spread over the (though functioning but isolated from each other ) parts of the cluster, but the logic layer ( engl. middle tier ) or the user nothing unusual noticed; the cluster behaves in normal operation from the user's point of view the same. However, the written of nodes / Part A block by nodes / Part B can by broken interconnect not be read - and vice versa.

The data sets diverge, therefore, the consistency of the data is not guaranteed. Recovery from this situation is normally only under indiskutablem time feasible or even completely impossible.

Countermeasures

The basis of all countermeasures is the simultaneous use of quorum and cluster interconnect: The separation of the two coordination modes are still allowed the distinction between division and partial failure.

The cover failures of parallel (simultaneous loss of several mission-critical parts) increases the complexity enormously - in the case of split-brain prevention, for example, the use of multiple quorums and the use of a parallelized / bonded interconnects intercepts the failure of interconnect and storage.

The interplay between Quora and interconnect a reliable automated decision-making is necessary, the decision will be made as follows, for example, in the Oracle Clusterware:

It survived after the loss of interconnects ( order sensitive):

In order to not repeat that just by several quorums supposedly solved problem ( I see two quorums, you see two quorums, but we see two different pairs! ) Oracle uses an odd number of these quorums. All nodes that meet the quorum, must see themselves in the interconnect. If this is not the case, decide the load and topology information in the Voting Disk on the life and death of the node. The above- mentioned decision list is expanded:

742169
de