Quorum (distributed computing)

Under a quorum or voting disk is defined as a component of the cluster manager of a computer cluster to ensure data integrity in the event of a part failure. In case of failure of the cluster interconnects ( the connection between the cluster nodes ), there is a risk of splitting the overall system undesirably autonomously acting units, which almost always data integrity threatened ( split-brain problem). By alternate or competing write to the logical structure of the Voting Disk of an interrupted interconnects will be decided in the case, is to survive what part of the cluster. The voting disk is on shared storage.

An example for the case of Oracle RAC, it survives:

  • Under asymmetric division (eg 2:3 node ), the greater part
  • With an even division (eg 2-2 knots) the part with the greater workload.

Such a distinction based on a failure of the interconnects as a communication channel would be impossible without a " vote " on mass storage. Since almost all cluster managers react to a failure of the interconnection with the restart of at least one node, is the persistent storage of the cluster state in the Voting Disk advantage: it accounts for a good part of the renegotiation of availability and Master status. These negotiations require without the persistent Voting Disk often multiple reboots. This increases when using a quorum availability of the individual nodes due to the absence reboot cycles.

Problems and Solutions

The Voting Disk itself is - as soon as it is used - an integral part of the cluster. If a formerly available quorum during a cluster operation suddenly no longer palpable, the entire system fails. This applies also in case of failure of a single shared storage. To avoid the so resulting single point of failure, is currently aspiration of all manufacturers of Clusterware.

The common approach to solving these problems is to mirror the voting disk on several physical media. Here, however, do again new depths on:

  • The Voting disks must be guaranteed consistent and fraught with the lowest possible latency.
  • Also counterproductive would be a split-brain scenario Distribution of Voting Disks between the potentially autonomous subunits. This conflict triggers such as Oracle Clusterware with an odd number of quorums.

The King solution for consistency - latency and availability problems, the (possibly very expensive) storage -side replication in the Storage Area Network ( SAN); they presented to all cluster members transparently a single replicated device and thus relieved Clusterware, cluster members and administrators.

668069
de