IBM High Availability Cluster Multiprocessing

The Cluster Manager for AIX is called HACMP ( High Availability Cluster Multi-Processing ). It is used in applications that need to have a high availability. These are usually mission-critical applications (such as the billing system for securities transactions in a bank ).

With HACMP Version 6.1 has been renamed to PowerHA. For new versions - - Even if the software is now no longer called that HACMP is the name in professional circles still common.

With Version 7.1 Smart Assist agents were introduced, which allow automatic detection and configuration of various applications as HA solution.

Operation

Participating machines on a HACMP cluster are called nodes. On these nodes run so-called resource groups (RG), which represent the central concept in HACMP: a RG is the logical combination

  • One or more Fileysteme
  • One or more IP addresses
  • One or more processes and associated Start-/Stop-Scripte

When invoking of a Resource Group on a cluster node, the associated file systems are mounted first, then launched with the help of stored in the RG- definition Start-/Stop-Scripten the processes of RG. After the IP address is applied (called the Service IP) as an IP alias on a particular interface for it.

If the Resource Group moved to another cluster node ( takeover ), it is not terminated with the stop script the application suspended the file systems and the IP alias deleted with the service IP, then on the other destination node activation routine (see above) processed. To the client, only a brief interruption occurs ( the time required for the change ) until the service is back under the same IP address. The fact that this IP address now represents a different machine, the client does not notice.

Most of the functions in HACMP or PowerHA is done by scripts ( in Korn shell), only a small kernel patch ( the so-called dead-man switch) directly accesses a -changing in the underlying operating system. This open architecture makes HACMP very flexible.

The biggest problem that needs solving cluster software, is the so-called split brain condition: both nodes believe to be the active or need to be. In HACMP / PowerHA various communication routes are defined in the configuration of the cluster on which the relevant cluster nodes mutually get news about their ability to function. This is called Heartbeat and can

  • Dedicated Internet IP interfaces
  • The plates of the resource groups to which they could even access both nodes must
  • Serial lines ( the classic method and to HACMP 4.4 Essential)

Accomplished. If a node due to not received heartbeats to the conclusion not being able to communicate with the partner or the outside world, the dead-man switch is triggered and the node switches itself depending on the configuration either from or restarts. The active node also checks whether the communication with the clients is still possible, before it shuts down, so that the standby node can take over.

Typical configurations

With HACMP / PowerHA a variety of cluster configurations are possible, which are by far the most common active / passive cluster (called HACMP jargon rotating clusters) and active / active cluster ( cascading cluster).

Rotating cluster

The Resource Group runs on a of usually two ( if required but also more) node to the other node is running only the OS and the cluster manager. If the active node, so the other performs a takeover. The mode is rotating named because the Resource Group is among the nodes shifted back and forth, so to speak, " rotated".

This mode is usually used for mission critical systems and has the advantage of being easy to plan with relatively low complexity. The disadvantage is that a significant part of the capacity ( of / standby node ) is not used most of the time.

Cascading clusters

The Resource Group with the main application is running on a node to another node running resource groups that can be switched off if required. In case of failure of the standby node first executes the stop scripts of its own resource groups, then a takeover is performed on the RG of the main application.

This mode is typical of systems in which a productive instance of one or more test and development instances is facing, such as SAP ERP or large databases. The test instances are then, as long as no error occurs, run on the standby node, in case of error they are not available for some time.

289586
de