A highly available (HA) cluster needs reliable service nodes, networking, and power management. No amount of software can overcome major deficiencies in hardware, although software can provide some redundancy when none was possible before. Generally, the majority of the different service nodes must be able to communicate with each other to coordinate restarting services when some nodes fail or become unreachable.
Many HA clusters are going to require shared storage disks. This may be implemented using some sort of storage area network (SAN), typically connected with fibre channel, iSCSI, or SAS switch. Disks or logical units (LUNs) of the storage will need to be accessible to several the nodes that will be implementing the shared storage system. If actual shared storage is not available, one could use a networked disk mirroring system such as DRBD to implement the shared storage.
HA clusters with shared storage also need to be able to fence nodes. Fencing is the separation of a node from the storage. When the HA cluster determines that a node is no longer responsive, it needs to make sure that the node truly is no longer able to write to the shared storage. This could be done by reconfiguring the SAN, but that will not help in recovering the node, which probably needs to be rebooted. Most HA clusters implement STONITH fencing (“shoot the other node in the head”), which is implemented by forcibly and immediately shutting down the node to be fenced. This can be done using commands to a power distribution unit, or to some management controller for the fenced node (such as to a blade chassis controller, or to an IPMI-capable management card).
Fencing must work even in the case of a network failure. This might require alternate network routes for controlling fencing devices. If fencing is implemented by a power distribution unit, it must be the case that each node can be fenced individually.