Lindsay Todd

Hardware for an HA cluster

 Technical  Comments Off on Hardware for an HA cluster
May 032011
 

A highly available (HA) cluster needs reliable service nodes, networking, and power management. No amount of software can overcome major deficiencies in hardware, although software can provide some redundancy when none was possible before. Generally, the majority of the different service nodes must be able to communicate with each other to coordinate restarting services when some nodes fail or become unreachable.

Shared storage

Many HA clusters are going to require shared storage disks. This may be implemented using some sort of storage area network (SAN), typically connected with fibre channel, iSCSI, or SAS switch. Disks or logical units (LUNs) of the storage will need to be accessible to several the nodes that will be implementing the shared storage system. If actual shared storage is not available, one could use a networked disk mirroring system such as DRBD to implement the shared storage.

Fencing

HA clusters with shared storage also need to be able to fence nodes. Fencing is the separation of a node from the storage. When the HA cluster determines that a node is no longer responsive, it needs to make sure that the node truly is no longer able to write to the shared storage. This could be done by reconfiguring the SAN, but that will not help in recovering the node, which probably needs to be rebooted. Most HA clusters implement STONITH fencing (“shoot the other node in the head”), which is implemented by forcibly and immediately shutting down the node to be fenced. This can be done using commands to a power distribution unit, or to some management controller for the fenced node (such as to a blade chassis controller, or to an IPMI-capable management card).

Fencing must work even in the case of a network failure. This might require alternate network routes for controlling fencing devices. If fencing is implemented by a power distribution unit, it must be the case that each node can be fenced individually.

High availability clustering for scientific computing

 Technical  Comments Off on High availability clustering for scientific computing
May 022011
 

Clustering is a term you will find in a lot of marketing literature, both for hardware configured as a “cluster”, and as software that enables management and use of a “cluster”. What can be misleading is that there are two distinctly different types of clusters commonly used, performing very different functions. Both types of clustering are useful for scientific computing. Continue reading »

Choosing an Enterprise Linux Distribution

 Technical  Comments Off on Choosing an Enterprise Linux Distribution
Mar 232011
 

At the supercomputing site I work at, we have been using Red Hat Enterprise Linux on our compute nodes. But now the subscription to updates has expired, and it is extremely difficult to figure out how to renew the subscription. It is not clear which “flavor” of RHEL we need, and the fact that we started with RHEL 4, and RHEL 6 is current, only complicates matters. Pricing information I can find suggests this is going to be pretty expensive – assuming we can ever find out how to order a new subscription! Mind you, we are not against paying something to Red Hat. We very much depend on their product and want to see them remain in business! Continue reading »

Jun 112005
 

We are interested in the problems relating to synchronizing data. For example, I have a couple laptop computers, a computer at home, and a computer at work. There is a large portion of data that I wish to have accessible to me from any of these computers, and changes must propogate to each system. A distributed file system might seem like the obvious solution. But my home computer does not have a fast Internet connection, nor is my laptop always connected to a network. Additionally, today’s distributed file systems share everything, but this is not always desirable.

What we need to be able to do is to synchronize our data. In many cases, the data to synchronize consists of whole files that have been either added or deleted at one or more computers, or a common file that has changed at one computer. File system synchronizers such as Unison can detect which files have been created, deleted, or changed and propogate these changes between computers. In other cases, a file has changed differently on more than one computer. These files need to be merged, brought to a consistent state incorporating the changes made on each system.