High availability (HA) systems often rely on HA hardware such as cluster computers. When a running computer encounters a hardware fault, another computer-on-standby will be activated to take over. The HA system seems easy in concept, but in practice, it is not so. The main difficulty is not the HA hardware, but rather it is the HA software that you have to write (if you cannot buy the software off-the-shelf).
Here I have listed down several guidelines that I have gathered while designing HA software.
- Do not use multi-threading.
Reason: In multi-threading application, if 1 thread hangs, it is hard to debug.
Solution: Use multiple processes instead. Then if a process hangs, you can attach a debugger to it and then choose Debug->“Break all” (in Visual Studio) to find out where does it hang. This also has the advantage of lowering the coupling in the software. This means that the other processes can continue running, without being greatly affected.
- Do not use blocking calls.
E.g. Sending socket messages would block forever if there was some problems with the network (no fault of the sender).
Solution: use non-blocking send() or set timeout in socket flags.
- Do not use messaging mechanisms that require tight coupling between senders and receiver.
E.g. Sending messages to a ZMQ push-pull socket would block forever if the recipient hangs (no fault of the sender).
Solution: Decouple sender-receivers using ZMQ publisher-subscriber sockets/pattern
- Do not forget to catch exceptions.
E.g. Sending messages to a UDP socket could fail if the queue is full or whatever other reasons (no fault of the sender). If you do not catch SocketException, your application is going to crash
Solution: Catch SocketException so that the application will continue running. You have a couple of options: either ignore the failure or wait a while to re-send (hoping that the queue will eventually clear by itself). The first option could be viable in the case of UDP since the protocol is known for lost packets.
- Log sufficient information in the exceptions.
E.g If you only catch the generic Exception, but not SocketException, you would lose the WinSock error code.
Solution: Catch the SocketException in order to log the error code. Also catch ZmqException. You can also wrap the SocketException inside another Exception (as an InnerException) if you want to log more information than is available from the SocketException.
- Use a reliable logging framework, e.g. Log4Net.
- Trace statements.