[lug] Setting up failover in Linux?

Sun Apr 29 16:30:52 MDT 2012

Linux-HA setups typically are a number of pieces that you have to then
design your high availability around.  They aren't like a RAID controller,
where you can just throw it into a system and if one component fails the
system keeps running seamlessly, while acting almost exactly the same way
as the system without the RAID controller...

Here are some thoughts I had while reading your message:

   Putting /usr on DRBD is almost certainly not what you want.  You may
   think it is, but it's not.  :-)  You really, REALLY, need to think about
   just what the components you need are, and only those should exist on
   DRBD.

   If you want the ability to do an upgrade of software on each system
   independently (whether OS or application software), you probably need to
   have that software exist outside of the DRBD.

   Remember that the DRBD is probably only mounted on the active machine.
   So any upgrades you do of things on the DRBD only happen on the active
   system.

   DRBD can be run in a number of modes.  If you have systems separated by
   large physical distance, you either have to be content with the remote
   system potentially being in an incomplete state, or you have to run in
   one of the modes which ensures that when flushes are done locally, the
   remote end either has received them, or has committed them to disc.  In
   this case, you have to be content with dramatically increased I/O
   latency (depending on your latency to the remote location).

   If you are doing maintenance and your HA design can't handle a fail-over
   event during the maintenance, you probably would shut down the fail-over
   system and if there were a hardware failure or the like during the
   maintenance you would need to manually resolve it.

If you want something that acts like a  more robust system, without having
to put a lot of time into the HA design of your own applications, you can
consider something like Proxmox or VMWare with HA.  These are
virtualization environments that provide fail-over at the virtual system
level.  So, you treat the virtual as just a regular system, and the
virtualization HA will fail over in the event of hardware problems, etc...

In most cases, a fail-over is the same as a reboot.  So it's not as fast as
if you design your application for the HA system.  For example, we have a
router setup that will fail over without losing any packets...

However, you are still going to have to deal with replication delays if
your secondary system is off-site -- if you want things committed at the
transaction-level and consistent across locations, you have to deal with
the latency between those locations...

I believe vmware also has a Fault Tolerant system which will run multiple
virtuals in lock-step.  I've never used that, however my understanding is
that increases your latency for all network operations by something like
50ms on average, because of the way they do it.  I believe you're also
looking at $30K+ to implement such a system, between licensing and SANs
capable of multiple-site disaster-recovery).

Sean