[lug] Setting up failover in Linux?

Sean Reifschneider jafo at tummy.com
Mon Apr 30 13:07:28 MDT 2012


On 04/29/2012 06:17 PM, Rob Nagler wrote:
> That's the odd part to me.  I understand it, but it means I need to
> design around this.  I don't have a problem with it (we're mostly
> done), but it does make me suspicious of all the hype around HA from
> vendors like Amazon.

I'm pretty sure that Amazon is not hyping Linux-HA, and I honestly can't
think of what you are thinking of as Amazon hyping HA.  In fact, in their
whitepapers they're quite clear that you have to design your applications
to work within their infrastructure and withstand failures of individual
components.  This is well published in the Netflix case paper about "chaos
monkey" that goes around and kills components to ensure that the
applications can withstand components failing through their design.

Amazon is 100% saying that you have to design specifically for their
infrastructure, just like with the Linux-HA and DRBD projects.

> I have a hard time believing that these systems actually work when you
> really need them.  I'm sure they are fine if there are external events

I guess you're either going to have to trust my 17 years of doing HA
deployments, or develop your own experience...  :-)

> it happened.  I do want it to be like RAID in that it handles the real
> life event that disks fail.  In the application space, you have the

My point was that RAID handles it seamlessly, as if the hardware had not
failed, without consideration for the lower levels.  With Linux-HA systems,
you have to design the applications and their start/stop to account for
restarting as part of the failure recovery.

> really want automatically failover, because I've been trying to figure
> out the network partition problem for 30 years, and dammit, I still
> can't figure it out.  :) n

Many systems do it via STONITH or Fencing.

> BTW, that's what we had at Tandem, but of course, that was ages ago,
> and we forgot all about how to build reliable transaction systems.

Actually, Tandem found that hardware was getting more reliable and that
Non-Stop didn't make as much sense as it once did.  The additional
complexity of Non-Stop introduced opportunities for operator error that
weren't there in a more traditional system.  My source for that is a
mid-90s era whitepaper from Tandem, but I don't have a link handy...

> Seriously, I do think the problem is solvable if you solve it globally
> with a transaction manager, and that's sort of the solution we are

Linux-HA and DRBD do not solve that problem...  They ensure that system
failure is detected, resources are started/restarted whenever possible, and
that data is replicated between hosts.

Sean



More information about the LUG mailing list