[lug] Clustering for Load-Balancing and Fault-Tolerance??

Wed Jan 30 07:18:57 MST 2002

Hi Alan,

Some thoughts on DNS... which I've done far too much of in my short life.
(GRIN)

> DNS has minimal useful fault tolerance.  When a server goes down, sites
and
> clients that have the dead server IP cached get the shaft.  People who use
> this solution typically dislike the result.  Microsoft clients are
> notoriously slow to bypass a dead server to go to the next one.  Delays of
a
> minute or more are not uncommon.  That's not very fault tolerant.

DNS is perfectly fault-tolerant when built correctly.  Your first comment
says that clients are left high and dry when a server goes down... yes, but
they switch to another NS-record listed server.  The problem is that most
clients on most OS's do not switch "quickly" enough for some people's
tastes.  However, if written properly the resolver *can* switch quickly if
the client-side application supports that.  Most desktop OS's don't switch
quickly.  This is the root-cause problem.

The delays are inherent in the CLIENT not the servers.  There is usually
another DNS server sitting ready and waiting to take the traffic in almost
every correctly-built DNS setup out there -- but the client waits for some
arbitrary timeout time.  Clients could be written better than they are on
modern multi-threaded systems.  (Predict that hey... that nameserver's
taking longer than it "usually" does, so I'll grab that answer from my
secondary DNS server and see if the primary ever answers... if not I'll
switch...)

> If you're my ISP and I run microsoft clients, and you just let DNS fault
> tolerance take care of it, I'm going to be very testy about these
failures.
> If you use load balancing or IP failover, I won't even know.

Not 100% true.  Load balancing and IP failover can't rescue a half-finished
(sent but not ACK'ed) DNS query at all.  If your clients do a lot of DNS
lookups and there is X percentage of them requesting a DNS entry every
minute, during the minute the IP failover kicks in those clients are still
going to have to time out and ask again...

Worse, if they're configured to only know about *one* DNS server behind that
load-balancer, they're going to report "destination host unreachable" if the
request timeout has been reached on that single load-balanced IP.  Becuase
they didn't get an answer.

Yes, there are some retries in there that might rescue them... but the point
is that during that one minute to failover via normal DNS means, that IP
failover and/or load-balancer doesn't buy you much.  For the cost.

And if you load-balance DNS you have to do it to at least two "logical" DNS
servers... so the dumber clients have something to "fail" to during that
switchover time.

And in practice, relatively slow hardware can handle MASSIVE amounts of DNS
traffic without hiccuping.

So the real redundancy needed and bang-for-the-buck (and the part that goes
down much much much more often) is the network leading TO the DNS servers.
Redundancy of DNS servers via IP failover or a load-balancer only becomes
useful in that you only have to publish a couple of IP addresses instead of
keeping a big list of DNS servers and setting up different clients to
different machines.

Nate Duehr, nate at natetech.com