[lug] Server clock losing serious time

Sat Aug 14 10:34:25 MDT 2004

Hey all--

I just built two new web servers (Dell PowerEdge 750s) and I'm having 
some serious issues with the system clock losing time.  In the span of 
seven hours, the clock lost an hour and a half of time.  That's a 
frightening rate of about *15 seconds per minute*.

What's more, I'm running NTP on both servers.  I can reset the time just 
fine via ntpdate:

# ntpdate -d  utcnist.colorado.edu
14 Aug 08:43:56 ntpdate[1500]: ntpdate 4.2.0 at 1.1161-r Sat Jul 17 
15:12:33 MDT 2004 (1)
Looking for host utcnist.colorado.edu and service ntp
host found : india.Colorado.EDU
<lots of debugging messages snipped>
14 Aug 08:43:56 ntpdate[1500]: step time server 128.138.140.44 offset 
6007.325720 sec

Wow, off by 6007 seconds (1hr 40min).  Ouch.

I start the NTP daemon (ntpd), but then the clock drifts alarmingly.  
I've verified that NTP is running (netstat -nl | grep 123), but if I 
trace the time servers I get this odd result:

# ntptrace
localhost: stratum 16, offset 0.000000, root distance 0.000000

I believe that indicates ntpd can't synchronize with a time server, so 
it's essentially not doing what it should.  I compared configuration 
files with another (working) server, and they're identical.  The ntpd 
executable is also identical.

In any case, I figured I'd just shut down NTP and let the clock run on 
it's own, perhaps resetting it every few days via 'ntpdate'.  I kill 
ntpd and the clock continues to wind down at the same rate as above.  
So it seems that NTP doesn't help (or hinder) the problem.

Out of curiosity, I checked the hardware clock.  It appears to be just 
fine; even after the system (software) clock drifted an hour and a 
half, the hardware clock was still dead-on:

# hwclock ; date
Sat Aug 14 10:28:06 2004  -0.052861 seconds
Sat Aug 14 08:43:56 MDT 2004

I have a hard time believing it's hardware-related, but I've never seen 
this kind of clock problem before.  NTP has always been rock-solid for 
me, so the 'ntptrace' problem above is a concern.  But even without NTP 
running, I shouldn't see this sort of thing.

Oh, and as a final note the kernel is 2.6.7 and the software is 
absolutely identical to an installation on another PE750 server on the 
same LAN that works fine.  The only thing that's special about these 
two servers is their second ethernet ports (eth1) are connected to one 
another via a crossover cable.

Any ideas?

TIA,
Jeff