[lug] advice on a problem

Steve A Hart shart at colorado.edu
Fri Jul 30 21:13:12 MDT 2010


Here's the setup:  I've got 52 RHEL 5 32 bit clients localized to a RHEL 
5 32 bit server which shares out home directories and /usr/local to the 
clients.  /usr/local on the server is where programs like Matlab are 
installed.  One side effect of this setup is that if NFS is interrupted 
in any way, the clients lock up.  That is not my problem, just the setup 
to my situation.

Out of the 52 clients, I have 1 system that is frequently locking up for 
no apparent reason at all.  Roughly 1-3 times per week this system just 
goes belly up and locks like it lost the NFS connection to the server. 
The kicker to this is that this system sits next to an exactly identical 
system (hardware and software setup) which acts completely normal and 
does not lock up.  All logs on the problem system show no errors of any 
kind.  Also, both systems are plugged into the same switch so it's not a 
network issue.

I'm trying to figure out if this system has a hardware issue or if 
something else is going on. I've replaced and tested the crucial memory 
installed and it tests fine with Memtest86.  I have two internal SATA 
hard drives and they both test fine using the WD drive tester.

My only commonality/trend I see is that when the system locks, the user 
is running a heavy matlab script that displays 20+ plots one right after 
the other so that all 20+ plots are visible in 20+ different windows. 
This script would run successfully multiple times in a week and then at 
some apparently random point, it locks.  It should be noted that if this 
same users runs the same code on the identical system, it runs fine 
every time.

Also, I've reloaded this system a couple times now and this last time I 
ran a multi day test where I worked the system hard by doing the 
following all at the same time:

1.  multiple glxgears
2.  Ran multiple flash-heavy websites (thought it might be a flash issue)
3.  had Matlab open but not running code (I didn't have any usable 
matlab codes to run)

I ran the above for three days straight and not a single hiccup from the 
system.  Gave it back to the user to use and within one week it locked up.

I've got the latest NVIDIA driver loaded and running and the system is 
fully updated.  Here's some of the system info:

* GIGABYTE GA-EP45T-UD3LR motherboard
* Intel Core 2 Quad Q9550 Yorkfield 2.83GHz
* 8GB Crucial 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10600)
* GIGABYTE GV-N95TOC-1GI GeForce 9500 GT video card
* 3Com Corporation 3c905C-TX/TX-M PCI card (wanted to make sure the 
onboard NIC was not the culprit)
* 750W power supply
* 16GB of swap

I'm open to any and all ideas on this.  I'm frankly out of ideas and the 
system owners are getting frustrated.  My only thought now is to replace 
all hardware and see if that does the trick but that seems to be an 
extreme measure on this unknown.  I'd kill for any error message that 
would give me a clue as to what's happening.

Any thoughts would be appreciated.

cheers

Steve Hart

-- 
Steve Hart
Systems Administrator
Colorado Center for Astrodynamics Research
University of Colorado Boulder
shart at colorado.edu
(303)492-8109



More information about the LUG mailing list