[lug] advice on a problem
Steve A Hart
shart at colorado.edu
Fri Jul 30 21:13:12 MDT 2010
Here's the setup: I've got 52 RHEL 5 32 bit clients localized to a RHEL
5 32 bit server which shares out home directories and /usr/local to the
clients. /usr/local on the server is where programs like Matlab are
installed. One side effect of this setup is that if NFS is interrupted
in any way, the clients lock up. That is not my problem, just the setup
to my situation.
Out of the 52 clients, I have 1 system that is frequently locking up for
no apparent reason at all. Roughly 1-3 times per week this system just
goes belly up and locks like it lost the NFS connection to the server.
The kicker to this is that this system sits next to an exactly identical
system (hardware and software setup) which acts completely normal and
does not lock up. All logs on the problem system show no errors of any
kind. Also, both systems are plugged into the same switch so it's not a
network issue.
I'm trying to figure out if this system has a hardware issue or if
something else is going on. I've replaced and tested the crucial memory
installed and it tests fine with Memtest86. I have two internal SATA
hard drives and they both test fine using the WD drive tester.
My only commonality/trend I see is that when the system locks, the user
is running a heavy matlab script that displays 20+ plots one right after
the other so that all 20+ plots are visible in 20+ different windows.
This script would run successfully multiple times in a week and then at
some apparently random point, it locks. It should be noted that if this
same users runs the same code on the identical system, it runs fine
every time.
Also, I've reloaded this system a couple times now and this last time I
ran a multi day test where I worked the system hard by doing the
following all at the same time:
1. multiple glxgears
2. Ran multiple flash-heavy websites (thought it might be a flash issue)
3. had Matlab open but not running code (I didn't have any usable
matlab codes to run)
I ran the above for three days straight and not a single hiccup from the
system. Gave it back to the user to use and within one week it locked up.
I've got the latest NVIDIA driver loaded and running and the system is
fully updated. Here's some of the system info:
* GIGABYTE GA-EP45T-UD3LR motherboard
* Intel Core 2 Quad Q9550 Yorkfield 2.83GHz
* 8GB Crucial 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10600)
* GIGABYTE GV-N95TOC-1GI GeForce 9500 GT video card
* 3Com Corporation 3c905C-TX/TX-M PCI card (wanted to make sure the
onboard NIC was not the culprit)
* 750W power supply
* 16GB of swap
I'm open to any and all ideas on this. I'm frankly out of ideas and the
system owners are getting frustrated. My only thought now is to replace
all hardware and see if that does the trick but that seems to be an
extreme measure on this unknown. I'd kill for any error message that
would give me a clue as to what's happening.
Any thoughts would be appreciated.
cheers
Steve Hart
--
Steve Hart
Systems Administrator
Colorado Center for Astrodynamics Research
University of Colorado Boulder
shart at colorado.edu
(303)492-8109
More information about the LUG
mailing list