[lug] advice on a problem

Maxwell Spangler maxlists at maxwellspangler.com
Fri Jul 30 23:55:38 MDT 2010


On Fri, 2010-07-30 at 21:13 -0600, Steve A Hart wrote:
> Out of the 52 clients, I have 1 system that is frequently locking up for 
> no apparent reason at all.  Roughly 1-3 times per week this system just 
> goes belly up and locks like it lost the NFS connection to the server. 
> The kicker to this is that this system sits next to an exactly identical 
> system (hardware and software setup) which acts completely normal and 
> does not lock up.  All logs on the problem system show no errors of any 
> kind.  Also, both systems are plugged into the same switch so it's not a 
> network issue.

> My only commonality/trend I see is that when the system locks, the user 
> is running a heavy matlab script that displays 20+ plots one right after 
> the other so that all 20+ plots are visible in 20+ different windows. 
> This script would run successfully multiple times in a week and then at 
> some apparently random point, it locks.  It should be noted that if this 
> same users runs the same code on the identical system, it runs fine 
> every time.
> 
> Also, I've reloaded this system a couple times now and this last time I 
> ran a multi day test where I worked the system hard by doing the 
> following all at the same time:
> 
> 1.  multiple glxgears
> 2.  Ran multiple flash-heavy websites (thought it might be a flash issue)
> 3.  had Matlab open but not running code (I didn't have any usable 
> matlab codes to run)

What happens again?  You said "belly up" yet suggested "as if its lost
connection to the nfs server."  If the nfs connection is down, can't you
still login locally as root - or keep a shell open, switch to it for
diagnostics, or is it system locked and nothing's happening?

Could be environment:

You tested it in your lab but not at the user's site?
*	Could the user have bad power causing brownouts or spikes?
*	Could the user have heat problems causing overheating?
*	Bad wall wiring interfering with network comm and causing glitches?

Could be software:

*	You tested it in a variety of ways but it sounds like you didn't test
it with MatLab.  Could there be some small thing that makes this
installation (combination of software) different from others?
*	Test it in your clean site (without potential power/heat/etc) issues,
but with his software and test suite to see if the software tests out.

To eliminate software, how about pulling a hard drive from an identical
machine, using it or cloning it and running it's working installation of
OS/apps on the questionable machine?  If problems show up, it should be
hardware.

Could be the user:

* Does this user want to switch to Macintosh or something and is just
causing problems? :-)



Whatever you do, don't spill water on it, it could multiply.

-- 
Maxwell Spangler
========================================================================
        Linux, Unix and Database Administration
        Currently: Boulder, Colorado
        LinkedIn: http://www.linkedin.com/in/maxwellspangler

        




More information about the LUG mailing list