[lug] advice on a problem

Steve A Hart shart at colorado.edu
Sat Jul 31 08:24:16 MDT 2010


David,

Sorry, should have also said that I did swap out the cat5 cable for a 
known good one and the same thing happened.  Both systems are identical 
because I bought them for the same professor to use in her lab.  Each 
one usually has a different user working off it but when the probblem 
system went down, the user switched to the identical good system and ran 
things normally for at least a week.

Dan,

ethtool report looks normal:

Settings for eth0:
	Supported ports: [ TP MII ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	Advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Full
	Port: MII
	PHYAD: 24
	Transceiver: internal
	Auto-negotiation: on
	Current message level: 0x00000001 (1)
	Link detected: yes

Jose,

Like I said, I've got 52 systems setup the same exact way but with 
different hardware.  The other thing I forgot to mention is that besides 
the identical system, there is another system with the exact same 
motherboard, same amount of memory (might be different type), and 
similar Intel quad core processor, and that machine does continuous 
heavy processing without an issue.  I may ask the user to try his crazy 
Matlab code on this other system and not just the identical one.

D. Stimits,

I will look at possible heat causes.  I know the CPU is not running hot 
but I have not checked the video card or power supply.  Will do that asap!

Ken,

The video card in the two systems are the exact same model.  This could 
be a case of a bad video card that is overheating.  I will take a look 
at that!  I did swap out the video card at one point but the one I 
replaced it with was a lower end card and might not have been able to 
handle the graphics load.

Thanks for the great ideas everyone!  I really want to put this problem 
to bed and never hear about it ever again.

cheers

Steve


On 07/30/2010 10:10 PM, David L. Anselmi wrote:
> Steve A Hart wrote:
>> Out of the 52 clients, I have 1 system that is frequently locking up for
>> no apparent reason at all. Roughly 1-3 times per week this system just
>> goes belly up and locks like it lost the NFS connection to the server.
>> The kicker to this is that this system sits next to an exactly identical
>> system (hardware and software setup) which acts completely normal and
>> does not lock up. All logs on the problem system show no errors of any
>> kind. Also, both systems are plugged into the same switch so it's not a
>> network issue.
>
> If you swap the wires where they plug into the 2 machines you'll know
> the problem is in the box. Otherwise it could be in the cable or switch
> port.
>
>> My only commonality/trend I see is that when the system locks, the user
>> is running a heavy matlab script that displays 20+ plots one right after
>> the other so that all 20+ plots are visible in 20+ different windows.
>> This script would run successfully multiple times in a week and then at
>> some apparently random point, it locks. It should be noted that if this
>> same users runs the same code on the identical system, it runs fine
>> every time.
>
> Does it run on an identical system for weeks on end? Perhaps it's
> related to this particular workload when there's an inopportune burst of
> network traffic.
>
> If an identical system works consistently you could at least swap them
> and solve this user's problem.
>
> Probably there's some debugging you could turn on but I haven't done
> that. I'd be curious what's happening on the network when it locks.
>
> Dave

-- 
Steve Hart
Systems Administrator
Colorado Center for Astrodynamics Research
University of Colorado Boulder
shart at colorado.edu
(303)492-8109



More information about the LUG mailing list