[lug] advice on a problem

Lee Woodworth blug-mail at duboulder.com
Sat Jul 31 09:11:23 MDT 2010


On 07/31/10 08:24, Steve A Hart wrote:
> David,
> 
> Sorry, should have also said that I did swap out the cat5 cable for a 
> known good one and the same thing happened.  Both systems are identical 
> because I bought them for the same professor to use in her lab.  Each 
> one usually has a different user working off it but when the probblem 
> system went down, the user switched to the identical good system and ran 
> things normally for at least a week.

Did you keep the same network cables but swap the switch ports? I've had
a single port on a switch go flaky with no error messages at all showing
on the affected system. If the matlab scripts cause lots of NFS activity
via disk ops that might be how a network issue shows up load.

Does the switch itself have logging facilities? Maybe it has info about the port.

> 
> Dan,
> 
> ethtool report looks normal:
> 
> Settings for eth0:
> 	Supported ports: [ TP MII ]
> 	Supported link modes:   10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	Supports auto-negotiation: Yes
> 	Advertised link modes:  10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	Advertised auto-negotiation: Yes
> 	Speed: 100Mb/s
> 	Duplex: Full
> 	Port: MII
> 	PHYAD: 24
> 	Transceiver: internal
> 	Auto-negotiation: on
> 	Current message level: 0x00000001 (1)
> 	Link detected: yes
> 
> Jose,
> 
> Like I said, I've got 52 systems setup the same exact way but with 
> different hardware.  The other thing I forgot to mention is that besides 
> the identical system, there is another system with the exact same 
> motherboard, same amount of memory (might be different type), and 
> similar Intel quad core processor, and that machine does continuous 
> heavy processing without an issue.  I may ask the user to try his crazy 
> Matlab code on this other system and not just the identical one.
> 
> D. Stimits,
> 
> I will look at possible heat causes.  I know the CPU is not running hot 
> but I have not checked the video card or power supply.  Will do that asap!
> 
> Ken,
> 
> The video card in the two systems are the exact same model.  This could 
> be a case of a bad video card that is overheating.  I will take a look 
> at that!  I did swap out the video card at one point but the one I 
> replaced it with was a lower end card and might not have been able to 
> handle the graphics load.
> 
> Thanks for the great ideas everyone!  I really want to put this problem 
> to bed and never hear about it ever again.
> 
> cheers
> 
> Steve
> 
> 
> On 07/30/2010 10:10 PM, David L. Anselmi wrote:
>> Steve A Hart wrote:
>>> Out of the 52 clients, I have 1 system that is frequently locking up for
>>> no apparent reason at all. Roughly 1-3 times per week this system just
>>> goes belly up and locks like it lost the NFS connection to the server.
>>> The kicker to this is that this system sits next to an exactly identical
>>> system (hardware and software setup) which acts completely normal and
>>> does not lock up. All logs on the problem system show no errors of any
>>> kind. Also, both systems are plugged into the same switch so it's not a
>>> network issue.
>>
>> If you swap the wires where they plug into the 2 machines you'll know
>> the problem is in the box. Otherwise it could be in the cable or switch
>> port.
>>
>>> My only commonality/trend I see is that when the system locks, the user
>>> is running a heavy matlab script that displays 20+ plots one right after
>>> the other so that all 20+ plots are visible in 20+ different windows.
>>> This script would run successfully multiple times in a week and then at
>>> some apparently random point, it locks. It should be noted that if this
>>> same users runs the same code on the identical system, it runs fine
>>> every time.
>>
>> Does it run on an identical system for weeks on end? Perhaps it's
>> related to this particular workload when there's an inopportune burst of
>> network traffic.
>>
>> If an identical system works consistently you could at least swap them
>> and solve this user's problem.
>>
>> Probably there's some debugging you could turn on but I haven't done
>> that. I'd be curious what's happening on the network when it locks.
>>
>> Dave
> 




More information about the LUG mailing list