[lug] advice on a problem

Steve A Hart shart at colorado.edu
Sat Jul 31 09:16:13 MDT 2010


Lee,

When I switched the cable I had it on a separate port of the same 
switch.  Something interesting Dan Ferris had me check was the RX 
packets in ifconfig:

For my problem system:
	RX packets:243655 errors:0 dropped:0 overruns:104 frame:0
         TX packets:153291 errors:0 dropped:0 overruns:0 carrier:0

For the good identical system on the same switch:
	RX packets:4903982 errors:0 dropped:0 overruns:0 frame:0
         TX packets:4154286 errors:0 dropped:0 overruns:0 carrier:0

Not sure yet what's causing the overruns but it might finally lead to 
the root of this problem!  Any sort of network issue on these system 
will cause it to lock up.

Steve

On 07/31/2010 09:11 AM, Lee Woodworth wrote:
> On 07/31/10 08:24, Steve A Hart wrote:
>> David,
>>
>> Sorry, should have also said that I did swap out the cat5 cable for a
>> known good one and the same thing happened.  Both systems are identical
>> because I bought them for the same professor to use in her lab.  Each
>> one usually has a different user working off it but when the probblem
>> system went down, the user switched to the identical good system and ran
>> things normally for at least a week.
>
> Did you keep the same network cables but swap the switch ports? I've had
> a single port on a switch go flaky with no error messages at all showing
> on the affected system. If the matlab scripts cause lots of NFS activity
> via disk ops that might be how a network issue shows up load.
>
> Does the switch itself have logging facilities? Maybe it has info about the port.
>
>>
>> Dan,
>>
>> ethtool report looks normal:
>>
>> Settings for eth0:
>> 	Supported ports: [ TP MII ]
>> 	Supported link modes:   10baseT/Half 10baseT/Full
>> 	                        100baseT/Half 100baseT/Full
>> 	Supports auto-negotiation: Yes
>> 	Advertised link modes:  10baseT/Half 10baseT/Full
>> 	                        100baseT/Half 100baseT/Full
>> 	Advertised auto-negotiation: Yes
>> 	Speed: 100Mb/s
>> 	Duplex: Full
>> 	Port: MII
>> 	PHYAD: 24
>> 	Transceiver: internal
>> 	Auto-negotiation: on
>> 	Current message level: 0x00000001 (1)
>> 	Link detected: yes
>>
>> Jose,
>>
>> Like I said, I've got 52 systems setup the same exact way but with
>> different hardware.  The other thing I forgot to mention is that besides
>> the identical system, there is another system with the exact same
>> motherboard, same amount of memory (might be different type), and
>> similar Intel quad core processor, and that machine does continuous
>> heavy processing without an issue.  I may ask the user to try his crazy
>> Matlab code on this other system and not just the identical one.
>>
>> D. Stimits,
>>
>> I will look at possible heat causes.  I know the CPU is not running hot
>> but I have not checked the video card or power supply.  Will do that asap!
>>
>> Ken,
>>
>> The video card in the two systems are the exact same model.  This could
>> be a case of a bad video card that is overheating.  I will take a look
>> at that!  I did swap out the video card at one point but the one I
>> replaced it with was a lower end card and might not have been able to
>> handle the graphics load.
>>
>> Thanks for the great ideas everyone!  I really want to put this problem
>> to bed and never hear about it ever again.
>>
>> cheers
>>
>> Steve
>>
>>
>> On 07/30/2010 10:10 PM, David L. Anselmi wrote:
>>> Steve A Hart wrote:
>>>> Out of the 52 clients, I have 1 system that is frequently locking up for
>>>> no apparent reason at all. Roughly 1-3 times per week this system just
>>>> goes belly up and locks like it lost the NFS connection to the server.
>>>> The kicker to this is that this system sits next to an exactly identical
>>>> system (hardware and software setup) which acts completely normal and
>>>> does not lock up. All logs on the problem system show no errors of any
>>>> kind. Also, both systems are plugged into the same switch so it's not a
>>>> network issue.
>>>
>>> If you swap the wires where they plug into the 2 machines you'll know
>>> the problem is in the box. Otherwise it could be in the cable or switch
>>> port.
>>>
>>>> My only commonality/trend I see is that when the system locks, the user
>>>> is running a heavy matlab script that displays 20+ plots one right after
>>>> the other so that all 20+ plots are visible in 20+ different windows.
>>>> This script would run successfully multiple times in a week and then at
>>>> some apparently random point, it locks. It should be noted that if this
>>>> same users runs the same code on the identical system, it runs fine
>>>> every time.
>>>
>>> Does it run on an identical system for weeks on end? Perhaps it's
>>> related to this particular workload when there's an inopportune burst of
>>> network traffic.
>>>
>>> If an identical system works consistently you could at least swap them
>>> and solve this user's problem.
>>>
>>> Probably there's some debugging you could turn on but I haven't done
>>> that. I'd be curious what's happening on the network when it locks.
>>>
>>> Dave
>>

-- 
Steve Hart
Systems Administrator
Colorado Center for Astrodynamics Research
University of Colorado Boulder
shart at colorado.edu
(303)492-8109



More information about the LUG mailing list