[lug] advice on a problem
Lee Woodworth
blug-mail at duboulder.com
Sat Jul 31 09:11:23 MDT 2010
On 07/31/10 08:24, Steve A Hart wrote:
> David,
>
> Sorry, should have also said that I did swap out the cat5 cable for a
> known good one and the same thing happened. Both systems are identical
> because I bought them for the same professor to use in her lab. Each
> one usually has a different user working off it but when the probblem
> system went down, the user switched to the identical good system and ran
> things normally for at least a week.
Did you keep the same network cables but swap the switch ports? I've had
a single port on a switch go flaky with no error messages at all showing
on the affected system. If the matlab scripts cause lots of NFS activity
via disk ops that might be how a network issue shows up load.
Does the switch itself have logging facilities? Maybe it has info about the port.
>
> Dan,
>
> ethtool report looks normal:
>
> Settings for eth0:
> Supported ports: [ TP MII ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 100Mb/s
> Duplex: Full
> Port: MII
> PHYAD: 24
> Transceiver: internal
> Auto-negotiation: on
> Current message level: 0x00000001 (1)
> Link detected: yes
>
> Jose,
>
> Like I said, I've got 52 systems setup the same exact way but with
> different hardware. The other thing I forgot to mention is that besides
> the identical system, there is another system with the exact same
> motherboard, same amount of memory (might be different type), and
> similar Intel quad core processor, and that machine does continuous
> heavy processing without an issue. I may ask the user to try his crazy
> Matlab code on this other system and not just the identical one.
>
> D. Stimits,
>
> I will look at possible heat causes. I know the CPU is not running hot
> but I have not checked the video card or power supply. Will do that asap!
>
> Ken,
>
> The video card in the two systems are the exact same model. This could
> be a case of a bad video card that is overheating. I will take a look
> at that! I did swap out the video card at one point but the one I
> replaced it with was a lower end card and might not have been able to
> handle the graphics load.
>
> Thanks for the great ideas everyone! I really want to put this problem
> to bed and never hear about it ever again.
>
> cheers
>
> Steve
>
>
> On 07/30/2010 10:10 PM, David L. Anselmi wrote:
>> Steve A Hart wrote:
>>> Out of the 52 clients, I have 1 system that is frequently locking up for
>>> no apparent reason at all. Roughly 1-3 times per week this system just
>>> goes belly up and locks like it lost the NFS connection to the server.
>>> The kicker to this is that this system sits next to an exactly identical
>>> system (hardware and software setup) which acts completely normal and
>>> does not lock up. All logs on the problem system show no errors of any
>>> kind. Also, both systems are plugged into the same switch so it's not a
>>> network issue.
>>
>> If you swap the wires where they plug into the 2 machines you'll know
>> the problem is in the box. Otherwise it could be in the cable or switch
>> port.
>>
>>> My only commonality/trend I see is that when the system locks, the user
>>> is running a heavy matlab script that displays 20+ plots one right after
>>> the other so that all 20+ plots are visible in 20+ different windows.
>>> This script would run successfully multiple times in a week and then at
>>> some apparently random point, it locks. It should be noted that if this
>>> same users runs the same code on the identical system, it runs fine
>>> every time.
>>
>> Does it run on an identical system for weeks on end? Perhaps it's
>> related to this particular workload when there's an inopportune burst of
>> network traffic.
>>
>> If an identical system works consistently you could at least swap them
>> and solve this user's problem.
>>
>> Probably there's some debugging you could turn on but I haven't done
>> that. I'd be curious what's happening on the network when it locks.
>>
>> Dave
>
More information about the LUG
mailing list