[lug] advice on a problem
Steve A Hart
shart at colorado.edu
Sat Jul 31 08:24:16 MDT 2010
David,
Sorry, should have also said that I did swap out the cat5 cable for a
known good one and the same thing happened. Both systems are identical
because I bought them for the same professor to use in her lab. Each
one usually has a different user working off it but when the probblem
system went down, the user switched to the identical good system and ran
things normally for at least a week.
Dan,
ethtool report looks normal:
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 24
Transceiver: internal
Auto-negotiation: on
Current message level: 0x00000001 (1)
Link detected: yes
Jose,
Like I said, I've got 52 systems setup the same exact way but with
different hardware. The other thing I forgot to mention is that besides
the identical system, there is another system with the exact same
motherboard, same amount of memory (might be different type), and
similar Intel quad core processor, and that machine does continuous
heavy processing without an issue. I may ask the user to try his crazy
Matlab code on this other system and not just the identical one.
D. Stimits,
I will look at possible heat causes. I know the CPU is not running hot
but I have not checked the video card or power supply. Will do that asap!
Ken,
The video card in the two systems are the exact same model. This could
be a case of a bad video card that is overheating. I will take a look
at that! I did swap out the video card at one point but the one I
replaced it with was a lower end card and might not have been able to
handle the graphics load.
Thanks for the great ideas everyone! I really want to put this problem
to bed and never hear about it ever again.
cheers
Steve
On 07/30/2010 10:10 PM, David L. Anselmi wrote:
> Steve A Hart wrote:
>> Out of the 52 clients, I have 1 system that is frequently locking up for
>> no apparent reason at all. Roughly 1-3 times per week this system just
>> goes belly up and locks like it lost the NFS connection to the server.
>> The kicker to this is that this system sits next to an exactly identical
>> system (hardware and software setup) which acts completely normal and
>> does not lock up. All logs on the problem system show no errors of any
>> kind. Also, both systems are plugged into the same switch so it's not a
>> network issue.
>
> If you swap the wires where they plug into the 2 machines you'll know
> the problem is in the box. Otherwise it could be in the cable or switch
> port.
>
>> My only commonality/trend I see is that when the system locks, the user
>> is running a heavy matlab script that displays 20+ plots one right after
>> the other so that all 20+ plots are visible in 20+ different windows.
>> This script would run successfully multiple times in a week and then at
>> some apparently random point, it locks. It should be noted that if this
>> same users runs the same code on the identical system, it runs fine
>> every time.
>
> Does it run on an identical system for weeks on end? Perhaps it's
> related to this particular workload when there's an inopportune burst of
> network traffic.
>
> If an identical system works consistently you could at least swap them
> and solve this user's problem.
>
> Probably there's some debugging you could turn on but I haven't done
> that. I'd be curious what's happening on the network when it locks.
>
> Dave
--
Steve Hart
Systems Administrator
Colorado Center for Astrodynamics Research
University of Colorado Boulder
shart at colorado.edu
(303)492-8109
More information about the LUG
mailing list