[lug] How bad is this?

Nate Duehr nate at natetech.com
Thu Jan 23 13:17:24 MST 2003


I recently fought some similar issues with bad add-on ECC RAM from
SimpleTech in a Compaq DL360.

We didn't even get the nice errors you got... just random lockups and
finally we got a "received NMI" (non-maskable interrupt) message on a
console that hinted the problem might be RAM during one of the lockups.

Going back to the original 512MB of RAM from Compaq in that box straightened
it right out.

Hardware issues are always.... um, "entertaining"...

Nate Duehr, nate at natetech.com
----- Original Message -----
From: "Ferdinand Schmid" <fschmid at archenergy.com>
To: <lug at lug.boulder.co.us>
Sent: Wednesday, January 22, 2003 3:50 PM
Subject: Re: [lug] How bad is this?


> Almost two weeks later...  sorry for the late reply, I am buried in
> work.  But I have seen this error recently on a Dual Athlon system
> using ECC memory.  The final answer to the problem was a bad memory
> slot on the motherboard.
>
> ECC memory seems very efficient at recovering from errors (hence the
> name).  So most of the time our system ran fine - just slower than it
> should since it had to do lots of error correction.  Sometimes however
> it would croak under heavy load (both CPUs 100% busy for several
> minutes and up to 60 hours).
>
> Successful troubleshooting meant removing one memory chip at a time
> and running on 3GB in our case.  E.G. remove memory from slot 1 and
> test, re-insert into slot 1 and remove from slot 2, ...  Of course the
> last slot on the board was bad.  Determining if it is the slot or the
> memory (DIMM) was done by inserting known good memory into the slot in
> question, testing, and the board crashing again.
>
> The little information I found regarding this issue came from the
> Linux kernel list.  But since I couldn't find a straight answer on
> Google when I was looking for this info myself I decided to answer
> this post despite the long delay.
>
> Ferdinand
>
> --On Thursday, January 09, 2003 03:11:52 PM -0700 Gary Hodges
> <Gary.Hodges at noaa.gov> wrote:
>
> > Jan  9 12:27:06 space kernel: CPU 0: Machine Check Exception:
> > 0000000000000004
> > Jan  9 12:27:06 space kernel: Bank 0: c436c00000000833 at
> > 0000000005933c40
> > Jan  9 12:27:06 space kernel: Bank 1: f600200000000853 at
> > 0000000001f531c0
> > Jan  9 12:27:06 space kernel: Kernel panic: CPU context corrupt
> >
> > I've been taxing my system today and it locked up with the above
> > messages displayed in all terminals.  I've done some google searching
> > on "Machine Check Exception" and I'm reading conflicting causes.
> > Some say CPU problems, some say cache, others indicate an ECC error,
> > and even others say it can't be an ECC crash.
> >
> > I have ECC memory in the PC and I guess I'm hoping it was simply a one
> > time anomalous flipped bit.  I have the same program running now and
> > it is keeping the PC at a load average around 1.75.  It crashed after
> > running this program a couple hours.  For now I guess I'll just wait
> > and see what happens, but I'd be interested to hear any comments.
> >
> > Athlon 1.4 (not XP) with retail HSF
> > Gigabyte GA-7DX MB
> > 512 MB (2x256) ECC Corsair DDR
> > Big steel case with extra fans and beefy power supply from PC Power
> > and Cooling
> >
> > This machine has been running for at least 15 months without trouble.
> >
> > Cheers,
> > Gary
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>
>
>
> --
> Ferdinand Schmid
> Architectural Energy Corporation
> Celebrating 20 Years of Improving Building Energy Performance
> http://www.archenergy.com
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>




More information about the LUG mailing list