[lug] How bad is this?

Ferdinand Schmid fschmid at archenergy.com
Wed Jan 22 15:50:35 MST 2003


Almost two weeks later...  sorry for the late reply, I am buried in
work.  But I have seen this error recently on a Dual Athlon system
using ECC memory.  The final answer to the problem was a bad memory
slot on the motherboard.

ECC memory seems very efficient at recovering from errors (hence the
name).  So most of the time our system ran fine - just slower than it
should since it had to do lots of error correction.  Sometimes however
it would croak under heavy load (both CPUs 100% busy for several
minutes and up to 60 hours).

Successful troubleshooting meant removing one memory chip at a time
and running on 3GB in our case.  E.G. remove memory from slot 1 and
test, re-insert into slot 1 and remove from slot 2, ...  Of course the
last slot on the board was bad.  Determining if it is the slot or the
memory (DIMM) was done by inserting known good memory into the slot in
question, testing, and the board crashing again.

The little information I found regarding this issue came from the
Linux kernel list.  But since I couldn't find a straight answer on
Google when I was looking for this info myself I decided to answer
this post despite the long delay.

Ferdinand

--On Thursday, January 09, 2003 03:11:52 PM -0700 Gary Hodges 
<Gary.Hodges at noaa.gov> wrote:

> Jan  9 12:27:06 space kernel: CPU 0: Machine Check Exception:
> 0000000000000004
> Jan  9 12:27:06 space kernel: Bank 0: c436c00000000833 at
> 0000000005933c40
> Jan  9 12:27:06 space kernel: Bank 1: f600200000000853 at
> 0000000001f531c0
> Jan  9 12:27:06 space kernel: Kernel panic: CPU context corrupt
>
> I've been taxing my system today and it locked up with the above
> messages displayed in all terminals.  I've done some google searching
> on "Machine Check Exception" and I'm reading conflicting causes.
> Some say CPU problems, some say cache, others indicate an ECC error,
> and even others say it can't be an ECC crash.
>
> I have ECC memory in the PC and I guess I'm hoping it was simply a one
> time anomalous flipped bit.  I have the same program running now and
> it is keeping the PC at a load average around 1.75.  It crashed after
> running this program a couple hours.  For now I guess I'll just wait
> and see what happens, but I'd be interested to hear any comments.
>
> Athlon 1.4 (not XP) with retail HSF
> Gigabyte GA-7DX MB
> 512 MB (2x256) ECC Corsair DDR
> Big steel case with extra fans and beefy power supply from PC Power
> and Cooling
>
> This machine has been running for at least 15 months without trouble.
>
> Cheers,
> Gary
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug



--
Ferdinand Schmid
Architectural Energy Corporation
Celebrating 20 Years of Improving Building Energy Performance
http://www.archenergy.com




More information about the LUG mailing list