[lug] How bad is this?

Gary Hodges Gary.Hodges at noaa.gov
Wed Jan 22 17:17:46 MST 2003


Ferdinand Schmid wrote:
> 
> Almost two weeks later...  sorry for the late reply, I am buried in
> work.  But I have seen this error recently on a Dual Athlon system
> using ECC memory.  The final answer to the problem was a bad memory
> slot on the motherboard.
> 
> ECC memory seems very efficient at recovering from errors (hence the
> name).  So most of the time our system ran fine - just slower than it
> should since it had to do lots of error correction.  Sometimes however
> it would croak under heavy load (both CPUs 100% busy for several
> minutes and up to 60 hours).
> 
> Successful troubleshooting meant removing one memory chip at a time
> and running on 3GB in our case.  E.G. remove memory from slot 1 and
> test, re-insert into slot 1 and remove from slot 2, ...  Of course the
> last slot on the board was bad.  Determining if it is the slot or the
> memory (DIMM) was done by inserting known good memory into the slot in
> question, testing, and the board crashing again.
> 
> The little information I found regarding this issue came from the
> Linux kernel list.  But since I couldn't find a straight answer on
> Google when I was looking for this info myself I decided to answer
> this post despite the long delay.

Thanks for the follow up!  I haven't had this happen to me again.  The
only change I made was with the ECC setting in the BIOS.  It was
initially set to Check Only and I changed it to Correct Error.  The four
possible settings are:

1. Disabled -- Disable DRAM ECC function
2. Check Error -- Set DRAM ECC setting to Check Only.  Enable DRAM error
checking function.
3. Correct Error -- Set DRAM ECC setting to Correct Errors.  Enable DRAM
1 bit error checking and correcting in CPU/AGP/PCI.
4. Correct + Scrub -- Set DRAM ECC setting to Correct+Scrub.  Enable
DRAM 1 bit error checking and correcting in CPU/AGP/PCI and DRAM.

Now that I am reading this again I'm wondering if I am checking for DRAM
errors since I have it set to Correct Error?  I'd be happy to hear any
suggestions on the preferred setting.  I don't really have time to
explore the original problem now, but when I do maybe I'll set it back
to Check Error or Disabled and see what happens.  If it locks again I'll
start switching the RAM around like you did.

Cheers,
Gary



> --On Thursday, January 09, 2003 03:11:52 PM -0700 Gary Hodges
> <Gary.Hodges at noaa.gov> wrote:
> 
> > Jan  9 12:27:06 space kernel: CPU 0: Machine Check Exception:
> > 0000000000000004
> > Jan  9 12:27:06 space kernel: Bank 0: c436c00000000833 at
> > 0000000005933c40
> > Jan  9 12:27:06 space kernel: Bank 1: f600200000000853 at
> > 0000000001f531c0
> > Jan  9 12:27:06 space kernel: Kernel panic: CPU context corrupt
> >
> > I've been taxing my system today and it locked up with the above
> > messages displayed in all terminals.  I've done some google searching
> > on "Machine Check Exception" and I'm reading conflicting causes.
> > Some say CPU problems, some say cache, others indicate an ECC error,
> > and even others say it can't be an ECC crash.
> >
> > I have ECC memory in the PC and I guess I'm hoping it was simply a one
> > time anomalous flipped bit.  I have the same program running now and
> > it is keeping the PC at a load average around 1.75.  It crashed after
> > running this program a couple hours.  For now I guess I'll just wait
> > and see what happens, but I'd be interested to hear any comments.
> >
> > Athlon 1.4 (not XP) with retail HSF
> > Gigabyte GA-7DX MB
> > 512 MB (2x256) ECC Corsair DDR
> > Big steel case with extra fans and beefy power supply from PC Power
> > and Cooling
> >
> > This machine has been running for at least 15 months without trouble.
> >
> > Cheers,
> > Gary
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 
> --
> Ferdinand Schmid
> Architectural Energy Corporation
> Celebrating 20 Years of Improving Building Energy Performance
> http://www.archenergy.com
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug



More information about the LUG mailing list