[lug] How bad is this?

Timothy C. Klein teece at silverklein.net
Thu Jan 9 16:00:55 MST 2003


* Gary Hodges (Gary.Hodges at noaa.gov) wrote:
> Jan  9 12:27:06 space kernel: CPU 0: Machine Check Exception:
> 0000000000000004
> Jan  9 12:27:06 space kernel: Bank 0: c436c00000000833 at
> 0000000005933c40
> Jan  9 12:27:06 space kernel: Bank 1: f600200000000853 at
> 0000000001f531c0
> Jan  9 12:27:06 space kernel: Kernel panic: CPU context corrupt
> 
> I've been taxing my system today and it locked up with the above
> messages displayed in all terminals.  I've done some google searching on
> "Machine Check Exception" and I'm reading conflicting causes.  Some say
> CPU problems, some say cache, others indicate an ECC error, and even
> others say it can't be an ECC crash.
> 
> I have ECC memory in the PC and I guess I'm hoping it was simply a one
> time anomalous flipped bit.  I have the same program running now and it
> is keeping the PC at a load average around 1.75.  It crashed after
> running this program a couple hours.  For now I guess I'll just wait and
> see what happens, but I'd be interested to hear any comments.
> 
> Athlon 1.4 (not XP) with retail HSF
> Gigabyte GA-7DX MB
> 512 MB (2x256) ECC Corsair DDR
> Big steel case with extra fans and beefy power supply from PC Power and
> Cooling
> 
> This machine has been running for at least 15 months without trouble.
> 

I've never done real research, but I think Machine Check Exception is a
way for the hardware (x86 only, I presume) to tell the kernel that
something has gone wrong in hardware.

If I enable MCE in the kernel, and then overclock by Athlon, I will get 
MCE errors.  It might have even been related to cache, but I don't
remember for sure.  Sometimes, they would not bring the system down.
Sometimes they immediately preceded a crash.

Is is bad?  I don't know.  I wouldn't personally worry unless it
happened again.

Tim

PS -> This is what the kernel help file says about MCE:

  * Machine Check Exception support allows the processor to notify the      *
  * kernel if it detects a problem (e.g. overheating, component failure).   *
  * The action the kernel takes depends on the severity of the problem,     *
  * ranging from a warning message on the console, to halting the machine.  *
  * You can safely select this on machines that do not support this feature *
  *                                                                         *
  * For pentium machines the mce support defaults to off as the mainboard   *
  * support is not always present. You must activate it as a boot option.   *
--
==============================================
==  Timothy Klein || teece at silverklein.net  ==
==  http://i148.denver.dsl.forethought.net  ==
== ---------------------------------------- ==
== "Hello, World" 17 Errors, 31 Warnings... ==
==============================================



More information about the LUG mailing list