[lug] How bad is this?
Timothy C. Klein
teece at silverklein.net
Thu Jan 9 16:00:55 MST 2003
* Gary Hodges (Gary.Hodges at noaa.gov) wrote:
> Jan 9 12:27:06 space kernel: CPU 0: Machine Check Exception:
> 0000000000000004
> Jan 9 12:27:06 space kernel: Bank 0: c436c00000000833 at
> 0000000005933c40
> Jan 9 12:27:06 space kernel: Bank 1: f600200000000853 at
> 0000000001f531c0
> Jan 9 12:27:06 space kernel: Kernel panic: CPU context corrupt
>
> I've been taxing my system today and it locked up with the above
> messages displayed in all terminals. I've done some google searching on
> "Machine Check Exception" and I'm reading conflicting causes. Some say
> CPU problems, some say cache, others indicate an ECC error, and even
> others say it can't be an ECC crash.
>
> I have ECC memory in the PC and I guess I'm hoping it was simply a one
> time anomalous flipped bit. I have the same program running now and it
> is keeping the PC at a load average around 1.75. It crashed after
> running this program a couple hours. For now I guess I'll just wait and
> see what happens, but I'd be interested to hear any comments.
>
> Athlon 1.4 (not XP) with retail HSF
> Gigabyte GA-7DX MB
> 512 MB (2x256) ECC Corsair DDR
> Big steel case with extra fans and beefy power supply from PC Power and
> Cooling
>
> This machine has been running for at least 15 months without trouble.
>
I've never done real research, but I think Machine Check Exception is a
way for the hardware (x86 only, I presume) to tell the kernel that
something has gone wrong in hardware.
If I enable MCE in the kernel, and then overclock by Athlon, I will get
MCE errors. It might have even been related to cache, but I don't
remember for sure. Sometimes, they would not bring the system down.
Sometimes they immediately preceded a crash.
Is is bad? I don't know. I wouldn't personally worry unless it
happened again.
Tim
PS -> This is what the kernel help file says about MCE:
* Machine Check Exception support allows the processor to notify the *
* kernel if it detects a problem (e.g. overheating, component failure). *
* The action the kernel takes depends on the severity of the problem, *
* ranging from a warning message on the console, to halting the machine. *
* You can safely select this on machines that do not support this feature *
* *
* For pentium machines the mce support defaults to off as the mainboard *
* support is not always present. You must activate it as a boot option. *
--
==============================================
== Timothy Klein || teece at silverklein.net ==
== http://i148.denver.dsl.forethought.net ==
== ---------------------------------------- ==
== "Hello, World" 17 Errors, 31 Warnings... ==
==============================================
More information about the LUG
mailing list