[lug] strange kernel happenings...

Tue Aug 13 18:44:51 MDT 2002

Nate Duehr wrote:
> On Mon, 2002-08-12 at 22:05, D. Stimits wrote:
> 
>>A general comment on such strange behaviors, that might or might not 
>>have anything to do with your situation. It is not unusual for heat 
>>buildup over time to do strange things. It is not unusual for marginal 
>>power supplies or marginal power line voltage (brown-out) to do this 
>>(have a voltmeter? if you have an UPS, hopefully it will output close to 
>>120 VAC, and never dip down to 110). Marginal memory often does such 
>>strange things as well, run memtest86 on it for half a day:
>>  http://www.memtest86.com/
> 
> 
> Yeah, had thought about this, but don't want that server down for long
> enough to run a long-running memtest86 test on it... hmmm... might have
> to.

If you can afford to load it down (both cpu and filesystem), then doing 
a batch automated kernel build and clean, one after the other, for about 
5 kernels, you might be able to test that way. If during a build you get 
a signal 11, or some other behavior similar to that, then you can put 
95% faith into bad ram or some other component overheat.

> 
> 
>>In terms of software, I think a triple-exception in the kernel will 
>>cause instant reboot, but not defunct processes.
> 
> 
> Interesting.  I'm not a kernel hack at all, so this is interesting info.
> 
> 
>>To find out more about what is going on, you can compile with the kernel 
>>debugger option, or kdb. I do not know if all kernels have that, but I 
>>think the redhat kernels do, and all of the SGI kernels I use do 
>>(naturally, kdb was written mainly by an SGI employee). kdb can give you 
>>a list of processes, and allow you to get a stack dump of any process. 
>>It will even work most of the time if the kernel is locked up hard. The 
>>problem with kdb is that it does not play nicely with X11, you mostly 
>>need either a real console or a serial console to another machine. [I 
>>have not looked, but I would bet that the kdb docs in the kernel source 
>>Documentation/ directory name oss.sgi.com; if not, likely oss.sgi.com 
>>has tons of docs on kdb]
> 
> 
> I think I'll avoid debugging kernels... it might be detrimental to my
> health. :)  (LOL...)

Ahh, but the trick is that you don't have to. You just get a backtrace 
and send it to the kernel devel list, they might ask for other details. 
In fact, you could simply ask the kernel devel list directly what would 
be the best thing to do for this situation, via kdb, to find what is 
required without knowing kernel hacking. If you use a serial terminal to 
another machine, you can mouse or any other way copy the data.

D. Stimits, stimits AT idcomm.com

> 
> 
>>Also, I'd run tail -f on /var/log/messages and always keep it visible, 
>>preferably via ssh from another machine. Although you said it does not 
>>show anything in the log, it might matter what the last message was, 
>>especially if the same message is always the last message. Maybe 
>>experiment with manually running it to init 2, and then back to init 3 
>>or init 5, see if that does anything (if it gives an oops, you are in 
>>luck...if you can type it in accurately).
> 
> 
> Good idea... I'll set that up on the console.
> 
> Nate, nate at natetech.com
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 
>