[lug] strange kernel happenings...

D. Stimits stimits at idcomm.com
Mon Aug 12 22:05:06 MDT 2002


A general comment on such strange behaviors, that might or might not 
have anything to do with your situation. It is not unusual for heat 
buildup over time to do strange things. It is not unusual for marginal 
power supplies or marginal power line voltage (brown-out) to do this 
(have a voltmeter? if you have an UPS, hopefully it will output close to 
120 VAC, and never dip down to 110). Marginal memory often does such 
strange things as well, run memtest86 on it for half a day:
  http://www.memtest86.com/

In terms of software, I think a triple-exception in the kernel will 
cause instant reboot, but not defunct processes.

To find out more about what is going on, you can compile with the kernel 
debugger option, or kdb. I do not know if all kernels have that, but I 
think the redhat kernels do, and all of the SGI kernels I use do 
(naturally, kdb was written mainly by an SGI employee). kdb can give you 
a list of processes, and allow you to get a stack dump of any process. 
It will even work most of the time if the kernel is locked up hard. The 
problem with kdb is that it does not play nicely with X11, you mostly 
need either a real console or a serial console to another machine. [I 
have not looked, but I would bet that the kdb docs in the kernel source 
Documentation/ directory name oss.sgi.com; if not, likely oss.sgi.com 
has tons of docs on kdb]

Also, I'd run tail -f on /var/log/messages and always keep it visible, 
preferably via ssh from another machine. Although you said it does not 
show anything in the log, it might matter what the last message was, 
especially if the same message is always the last message. Maybe 
experiment with manually running it to init 2, and then back to init 3 
or init 5, see if that does anything (if it gives an oops, you are in 
luck...if you can type it in accurately).

D. Stimits, stimits AT idcomm.com

Nate Duehr wrote:
> Anyone have any ideas on this...?
> 
> Seeing some really "Bad Things(TM)" on one of my boxes.  Started after
> an upgrade to 2.4.18 from Debian.  Have tried both 2.4.18-bf24 and
> 2.4.18-k6 flavors...
> 
> Basically, what's happening is suddenly and without warning (no logs, no
> idea that it even is occurring) the kernel stops processing SIGTERM,
> SIGKILL, etc.  Also, programs like top and ps (anything that views a
> process list) lock up the terminal session (both after displaying a
> partial list).
> 
> I'm actually typing this from that machine -- the machine stays running,
> daemons act semi-normally, and in general everything works except
> anything that processes signals to the kernel.
> 
> Yesterday for the first time I noticed that keventd had gone defunct
> during this, but usually that doesn't happen.
> 
> What are some of the things I could check on such a goofy system to help
> find and correct the problem permanently or help someone much smarter than
> I know what might be causing this?
> As I said, no logs from syslog show anything useful... not even an
> interrupt conflict or IDE hiccup... nothing... it just starts acting
> strangely.
> (One of the interesting side-effects of this bug/situation is that the
> system won't process signals, so "shutdown" and "init" don't work...
> can't even reboot it without hitting The Big Red Switch.
> 
> Oh and even stranger... I waited about eight hours and logged into the box
> again and a "reboot" command INSTANTLY dropped it... no warning message,
> and earlier today it wasn't responding at all to any kind of signals...
> wow.






More information about the LUG mailing list