[lug] Unable to cleanly reboot

D. Stimits stimits at idcomm.com
Thu Oct 18 15:17:04 MDT 2001


Michael Deck wrote:
> 
> On Thursday 18 October 2001 02:59 pm, you wrote:
> > Michael Deck wrote:
> > > Thanks. I'm learning something here but, unfortunately, I haven't yet
> > > collared the problem.
> > >
> > > First I did "init 3" which sent me back to a command-line prompt. At that
> > > run level a lot of stuff was still in use but there were a number of
> > > services that I saw I didn't want or need on this machine so I killed
> > > them and took them out of chkconfig. Then I did "init 2" and that killed
> > > a few more services. I did lsof on all the partitions again and, after
> > > killing some services manually, saw that only 'login', 'bash', and 'lsof'
> > > were using files on /dev/hda6. I assumed that was OK and did "init 1"
> > > that froze the system. It told me it had stopped the random service and
> > > eth0 successfully, then 'no more processes in this runlevel" and it's
> > > once again time for a hard restart and fsck.
> >
> > I wonder if possibly it is eth0 that killed things. This was the last
> > thing it did before apparently locking up? While in init 2 or init 1,
> 
> I can't get to init 1. When in init 2, if I type "init 1" it hangs. I can
> kill eth0 without apparent problem. When I ctrl-alt-del it says "initlevel 6"
> and hangs.

Wayde had mentioned looking for rc[0-6].d directory contents. In this
case, the processes shutting down when transitioning from from init 2 to
init 1 are contained by sym links starting with "K" in /etc/rc.d/rc1.d/.
Each sym link will be named "K", then a priority for order of run, then
the name of the script it points at in /etc/rc.d/init.d/. The list is
long in terms of the number of processes that are told to be killed at
init 1, since it is possible for a system to go straight from higher
init levels directly to init 1. Most of them will already be stopped.
Likely one of two conditions prevail:
(1) something being started (sym links starting with "S") is locking up
or failing to run something that must be run; my system starts "single"
and "keytable" in runlevel 1. Hard to test if they locked up.
(2) one of the stopped services isn't really stopped, but is locking up
(more likely scenario); run each "K" script with argument "status". Go
into rc1.d and run each of the "K" scripts (while already in init 2),
and see which ones are running. Example:
cd /etc/rc.d/rc1.d/
./K88syslog status
(if it is running, write it down, it is a credible lockup possibility)

For the most part, concentrate your efforts on figuring out if something
that is running in init 2, and is set to be stopped in init 1. Get as
much data as you can on processes that should be stopped. You might also
want to search the /var/log/messages file for anything related to those
services.

D. Stimits, stimits at idcomm.com

> 
> > can you do "init 6" and have it reboot? Or "init 0" and have it
> > correctly halt? Also, for your normal shutdown, instead of calling
> > "reboot", try:
> > shutdown -r now
> > OR:
> > shutdown -h now
> 
> Doesn't matter, they all have the same effect.
> 
> >
> > If those work right, maybe the problem is just the "reboot" scripts is
> > messed up.
> >
> > > What else should I be looking for? Or was there something I missed in all
> > > that?
> >
> > Look at the output of "ps aux" for more hints of what is still running.
> > Look at "chkconfig --list" and do "/etc/rc.d/init.d/whatever status" on
> > everything that is supposed to be running, see if it is really running,
> > and not failed but subsystem locked (PostgeSQL does this if the system
> > isn't cleanly rebooted). Maybe show a list of "lsof /dev/whatever" for
> > all partitions that are relevant, just before you get to a stage where
> > it locks up.
> 
> >
> > And a very minor thing, when doing your last init or shutdown command,
> > first cd to "/". I say this because it is possible that a subdirectory
> > can be considered in use if someone has done a cd to that directory. In
> > theory it should kill your shell and not lock it in use.
> 
> That sounds promising. I'll try it.
> 
> >
> > D. Stimits, stimits at idcomm.com
> >
> > > -Mike
> > >
> > > On Thursday 18 October 2001 12:53 pm, you wrote:
> > > > Michael Deck wrote:
> > > > > Help! I'm unable to reboot my Linux box without a hard reset (and
> > > > > fsck all drives). When, as root, I type "reboot" it goes through the
> > > > > steps of stopping relevant services and then says, "No more processes
> > > > > in this runlevel" but then it just hangs. Unlike it did before (or on
> > > > > my other Linux boxen) where it pauses and then the md recovery thread
> > > > > gets woken up which powers off or reboots the box. This is a huge
> > > > > drag. I really need some advice on how to fix this because it's 15
> > > > > minutes to reboot otherwise. I'm not (presently) running X so the
> > > > > default runlevel is 3.
> > > > >
> > > > > Here's what has been happening on this box, in case it's of use.
> > > > >
> > > > > I've been trying to replace Mandrake 7.2 with KRUD 9-01 for the past
> > > > > 3 days. First, I couldn get initrd.img to boot and the hardware was
> > > > > suspected. So I put a newer CDROM drive in and tried again.
> > > > >
> > > > > Booting worked, but I was getting sporadic failures to find rpm files
> > > > > during the actual install. Each of these install attempts left the
> > > > > system in a more or less unusable state. I did go back and re-install
> > > > > Mandrake from CD successfully but it has many older RPMs and
> > > > > (unfortunately) at some point I trashed /usr so my various patches
> > > > > and updates were lost. Sigh.
> > > > >
> > > > > This morning I figured out how to do a hard-disk based install and
> > > > > tried that. This particular box can't successfully copy the CD's but
> > > > > I was able to us another Linux box to copy them into ISO images and
> > > > > then upload them to the target system. Voila!, I thought, and I (once
> > > > > again) commenced the KRUD installation. Text mode, but it got done.
> > > > >
> > > > > But it still has this same ugly problem of not shutting down cleanly.
> > > > >
> > > > > Your suggestions appreciated!
> > > > >
> > > > > -Mike
> > > > >
> > > > > _______________________________________________
> > > > > Web Page:  http://lug.boulder.co.us
> > > > > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > > >
> > > > A couple of general tools. The main one being lsof. It lists the users
> > > > of a filesystem resource. For example, if you have hard drive hda, and
> > > > partition hda1, and program is using a file on hda1, then you can do
> > > > "lsof /dev/hda1" and it'll list the users.
> > > >
> > > > Second tool, you can go to /etc/rc.d/init.d and use "./whatever stop"
> > > > to stop a given service in the same way that a shutdown would do (it'll
> > > > be back after reboot or "./whatever start").
> > > >
> > > > I believe you will also find that runlevel 2 is single user with
> > > > networking, and runlevel 1 is single user without networking. Manually
> > > > run "init 2" to drop to runlevel 2. If nothing was hung up, run "init
> > > > 1" and drop to runlevel 1. If nothing hung up, then you have basically
> > > > the minimum of services running before hang. If you want to go back to
> > > > runlevel 3, just "init 3".
> > > >
> > > > While in your lowest runlevel (runlevel 0 is halt, runlevel 6 reboots,
> > > > runlevel 1 is the lowest interactive level), run lsof against the hard
> > > > drives that are mounted. You will have to run it against each
> > > > partition, not just the drive (I once had a similar hang because of a
> > > > bug with a sym link from a man page causing the partition to think it
> > > > was still in use, had to delete the sym and relink it once a newer
> > > > kernel version was in). Before you bother with panicking over a large
> > > > number of users of a partition, run "chkconfig --list" and view
> > > > services that are running from your current runlevel (or rather, for
> > > > services that are supposed to be running). If you see something
> > > > optional that will complicate your search, got o /etc/rc.d/init.d/ and
> > > > run the "./whatever status" to see if it really runs (maybe it'll say
> > > > "service is stopped but subsystem is locked" instead); then run
> > > > "./wahtever stop" to stop the service. Just be careful not to stop
> > > > something you need. Eventually you can
> > > > investigate the partition with lsof and decide exactly what processes
> > > > are candidates for the lockup, and attempt to work on each in turn. If
> > > > for example you saw that netscape was still locking, you know damn well
> > > > you found your problem. Maybe it'll be like the problem I found long
> > > > ago, and a sym link will be mistaken for an open file.
> > > >
> > > > D. Stimits, stimits at idcomm.cmo
> > > > _______________________________________________
> > > > Web Page:  http://lug.boulder.co.us
> > > > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > >
> > > _______________________________________________
> > > Web Page:  http://lug.boulder.co.us
> > > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> >
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug



More information about the LUG mailing list