[lug] Unable to cleanly reboot

D. Stimits stimits at idcomm.com
Thu Oct 18 14:59:50 MDT 2001


Michael Deck wrote:
> 
> Thanks. I'm learning something here but, unfortunately, I haven't yet
> collared the problem.
> 
> First I did "init 3" which sent me back to a command-line prompt. At that run
> level a lot of stuff was still in use but there were a number of services
> that I saw I didn't want or need on this machine so I killed them and took
> them out of chkconfig. Then I did "init 2" and that killed a few more
> services. I did lsof on all the partitions again and, after killing some
> services manually, saw that only 'login', 'bash', and 'lsof' were using files
> on /dev/hda6. I assumed that was OK and did "init 1" that froze the system.
> It told me it had stopped the random service and eth0 successfully, then 'no
> more processes in this runlevel" and it's once again time for a hard restart
> and fsck.

I wonder if possibly it is eth0 that killed things. This was the last
thing it did before apparently locking up? While in init 2 or init 1,
can you do "init 6" and have it reboot? Or "init 0" and have it
correctly halt? Also, for your normal shutdown, instead of calling
"reboot", try:
shutdown -r now
OR:
shutdown -h now

If those work right, maybe the problem is just the "reboot" scripts is
messed up.

> 
> What else should I be looking for? Or was there something I missed in all
> that?

Look at the output of "ps aux" for more hints of what is still running.
Look at "chkconfig --list" and do "/etc/rc.d/init.d/whatever status" on
everything that is supposed to be running, see if it is really running,
and not failed but subsystem locked (PostgeSQL does this if the system
isn't cleanly rebooted). Maybe show a list of "lsof /dev/whatever" for
all partitions that are relevant, just before you get to a stage where
it locks up.

And a very minor thing, when doing your last init or shutdown command,
first cd to "/". I say this because it is possible that a subdirectory
can be considered in use if someone has done a cd to that directory. In
theory it should kill your shell and not lock it in use.

D. Stimits, stimits at idcomm.com

> 
> -Mike
> 
> On Thursday 18 October 2001 12:53 pm, you wrote:
> > Michael Deck wrote:
> > > Help! I'm unable to reboot my Linux box without a hard reset (and fsck
> > > all drives). When, as root, I type "reboot" it goes through the steps of
> > > stopping relevant services and then says, "No more processes in this
> > > runlevel" but then it just hangs. Unlike it did before (or on my other
> > > Linux boxen) where it pauses and then the md recovery thread gets woken
> > > up which powers off or reboots the box. This is a huge drag. I really
> > > need some advice on how to fix this because it's 15 minutes to reboot
> > > otherwise. I'm not (presently) running X so the default runlevel is 3.
> > >
> > > Here's what has been happening on this box, in case it's of use.
> > >
> > > I've been trying to replace Mandrake 7.2 with KRUD 9-01 for the past 3
> > > days. First, I couldn get initrd.img to boot and the hardware was
> > > suspected. So I put a newer CDROM drive in and tried again.
> > >
> > > Booting worked, but I was getting sporadic failures to find rpm files
> > > during the actual install. Each of these install attempts left the system
> > > in a more or less unusable state. I did go back and re-install Mandrake
> > > from CD successfully but it has many older RPMs and (unfortunately) at
> > > some point I trashed /usr so my various patches and updates were lost.
> > > Sigh.
> > >
> > > This morning I figured out how to do a hard-disk based install and tried
> > > that. This particular box can't successfully copy the CD's but I was able
> > > to us another Linux box to copy them into ISO images and then upload them
> > > to the target system. Voila!, I thought, and I (once again) commenced the
> > > KRUD installation. Text mode, but it got done.
> > >
> > > But it still has this same ugly problem of not shutting down cleanly.
> > >
> > > Your suggestions appreciated!
> > >
> > > -Mike
> > >
> > > _______________________________________________
> > > Web Page:  http://lug.boulder.co.us
> > > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> >
> > A couple of general tools. The main one being lsof. It lists the users
> > of a filesystem resource. For example, if you have hard drive hda, and
> > partition hda1, and program is using a file on hda1, then you can do
> > "lsof /dev/hda1" and it'll list the users.
> >
> > Second tool, you can go to /etc/rc.d/init.d and use "./whatever stop" to
> > stop a given service in the same way that a shutdown would do (it'll be
> > back after reboot or "./whatever start").
> >
> > I believe you will also find that runlevel 2 is single user with
> > networking, and runlevel 1 is single user without networking. Manually
> > run "init 2" to drop to runlevel 2. If nothing was hung up, run "init 1"
> > and drop to runlevel 1. If nothing hung up, then you have basically the
> > minimum of services running before hang. If you want to go back to
> > runlevel 3, just "init 3".
> >
> > While in your lowest runlevel (runlevel 0 is halt, runlevel 6 reboots,
> > runlevel 1 is the lowest interactive level), run lsof against the hard
> > drives that are mounted. You will have to run it against each partition,
> > not just the drive (I once had a similar hang because of a bug with a
> > sym link from a man page causing the partition to think it was still in
> > use, had to delete the sym and relink it once a newer kernel version was
> > in). Before you bother with panicking over a large number of users of a
> > partition, run "chkconfig --list" and view services that are running
> > from your current runlevel (or rather, for services that are supposed to
> > be running). If you see something optional that will complicate your
> > search, got o /etc/rc.d/init.d/ and run the "./whatever status" to see
> > if it really runs (maybe it'll say "service is stopped but subsystem is
> > locked" instead); then run "./wahtever stop" to stop the service. Just
> > be careful not to stop something you need. Eventually you can
> > investigate the partition with lsof and decide exactly what processes
> > are candidates for the lockup, and attempt to work on each in turn. If
> > for example you saw that netscape was still locking, you know damn well
> > you found your problem. Maybe it'll be like the problem I found long
> > ago, and a sym link will be mistaken for an open file.
> >
> > D. Stimits, stimits at idcomm.cmo
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug



More information about the LUG mailing list