[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

halt, apm, and all that



After tonight's fun with sick root disks on imap machines, I thought
I'd take another look at the "halt" problem.

Basic review of the problem: after the machine boots, it runs fsck,
first on the root disk, then on everything else.  If fsck fails and
indicates that a human needs to look at things, then the rc scripts
call "halt".  This is supposed to stop the machine in some obvious
fashion that a human will notice and hopefully correct -- like in the
rom monitor.  Since intel machines don't have a rom monitor, the lfs
scripts for "halt" actually do a "poweroff" (halt -p = poweroff).
Since this doesn't currently work on our systems, we have another hack
which I had previously recommended which leaves people at a '# '
prompt.  The only safe thing you can do at this prompt is to turn the
power off ("halt nosync").  This is certainly not particular convenient
for people located at home.  Presumably, these people find other things
to do, and so I suspect although cannot prove this is why we might have
gotten corrupted root filesystems on the imap servers.

For the short-term, people who are at home will probably want to
remember this:
	reboot -n -f
(-n = nosync, -f = don't call shutdown, just do it.)  This is the
closest equivalent there is to "halt nosync", and I believe is
the *only* safe thing you can do at this point, unless you really
*really* know what you're doing.

There are 2 things we can do in the next kernel to make this better:

#1 - build "magic sysrq" support in.  This enables support
in the kernel to send a break and then one of the following
letters:
	r - turn off raw keyboard access (useless for us)
	k - secure access mode (probably not useful)
	b - boot without sync or unmount (BINGO--WE WANT THIS)
	s - attempt to sync filesystems (also useful on not quite dead systems)
	u - attempt to mount filesystems ro
	p - dump registers & flags to console
	t - dump current tasks & info
	m - dump current meminfo
	0-9 - set console log level
	e - send sigterm to everything but init
	i - send sigkill to everything but init
	l - send sigkill to everything including init(!)
	h - help...
This (and more) is all described in the kernel source, in
	Documentation/sysrq.txt
The code appears to support one more option which is oddly enough
not documented:
	o - turn machine power off.
this only works with acpi or:

#2 - build "apm" support into the kernel, modify lilo.conf to
	supply "apm=power-off", and modify /etc/rc.d/init.d/halt
	to remove the #.

	Apm support is normally disabled on SMP machines, which is all
	our production equipment.  The apm theoretical machine model is
	too simple to describe SMP hardware, so is not very useful at
	best, and positively dangerous at worst.  That's why the apm
	kernel code doesn't like to run on our hardware.  However,
	"power-off" is (usually) an exception to this, and by popular
	demand, the latest linux kernels support this even on SMP
	machines -- but *only* with the right options which for some
	reason are not the default.

	Since our current server kernel doesn't support this,
	I tried this with a "generic" 2.4.26 kernel (with lots
	of options) -- by adding this to /etc/modules.conf:
		add options apm debug=1 power_off=1
	and doing "modprobe apm" I was then able to do
		halt -d -f -i -p
	and have the right thing happen.  Note that option
	processing for built-in modules is slightly different
	and we probably don't need debug output - hence my
	slightly different recommendations above for lilo.conf
	vs. what I actually tested.

Of course, this begs another question - how to you turn a machine on
from home?  But -- you see, option 1 gives you choices, and option
2 depending on what you choose at boot time allows you to choose
between various different dangerous behaviors.

				-Marcus