(Longish) Anecdotal solution Re: [ale] Reboot wedging problems
John Mills
johnmills at speakeasy.net
Thu Jun 19 10:50:55 EDT 2003
Dow, Geoffrey, Robert, ALErs -
Thanks for the comments and suggestions. I _think_ I have improved the
situation, but time will tell. Here's the smut:
On Wed, 18 Jun 2003, Dow Hurst wrote:
> A big fat UPS with nut running on the server for autoshutdown on low
> battery and you won't have this problem ever.
That certainly sounds like a _Good_Thing(TM)_. Someday ...
> This also indicates that even though you are using a journaling
> filesystem the meta data doesn't seem to be written to the journal
> immediately and is being lost. There may be a parameter for ext3 that
> says don't wait to write metadata. XFS always writes metadata
> immediately in the SGI implementation. You can lose real data still in
> RAM but at least the file system integrity won't be compromised. A UPS
> based shutdown normally just runs the "shutdown -h now" command, so
> flushes data to disk. The journal should always be uptodate anyway.
> Dow
I didn't find such a parameter to set. Boot logs suggest I have a five
second journaling interval. 'xfs_admin' seems to be an SGI app - at least
I didn't find an x86 RPM for it.
> Geoffrey wrote:
> > Tom's r/b has badblocks command. Boot with it and run it against your
> > / filesystem?
I did that ("# badblocks -n -o <filename>") and the partition passed
cleanly. I also changed the cabling and jumpering on the disks in case I
had problems with bad reads or writes, and/or was interrupting retry
sequences. I also went back to the HDD I had originally installed RH-7.3
upon, and found the same *(&^@!! behavior. (I had used RH-7.3 and ext3
filesystems on other boxen and been particularly pleased with their
recovery from power loss, but this was a recent installation followed by a
disk upgrade.)
I noticed some traces: 1) the INODES in question were _not_ repeaters,
though in nearby parts of the disk; 2) all lockup problems occured when
doing the intial check of the root partition; 3) boot logs suggested
the 'ext3' root partition was mounted as 'ext2' at that point; 4) the
segments in 'lost+found' were things like mail-spool fragments; and 5)
some complaints seemed to be about socket mount-points.
Accordingly I moved /var/spool, /var/log, and /tmp to another 'ext3'
partition (which was fsck'ed on boot _as_ 'ext3') and soft-linked them
back into '/var' and '/'. I shutdown in an orderly way, rebooted, and
jumped power OFF and ON.
I was rewarded with a successful boot: all problems were fixable with no
complaints about fs integrity. Much more satisfying!
TENTATIVE CONCLUSION: The problem most likely came from the specifics of
my hardware and disk usage, and _maybe_ I've reduced the impact by putting
the most volatile material where it is checked more robustly on reboot.
If this suggests any other points or measures I would be happy to hear
them. Meanwhile stay tuned and we'll see if this was the right track.
Always a pain to troubleshoot an installation-specific problem like this!!
Regards.
John Mills
john.m.mills at alum.mit.edu
_______________________________________________
Ale mailing list
Ale at ale.org
http://www.ale.org/mailman/listinfo/ale
More information about the Ale
mailing list