(Longish) Anecdotal solution Re: [ale] Reboot wedging problems

Thu Jun 19 10:50:55 EDT 2003

Dow, Geoffrey, Robert, ALErs -

Thanks for the comments and suggestions. I _think_ I have improved the 
situation, but time will tell. Here's the smut:

On Wed, 18 Jun 2003, Dow Hurst wrote:

> A big fat UPS with nut running on the server for autoshutdown on low 
> battery and you won't have this problem ever.

That certainly sounds like a _Good_Thing(TM)_. Someday ...

> This also indicates that even though you are using a journaling 
> filesystem the meta data doesn't seem to be written to the journal 
> immediately and is being lost.  There may be a parameter for ext3 that 
> says don't wait to write metadata.  XFS always writes metadata 
> immediately in the SGI implementation.  You can lose real data still in 
> RAM but at least the file system integrity won't be compromised.  A UPS 
> based shutdown normally just runs the "shutdown -h now" command, so 
> flushes data to disk.  The journal should always be uptodate anyway.
> Dow

I didn't find such a parameter to set. Boot logs suggest I have a five
second journaling interval. 'xfs_admin' seems to be an SGI app - at least
I didn't find an x86 RPM for it.

> Geoffrey wrote:

> > Tom's r/b has badblocks command.  Boot with it and run it against your 
> > / filesystem?

I did that ("# badblocks -n -o <filename>") and the partition passed
cleanly. I also changed the cabling and jumpering on the disks in case I
had problems with bad reads or writes, and/or was interrupting retry
sequences. I also went back to the HDD I had originally installed RH-7.3
upon, and found the same *(&^@!! behavior. (I had used RH-7.3 and ext3
filesystems on other boxen and been particularly pleased with their
recovery from power loss, but this was a recent installation followed by a
disk upgrade.)

I noticed some traces: 1) the INODES in question were _not_ repeaters,
though in nearby parts of the disk; 2) all lockup problems occured when
doing the intial check of the root partition; 3) boot logs suggested
the 'ext3' root partition was mounted as 'ext2' at that point; 4) the
segments in 'lost+found' were things like mail-spool fragments; and 5) 
some complaints seemed to be about socket mount-points.

Accordingly I moved /var/spool, /var/log, and /tmp to another 'ext3'
partition (which was fsck'ed on boot _as_ 'ext3') and soft-linked them
back into '/var' and '/'. I shutdown in an orderly way, rebooted, and
jumped power OFF and ON.

I was rewarded with a successful boot: all problems were fixable with no
complaints about fs integrity. Much more satisfying!

TENTATIVE CONCLUSION: The problem most likely came from the specifics of
my hardware and disk usage, and _maybe_ I've reduced the impact by putting
the most volatile material where it is checked more robustly on reboot.

If this suggests any other points or measures I would be happy to hear 
them.  Meanwhile stay tuned and we'll see if this was the right track. 
Always a pain to troubleshoot an installation-specific problem like this!!

Regards.

 John Mills
 john.m.mills at alum.mit.edu

_______________________________________________
Ale mailing list
Ale at ale.org
http://www.ale.org/mailman/listinfo/ale