[ale] RAID mirror boot nightmare

Phil Turmel philip at turmel.org
Wed Jul 11 08:37:17 EDT 2012


On 07/11/2012 02:27 AM, Bob Toxen wrote:
> All,
> 
> PROBLEM SOLVED!  Phil's suggestion of the initrd being wrong was
> correct.  I was starting to suspect this.  See below for details on how
> I got into this mess and how I got out.
> 
> Phil: please send me private email with your name as it should appear on
> your $50 check and your mailing address.  I was deadly serious about
> the reward as I was desperate as this system is for a client and I want
> to give a real thanks!

Done!  Thanks!

Some notes for the future...  I was going to recommend
linux-raid at vger.kernel.org if you continued having trouble.  I'm a
regular responder there, but there are many others (world-wide, too, for
those late evenings).  The author and maintainer of linux MD and mdadm,
Neil Brown, listens there as well.

If you are doing these sorts of things for customers, lurking on
linux-raid will also give you an early warning of kernel bugs and a feel
for common problems.  And the best practices to avoid those problems.
The public assistance there is free.

> The command to rebuild the initrd under CentOS is mkinitrd.

Ahh.  I've been using dracut in gentoo... quite different.  Comes from
the Fedora project, though, so will eventually find its way to CentOS.

> In the md superblock there is a field called "Preferred Minor", i.e.,
> preferred minor device for the md device that is created.  There seems
> to be no command and option to just update this field; apparently one
> must use mdadm with --create or --assemble to update the md superblock
> on the underlying real disk devices.

This field is ignored by modern initramfs setups.  The field doesn't
even exist any more in MD metadata v1.0 and up.  It's only used by
kernel auto-assembly now, and that is deprecated.  Minors are now
assigned from 127 counting down as arrays are encountered, except for
arrays explicitly declared in mdadm.conf.   The AUTO statement in
mdadm.conf can selectively disable this behaviour.  See the man-page
for mdadm.conf.

Arrays assembled under initramfs control should *not* be marked with
partition type 0xfd.  Instead, use 0x83 or 0xda.

> Due to the "brilliance" of whomever wrote that md code, on first write
> when any md device is activated, that md device's minor device is written
> into the superblock stored under each underlying device, e.g., /dev/sda6
> and /dev/sda6.  When I used the CD Rescue code, it generated md devices
> of the form /dev/md123, 4, 5, 6, 7.  Probably, when I then ran "fsck -f"
> (or just read a file which causes the file's access time to be updated)
> under the CD Rescue CD's Linux, it changed the preferred minor device
> in each underlying disk device, precipitating this nightmare.

I don't think so.  Your earlier report said it was assembling /dev/md0.
That had to be either kernel autoassembly (bad to mix with initrds) or
an array called out in the initramfs' copy of your mdadm.conf file.

That copy of your mdadm.conf undoubtedly had the uuid of your original
/dev/md4.  You re-created that array, which would have assigned a new
uuid.  After that, /dev/md4 as understood by the initramfs would not
be found in your system.  Depending on the AUTO setting, the array would
either get /dev/md127 or not be assembled at all.

If your kernel command line in grub.conf used the syntax root=LABEL=...
your system would probably have still booted.  I strongly recommend
that in /etc/fstab as well.

> Unfortunately, on boot the kernel fails to give useful info on what
> device it was trying to mount or why it failed -- very UN-Linux-like.

Most modern initramfs implementations offer some form of debugging
shell, made available with a kernel option.  For dracut:
http://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#using-the-dracut-shell

> I booted from a different non-RAID partition, mounted the md partitions
> now called /dev/md126 and /dev/md0 as /mnt2 and /mnt2/boot and issued
> the command
> 
>   mkinitrd --fstab=/mnt2/etc/fstab /boot/initrd_126 `uname -r`
> 
> and then edited my /boot/grub/grub.conf to change the initrd field to
> initrd_126.
> 
> I then edited /mnt2/etc/fstab to specify /dev/md126 as my / device
> and /dev/md0 as /boot.  The kernel doesn't really seem to care about
> /etc/mdadm.conf as the newer Linux RAID stores the info in the
> "md superblock" (not to be confused with the ext[234] *nix superblock).

You've got this backward, due to a slight misunderstanding.  The copy
of mdadm.conf that gets inserted into your initramfs is the controlling
copy.  If it is missing, all discovered devices and partitions will be
examined for initramfs assembly, to nodes /dev/md127 and down.  The
assignment of these nodes is very dependent on discovery order and
timing, and has been known to change between kernel versions.

The copy of mdadm.conf in /etc is only used *after* boot, or to update
your initramfs.

> The kernel also seemed to ignore on the grub line:
> 
>   md=4,/dev/sda6,/dev/sdb6

This syntax is supported only for kernel autoassembly of v0.9 arrays,
and most modern distributions don't include this deprecated feature.

See what you get from:

zcat /proc/config.gz |grep MD_AUTODETECT

> I then booted successfully.  I then copied this new grub.conf and
> initrd_126 to my new raid / and /boot partitions for redundancy
> 
> The above got my md devices running under the new names but with only
> sda as I had not yet installed my replacement second disk.
> 
> 
> To recover with my new empty disk, I installed it as sdb and did:
>   sfdisk -d /dev/sda | sfdisk /dev/sdb
> 
> 
> NOTE 1: Save your partition tables to a file thusly (and then to another
>         system):
> 
>           sfdisk -d /dev/sda > partition_table_sda
>           sfdisk -d /dev/sdb > partition_table_sdb
> 
>         To later recover (DANGEROUS):
> 
>           sfdisk /dev/sda < partition_table_sda
> 
> 
> NOTE 2: For those that do full backups with tar, rsync, etc. which
>         does NOT save inode numbers, it is very important to backup
> 	the inode numbers.  Thus, when you eventually suffer disk
> 	corruption (possible under ext3 occasionally with unclean
> 	shutdown), when fsck asks about inode 235255 you can grep for it
> 	in your backup and know which file may be corrupted and in need
> 	of a restore.  Also, if a directory file gets trashed you will
> 	know where to restore the orphan files that ended up in
> 	/lost+found.
> 
> 	One way to capture inodes (prior to backup) is the following
> 	except prune to skip /proc and other fake file systems:
> 
> 	  find / -ls > /root/inodes.list
> 
> 
> To change them back to my original preferred names I then did:
> 
>   1. Booted to my primary non-raid sda5.
> 
>   2. Ensured that no md devices were mounted with the following
>      (better than "mount" because it doesn't babble about /dev, /proc,
>      etc.):
> 
>        df -h
> 
>   3. Deactivated the "wrong" md device:
> 
>        mdadm -S /dev/md126
> 
>   4. Created the "right" md device (this worked because after installing
>      replacing my failed sdb disk I allowed RAID automatically to sync
>      over several hours):
> 
>        mdadm mdadm -A /dev/md4 -v -U super-minor /dev/sd[ab]6
> 
>      Verify that md4 was created successfully:
>        cat /proc/mdstat
>        mdadm -D /dev/md4
>        mdadm -E /dev/sd[ab]6
> 
>      Alternatively (if there was a out-of-date file system on /dev/sdb6
>      such as if the replacement disk had been used)
>      I first would have had to scribble both the md superblock and the
>      ext3 superblock with:
> 
>        mdadm --zero-superblock /dev/sdb6            # Dangerous
>        dd bs=512 count=1 if=/dev/zero of=/dev/sdb6  # Dangerous
> 
> Then update /boot/grub/grub.conf, /etc/fstab, and /etc/mdadm.conf.

I think you've created a great deal of work for yourself that could
all be avoided if you used LABEL= syntax in grub.conf and /etc/fstab.

You could then use an empty mdadm.conf and it would all "just work"
when you change things around.  FWIW, SystemRescueCD has an empty
mdadm.conf.  That's why it always assembled your arrays.

HTH,

Phil


More information about the Ale mailing list