[ale] One for the archives

Mark Wright mpwright at speedfactory.net
Sun Mar 4 00:19:43 EST 2007


WOW.  I haven't heard a story like that in a long time.

Mark
On Mar 3, 2007, at 9:27 PM, James P. Kinney III wrote:

> A server got hosed because of the following series of failures. Since
> the final step was a major "GOTCHA", I am sharing it here now so that
> others can avoid the pain later.
>
> Background:
>
> Main SOHO server with SCSI card for tape backup (old DLT 7000) and x4
> 200GB SATA in a software RAID setup. The main data storage area (a big
> samba share spot) was stored across all 4 drives in a RAID 5 array.
>
> System hiccups and reports a failed drive (it won't spin up at  
> all). No
> problem. Not a hot-swap system so it is taken down, the drive replaced
> and the system rebooted to run-level 1. Console screen tail
> in /proc/mdstat shows system is doing a drive recovery/repair onto the
> new hard drive. Everything looks good.
>
> After some period of time (approximately 10-20 minutes) the system is
> seen REBOOTING!
>
> It was assumed that all was OK as after the reboot, no forced  
> filesystem
> checks occurred. It was quite odd that the server would shut down like
> that. About 2-3 minutes later, it rebooted itself again.
>
> At this time it was determined that the power supply was failing.
>
> It was replaced.
>
> Later, it was determined that almost all of the files in the samba  
> share
> section were scrambled. And the backup application had lost all of  
> it's
> config files and the backup catalog (bacula).
>
> Then the database failed to start.
>
> Panic begins to creep in. The power blink during the hard drive  
> recovery
> had caused apparently massive damage to the storage systems.
>
> A new drive and fresh OS was installed. The old RAID arrays were  
> mounted
> in order to extract what was usable from the samba shares. Email files
> recovered OK as well as home directories. But the samba shares were
> still screwball as well as all the backup system catalog and database.
>
> So the process was begun to extract the backup catalog off the tapes.
> Searching for the catalog files is a painfully laborious task on a
> poky-slow tape drive when there are 21 tapes to sift through.
>
> While the backups were being hunted down, calendar time continues  
> on and
> several weeks go by with no working backups (only one tape drive  
> and it
> spent all day "collecting it's thoughts" for recovery). A file from  
> the
> samba share was discovered to be clearly scrambled and worthless (an
> installation disk for an application that had been stored with an md5
> checksum). So it was deleted since the disk was available and it would
> need to be recopied anyway.
>
> The delete took a long time to return from.
>
> The entire filesystem had been deleted.
>
> Everything. All files.
>
> The file was deleted from within containing directory using the  
> command
> rm <filename> and then answering "yes" to the "are you sure" prompt.
>
> As far as can be discerned, the file corruption was bad enough that  
> the
> delete process was redirected to another point in the filesystem where
> massive deletion occurred.
>
> The moral of this story is three-fold:
>
> 1. Bare-metal recovery of the backup system is both hard and more
> important than air.
>
> 2. Any filesystem that becomes corrupted because of a RAID 5  
> malfunction
> should not be trusted at all under any circumstances. It should be
> removed from the system and overwritten immediately and the contents
> recovered from backups.
>
> 3. Any time a drive fails in a RAID system, go ahead and replace the
> power supply for safety reasons. Unless it is a redundant power supply
> (this was not) it will certainly cost less that the antacid bill on
> this.
>
> -- 
> James P. Kinney III
> CEO & Director of Engineering
> Local Net Solutions,LLC
> 770-493-8244
> http://www.localnetsolutions.com
>
> GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
> <jkinney at localnetsolutions.com>
> Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale




More information about the Ale mailing list