[ale] One for the archives

H. A. Story adrin at bellsouth.net
Wed Mar 7 20:42:02 EST 2007


I agree DLT drives are very slow and my experience is that they have a 
low MTTF.

I tried Bacula and didn't care for it to much.  I just wanted a simple 
backup program the verifies and it wanted to catalog and do a full 
system backup first.  I didn't have to to argue with it at the time.

Veritas: Well, I would rather not say.    Would like to hear what other 
have used for backups and more importantly bare metal recovery, though 
this may not be the place for it.

Adrin


James P. Kinney III wrote:
> On Mon, 2007-03-05 at 07:50 -0500, Jeff Lightner wrote:
>   
>> Bacula doesn't have catalog backups?   In NetBackup this is an important
>> thing.  You can later reinstall the software then restore the most
>> recent catalog backup and voila all your tapes are there.
>>     
>
> Bacula does have catalog backups. They are on tape. The screwup on my
> end was not having them also somewhere else. DLT7000 is painfully slow
> to do a bscan recovery from.
>   
snip
>> -----Original Message-----
>> From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of
>> James P. Kinney III
>> Sent: Saturday, March 03, 2007 9:27 PM
>> To: Atlanta Linux Enthusiasts
>> Subject: [ale] One for the archives
>>
>> A server got hosed because of the following series of failures. Since
>> the final step was a major "GOTCHA", I am sharing it here now so that
>> others can avoid the pain later.
>>
>> Background:
>>
>> Main SOHO server with SCSI card for tape backup (old DLT 7000) and x4
>> 200GB SATA in a software RAID setup. The main data storage area (a big
>> samba share spot) was stored across all 4 drives in a RAID 5 array.
>>
>> System hiccups and reports a failed drive (it won't spin up at all). No
>> problem. Not a hot-swap system so it is taken down, the drive replaced
>> and the system rebooted to run-level 1. Console screen tail
>> in /proc/mdstat shows system is doing a drive recovery/repair onto the
>> new hard drive. Everything looks good.
>>
>> After some period of time (approximately 10-20 minutes) the system is
>> seen REBOOTING!
>>
>> It was assumed that all was OK as after the reboot, no forced filesystem
>> checks occurred. It was quite odd that the server would shut down like
>> that. About 2-3 minutes later, it rebooted itself again.
>>
>> At this time it was determined that the power supply was failing.
>>
>> It was replaced.
>>
>> Later, it was determined that almost all of the files in the samba share
>> section were scrambled. And the backup application had lost all of it's
>> config files and the backup catalog (bacula).
>>
>> Then the database failed to start.
>>
>> Panic begins to creep in. The power blink during the hard drive recovery
>> had caused apparently massive damage to the storage systems.
>>
>> A new drive and fresh OS was installed. The old RAID arrays were mounted
>> in order to extract what was usable from the samba shares. Email files
>> recovered OK as well as home directories. But the samba shares were
>> still screwball as well as all the backup system catalog and database.
>>
>> So the process was begun to extract the backup catalog off the tapes.
>> Searching for the catalog files is a painfully laborious task on a
>> poky-slow tape drive when there are 21 tapes to sift through.
>>
>> While the backups were being hunted down, calendar time continues on and
>> several weeks go by with no working backups (only one tape drive and it
>> spent all day "collecting it's thoughts" for recovery). A file from the
>> samba share was discovered to be clearly scrambled and worthless (an
>> installation disk for an application that had been stored with an md5
>> checksum). So it was deleted since the disk was available and it would
>> need to be recopied anyway. 
>>
>> The delete took a long time to return from.
>>
>> The entire filesystem had been deleted.
>>
>> Everything. All files. 
>>
>> The file was deleted from within containing directory using the command
>> rm <filename> and then answering "yes" to the "are you sure" prompt.
>>
>> As far as can be discerned, the file corruption was bad enough that the
>> delete process was redirected to another point in the filesystem where
>> massive deletion occurred.
>>
>> The moral of this story is three-fold:
>>
>> 1. Bare-metal recovery of the backup system is both hard and more
>> important than air.
>>
>> 2. Any filesystem that becomes corrupted because of a RAID 5 malfunction
>> should not be trusted at all under any circumstances. It should be
>> removed from the system and overwritten immediately and the contents
>> recovered from backups.
>>
>> 3. Any time a drive fails in a RAID system, go ahead and replace the
>> power supply for safety reasons. Unless it is a redundant power supply
>> (this was not) it will certainly cost less that the antacid bill on
>> this.
>>
>> -- 
>> James P. Kinney III          
>> CEO & Director of Engineering 
>> Local Net Solutions,LLC        
>> 770-493-8244                    
>> http://www.localnetsolutions.com
>>
>> GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
>> <jkinney at localnetsolutions.com>
>> Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://www.ale.org/mailman/listinfo/ale
>>     
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://www.ale.org/mailman/listinfo/ale



More information about the Ale mailing list