[ale] read after write verify?, data scrubbing procedures

Phil Turmel philip at turmel.org
Fri Oct 26 16:21:16 EDT 2012


On 10/26/2012 02:26 PM, mike at trausch.us wrote:
> On 10/26/2012 09:16 AM, Phil Turmel wrote:
>> The catch that some people encounter is that some of the metadata space
>> is wasted, and never read or written.  If a URE develops in that area,
>> no amount of raid scrubbing will fix it, leaving the sysadmin scratching
>> their head.
> 
> Eh, yeah, but I pull the member first and ask questions later.  The way 
> that I see it, if a drive in a RAID has failed, I don't have time to 
> scratch my head and find out why it failed, I have only the time to 
> replace it.  The questions come later, when I dig around logs (both the 
> system and the drive) and usually the answer is clear from the drive 
> logs alone...

UREs by themselves are *not* signs the drive is failed.  On modern
drives, spec'ed to 1x10^14, they happen all too often.  (Four complete
read passes through a 3T drive ~= 1x10^14 bits.)

I scrub my drives every week, and I'm not replacing the 3T drives every
month.  Nor will the manufacturer take them on warranty for UREs.

An uncorrectable read error simply needs to be rewritten.  If that
succeeds in place, there's nothing (yet) wrong with the drive.

> I did have it happen once that I experienced a point, for about 10 
> seconds that seemed like an eternity.  I had a RAID 6 that was 
> rebuilding because of a single-drive failure.  It was about 98% finished 
> rebuilding the array, when another drive failed.  Oh, and this was my 
> first failed disk ever in a RAID 6.  :-)
> 
> The rebuild finished, and then ANOTHER drive failed about 20 seconds 
> later, as I was getting ready to shut down the system to replace failed 
> drive #2.
> 
> Those 30 seconds are 30 seconds I will not forget.
> 
> Fortunately, the three drives were the last ones out of the original 
> set.  They were known that they were going to fail.  But the swap-out 
> schedule got held up for some reason I no longer recall and the 
> drives---which were supposed to all be replaced within one year of 
> deployment---had lasted about 19 months.  (They were horrible choices 
> for a RAID, but they were cheap.  "Re-manufactured", "green" drives.)

This is scenario is extremely common with cheap drives due to a mismatch
between controller timeouts and internal drive error recovery timeouts.
 Standard desktop drives have extremely long error recovery algoriths,
on the order of two or three minutes.  Linux controllers have a default
timeout of 30 seconds.  The following scenario happens when creating an
array from "green" drives:

1) Drive A experiences a URE and tries to recover it,
2) Controller for A times out after 30 seconds and reports the error,
3) MD Raid reads another mirror for that data and
4a) Supplies it to the caller,
4b) Tries to write the data back to A,
5) Drive A is still busy recovering and fails to respond to the write,
6) Drive A is kicked out of the array as "failed", array is degraded
7) Spare drive Z is added to the array and a rebuild started,
8) Drive B experiences a URE and tries to recover it,
9a) On raid5 or single mirrors, rebuild stops, data is lost.
9b) On raid6 or multi mirrors, #2-#6 repeat
10) (raid6 or mirrors+) Drive C experiences a URE...

Mind you, this happens with *good* drives that just happen to have UREs
within the span of a rebuild.  With modern drive sizes, this is very likely.

Users of green drives in raid who never knew to scrub their arrays are
often burned by this, as after months of operation most drives have at
least one weak spot.  Then they scrub, or have a real failure on one
drive, and all hell breaks loose.

> In case anyone's curious, the original plan was to swap out 1 drive per 
> quarter, except for the last two drives which were to be swapped out a 
> month apart.  12 months was supposed to be the longest any of them were 
> there...

Except for one early death, all of my non-green drives have lasted 3-5
years of 24/7 duty with weekly scrubs.

I researched the above scenario after I replaced a couple old Seagate
drives with newer, larger ones, that happened to not offer the timeout
adjustments the old ones had (SCTERC).  A few months later they both
dropped out of a raid6 during a scrub.

The only manufacturer of consumer/desktop drives that still supports
SCTERC is Hitachi, FWIW.

>>> ... are inversely proportional to just how much you actually attempt to
>>> protect your data from failure.  :-)  And being that I have backups in
>>> place, I'm not terribly worried about that.  Drive fails?  Replace it.
>>> Two drives fail?  Replace them.  Three or more drives fail?  Recover it.
>>>    I get a much larger paycheck that week, then.
>>
>> :-)  I'm self-employed.  I get a much *smaller* paycheck when I spend
>> too much time on this.
> 
> Hrm.  Bill hourly!

?  Bill myself hourly?  I'm an engineer, not an IT contractor.  I do IT
for myself and my own small business.  Time spent on IT is time *not*
spent on engineering.

> Flat-rate is high-risk, and I'll only do it for insane values of "flat 
> rate".  Pay me $25,000 per month, and I'll become your dedicated support 
> dude, no questions asked, and assign all my other work to someone else. 
>   That's about the smallest flat rate I'd take.  :-)

We're getting a bit OT here, but I arrange for most of my engineering
work to be paid a fixed fee per project.

>>>> par2 is much better than md5sums, as it can reconstruct the bad spots
>>>> from the Reed-Solomon recovery files.
>>>
>>> Interesting.  Though it looks like it wouldn't work for my applications
>>> at the moment.  Something that can scale to, oh, something on the order
>>> of two to four terabytes would be useful, though.  :-)
>>
>> I find it works very well keeping archives of ISOs intact.  The larger
>> the files involved, the more convenient par2 becomes.
> 
> My take on what the Wikipedia article said implied that wasn't really 
> possible.  I'm guessing that it's subtly inaccurate somehow---my 
> understanding there was that Par2 was limited to being able to have 
> 32,768 blocks of recovery data.  That doesn't sound like it'd scale to 1 
> TB or so unless the block size is 32 MB or larger.

Works fine with 8+ GB isos.  I didn't read the wiki article.

>>> I'll keep an eye on the third version of that spec, too.  Learn (about)
>>> something new every day!
>>>
>>>> Indeed, raid5 cannot be trusted with large devices.  But raid6*can*  be
>>>> trusted.  And is very resilient to UREs if the arrays are scrubbed
>>>> regularly.
>>>
>>> Well, that depends.  The level of trust in each comes from the number of
>>> drives.  For example, would you trust a bank of 24 drives to RAID 6?
>>> Only if you're nuts, I suspect.
>>
>> For near-line light-duty high-capacity storage, I would certainly set up
>> such a raid6.  Configuring 24 drives as 22 in raid6 w/ two hot spares
>> would be more robust that a pair of twelve-drive raid6 arrays concatenated.
>>
>> Same capacity, higher unattended fault tolerance, but significantly
>> lower performance.  Everything is tradeoff.
>>
>>> I'd use RAID 5 for a 3(2)-drive array.  I'd use RAID 6 up to probably
>>> 7(5), tops.  If I needed to do anything more than that, I'd start
>>> stacking RAID levels depending on the application's requirements.
>>
>> I don't use raid5 at all nowadays.  Triple mirror on three devices is my
>> minimum setup.  Raid10,f3 or raid6 for anything larger.
> 
> I don't use RAID for almost anything: my desktop has a single drive in 
> it, and I perform backups of everything that isn't under git, very 
> regularly.
> 
> Everything that is under git exists in at least 3 places for the 
> private/internal projects, and usually dozens for the public ones.  So I 
> don't worry about those so much.

+1 for git

> Also, we don't have hundreds of GB of data ourselves to back up: our 
> whole history fits on a CD at the moment, making backup relatively 
> convenient for the time being (and for the forseeable future).  In fact, 
> we're looking at using M-Disc for the annual archive disks.

My media server at home (MythTV) generates about 8G per hour per tuner.
 :-)  Not super-critical, of course, and not backed up.  Raid is for
uptime, not backup.  I do not want to disappoint she-who-must-be-obeyed.

Phil


More information about the Ale mailing list