[ale] read after write verify?, data scrubbing procedures

Thu Oct 25 17:58:21 EDT 2012

On 10/25/2012 04:43 PM, Ron Frazier (ALE) wrote:
> Hi all,
> 
> In another thread, I mentioned that I do a manual data scrubbing
> procedure on my hard drives which involves reading inverting writing
> reading inverting writing all sectors on the hard drive periodically
> with Spinrite.  As an alternative, I use the Linux badblocks non
> destructive read write mode.  This process is certainly effective in
> reverifying the integrity of the drive barring any mechanical
> problems or controller problems.  As also mentioned in another
> thread, once some drives develop bad sectors, and based partly on
> advice here, I replace them.

Writing inverted data sounds dangerous to me.  What happens if that tool
is interrupted?  How big a chunk is at risk at any point?

> I basically want to have as much evidence as I can that the drive is
> reliable, periodically, still understanding that spontaneous or semi
> spontaneous failures are possible.
> 
> I am considering methods of reducing the maintenance procedures while
> still maintaining high confidence in the drives.  Mike T mentioned
> some automation and raid procedures that I may be able to implement
> on some machines at a future date.  In the mean time, I'm considering
> altering my procedures without giving up too much functionality.
> 
> My first question is, does a HDD do a read after write verify by
> default?  So, if the application or OS sends a command to write a
> sector, will a verify operation automatically be done after the write
> process, and, if the write failed, will the data be automatically
> relocated to a spare sector?  I'm assuming that this is the case, but
> would like to verify that.

No.

This came up on the linux-raid list recently, where a user had
hand-mirrored onto a used drive (IIRC), placed into mirror duty, and was
surprised to discover that the writes had not been verified by the drive.

If you do a simple streaming read vs. streaming write speed test, you'll
find the write performance of typical drives is close to the read
performance.  That clearly indicates that the drive firmware is *not*
moving the heads back to re-read the freshly written data.  If it were,
the write speed would be no better than half the read speed.

As I understand it, only sectors that have had read errors (marked
"pending") will be rechecked after write.  I also understand that reads
that succeed, but require more than some threshold of error correction
in the drive, trigger a silent rewrite attempt/relocation.

> Assuming the read after write verify happens, then, if I write data
> to every sector of a drive during a backup operation, as I mentioned
> in the clonezilla thread, then, effectively, I have written and read
> every sector on the target drive.  Perhaps, then, a successful backup
> operation can take the place of my data scrubbing activity on the
> target drive.
> 
> Then, there is the question of the data scrubbing on the source
> drive.  In this case, once I've completed a backup, I will have read
> each sector on the source drive.  Assuming there are no read errors,
> (If there were, I have to get out the big guns.)  then, this has
> accomplished 1/2 of what my scrubbing does, the read half.
> 
> So, the question arises as to what about the write test part of my
> scrubbing operation for the source drive, which didn't get done.
> Normally, my scrubbing operation would have written to each sector.
> If the drive automatically detects write problems when a write is
> attempted, perhaps the write scrubbing operation is not necessary, or
> is less necessary.
> 
> Hope this makes any sense whatsoever.  Any opinions?

Writing over a sector that might be "weaker" than another will hide its
condition from the drive firmware.  When it is finally unreadable,
you've lost it.  A spot that degrades over time but is read regularly
will likely be caught by the ECC threshold and relocated before it is lost.

A SMART background self-test (long) on a regular basis is the best
medicine for solo storage, preferably with Reed-Solomon reconstruction
files available.  (Such as those created by "par2".)

For raid arrays, add a regular "check" scrub.  Use a "repair" scrub only
when a "check" scrub finds mismatches.

HTH,

Phil