[ale] Which large capacity drives are you having the best luck with?

Wed Jan 5 20:42:07 EST 2011

On Wed, 5 Jan 2011 19:23:10 -0500
Greg Freemyer <greg.freemyer at gmail.com> wrote:

> The first thing I look at is POH (Power on Hours).  In this case
> 27,871.  This field has been pretty reliable in my experience to be
> exactly what it says.  So my drive is not exactly new.

That's a pretty good specimen, 3 years powered on :)

> Then look at Reallocated_Sector_Ct.  Mine is zero.  That's cool.

I'm probably not telling you this, but when I see this number start to
move away from 0 I start ordering a replacement drive :)

Interestingly, the X25-M in my laptop is showing 1 relocated sector.  

> But Hardware_ECC_Recovered is 140,010,573.  That may sound large, but
> remember, the reads succeeded because of the ECC data, so there is no
> data loss.  I tend to agree with you that as magnetism fades for a
> sector, checksum failures increase and ECC recovery is needed.
> Spinrite used as you describe may keep that value lower.

It may sound large, but what does it really mean?  Is it literally the
number of ECC errors?  How large is a single ECC block?  How many ECC
blocks are involved in the read of a single sector?  Could every bad
ECC in a sector result in the count going up?  5000 per hour sounds
like a lot of recovered reads to me, assuming it means sectors.

> But I don't think spinrite tries to detect sectors that have bee ECC
> recovered.  So it doesn't really know the details.

I would agree.  The drives in my laptop don't report ECC errors via
smart.

> A smart long self test has the ability to know that a ECC recovery is
> needed for a sector.  What it does with the knowledge, I don't know.
> But it certainly has more knowledge to work with than spinrite.

How thorough do you think a long smart test is?  I've had drives die
within a day or three of passing a long smart test.  They also don't
take all that long, and they sure don't cause the drive to make any
noise :)

> fyi: hdparm has a long read capability that allows a full physical
> sector to be read with no error correction!  So spinrite could in
> theory read all of the sectors with CRC verification disabled and
> check the CRC itself.  The trouble is the the drive manufactures
> implement proprietary CRC / ECC solutions, so spinrite has no way to
> actually delve into the details of the sectors data accuracy.

I doubt that spinrite does this.  

> fyi: hdparm has a way to force a write to Pending Sector and put new
> good data on it.  Thus spinrite could do this if it wanted to as well.
>  I certainly hope it is not doing so.

My understanding is that spinrite attempts to read every sector and
(eventually) write them back to the disk.  If it fails to read
correctly it will start reading similarly to dd_rescue.

> > It also doesn't mean that the sector has been reallocated.
> 
> You imply a sector can be moved without it being reallocated.  I think
> that is wrong.  The only way to move the sector is to allocate a spare
> and use it instead of the original.

I think he's implying that the data in the sector can be moved by
software.  That can't be done without Spinrite understanding the file
system and making the appropriate changes.  It definitely doesn't do
this.

> > This forces the drive's firmware to evaluate the performance at
> > that point, and forces the surface to absorb both a 1 and 0 in turn
> > at that point.  Also, I believe that the magnetic fields
> > deteriorate over time.  I could probably corroborate that with some
> > extensive research.
> 
> Agreed, but I often store hard drives offline for extended periods.
> We rarely see read failures for drives we put back on line.  So the
> deteriation is very slow and not likely to be an issue.

How do you know?  Sun published a lot of numbers related to silent data
corruption.  ECC is pretty fallible.  Especially when it has to be used
to correct data 150 million times.

> fyi: The DOD uses thermite in the drive platter area to heat the media
> to several hundred degrees.  When this happens the magnetism is
> released and the data is gone.

The magic smoke gets out!

> Especially with laptop drives, you get physical damage as the flying
> head hits the platters from time to time.  To protect the platters,
> they are often actually coated with a fine coat of diamond dust.
> That's one reason laptop drives cost more.

Laptop drives are rated for higher g forces than desktop drives.
Taking both apart I wouldn't guess that, it must have something to do
with the inertia of lighter parts.

> > The read invert write read invert write cycle, if nothing else,
> > will ensure that all the magnetic bits are good and strong since
> > they are all ultimately rewritten.
> 
> True, but I think normal degradation is much slower than you imply.

I agree.  

> For a drive you've treated with spinrite, what's your ECC_Recovered /
> POH ratio.
> 
> ie. Mine is 5000 recoveries per power on hour.  And I don't do
> anything to "maintain" it.  This is just my desktop machine.

I wish we had old smart numbers for your drive.  I wonder if that ratio
has been increasing over time and if so, by how much.

> I believe a smart long self test will read all of the sectors and
> identify those that are not ECC Recoverable.  I don't think it will
> actually reallocate them.

I don't believe a long smart self test touches every sector.  A long
test runs much too fast for that, at least on the drives I've paid
attention to.

> What spinrite likely does is read the sector in various ways.  ie many
> data recovery tools can read the sectors in reverse order.  This
> causes the drive to align the head slightly differently I believe.
> Due to that slight change, some bad sectors can be read.  So I
> actually do think spinrite could have some logic to do this that
> normal read logic would not have.

BACKUPS, BACKUPS, BACKUPS.  And then more backups.  Data recovery is
for people who don't have backups :).

> > Again, this may or
> > may not trigger sector reallocation.
> 
> I surely hope writing to a sector previously had read failures not
> handle-able via ECC recovery triggers a reallocate.

If it doesn't then you're out of spare sectors and the drive is ready
for the scrap heap.  This is also one of the reasons why you want to
image a bad drive onto a good drive.  Once you start writing you can
really screw things up even more.

> >  Spinrite will report these data
> > areas as recovered or unrecovered as appropriate.  The drive itself
> > may still be fully usable, if, for example, the data error was
> > caused by a power failure, but the drive was not damaged.  If
> > sectors start getting reallocated, I would agree that it's time to
> > consider changing the drive out, as I did with one of mine last
> > night.
> 
> I'm not so sure I agree.  A lot of reallocates are just physical
> platter issues.  It used to be that drives shipped new with lots
> reallocated sectors.
> 
> Admittedly, new ones tend to have zero anymore.

Drives are cheap.  RMAing a drive is cheap.  If a drive starts acting
up and doesn't want to stay in one of my RAIDs it is time to replace
it.  I saw a 5900 rpm seagate 2 TB drive on sale for $100 shipped
today.

I could probably also argue that I don't completely trust a drive fresh
out of the box, either :)

Pat