[ale] Which large capacity drives are you having the best luck with?

Tue Jan 4 23:03:37 EST 2011

Greg,

Thanks for the note.  I see from your sig that you're into computer
forensics, so this should be right up your alley.  I listened to your IT
Autopsy audio file you have posted on your site.  That's pretty cool.

I know what you're saying.  It does seem hard to believe.  Your note
piqued my interest, and sent me on a tangent most of the day researching
hard drive failure modes.  I thought I'd share my experience with
everyone.

Before getting to that, I misspoke regarding my startup procedure for
new drives.  Actually, I first wipe them with random data then run
Spinrite's deep analysis 6 times.  That way, every sector and every
binary bit of the drive is first read, then written, then read and
written with both 1's and 0's a total of 12 times.  This takes a couple
of weeks to do, but hopefully, it minimizes the chance of "infant
mortality", a proven phenomenon where drives tend to fail early in life.
It also thoroughly thrashes (not trashes) the drive and gives its
firmware a substantial chance to discover problems.

Based on your suggestion, I installed smartmontools (which contains
smartctl) and gsmartcontrol (a graphical interface to smartctl) on my
Ubuntu computer.  The man file for smartctl is over 1000 lines, and is
to overwhelming to read completely.  I scanned it for info on the long
test.  I never could find out exactly what it does.  I searched Google
and found various interesting things but no details on the test.

Note that if your drive or BIOS is not providing access to the smart
subsystem, you cannot run these tests.  Using gsmartcontrol, I activated
a long test and let it run.  It gives almost NO indication that it's
running.  The progress indicator appears to go to 10% and freeze, the
hard drive light is not on, and the drive doesn't even make much seek
noise.  However, I just waited, eventually, the progress bar clicked to
20%, and continued slowly until about two hours later.  It finished with
no errors reported.  I later found that you can run the same test from
the Disk Utility under the Ubuntu system administration menu.

Running the test this way gives a more friendly and accessible list of
the results with warnings about items of concern.  However, there is
still little indication of its running other than a very slowly moving
progress bar.  Again, the test itself passed with no errors.  However,
the utility posted a warning and said this drive has a couple of bad
sectors and that the reallocated sector count is 2.  I guess the drive
found this acceptable, but I'm not sure I do.

Google did an extensive study of hard drive failures in their own data
centers a few years ago.

http://labs.google.com/papers/disk_failures.html

https://encrypted.google.com/search?q=google+hard+drive
+study&btnG=Search&hl=en&sa=2

The document is extremely technical and somewhat hard to read.  Here are
some summary comments by various writers on the Internet.

From: http://www.webmasterworld.com/goog/3257036.htm  (one of the blog
posts from AlexK)

snapshots: 
      * Young drives (less than 2 years) prefer it hot (above 35C),
        whilst older drives like it mild (below 40C). 
      * Get rid after the first scan error (39 times more likely to fail
        within 60 days than if none). 
      * Get rid after the first sector reallocation (14 times more
        likely to fail within 60 days than if none) (21 times for
        offline reallocations). 
      * Get rid after the first sector Probational Count (16 times more
        likely to fail within 60 days than if none). 
      * More than 56% of failed drives do not have either SMART-reported
        scan errors, sector reallocations, offline reallocations nor
        sector Probational Counts. 
      * More than 72% of all drives report seek errors. 
      * 36% of failed drives have no error signals of any kind.

From:
http://gizmodo.com/237980/google-teaches-us-five-things-about-hard-drive-death

•First of all, Mean Time Between Failure rates mean nothing.
•Secondly, SMART hardware monitoring missed 36% of all uh-ohs.
•Third, overworked drives fail similarly to standard drives after the
first year.
•Fourth, Hard drive age means less than you think.
•Fifth, failure does not go up when temperatures are higher than usual
(unless super high.)

From: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

The StorageMojo take
There is a lot here and the implications may surprise.

     1. Disk MTBF numbers significantly understate failure rates. If you
        plan on AFRs that are 50% higher than MTBFs suggest, you’ll be
        better prepared. 
     2. For us SOHO users, consider replacing 3 year old disks, or at
        least get serious about back up. 
     3. Enterprise disk purchasers should demand real data to back up
        the claimed MTBFs – typically 1 million hours plus – for those
        costly and now much less studied drives. 
     4. SMART will alert you to some issues, but not most, so the
        industry should get cracking and come up with something more
        useful. 
     5. Workload numbers call into question the utility of
        architectures, like MAID, that rely on turning off disks to
        extend life. The Googlers didn’t study that application, but if
        I were marketing MAID I’d get ready for some hard questions. 
     6. Folks who plan and sell cooling should also get ready for tough
        questions. Maybe cooler isn’t always better. But it sure is a
        lot more expensive. 
     7. This validates the use of “consumer” drives in data centers
        because for the first time we have a large-scale population
        study that we’ve never seen for enterprise drives. 

This is Ron talking again, no longer quoting.

Based on the comments above about relocation errors, I decided to
replace this hard drive.  I'm going to try to make a warranty claim with
Seagate since the drive is only a year old.

I don't even know what scan errors and sector probational counts are.

Also note that in 36% of the cases, the smart system did not detect
eminent drive failures. 

I tried to search for the term "bit rot" on Google to find out about
failure modes, but I couldn't find anything conclusive.

I know that I am totally convinced that data on hard drives deteriorates
over time.  I also believe that refreshing the data periodically with
something like Spinrite can dramatically prolong the life of the data on
the drive.

I had a Windows drive that was failing once.  It wouldn't even boot.  I
ran Spinrite on it.  It immediately warned me that the smart system said
the drive was on it's way out.  I continued the diagnostic.  Afterwards
I was able to boot the machine well enough to get my data backed up.  I
then retired the drive, because I think it would have failed completely.
I have no problem believing the Spinrite testimonials cited on the
podcast, which are many.

I'm glad you mentioned smartctl to me, as it helped me find the bad
sectors on this drive.  However, I don't think it does what Spinrite
does.  Spinrite specifically turns off the smart system so it can do a
very detailed low level analysis of the disk surface.  I never could
find out what the long smart analysis does.  However, it only took 2
hours, whereas Spinrite would take about 48 hours on a 2 TB drive.  I
must assume Spinrite is much more exhaustive.  Also, Spinrite can do
intensive data recovery at a sub sector level.  If the data cannot be
read initially, it tries up to 2000 times per item to read the data
using advanced statistical methods which even involve altering of the
head trajectory over the track and direction of the seek to the track.
I think this is how he's able to resurrect so many dead systems.

For my own purposes, I prefer to just have my drives never fail, which
is why I'm somewhat obsessive about running maintenance on them.  I
usually have good backups for the Windows side of my dual boot systems,
but they are anywhere from 2 weeks to 2 months old.  I have online
backups for data that changes.   I can use the images for full system
recovery, but repatching the computer and making the configuration
changes I lost after a failure is always a tremendous pain and time
sink.  I'm still looking for a good Linux imaging solution.  I could use
my Acronis TrueImage for Windows to backup the Linux partition sector by
sector if I want to.  I never want to reinstall the system, all the
apps, and all the data as that's a week long pain.

By the way, I think power protection and vibration protection are key
elements to drive longevity.

Here are some links I uncovered in my research.

http://sourceforge.net/apps/trac/smartmontools/wiki - makers of
smartctl, also contains info about smart system data

http://home.clara.net/drdsl/OpenSource/smart.html - article about
smartctl

http://karuppuswamy.com/wordpress/2010/05/19/how-to-predict-hard-disk-failure-in-ubuntu-with-3-clicks/ - using the Ubuntu disk utility

http://joshua14.homelinux.org/blog/?p=236 - another technical article

http://www.linuxjournal.com/content/know-when-your-drives-are-failing-smartd - an article in Linux Journal

http://www.linuxjournal.com/magazine - Linux Journal archives, delayed a
couple of months.  Web page format, doesn't look like PDF is available.

http://lists.ufl.edu/cgi-bin/wa?A0=RECMGMT-L - records management list -
thought you'd like this due to your background in computer forensics

Sincerely,

Ron

On Tue, 2011-01-04 at 08:33 -0500, Greg Freemyer wrote: 
> Spinrite is one of those tools whose claims live on the edge of reality.
> 
> Many of their claims about recovery look like bunk to me, but much of
> what you said is real.
> 
> But a smart long selftest (man smartctl) is probably just as good as
> spinrite.  (All new drives support smart).  Use it just as you
> currently use spinrite and I'd bet you get all the same effect.
> 
> Greg
> 
> 
> On 1/3/11, Ron Frazier <atllinuxenthinfo at c3energy.com> wrote:
> > Ryan,
> >
> > I realize it's after Christmas, but I just saw this.  I don't have any
> > 1.5 TB or 2 TB drives yet.  However, I've always liked the Seagate brand
> > and have had good performance and reliability with them.  They also have
> > a 5 year warranty.  I have a couple of 1 TB Seagate drives in my desktop
> > box that I built and they seem to do fine.  I also have a Hitachi 500 GB
> > drive that works well also.
> >
> > No matter what you buy, I would get a copy of Spinrite ( $ 90 )
> > ( http://www.grc.com/sr/spinrite.htm ).  This is one of the world's most
> > advanced disk drive analysis and recovery programs.  The installer is a
> > Windows program, but once you run that, you can create a bootable CD
> > image which boots on it's own and has it's own clone of DOS.  It has to
> > run own its own without the normal OS running.
> >
> > This will allow you to run an exhaustive read, invert, write, read,
> > invert, write analysis on the drive.  Now, while you don't need to
> > recover data on a new drive, this reads and writes every bit in every
> > sector twice (to 1 and 0), and forces the drive's firmware to make a
> > very thorough analysis of what's good and what's not and map out any bad
> > areas.  I do this to ALL new drives I get, and all new computers (it's
> > non destructive).  (Actually, on new drives, I run one sweep of a drive
> > wipe program to put random data on the drive, rather than having all 1's
> > or 0's.  Then I run Spinrite.)  I also try to run Spinrite on each drive
> > 2 - 3 times per year.  Doing this over time reads weak sectors while
> > they're still readable and maps them out if they're getting flaky.  If
> > you do this, barring mechanical, power, or controller problems, you can
> > keep a drive functioning flawlessly for many years.
> >
> > The creator of Spinrite is an expert on computer security.  He has an
> > excellent podcast on the topic at the address below.  Also, on almost
> > every show, he cites an example of a testimonial of a dying drive that
> > has been recovered by this software.  I really think it's worth a look.
> > I have no financial interest in the product, but am a happy user.
> >
> > http://www.grc.com/securitynow.htm
> > http://www.twit.tv/sn
> >
> > If you decide to get and use the product, I'll be glad to explain how to
> > run it.  On drives of the size you're talking about, expect the
> > procedure to take about 48 hours to run.  However, it can be stopped and
> > resumed.
> >
> > Sincerely,
> >
> > Ron
> >
> > On Tue, 2010-12-21 at 22:39 -0500, Matty wrote:
> >> I want to play around with openfiler and freenas over Xmas, and I've
> >> been reading tons of 1.5TB and 2TB disk drive reviews tonight. From
> >> what I've gathered so far, the reliability of large capacity drives
> >> sucks and you need to factor this in when picking your RAID levels
> >> (I'm going with RAID6). Amazon currently has 1.5TB drives for $60:
> >>
> >> http://www.amazon.com/gp/product/B002ZCXJZE?ie=UTF8&tag=tp6708-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=B002ZCXJZE"
> >>
> >> So I'm thinking about picking up a few of these for my project. Anyone
> >> have any thoughts on these drives or any of the other 1.5TB - 2TB
> >> drive manufacturers? Trying to find something that performs reasonably
> >> well and is relatively reliable.
> >>
> >> Thanks for any feedback,
> >> - Ryan
> >> --
> >> http://prefetch.net

-- 

(PS - If you email me and don't get a quick response, you might want to
call on the phone.  I get about 300 emails per day from alternate energy
mailing lists and such.  I don't always see new messages very quickly.)

Ron Frazier

770-205-9422 (O)   Leave a message.
linuxdude AT c3energy.com