[ale] HDD failure modes, why your drive might NOT read or write your data
Ron Frazier (ALE)
atllinuxenthinfo at techstarship.com
Fri Dec 28 14:22:28 EST 2012
Hi guys,
I have a recurring interest in the reliability of HDD's for those that I
own as well as some I maintain for family members. I recently had to
replace two 1 TB drives that started throwing reallocated sector errors
at about the same time after 3 years of operation. It was only a
coincidence that I was doing my biannual extensive hard drive
maintenance at the same time they started throwing a tantrum. I do
believe in regular backups, but it is very hard to keep all my hard
drives backed up at all times. I do have online backups every 6 hours
for new data on most machines, but not for data that takes huge amounts
of space. I'm interested to know what you guys do on a personal level
to monitor your hard drives health and, perhaps, preemptively replace
them when they are starting to fail. I have implemented a utility on
the PC where those two 1 TB drives started to fail which will monitor
reallocated sector counts and a couple of other things. That utility
must run in administrative mode (in Windows) however, so I can't run it
on Dad's machine since he always runs with a standard user login.
I recently read an analogy of how the precision tolerances affect a
modern HDD. Imagine the platter is 3 miles wide. Each track would be
.4" wide. The read write head would be a go cart flying above the
platter at the width of a human hair. And, the platter would be
spinning at 3.6 million MPH. Obviously, this is just an analogy, but
it's amazing they work at all.
I've discovered some interesting data about how parts of the drive that
were writable can become unwritable later and how data that was written
correctly can become unreadable later. These are called latent defects,
or sometimes grown defects. They are not discovered until a read or
write error occurs, at which time the data may be unrecoverable.
Here's a link to a cool article.
http://entertainmentstorage.org/articles/Hard%20Disk%20Drives_%20The%20Good,%20The%20Bad%20and%20The%20Ugly.pdf
It's a bit dated, but has good info.
Note that the head fly height is .3uIN or less. If my conversions are
right, this equates to .0076 uM or 7.6 nM or 76 angstroms. No matter
how you say it, it's a VERY small space.
Note also that the drive is made of many dissimilar metals, which are
machined, and which have dissimilar hardnesses and thermal expansion
coefficients.
So, if there are any particles or abberations larger than 7 nM, they are
subject to get trapped under the read write head. The article points
out that removing all particles this small is very difficult.
So, how could a defect appear or grow?
The article lists several ways.
1) Any vibration such as bumping the unit, walking across the floor, or
even sound can alter the head position just enough to cause transient
errors during reading or writing. As we've discussed during a previous
thread, the OS or drive controller doesn't generally do a read after
write verify. So, your software may be happily humming along writing
data and not even know that it didn't get written properly.
For some really interesting reading, google "don't shout at your hard
drive". This will show some interesting new research on how, even
sound, can screw up hard drives.
2) The head's fly height can be raised by the accumulation of lubricants.
3) from the article: "Media imperfections such as voids (pits),
scratches, hydrocarbon contamination (various oils), and smeared soft
particles can not only cause errors during writing, but also corrupt
data after it has been written."
The types of problems listed in 3) can occur after the drive has been
put into service. If a particle that is softer than the platter media
coating get's trapped by the head, it can get smeared along the
surface. If a particle that is harder than the surface gets trapped, it
can scratch or gouge the surface. This potentially ruins data that is
there or prevents future writes in that area.
4) from the article: "Data can become corrupted any time the disks are
spinning, even when data is not being written to or read from the disk.
Common causes for erasure include thermal asperities, corrosion, and
scratches or smears."
"Thermal asperities are instances of high heat for a short duration
caused by head-disk contact. This is usually the result of heads hitting
small "bumps" created by particles that remain embedded in the media
surface even after burnishing and polishing. The heat generated on a
single contact can be high enough to erase data. Even if not on the
first contact, cumulative effects of numerous contacts may be sufficient
to thermally erase data or mechanically destroy the media coatings and
erase data."
5) from the article: "Another problem associated with PMR [perpendicular
magnetic recording] is side-track erasure. Changing the direction of the
magnetic grains also changes the direction of the magnetic fields. PMR
has a return field that is close to the adjacent tracks and can
potentially erase data in those tracks. In general, the track spacing is
wide enough to mitigate this mechanism, but if a particular track is
written repeatedly, the probability of side-track erasure increases."
6) From a prior thread here, someone mentioned pending spindle bearing
failures as a cause of poor / anomalous head / track alignment.
So, latent defects can occur because of vibrations, lubricants, fly
height, pits, scratches, smears, thermal asperities, side track erasure,
corrosion, and spindle bearings. And, that assumes that all the major
parts of the drive are working normally.
So, two things are apparent. A) You cannot always assume that your data
was written properly. and B) You cannot always assume that data that
was written properly can be read properly.
So, my questions are these:
What do you do, on your personal equipment, where you may have less
resources than at work, to monitor for drive errors before they become
catastrophic and catch them?
What do you do at work?
What could be done better?
Is there a way to force the OS, either Windows or Linux, to do verifies
after each write operation?
I believe, now more than ever, that doing a full read write surface
analysis on a drive a couple of times / year is a good idea, and then
rewriting the data back after any NEW latent defects have been
identified by the drive's controller.
The utility I found for Windows monitors reallocated sectors, pending
sectors, and uncorrectable sectors (Smart attributes 5, c5, c6); as well
as temperature; and sends me an email if they get high. Are those good
indicators of pending failures?
Would there be anything else I should do to detect failures before
they're serious enough to compromise my data?
Sincerely,
Ron
--
(To whom it may concern. My email address has changed. Replying to former
messages prior to 03/31/12 with my personal address will go to the wrong
address. Please send all personal correspondence to the new address.)
(PS - If you email me and don't get a quick response, you might want to
call on the phone. I get about 300 emails per day from alternate energy
mailing lists and such. I don't always see new email messages very quickly.)
Ron Frazier
770-205-9422 (O) Leave a message.
linuxdude AT techstarship.com
More information about the Ale
mailing list