[ale] disk drive diagnostics nirvana - NOT - I have questions
Ron Frazier (ALE)
atllinuxenthinfo at techstarship.com
Mon Oct 22 22:05:29 EDT 2012
Hi Phil,
Thanks for the note. Comments inline.
On 10/22/2012 9:26 PM, Phil Turmel wrote:
> Hi Ron,
>
> On 10/22/2012 06:12 PM, Ron Frazier (ALE) wrote:
>
>> Hi all,
>>
>> I've spent the last couple of days doing disk diagnostics on all my hard
>> drives, which I do periodically, and learning more than I really wanted
>> to know about sector errors. I'll try to share more details later, but
>> for now, I'm just going to post the minimum. As you may know, a HDD
>> that works perfectly from the factory may develop problems over time and
>> show either bad (reallocated) sectors or bad blocks. Since the HDD
>> controller can usually only discover read / write problems when you
>> actually access the sector, I've developed a practice over the years to
>> read and write every sector on the hard drive a few times per year. I
>> usually use Spinrite, which can operate on Windows or Linux drives. It
>> boots as a free standing executable. In the mode I use, it reads,
>> inverts and rewrites every sector on the disk, then does it again. This
>> forces the drive's controller to find and remap any weak sectors to
>> somewhere else while they can still be read properly. If the sector
>> doesn't read, Spinrite uses advanced statistical algorithms to try up to
>> 2000 times to recover the data. You can also do something similar with
>> badblocks -nsv in Linux, except for the bad sector recovery, although I
>> don't know exactly what algorithm it uses. On a large drive, these
>> tests take days to complete. Once they're done, I know that the drive
>> can absolutely read or write any sector reliably, or if it couldn't,
>> those questionable sectors should have been reallocated to other areas
>> by the controller. The first thing I do when I get a new drive is write
>> it with random data then Spinrite it about 6 times to thoroughly burn it
>> in. I then follow up with one such procedure every 4 - 6 months.
>>
> I only run Windows in VMs nowadays, and my critical servers (home media
> server and my small business server) are 24/7 linux, so spinrite isn't
> for me.
>
>
If you were so inclined, SpinRite can be used on any HDD that the
computer's bios can see and properly control. It boots as a stand alone
executable with the OS not running. I use it on my Linux partitions the
same as I do on my Windows partitions. SR works at the sector level.
It doesn't care what's on the sectors. One of my computers has BIOS so
old (2002) it cannot see all of the 320 GB drive I have in it. I use
badblocks on that. Once Linux is booted, it doesn't care about the
limits of BIOS. By the way, in recent podcasts, Steve Gibson, inventor
of SpinRite, has indicated that using it on SSD drives, in read only
mode, can help the SSD be more reliable and sometimes recover finicky
data. Apparently, the scrubbing has benefits there too.
> My critical servers all use linux software raid in various combinations,
> and all of the raid arrays are scrubbed weekly. By scrubbed, I mean a
> cron job instructs the kernel to read every sector on every member
> device in the background, compute parity as appropriate, and report any
> inconsistencies. Any read errors trigger the corresponding recovery and
> rewrite functions that would normally occur if an application
> encountered the sectors. Any unsuccessful write kicks that device out
> of the array as usual.
>
> I have been doing this for about ten years now, with about seven or
> eight drive failures in that time. Never lost any data, though I've
> been nervous a few times when waiting for a replacement disk for a raid5
> array. Everything is now raid6 or triple-mirrored, so I sleep well.
>
>
I like the scrubbing idea. Essentially what I'm doing manually.
>> What usually happens is:
>>
>> * Run file system check. No problems, or minor problems fixed.
>> * Run Spinrite or badblocks. No read write errors.
>> * Follow up by checking SMART data using Disk Utility or GSmartControl.
>> (PS, Disk Utility will not show SMART data on a USB drive due to a bug,
>> but GSmartControl can.) No bad sectors and no pending reallocations.
>>
>> I have two 1 TB drives that I use for backup. I backup to one then
>> mirror it to the other. I recently had occasion to completely read one
>> and write the other in a mirroring process. As far as I know, there are
>> no read or write errors. When I ran the SMART check, I found that one
>> of these has 12 bad or reallocated sectors and the other has 120. This
>> prompted me to start the Spinrite process on one, which I haven't
>> finished, to read, invert, write, read, invert, write the data. I could
>> have used badblocks as well. I've finished 72% of one drive, and, thus
>> far, have had no read or write failures or bad blocks reported.
>>
>> So, the $ 600,000 question is this. Assuming every active sector on the
>> drives can be successfully read and written, should I be concerned about
>> 12 or 120 bad reallocated sectors? I find a wide variety of opinion on
>> the net ranging from not a problem all the way to replace the drives
>> immediately. Note that these are my backup drives for this PC, so I
>> REALLY don't want them to fail. The drives may be more than 5 years
>> old. I'd have to dig through receipts. However, they're showing a
>> powered on time of 2.1 years.
>>
>> Let me know what you think.
>>
> All of the drives that failed on me had fewer than 100 relocated
> sectors. None of them had fewer than 20 relocated sectors. Mostly
> 30,000+ hours of operation. This seems to correspond well to the
> reports I read on the linux-raid mailing list. I tolerate drives with
> single-digit relocation counts, but I recheck them every week. After
> that, they're outa there.
>
> Some of the research on the topic suggests that climbing relocation
> counts is most often caused by approaching spindle bearing failure,
> where the wobble causes head tracking errors. Whatever the underlying
> reason, that's my red line.
>
> HTH,
>
> Phil
>
This, along with other reading I've been doing, is making me nervous. I
may have to bite the bullet and replace these. I found at least one
receipt and they're 3 years old, with 2.1 years of runtime, which
amounts to about 18,000 hours. I still feel that running all the time
is potentially less damaging than cycling power once or twice per day.
I wonder if I can get Seagate to RMA them? The catch is, a read write
test passes with flying colors.
I've always wondered why the surface of a drive that hadn't crashed
would deteriorate, and even why the servo mechanism would go wonky.
But, what you're saying about the spindle bearing makes some sense. I
can see how that could cause errors. These drives have been very
lightly used, except for running 24 / 7 when the weather is good. As I
said in my original post, as far as I can tell, I can read and write to
all the remaining sectors. However, with something like a bearing
spinning at 7200 RPM, even a slight intermittent problem could quickly
degenerate into a catastrophic problem.
This idea of mechanical failure reminds me of a contact I had years ago
with someone in the field of industrial equipment reliability. They had
this really cool test system where they could measure the ultrasonic
signature of a motor in a factory and predict failure months in advance,
allowing preemptive replacement. I wonder if you could do such a thing
with hard drives reliably.
By the way, how could I do background read only scrubbing in Windows and
Linux such that each sector is read at least every 1-3 months while the
OS is in use. None of my drives are RAID.
Thanks for the information you shared.
Sincerely,
Ron
--
(To whom it may concern. My email address has changed. Replying to former
messages prior to 03/31/12 with my personal address will go to the wrong
address. Please send all personal correspondence to the new address.)
(PS - If you email me and don't get a quick response, you might want to
call on the phone. I get about 300 emails per day from alternate energy
mailing lists and such. I don't always see new email messages very quickly.)
Ron Frazier
770-205-9422 (O) Leave a message.
linuxdude AT techstarship.com
More information about the Ale
mailing list