[ale] Bad SATA interactions

Sun Nov 4 17:56:38 EST 2012

On Sun, Nov 4, 2012 at 9:38 AM, Michael Trausch <mike at trausch.us> wrote:
> So I had an interesting few days... Aside from the fact that I have been
> sick, it turns out I have had an interesting problem appear.
>
> I changed motherboards recently, to test UEFI and so forth out. When I did
> so I started having some problems that traditionally scream "memory errors",
> except my RAM was just fine.
>
> I hadn't immediately thought to check the drive's SMART log because I am
> used to distributions signaling via the UI when such events happen. Well, it
> turns out that Fedora doesn't do smart monitoring by default!
>
> I had an apparently bad SATA cable (am running tests now to see if the new
> cable is actually the solution here). The symptom was UDMA CRC error counts
> through the roof, which the drive detected and then aborted the
> corresponding command.
>
> I mention this as we recently had a thread on silent corruption.
>
> So, to the question part: even with smartctl and friends not installed and
> running, shouldn't modern file systems be storing checksums to catch this
> sort of thing without obscure errors? I thought that ext4 had such support,
> but I would appear to be incorrect there.

Not sure how a filesystem-level checksum would help with corruption on
the wire, other than to prevent reading back bad blocks.  During the
write, you're pretty much trusting what's there, unless you want to
read back the data and verify the checksum immediately, in which case
you're talking about a seek immediately after each block write.  Good
luck with "performance" on that.  As you pointed out, the UDMA CRC was
catching this problem.  Do you think any data was corrupted due to
this bad cable?

-- 
David Tomaschik
OpenPGP: 0x5DEA789B
http://systemoverlord.com
david at systemoverlord.com