[ale] Bad SATA interactions

Sun Nov 4 21:35:38 EST 2012

Doesn't it? Especially when nearly 200 of them occur. I would at least
expect a small percentage to get back.

This board only does AHCI mode, so I am at a loss... Sad thing is that I
have an early model BD-RE burner that doesn't work in ACHI mode (locks up
hard until power is removed and the device is talked to in "old style"
mode) so I had to remove it.

If I knew enough to debug, I would. But I am completely unfamiliar with the
workings of the bus, and don't have time to figure out the why, sadly, if
the behavior is out of spec...

Isn't the world of technology grand? :-)
On Nov 4, 2012 8:15 PM, "David Tomaschik" <david at systemoverlord.com> wrote:

> On Sun, Nov 4, 2012 at 4:23 PM, mike at trausch.us <mike at trausch.us> wrote:
> > On 11/04/2012 05:56 PM, David Tomaschik wrote:
> >>
> >> Not sure how a filesystem-level checksum would help with corruption on
> >> the wire, other than to prevent reading back bad blocks.  During the
> >> write, you're pretty much trusting what's there, unless you want to
> >> read back the data and verify the checksum immediately, in which case
> >> you're talking about a seek immediately after each block write.  Good
> >> luck with "performance" on that.  As you pointed out, the UDMA CRC was
> >> catching this problem.  Do you think any data was corrupted due to
> >> this bad cable?
> >
> >
> > Know so.
> >
> > Yesterday, I ran a series of tests.  Even decompressed the data from its
> > origin drive, which worked.  See, at first I thought maybe it was a
> software
> > problem.  So I compiled gzip statically on an Ubuntu system that could
> > decompress the original data set, scp'd that to my box here, and it still
> > had the error.  Okay, software problems/bugs are now eliminated.
> >
> > Next thing I figured was, since gzip doesn't use LOTS of memory, it might
> > have just had the misfortune of landing on bad RAM every time it loaded,
> so
> > I ran a memtest.  Nope, nada.
> >
> > During the copy TO my internal drive, the internal drive found and
> flagged
> > errors internally, but never returned an error status to the operating
> > system.  WTF is the point of that behavior?  It just chugged right along.
> > So at this point, I'm thinking that I have a bad drive. (At this point, I
> > hadn't checked SMART yet, either, because I was operating under the
> > erroneous assumption that all modern distros do so for you.)  But since I
> > had no other conclusions, I thought I would check it manually.  Went to
> run
> > smartctl and... got a command not found error message.
> >
> > Well, that explained a fair bit!
> >
> > So I installed that stuff and ran it, and it's error log was full (5
> entries
> > is all it holds) and so I ran a full self test and went to bed.  Self
> test
> > and surface scan was perfectly fine.  So, I concluded then that it must
> be
> > the cable.
> >
> > Swapped the cable, and the UDMA error count stopped increasing, two
> short of
> > what the drive firmware considers "dying".  Heh.
> >
> > At that point I tried decompressing the data, and still had the same
> > problem.
> >
> > Solution?
> >
> > # touch *
> > # rsync --inplace --no-whole-file -av /path/to/orig /path/to/corrupt
> >
> > ... which corrected all the errors and then I was finally able to
> > decompress.
> >
> > I would have decompressed it to my drive to work around the problem,
> except
> > that would have just created new ones.  ;-)
> >
> > Really, there are two things that would have made this better: (a) the
> drive
> > should have reported error status back to the operating system during the
> > write in which it detected the error, because then I would have known
> > IMMEDIATELY that something was wrong.  (b) When reading it back, a
> checksum
> > would have said "hey your data is corrupt" instead of the drive saying
> "all
> > good" and gzip going "format violated".
> >
> > I know that checksums wouldn't help at write time, but they would sure
> > clarify the errors at read time.  I'm still confused, though, as to why
> the
> > drive didn't yell loudly.  Why didn't I get an I/O error abort if the
> drive
> > bloody well knew that it got corrupted data?
> >
> >         --- Mike
>
>
> Erm, yeah, silently dropping corrupt commands is kinda crappy.  Of
> course, then you start to run into the two generals problem:[1]  how
> can the drive be sure error messages are getting back to the
> controller?
>
> Actually, doesn't SATA require some sort of ACK from the drive?
> There's an error register specifically in AHCI mode[2] that should
> report back CRC failures.
>
> I'm wondering if it's a case of crappy drive firmware, but it seems
> odd that it would update smart registers and not report back to the
> OS...
>
>
>
> [1] https://en.wikipedia.org/wiki/Two_Generals'_Problem
> [2] http://wiki.osdev.org/AHCI
> --
> David Tomaschik
> OpenPGP: 0x5DEA789B
> http://systemoverlord.com
> david at systemoverlord.com
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20121104/0c7d5c7c/attachment.html>