[ale] Many thanks to Phil Turel

Mon Aug 6 14:26:06 EDT 2018

Hi Derek,

On 08/06/2018 01:46 PM, Derek Atkins wrote:
> Phil Turmel via Ale <ale at ale.org> writes:
> 
>> You're welcome, Malcolm.
>>
>> Very interesting and unusual bit of corruption, on all but the first
>> superblock, and precisely on the single 512-byte sectors of those other
>> superblocks.  Never seen anything like it.
> 
> So how did you debug it?   And how did you fix it?

I used xfs_db, based on a clue from an old mailing list entry with a
similar error message.

Within xfs_db, "sb 0" would move the cursor to the first superblock,
which I could then "print", report the block # with "fsb", and report
the sector number with "daddr".  Repeat with "sb 1", "sb 2", and "sb 3".

With the sector numbers, I could get hex for the superblock and
surrounding sectors with:

dd if=/dev/whatever bs=512 skip=sector count=16 |hexdump -C

That showed me the scrambled data in just one sector in the latter
superblocks, with proper data structures following.

I then used dd to extract the good superblock:

dd if=/dev/whatever bs=512 count=1 of=tempsb.dat

And write it to the other locations:

dd if=tempsb.dat bs=512 count=1 seek=sector of=/dev/whatever

xfs_repair then worked, but with a handful of corrections, due to the
inability to mount to replay the log.

> If it's that regular a pattern it could be anything from a rotary issue
> in the HDD to a failed memory stick.

The original failing device was an M.2 mini-PCIe SSD.  And it was
failing, and gave up the ghost completely later.

I have no idea what failure mode made it possible to write just the one
scrambled 512-byte sector to the beginning of each allocation group,
except the first.  Smells like an offset calculation bug to me.

Phil