[ale] Testing a SATA HDD for physical/electrical hardware faults

Wed Aug 24 20:56:01 EDT 2022

Hey ALE, I have a hard drive that I'm planning to discard due to what I'm about to describe below, but before I do that I'm interested in seeing if anyone knows if there's some way to do an in-depth test of its physical hardware if it's connected to an old desktop tower, like any specialized Linux packages specifically capable of doing advanced hardware testing beyond what's achievable by smartctl; for the type of testing I'd like to do I presume that it's probably not feasible if it's connected to a standard motherboard's SATA slot and I'd need some specialized hardware to test it, but I just wanted to check to confirm this.

Anyway, story time for those who'd be interested: this 16TB SATA drive arrived in a set of 20 along with a server that contains an AVAGO 3108 MegaRAID card & more than enough bays to hold all of the drives with spares for later expansion. After working fine for over a year, the system began notifying that its RAID array was degraded due to a PHY failure (ie. the slot on the drive backplane stopped working) causing the drive to disappear from the array. Moving the drive to another spare slot in the server brought it back online & the RAID card happily detected the drive & rebuilt the array; smartctl was used to run a S.M.A.R.T. test on the drive just in case and it reported no problems, so the slot it came from was noted as defective and no further troubleshooting was performed since the system was now back in full operation & a single slot failure wasn't too concerning since there were plenty of spare slots available & a lack of available time to dedicate IT staff resources to investigating further. A few months later, the system began notifying of a degraded RAID array again and looking into it I found the exact same type of error being reported (megaraid_sas 0000:3d:00.0: 19793 (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad for slot 20) and again it was the same drive I'd moved out of the previous slot months earlier. All the other drives have had no issues since this server was first put into operation, but this one drive has now had both backplane slots it was plugged into become completely unresponsive (as far as I can tell the system doesn't even detect them no matter what's plugged into them; I've not been able to power-cycle the server to confirm if that would bring them back online or not due to maintenance window timing for the extended downtime a power-cycle could possibly require if there are issues).

In a single failure or a double-failure with different drives I'd chalk it up to the backplane being bad, but since both of these failures have occurred with the same drive I have to consider that the drive itself is potentially causing the problem rather than the backplane being faulty. I mainly want to test this out of curiosity and an interest in learning what could cause the backplane slots to fail if it is a fault of the drive that was connected to them, as the results aren't going to change things operations-wise (this drive's not being put back in service again & I've installed a new drive in the system to restore the array).

Thanks for your time and your input,

-Robert