[ale] Testing a SATA HDD for physical/electrical hardware faults

Thu Aug 25 08:04:57 EDT 2022

Holy crap!

They only thing I can think of is either the current draw of the drive is
too high or the voltage tolerance of both is too tight. Either way it's
unlikely a failed drive can destroy a backplane slot. It's highly likely it
overloaded a tolerance that the controller read as "no slot device". Unless
that drive was really wonky and died with a strong oscillating current draw
that could damage a trim capacitor or an inductor on the slot power it will
likely recover on a power cycle.
Yeah, maintenance window challenge.

Might be able to run the drive with some current measurements to see if it
can be running out of spec. Will need to cut power lines and splice in
gear.

It might be possible to make the controller re-read the backplane.

On Wed, Aug 24, 2022, 8:56 PM Robert Tweedy via Ale <ale at ale.org> wrote:

> Hey ALE, I have a hard drive that I'm planning to discard due to what I'm
> about to describe below, but before I do that I'm interested in seeing if
> anyone knows if there's some way to do an in-depth test of its physical
> hardware if it's connected to an old desktop tower, like any specialized
> Linux packages specifically capable of doing advanced hardware testing
> beyond what's achievable by smartctl; for the type of testing I'd like to
> do I presume that it's probably not feasible if it's connected to a
> standard motherboard's SATA slot and I'd need some specialized hardware to
> test it, but I just wanted to check to confirm this.
>
> Anyway, story time for those who'd be interested: this 16TB SATA drive
> arrived in a set of 20 along with a server that contains an AVAGO 3108
> MegaRAID card & more than enough bays to hold all of the drives with spares
> for later expansion. After working fine for over a year, the system began
> notifying that its RAID array was degraded due to a PHY failure (ie. the
> slot on the drive backplane stopped working) causing the drive to disappear
> from the array. Moving the drive to another spare slot in the server
> brought it back online & the RAID card happily detected the drive & rebuilt
> the array; smartctl was used to run a S.M.A.R.T. test on the drive just in
> case and it reported no problems, so the slot it came from was noted as
> defective and no further troubleshooting was performed since the system was
> now back in full operation & a single slot failure wasn't too concerning
> since there were plenty of spare slots available & a lack of available time
> to dedicate IT staff resources to invest
>  igating further. A few months later, the system began notifying of a
> degraded RAID array again and looking into it I found the exact same type
> of error being reported (megaraid_sas 0000:3d:00.0: 19793
> (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad for
> slot 20) and again it was the same drive I'd moved out of the previous slot
> months earlier. All the other drives have had no issues since this server
> was first put into operation, but this one drive has now had both backplane
> slots it was plugged into become completely unresponsive (as far as I can
> tell the system doesn't even detect them no matter what's plugged into
> them; I've not been able to power-cycle the server to confirm if that would
> bring them back online or not due to maintenance window timing for the
> extended downtime a power-cycle could possibly require if there are issues).
>
> In a single failure or a double-failure with different drives I'd chalk it
> up to the backplane being bad, but since both of these failures have
> occurred with the same drive I have to consider that the drive itself is
> potentially causing the problem rather than the backplane being faulty. I
> mainly want to test this out of curiosity and an interest in learning what
> could cause the backplane slots to fail if it is a fault of the drive that
> was connected to them, as the results aren't going to change things
> operations-wise (this drive's not being put back in service again & I've
> installed a new drive in the system to restore the array).
>
> Thanks for your time and your input,
>
> -Robert
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20220825/5ae2c901/attachment.htm>