[ale] Testing a SATA HDD for physical/electrical hardware faults

Thu Aug 25 08:40:58 EDT 2022

A lead whisker in the right place in the drive (or similar 
not-quite-direct-short) could definitely produce a port-killer.

I'd crush that drive and get on with life.

On 8/25/22 08:04, Jim Kinney via Ale wrote:
> Holy crap!
> 
> They only thing I can think of is either the current draw of the drive 
> is too high or the voltage tolerance of both is too tight. Either way 
> it's unlikely a failed drive can destroy a backplane slot. It's highly 
> likely it overloaded a tolerance that the controller read as "no slot 
> device". Unless that drive was really wonky and died with a strong 
> oscillating current draw that could damage a trim capacitor or an 
> inductor on the slot power it will likely recover on a power cycle.
> Yeah, maintenance window challenge.
> 
> Might be able to run the drive with some current measurements to see if 
> it can be running out of spec. Will need to cut power lines and splice 
> in gear.
> 
> It might be possible to make the controller re-read the backplane.
> 
> On Wed, Aug 24, 2022, 8:56 PM Robert Tweedy via Ale <ale at ale.org 
> <mailto:ale at ale.org>> wrote:
> 
>     Hey ALE, I have a hard drive that I'm planning to discard due to
>     what I'm about to describe below, but before I do that I'm
>     interested in seeing if anyone knows if there's some way to do an
>     in-depth test of its physical hardware if it's connected to an old
>     desktop tower, like any specialized Linux packages specifically
>     capable of doing advanced hardware testing beyond what's achievable
>     by smartctl; for the type of testing I'd like to do I presume that
>     it's probably not feasible if it's connected to a standard
>     motherboard's SATA slot and I'd need some specialized hardware to
>     test it, but I just wanted to check to confirm this.
> 
>     Anyway, story time for those who'd be interested: this 16TB SATA
>     drive arrived in a set of 20 along with a server that contains an
>     AVAGO 3108 MegaRAID card & more than enough bays to hold all of the
>     drives with spares for later expansion. After working fine for over
>     a year, the system began notifying that its RAID array was degraded
>     due to a PHY failure (ie. the slot on the drive backplane stopped
>     working) causing the drive to disappear from the array. Moving the
>     drive to another spare slot in the server brought it back online &
>     the RAID card happily detected the drive & rebuilt the array;
>     smartctl was used to run a S.M.A.R.T. test on the drive just in case
>     and it reported no problems, so the slot it came from was noted as
>     defective and no further troubleshooting was performed since the
>     system was now back in full operation & a single slot failure wasn't
>     too concerning since there were plenty of spare slots available & a
>     lack of available time to dedicate IT staff resources to invest
>       igating further. A few months later, the system began notifying of
>     a degraded RAID array again and looking into it I found the exact
>     same type of error being reported (megaraid_sas 0000:3d:00.0: 19793
>     (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad
>     for slot 20) and again it was the same drive I'd moved out of the
>     previous slot months earlier. All the other drives have had no
>     issues since this server was first put into operation, but this one
>     drive has now had both backplane slots it was plugged into become
>     completely unresponsive (as far as I can tell the system doesn't
>     even detect them no matter what's plugged into them; I've not been
>     able to power-cycle the server to confirm if that would bring them
>     back online or not due to maintenance window timing for the extended
>     downtime a power-cycle could possibly require if there are issues).
> 
>     In a single failure or a double-failure with different drives I'd
>     chalk it up to the backplane being bad, but since both of these
>     failures have occurred with the same drive I have to consider that
>     the drive itself is potentially causing the problem rather than the
>     backplane being faulty. I mainly want to test this out of curiosity
>     and an interest in learning what could cause the backplane slots to
>     fail if it is a fault of the drive that was connected to them, as
>     the results aren't going to change things operations-wise (this
>     drive's not being put back in service again & I've installed a new
>     drive in the system to restore the array).
> 
>     Thanks for your time and your input,
> 
>     -Robert
>     _______________________________________________
>     Ale mailing list
>     Ale at ale.org <mailto:Ale at ale.org>
>     https://mail.ale.org/mailman/listinfo/ale
>     <https://mail.ale.org/mailman/listinfo/ale>
>     See JOBS, ANNOUNCE and SCHOOLS lists at
>     http://mail.ale.org/mailman/listinfo
>     <http://mail.ale.org/mailman/listinfo>
> 
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo