[ale] Testing a SATA HDD for physical/electrical hardware faults
Phil Turmel
philip at turmel.org
Thu Aug 25 08:40:58 EDT 2022
A lead whisker in the right place in the drive (or similar
not-quite-direct-short) could definitely produce a port-killer.
I'd crush that drive and get on with life.
On 8/25/22 08:04, Jim Kinney via Ale wrote:
> Holy crap!
>
> They only thing I can think of is either the current draw of the drive
> is too high or the voltage tolerance of both is too tight. Either way
> it's unlikely a failed drive can destroy a backplane slot. It's highly
> likely it overloaded a tolerance that the controller read as "no slot
> device". Unless that drive was really wonky and died with a strong
> oscillating current draw that could damage a trim capacitor or an
> inductor on the slot power it will likely recover on a power cycle.
> Yeah, maintenance window challenge.
>
> Might be able to run the drive with some current measurements to see if
> it can be running out of spec. Will need to cut power lines and splice
> in gear.
>
> It might be possible to make the controller re-read the backplane.
>
> On Wed, Aug 24, 2022, 8:56 PM Robert Tweedy via Ale <ale at ale.org
> <mailto:ale at ale.org>> wrote:
>
> Hey ALE, I have a hard drive that I'm planning to discard due to
> what I'm about to describe below, but before I do that I'm
> interested in seeing if anyone knows if there's some way to do an
> in-depth test of its physical hardware if it's connected to an old
> desktop tower, like any specialized Linux packages specifically
> capable of doing advanced hardware testing beyond what's achievable
> by smartctl; for the type of testing I'd like to do I presume that
> it's probably not feasible if it's connected to a standard
> motherboard's SATA slot and I'd need some specialized hardware to
> test it, but I just wanted to check to confirm this.
>
> Anyway, story time for those who'd be interested: this 16TB SATA
> drive arrived in a set of 20 along with a server that contains an
> AVAGO 3108 MegaRAID card & more than enough bays to hold all of the
> drives with spares for later expansion. After working fine for over
> a year, the system began notifying that its RAID array was degraded
> due to a PHY failure (ie. the slot on the drive backplane stopped
> working) causing the drive to disappear from the array. Moving the
> drive to another spare slot in the server brought it back online &
> the RAID card happily detected the drive & rebuilt the array;
> smartctl was used to run a S.M.A.R.T. test on the drive just in case
> and it reported no problems, so the slot it came from was noted as
> defective and no further troubleshooting was performed since the
> system was now back in full operation & a single slot failure wasn't
> too concerning since there were plenty of spare slots available & a
> lack of available time to dedicate IT staff resources to invest
> igating further. A few months later, the system began notifying of
> a degraded RAID array again and looking into it I found the exact
> same type of error being reported (megaraid_sas 0000:3d:00.0: 19793
> (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad
> for slot 20) and again it was the same drive I'd moved out of the
> previous slot months earlier. All the other drives have had no
> issues since this server was first put into operation, but this one
> drive has now had both backplane slots it was plugged into become
> completely unresponsive (as far as I can tell the system doesn't
> even detect them no matter what's plugged into them; I've not been
> able to power-cycle the server to confirm if that would bring them
> back online or not due to maintenance window timing for the extended
> downtime a power-cycle could possibly require if there are issues).
>
> In a single failure or a double-failure with different drives I'd
> chalk it up to the backplane being bad, but since both of these
> failures have occurred with the same drive I have to consider that
> the drive itself is potentially causing the problem rather than the
> backplane being faulty. I mainly want to test this out of curiosity
> and an interest in learning what could cause the backplane slots to
> fail if it is a fault of the drive that was connected to them, as
> the results aren't going to change things operations-wise (this
> drive's not being put back in service again & I've installed a new
> drive in the system to restore the array).
>
> Thanks for your time and your input,
>
> -Robert
> _______________________________________________
> Ale mailing list
> Ale at ale.org <mailto:Ale at ale.org>
> https://mail.ale.org/mailman/listinfo/ale
> <https://mail.ale.org/mailman/listinfo/ale>
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
> <http://mail.ale.org/mailman/listinfo>
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
More information about the Ale
mailing list