[ale] Testing a SATA HDD for physical/electrical hardware faults

Thu Aug 25 08:42:57 EDT 2022

And lead-free solders are more apt to whisker.

On Thu, Aug 25, 2022, 8:41 AM Phil Turmel via Ale <ale at ale.org> wrote:

> A lead whisker in the right place in the drive (or similar
> not-quite-direct-short) could definitely produce a port-killer.
>
> I'd crush that drive and get on with life.
>
> On 8/25/22 08:04, Jim Kinney via Ale wrote:
> > Holy crap!
> >
> > They only thing I can think of is either the current draw of the drive
> > is too high or the voltage tolerance of both is too tight. Either way
> > it's unlikely a failed drive can destroy a backplane slot. It's highly
> > likely it overloaded a tolerance that the controller read as "no slot
> > device". Unless that drive was really wonky and died with a strong
> > oscillating current draw that could damage a trim capacitor or an
> > inductor on the slot power it will likely recover on a power cycle.
> > Yeah, maintenance window challenge.
> >
> > Might be able to run the drive with some current measurements to see if
> > it can be running out of spec. Will need to cut power lines and splice
> > in gear.
> >
> > It might be possible to make the controller re-read the backplane.
> >
> > On Wed, Aug 24, 2022, 8:56 PM Robert Tweedy via Ale <ale at ale.org
> > <mailto:ale at ale.org>> wrote:
> >
> >     Hey ALE, I have a hard drive that I'm planning to discard due to
> >     what I'm about to describe below, but before I do that I'm
> >     interested in seeing if anyone knows if there's some way to do an
> >     in-depth test of its physical hardware if it's connected to an old
> >     desktop tower, like any specialized Linux packages specifically
> >     capable of doing advanced hardware testing beyond what's achievable
> >     by smartctl; for the type of testing I'd like to do I presume that
> >     it's probably not feasible if it's connected to a standard
> >     motherboard's SATA slot and I'd need some specialized hardware to
> >     test it, but I just wanted to check to confirm this.
> >
> >     Anyway, story time for those who'd be interested: this 16TB SATA
> >     drive arrived in a set of 20 along with a server that contains an
> >     AVAGO 3108 MegaRAID card & more than enough bays to hold all of the
> >     drives with spares for later expansion. After working fine for over
> >     a year, the system began notifying that its RAID array was degraded
> >     due to a PHY failure (ie. the slot on the drive backplane stopped
> >     working) causing the drive to disappear from the array. Moving the
> >     drive to another spare slot in the server brought it back online &
> >     the RAID card happily detected the drive & rebuilt the array;
> >     smartctl was used to run a S.M.A.R.T. test on the drive just in case
> >     and it reported no problems, so the slot it came from was noted as
> >     defective and no further troubleshooting was performed since the
> >     system was now back in full operation & a single slot failure wasn't
> >     too concerning since there were plenty of spare slots available & a
> >     lack of available time to dedicate IT staff resources to invest
> >       igating further. A few months later, the system began notifying of
> >     a degraded RAID array again and looking into it I found the exact
> >     same type of error being reported (megaraid_sas 0000:3d:00.0: 19793
> >     (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad
> >     for slot 20) and again it was the same drive I'd moved out of the
> >     previous slot months earlier. All the other drives have had no
> >     issues since this server was first put into operation, but this one
> >     drive has now had both backplane slots it was plugged into become
> >     completely unresponsive (as far as I can tell the system doesn't
> >     even detect them no matter what's plugged into them; I've not been
> >     able to power-cycle the server to confirm if that would bring them
> >     back online or not due to maintenance window timing for the extended
> >     downtime a power-cycle could possibly require if there are issues).
> >
> >     In a single failure or a double-failure with different drives I'd
> >     chalk it up to the backplane being bad, but since both of these
> >     failures have occurred with the same drive I have to consider that
> >     the drive itself is potentially causing the problem rather than the
> >     backplane being faulty. I mainly want to test this out of curiosity
> >     and an interest in learning what could cause the backplane slots to
> >     fail if it is a fault of the drive that was connected to them, as
> >     the results aren't going to change things operations-wise (this
> >     drive's not being put back in service again & I've installed a new
> >     drive in the system to restore the array).
> >
> >     Thanks for your time and your input,
> >
> >     -Robert
> >     _______________________________________________
> >     Ale mailing list
> >     Ale at ale.org <mailto:Ale at ale.org>
> >     https://mail.ale.org/mailman/listinfo/ale
> >     <https://mail.ale.org/mailman/listinfo/ale>
> >     See JOBS, ANNOUNCE and SCHOOLS lists at
> >     http://mail.ale.org/mailman/listinfo
> >     <http://mail.ale.org/mailman/listinfo>
> >
> >
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > https://mail.ale.org/mailman/listinfo/ale
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20220825/55493b8a/attachment.htm>