[ale] Testing a SATA HDD for physical/electrical hardware faults
Jim Kinney
jim.kinney at gmail.com
Thu Aug 25 08:42:57 EDT 2022
And lead-free solders are more apt to whisker.
On Thu, Aug 25, 2022, 8:41 AM Phil Turmel via Ale <ale at ale.org> wrote:
> A lead whisker in the right place in the drive (or similar
> not-quite-direct-short) could definitely produce a port-killer.
>
> I'd crush that drive and get on with life.
>
> On 8/25/22 08:04, Jim Kinney via Ale wrote:
> > Holy crap!
> >
> > They only thing I can think of is either the current draw of the drive
> > is too high or the voltage tolerance of both is too tight. Either way
> > it's unlikely a failed drive can destroy a backplane slot. It's highly
> > likely it overloaded a tolerance that the controller read as "no slot
> > device". Unless that drive was really wonky and died with a strong
> > oscillating current draw that could damage a trim capacitor or an
> > inductor on the slot power it will likely recover on a power cycle.
> > Yeah, maintenance window challenge.
> >
> > Might be able to run the drive with some current measurements to see if
> > it can be running out of spec. Will need to cut power lines and splice
> > in gear.
> >
> > It might be possible to make the controller re-read the backplane.
> >
> > On Wed, Aug 24, 2022, 8:56 PM Robert Tweedy via Ale <ale at ale.org
> > <mailto:ale at ale.org>> wrote:
> >
> > Hey ALE, I have a hard drive that I'm planning to discard due to
> > what I'm about to describe below, but before I do that I'm
> > interested in seeing if anyone knows if there's some way to do an
> > in-depth test of its physical hardware if it's connected to an old
> > desktop tower, like any specialized Linux packages specifically
> > capable of doing advanced hardware testing beyond what's achievable
> > by smartctl; for the type of testing I'd like to do I presume that
> > it's probably not feasible if it's connected to a standard
> > motherboard's SATA slot and I'd need some specialized hardware to
> > test it, but I just wanted to check to confirm this.
> >
> > Anyway, story time for those who'd be interested: this 16TB SATA
> > drive arrived in a set of 20 along with a server that contains an
> > AVAGO 3108 MegaRAID card & more than enough bays to hold all of the
> > drives with spares for later expansion. After working fine for over
> > a year, the system began notifying that its RAID array was degraded
> > due to a PHY failure (ie. the slot on the drive backplane stopped
> > working) causing the drive to disappear from the array. Moving the
> > drive to another spare slot in the server brought it back online &
> > the RAID card happily detected the drive & rebuilt the array;
> > smartctl was used to run a S.M.A.R.T. test on the drive just in case
> > and it reported no problems, so the slot it came from was noted as
> > defective and no further troubleshooting was performed since the
> > system was now back in full operation & a single slot failure wasn't
> > too concerning since there were plenty of spare slots available & a
> > lack of available time to dedicate IT staff resources to invest
> > igating further. A few months later, the system began notifying of
> > a degraded RAID array again and looking into it I found the exact
> > same type of error being reported (megaraid_sas 0000:3d:00.0: 19793
> > (678503133s/0x0004/CRIT) - Enclosure PD 00(c Port 0 - 3/p1) phy bad
> > for slot 20) and again it was the same drive I'd moved out of the
> > previous slot months earlier. All the other drives have had no
> > issues since this server was first put into operation, but this one
> > drive has now had both backplane slots it was plugged into become
> > completely unresponsive (as far as I can tell the system doesn't
> > even detect them no matter what's plugged into them; I've not been
> > able to power-cycle the server to confirm if that would bring them
> > back online or not due to maintenance window timing for the extended
> > downtime a power-cycle could possibly require if there are issues).
> >
> > In a single failure or a double-failure with different drives I'd
> > chalk it up to the backplane being bad, but since both of these
> > failures have occurred with the same drive I have to consider that
> > the drive itself is potentially causing the problem rather than the
> > backplane being faulty. I mainly want to test this out of curiosity
> > and an interest in learning what could cause the backplane slots to
> > fail if it is a fault of the drive that was connected to them, as
> > the results aren't going to change things operations-wise (this
> > drive's not being put back in service again & I've installed a new
> > drive in the system to restore the array).
> >
> > Thanks for your time and your input,
> >
> > -Robert
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org <mailto:Ale at ale.org>
> > https://mail.ale.org/mailman/listinfo/ale
> > <https://mail.ale.org/mailman/listinfo/ale>
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
> > <http://mail.ale.org/mailman/listinfo>
> >
> >
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > https://mail.ale.org/mailman/listinfo/ale
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20220825/55493b8a/attachment.htm>
More information about the Ale
mailing list