[ale] lsof and a hung system

Jim Kinney jkinney at jimkinney.us
Tue Oct 20 12:25:15 EDT 2015


Yep. The 10G card driver had oopsed all over itself and wouldn't keep a
connection up. I initially tried to stop network, unload the module,
load the module, start the network but even that failed to reset the
card completely. I needed to add a sleep 20 before loading the module
again. Once the connection was actually working the system was cleanly
rebooted to lop off the zombies and things were happily OK.
On Tue, 2015-10-20 at 11:32 -0400, Ed Cashin wrote:
> On Mon, Oct 19, 2015 at 10:58 PM, Jim Kinney <jim.kinney at gmail.com>
> wrote:
> ... 
> > Other system with same nfs mounted storage is fine. Storage server
> > is connected to both number crunchers by dedicated, unswitched
> > 10Gbps fiber ethernet. 
> > > 
> > 
> You mean with direct connections?  In that case, the other number
> cruncher's connection could be fine, while the affected system could
> not be able to do networking to the NFS server (for some as yet
> undetermined reason), which could result in the behavior you describe
> if the NFS mount is "hard".
> 
> -- 
>   Ed Cashin <ecashin at noserose.net>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20151020/3d9afd79/attachment.html>


More information about the Ale mailing list