[ale] Lab Workstation Mystery
Jim Kinney
jim.kinney at gmail.com
Mon Mar 28 12:34:36 EDT 2016
The root dir is NOT NFS mounted so that's a red-herring that you can't
mount the /home later. If /var is not writeable, the system will hang
as it can't log any more. Mounting requires a log entry
Since it's not happening all at once to all the machines it really
smells like a local machine problem. Verify that the drive is not full.
Check to see if the affected machines are on the power circuit.
Is it the same 2-3 each time? If so, run memtest and badblocks. If swap
gets corrupted, Linux system lock up.
On Mon, 2016-03-28 at 10:54 -0500, Todor Fassl wrote:
> I have a mysterious problem with workstations in a shared use
> environment. There are 2 labs in different buildings, onewith 6
> workstations and one with 8. These workstations are used by a group
> of
> about 30 grad student TAs. All are running ubuntu 15.10.
> Authentication
> is via ldap and home directories are mounted via nfs. Every day, 2
> or
> 3 of the machines go down. The earliest symptom I can find is that
> the
> root filesystem is remounted read-only. Soon they stop responding
> to
> ssh and snmp and they are essentially locked up. They still respond
> to
> pings though.
>
> I've caught the machines in the period where the root system is
> read-only but I can still ssh to them. I've found that I cannot nfs
> mount home directories on our file server. I can mount nfs shares
> on
> other servers. And I can mount the same home directories if I go to
> another workstation. Restarting nfs on the file server has no effect.
>
> When I try to mount a home directory on an effected machine, the
> mount
> just hangs. I ran it with strace and it just showed it was waiting
> --
> for what, I'm not sure and I don't have a screen cap available at
> the
> moment. I put a packet sniffer on the server and it showed it
> received a
> single packet from the client and that's it.
>
> There is nothing in the logs on the client. In fact, they simply stop
> at
> some point in the process. At first I attributed this to the root
> filesystem being read-only but it continues after I move /var to a
> separate file system. At some point it just stops writing records to
> the
> syslog but I don't know if it's before or after the root filesystem
> is
> remounted read-only.
>
> Many of the TAs also have identical workstations in their offices.
> None
> of those machines seem to have this problem. The TAs do tend to
> walk
> away from the workstations w/o logging out. But I wrote a script to
> kill
> off their sessions and it didn't help. I had it send me an email
> whenever it killed somebody's session and it doesn't seem to be
> correlated with that. In other words, sometimes machines go down even
> if
> everyone who has used it has remembered to log out.
>
> I'm pretty desperate. Any ideas?
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
--
James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20160328/d8d8281e/attachment.html>
More information about the Ale
mailing list