[ale] shared research server help

Thu Oct 5 10:25:07 EDT 2017

It doesn't just happen with students.

A few years ago I worked at a big network gear maker.   We had multiple test/dev, staging and production environments.   Right after I started one of the first things they assigned to me was determining which of 2 environment groups was using more resources on their shared server.    Since it was HP-UX I was able to setup data captures based on environments in Glance/MeasureWare.    A day later I was able to send graphs showing the 1 environment group was using 95% of the resources.

Graphs help to impress the untrained so much more than detailed analysis and you telling them the problem.    Being able to quickly give them an answer for what had apparently been a long running argument was one of the many things that made them ask the headhunter for a person just like me when I left to return to Atlanta.

One thing that occurred to me on your original question was the idea of giving students their own virtual machines.   You can assign vcpus, storage and RAM to virtuals so that students couldn't exceed what had been assigned.   Of course I've not worked with slurm or other resource limiting tools on Linux (other than ulimits as mentioned by someone else).

-----Original Message-----
From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of Todor Fassl
Sent: Thursday, October 05, 2017 9:27 AM
To: Jim Kinney; Atlanta Linux Enthusiasts
Subject: Re: [ale] shared research server help

Right, Jim, another aspect of this problem is that most of the students don't even realize they need to be careful, much less how to be careful. 
"What? Is there a problem with me asking for 500 gigabytes of ram?" 
Well, the machine has only 256. But I'm just the IT guy and it's not my place to demand that these students demonstrate a basic understanding of sharing resources before getting started. The instructors would never go for that. I am pretty much stuck providing that informally on a one-to-one basis. But I think it would be valuable for me to work on automating that somehow. Pointers to the wiki, stuff like that.

Somebody emailled me off list and made a really good point. The key, I think is information. Well, that and peer pressure. I know nagios can trigger an alert when a machine runs low on ram or cpu cycles. It might even be able to determine who is running the procs that are causing it. 
I can at least put all the users in a nagios group and send them alerts when a research server is near an OOM event. I'll have to see what kind of granularity I can get out of nagios and experiment with who gets notified. I can do things like keep widening the group that gets notified of an event if the original setup turns out to be ineffective.

This list has really come through for me again just with ideas I can bounce around. I'll have to tread lightly though. About a year ago, I configured the machines in our shared labs to log someone off after 15 minutes of inactivity. Believe it or not, that was controversial. Not with the faculty but with the students using the labs. It was an easy win for me but some of the students went to the faculty with complaints. 
Wait, you're actually defending your right to walk away from a workstation in a public place still logged in? In a way that's not such a bad thing. This is a university and the students should run the place. 
But they need a referee.