[ale] shared research server help

Jim Kinney jim.kinney at gmail.com
Thu Oct 5 07:52:15 EDT 2017


Back to the original issue:

A tool like torque or slurm is really your best solution to intensive shared resources. It prevents 2 big jobs from eating the same machine and can also encourage users to code better to manage resources better so they can run more jobs.

I have the same problem. One heavy gpu machine (4 tesla P100) only has 64 G ram. Student tried to load in 200+G of data into ram. 

A few crashes later he can run 2 jobs at once, each only eats 30G ram and one p100. 

On October 4, 2017 6:32:32 PM EDT, Todor Fassl <fassl.tod at gmail.com> wrote:
>I manage a group of research servers for grad students at a university.
>
>The grad students use these machines to do the research for their Ph.D 
>theses. The problem is that they pretty regularly kill off each other's
>
>programs by using up all the ram. Most of the machines have 256G of
>ram. 
>One kid uses 200Gb and another 100Gb and one or the other, often both, 
>die. Sometimes they bringthe machines down by hogging the cpu or using 
>up all the ram. Well, the machines never crash but they might as well
>be 
>down.
>
>We really, really don't want to force them to use a scheduling system 
>like slurm. They are just learnng and they might run the same piece of 
>code 20 times in an hour.
>
>Is there a way to set a limit on the amount of ram all of a user's 
>processes can use? If so, we were thinking of setting it at 50% of the 
>on-board ram. Then it would take 3 students together to trash a
>machine. 
>It might still happen but it would be a lot more infrequent.
>
>Any other suggestions? Anything at all? Just keep in mind that we
>really 
>want to keep it easy for the students to play around.
>
>
>-- 
>Todd
>_______________________________________________
>Ale mailing list
>Ale at ale.org
>http://mail.ale.org/mailman/listinfo/ale
>See JOBS, ANNOUNCE and SCHOOLS lists at
>http://mail.ale.org/mailman/listinfo

-- 
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20171005/08107a6e/attachment.html>


More information about the Ale mailing list