[ale] Ram, sigstop, swap, pids, etc

Mon Feb 8 15:51:56 EST 2021

I've got job priority already. It's effectively nice levels at job start time. That won't get the scheduler to launch onto the busy node. One there, yes, I can nice -19 Mary and Bob basically does nothing. But not exactly 0. 

I'm beginning to see the scheduler as the stuck point. It's needs to overload a node (or 20) since the other jobs will get nice +20 and Mary gets nice -19.

On February 8, 2021 3:47:01 PM EST, Chuck Payne <terrorpup at gmail.com> wrote:
>Is this where nice would come into play? Or using CPULimit on a job?
>
>On Mon, Feb 8, 2021, 3:36 PM Jim Kinney via Ale <ale at ale.org> wrote:
>
>> I've been looking at criu. My use case is HPC.
>>
>> On the performance issues, since Bob is not running as it gets paged
>to
>> swap only a bit of Mary will slow for the page out time. Bob can
>suffer
>> since Mary owns the hardware.
>>
>> The thing that criu does I can't see a way to work with is the pid
>change
>> on restore.
>>
>> In sge and variants, there's a shepherd process that manages the job
>> process tree that's run on the hpc nodes. Criu would have to pause
>the
>> shepherd process for each job which breaks the node daemon or pause
>the job
>> which breaks the shepherd.
>>
>> Granted, I'm still in theory land with no practical testing yet.
>>
>> If only this hpc process actually worked with cgroups as is
>claimed....
>>
>> On February 8, 2021 3:09:29 PM EST, Solomon Peachy via Ale
><ale at ale.org>
>> wrote:
>>>
>>> On Mon, Feb 08, 2021 at 02:13:55PM -0500, Jim Kinney via Ale wrote:
>>>
>>>> Will the kernel move Bob's process from ram to swap and back if it
>>>> sits in STOP for a while (hours to days)? Unknown how long after
>Mary
>>>> starts that it eats all the RAM.
>>>>
>>>
>>> It won't automatically move Bob's process to swap in one fell swoop;
>>> instead as Mary's process needs more RAM, Bob's will get
>incrementally
>>> paged out as it's not actively being accessed.
>>>
>>> And when Mary's is finished, once Bob's is allowed to resume, it
>will
>>> get incremetnally paged back in as its components are needed. 
>(There's
>>> probably a tunable or other mechanism to "encourage" it to page back
>in
>>> more quickly, beyond running swapoff and forcing everything back..)
>>>
>>> Performance is going to suffer while the paging is happening.
>>>
>>> Perhaps a better option is the explicit checkpoint/restore mechanism
>using
>>> the criu tool.
>>>
>>>  - Solomon
>>>
>>>
>> --
>> Computers amplify human error
>> Super computers are really cool
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> https://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>

-- 
Computers amplify human error
Super computers are really cool
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20210208/62178238/attachment.html>