[ale] Supporting Linux on super computers?

Jim Kinney jim.kinney at gmail.com
Tue Jun 4 08:20:40 EDT 2024


The tuning does involve heavy kernel params and much understanding of the
programs used.

Add to this the interconnection method, ethernet vs. Infiniband or
Slingshot (IPoIB with dynamic adaptive routing).

Now couple with file storage, it's bandwidth and IO performance limits and
the horrors of "tiny files".

Those systems all have params that can be tweaked. The simple test is total
time to run the code. What can be done to speed it up? Getting that data
involves many, many runs of the same dataset with tiny changes in settings.
There are threshold events that be an inflection point when a tiny param
change can have a big impact on performance.

Data collection. Sar is a key dataset for evaluating system use. But a
check every 10 minutes is useless. So it get turned up to every minute or
even shorter. Now that tool is competing with the application for cpu time.
So dedicate a core to perf data collection. Now the application might be
running "unbalanced". Some code runs better in even numbers or cores per
CPU. Some code doesn't care.

So now tweak, wiggle, poke, and cuss until each application runs as fast as
possible. Then compare best option settings between the different
applications and try keep as many as possible while down tuning some parts
to get a best fit.

Sometimes it's possible to do param changes as part of the job startup.
Some params require a reboot.

Practice juggling and brush up on the diplomacy skills.

On Tue, Jun 4, 2024, 7:43 AM Leam Hall via Ale <ale at ale.org> wrote:

> Jim,
>
> Can you talk a little about how to learn performance tuning and scaling
> data collection? Most of my work over the years has just been getting
> things to work, or fixing things after the developers did a "nothing major"
> change. Performance and throughput observation and tuning are important,
> but I've not really done a lot of it.
>
> Is this an area where kernel parameters are used a lot? Are there kernel
> re-compiles? Is this a job space where dusting off my C would be useful?
> Are other programming languages heavily used?
>
> Thanks!
>
> Leam
>
> On 6/3/24 19:47, Jim Kinney wrote:
> > To make it work there's tools that multiply a command across many nodes.
> > Nodes are often pxe boit from a common point then get ip and name
> assigned.
> > Depending on maker and tools the systems are usually batch processors
> with
> > a manager like slurm (or pbs if life hands out lemons 😞).
> >
> > Monitoring tools are numerous. Some old one still work. Some new have
> > problems scaling.
> >
> > The hard part is figuring out performance tuning and scaling data
> > collection to not overrun system usage.
> >
> > On Mon, Jun 3, 2024, 4:05 PM Leam Hall via Ale <ale at ale.org> wrote:
> >
> >> For those of you who know, what's different about supporting Linux on
> >> supercomputers?
> >>
> >> Thanks!
> >>
> >> Leam
>
> --
> DevSecOps Engineer         (reuel.net/resume)
> Scribe: The Domici War     (domiciwar.net)
> General Ne'er-do-well      (github.com/LeamHall)
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20240604/210643b2/attachment.htm>


More information about the Ale mailing list