[ale] System Load Summary Script?
Jeff Hubbs
jhubbslist at att.net
Wed Jun 26 15:29:47 EDT 2019
On 6/26/19 1:58 PM, Todor Fassl via Ale wrote:
> Right, but that is my point. If I run uptime and I see the load on a
> system is high, I still have to manually figure out if it is cpu
> bound, memory bound, or disk IO bound, or network IO bound. If you
> google for tutorials on diagnosing load problems, they all say
> something like "First run top and look at column 10. Then run iotop
> and look at column 23. Then run netstat and ..." I don't think I
> should have to do that in 2019.
Maybe just go to lunch?
I'm only half-joking. Well, not even half.
At A Previous Employer (tm) the network operations group forced the
issue of running Nagios to monitor everything. I complied and put a
Nagios client on the Gentoo Linux file server I'd designed, built, and
managed for the entire company's use. Every night this machine made
Nagios absolutely explode with warnings. Of course it would, I told
them, it's running mksquashfs on all the Samba share volumes to make
backups and it lights up every core in the box in so doing because the
RAID1+0 is insanely fast in read and it's writing to a completely
different set of spindles on a completely different controller.
Moreover, it would do the same thing whenever ClamAV ran because ClamAV
was nicely multithreaded and would read at over 200MiB/s. It was
expected, normal, and intended. The "problem," plainly speaking, was
Nagios.
The point of this graybeard parable is that machines turning into
hairdryers is not a bad thing on its face. It's different if e.g. a) it
can't complete something in the amount of time it has to do it per
line-of-business requirements b) you're limited on electrical or cooling
plant power c) your computers are doing something with no utility or
value. Just let the things glow red and go to lunch.
More information about the Ale
mailing list