[ale] System Load Summary Script?

Wed Jun 26 15:29:47 EDT 2019

On 6/26/19 1:58 PM, Todor Fassl via Ale wrote:
> Right, but that is my point. If I run uptime and I see the load on a 
> system is high, I still have to manually figure out if it is cpu 
> bound, memory bound, or disk IO bound, or network IO bound. If you 
> google for tutorials on diagnosing load problems, they all say 
> something like "First run top and look at column 10. Then run iotop 
> and look at column 23. Then run netstat and ..." I don't think I 
> should have to do that in 2019.

Maybe just go to lunch?

I'm only half-joking. Well, not even half.

At A Previous Employer (tm) the network operations group forced the 
issue of running Nagios to monitor everything. I complied and put a 
Nagios client on the Gentoo Linux file server I'd designed, built, and 
managed for the entire company's use. Every night this machine made 
Nagios absolutely explode with warnings. Of course it would, I told 
them, it's running mksquashfs on all the Samba share volumes to make 
backups and it lights up every core in the box in so doing because the 
RAID1+0 is insanely fast in read and it's writing to a completely 
different set of spindles on a completely different controller. 
Moreover, it would do the same thing whenever ClamAV ran because ClamAV 
was nicely multithreaded and would read at over 200MiB/s. It was 
expected, normal, and intended. The "problem," plainly speaking, was 
Nagios.

The point of this graybeard parable is that machines turning into 
hairdryers is not a bad thing on its face. It's different if e.g. a) it 
can't complete something in the amount of time it has to do it per 
line-of-business requirements b) you're limited on electrical or cooling 
plant power c) your computers are doing something with no utility or 
value. Just let the things glow red and go to lunch.