Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shell script to check cpu load on many servers
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Shell script to check cpu load on many servers

umiumi Member
edited June 2020 in Tutorials

Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the target server's memory?

Here is the shell script that if put into cron will preodically check servers via ssh

#!/bin/bash
#to crontab for checks every 15 mins
#0,15 * * * * /path/to/this/script
#must be run once to suppress ssh hello screen
#touch ~/.hushlogin
#put here your server's ips
declare -a ips=("10.0.0.1" "10.0.0.2" "10.0.0.3" "10.0.0.4" "10.0.0.5")

for i in "${ips[@]}"
do
result=$(ssh -T $i "cat /proc/loadavg" 2>&1)
#the result can be parsed and some alerts issued or action taken
dt=`date +'%Y-%m-%d %H:%M'`
#append it into log file
echo "$dt $result" >> /user/name/cpumon/$i
done

Comments

  • Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the server's memory?

    You don't. Daemons cost resources.

    The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    The best architectures are running lightweights metric exporters and then a separate system to gather the metrics, analyse, alert, graph etc (e.g. Prometheus, Grafana).

    Thanked by 3umi raindog308 Pwner
  • umiumi Member
    edited June 2020

    i.e. some kind of a tiny kernel module which will send alerting UDP packets to monitoring servers when things are about to go south? It looks like something must be running on a server. The idea is to make it as light as possible then.

  • raindog308raindog308 Administrator, Veteran

    @danielhm said: The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    Yeah, you need to have something running "above" the ssh client that has a timeout alarm. When things reach that level of complexity, I'd rather work in perl or python, but it can be done in shell:

    https://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/

    Thanked by 1umi
  • rcxbrcxb Member

    A process using 100% of CPU is no big deal. It will just get slowed down a little bit when something else wants time, and both processes will just be slow. What really makes a system unresponsive is running out of memory (and other less common things). You can set some ulimits to automatically kill a process after using a certain amount of CPU or memory if you want to, but usually you put network/server monitoring system in place, that wakes you up when things are getting dangerous, before things get unresponsive.

    Thanked by 1umi
  • umiumi Member
    edited June 2020

    to address memory allocation I changed default crazy overcommit sysctl settings to something more meaningful:
    vm.swappiness = 5
    kernel.panic = 5
    vm.overcommit_memory = 2
    vm.overcommit_ratio = 120

    So far I stick to idea of a tiny kernel module to monitor load and send udp packets to control center if situation is out of normal. That's because kernel will continue to run a bit longer even if all userspace is dead and that will be enough to send telemetry to remove this node from production. In addition, the node will be removed from production if it misses 2 consecutive the "i'm alive" packets.

  • Daniel15Daniel15 Veteran
    edited June 2020

    to crontab for checks every 15 mins

    Only collecting once every 15 minutes is not going to give you sufficient granularity for debugging most issues.

    I'd recommend using Netdata, which collects data once per second, including per-process CPU and memory usage. You can aggregate multiple servers using Netdata Cloud or Prometheus (Prometheus is useful if you want to run queries across all the data).

    Thanked by 2umi vimalware
  • umiumi Member
    edited June 2020

    Netdata is shiny! Installation&compilation is a masterpiece, but it has committed 380MB of memory and used 50MB which is a lot for a 500MB ram vps. I'll put it on a vps with 1.5Gb RAM to study more.

  • LittleCreekLittleCreek Member, Patron Provider

    @danielhm said:
    The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    I don't know much about shell but I did something similar in perl just to check http and it forks every process so that no single server can hold up the rest of the script from checking the other servers so that is also a possibility.

  • umiumi Member

    Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

  • LittleCreekLittleCreek Member, Patron Provider

    @umi said:
    Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

    Oh yeah. I forgot about &. I just know perl so much more.

  • umiumi Member
    edited June 2020

    Outer ssh monitoring is okay with infrequent tasks, like checking something once in 15 min. For more frequent monitoring it is quite clumsy unless one is using ssh tunnel to avoid frequent handshakes. The light agent running inside monitoring system is preferable in that case.

  • NanoG6NanoG6 Member

    Why not using hetrixtools?

    Thanked by 1umi
  • umiumi Member
    edited June 2020

    Wow! Looks nice. UptimeRobot also nice but less frequent ;) UptimeRobot reveals that ping to server in SanJose is 40ms at night and 49ms during daytime when traffic load is higher. This is common behaviour but in Asia it can be as twice as bad between day and night.

  • NanoG6NanoG6 Member
    edited June 2020

    Unless i'm missing something, hetrixtools is all you need :)

    Thanked by 1umi
  • CConnerCConner Member, Host Rep

    What you can do is assign the processes you suspect might eat up your CPU to a CGroup. This will scale down the amount CPU time it gets to a set level if CPU time is being contested between processes.

    Thanked by 1umi
Sign In or Register to comment.