Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Advertise on LowEndTalk.com
Shell script to check cpu load on many servers
New on LowEndTalk? Please read our 'Community Rules' by clicking on it in the right menu!

Shell script to check cpu load on many servers

umiumi Member
edited June 13 in Tutorials

Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the target server's memory?

Here is the shell script that if put into cron will preodically check servers via ssh

#!/bin/bash
#to crontab for checks every 15 mins
#0,15 * * * * /path/to/this/script
#must be run once to suppress ssh hello screen
#touch ~/.hushlogin
#put here your server's ips
declare -a ips=("10.0.0.1" "10.0.0.2" "10.0.0.3" "10.0.0.4" "10.0.0.5")

for i in "${ips[@]}"
do
result=$(ssh -T $i "cat /proc/loadavg" 2>&1)
#the result can be parsed and some alerts issued or action taken
dt=`date +'%Y-%m-%d %H:%M'`
#append it into log file
echo "$dt $result" >> /user/name/cpumon/$i
done

Comments

  • danielhmdanielhm Member

    Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the server's memory?

    You don't. Daemons cost resources.

    The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    The best architectures are running lightweights metric exporters and then a separate system to gather the metrics, analyse, alert, graph etc (e.g. Prometheus, Grafana).

    Thanked by 3umi raindog308 Pwner
  • umiumi Member
    edited June 13

    i.e. some kind of a tiny kernel module which will send alerting UDP packets to monitoring servers when things are about to go south? It looks like something must be running on a server. The idea is to make it as light as possible then.

  • raindog308raindog308 Moderator

    @danielhm said: The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    Yeah, you need to have something running "above" the ssh client that has a timeout alarm. When things reach that level of complexity, I'd rather work in perl or python, but it can be done in shell:

    https://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/

    Thanked by 1umi

    For LET support, please visit the support desk.

  • rcxbrcxb Member

    A process using 100% of CPU is no big deal. It will just get slowed down a little bit when something else wants time, and both processes will just be slow. What really makes a system unresponsive is running out of memory (and other less common things). You can set some ulimits to automatically kill a process after using a certain amount of CPU or memory if you want to, but usually you put network/server monitoring system in place, that wakes you up when things are getting dangerous, before things get unresponsive.

    Thanked by 1umi
  • umiumi Member
    edited June 14

    to address memory allocation I changed default crazy overcommit sysctl settings to something more meaningful:
    vm.swappiness = 5
    kernel.panic = 5
    vm.overcommit_memory = 2
    vm.overcommit_ratio = 120

    So far I stick to idea of a tiny kernel module to monitor load and send udp packets to control center if situation is out of normal. That's because kernel will continue to run a bit longer even if all userspace is dead and that will be enough to send telemetry to remove this node from production. In addition, the node will be removed from production if it misses 2 consecutive the "i'm alive" packets.

  • Daniel15Daniel15 Member
    edited June 14

    to crontab for checks every 15 mins

    Only collecting once every 15 minutes is not going to give you sufficient granularity for debugging most issues.

    I'd recommend using Netdata, which collects data once per second, including per-process CPU and memory usage. You can aggregate multiple servers using Netdata Cloud or Prometheus (Prometheus is useful if you want to run queries across all the data).

    Thanked by 1umi
  • umiumi Member
    edited June 14

    Netdata is shiny! Installation&compilation is a masterpiece, but it has committed 380MB of memory and used 50MB which is a lot for a 500MB ram vps. I'll put it on a vps with 1.5Gb RAM to study more.

  • LittleCreekLittleCreek Member, Provider

    @danielhm said:
    The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

    I don't know much about shell but I did something similar in perl just to check http and it forks every process so that no single server can hold up the rest of the script from checking the other servers so that is also a possibility.

    Floyd Morrissette - DirectAdmin Expert
    LittleCreekHosting.com - VPS specials

  • umiumi Member

    Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

  • LittleCreekLittleCreek Member, Provider

    @umi said:
    Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

    Oh yeah. I forgot about &. I just know perl so much more.

    Floyd Morrissette - DirectAdmin Expert
    LittleCreekHosting.com - VPS specials

  • umiumi Member
    edited June 14

    Outer ssh monitoring is okay with infrequent tasks, like checking something once in 15 min. For more frequent monitoring it is quite clumsy unless one is using ssh tunnel to avoid frequent handshakes. The light agent running inside monitoring system is preferable in that case.

  • NanoG6NanoG6 Member

    Why not using hetrixtools?

    Thanked by 1umi
  • umiumi Member
    edited June 15

    Wow! Looks nice. UptimeRobot also nice but less frequent ;) UptimeRobot reveals that ping to server in SanJose is 40ms at night and 49ms during daytime when traffic load is higher. This is common behaviour but in Asia it can be as twice as bad between day and night.

  • NanoG6NanoG6 Member
    edited June 15

    Unless i'm missing something, hetrixtools is all you need :)

    Thanked by 1umi
  • CConnerCConner Member, Provider

    What you can do is assign the processes you suspect might eat up your CPU to a CGroup. This will scale down the amount CPU time it gets to a set level if CPU time is being contested between processes.

    Thanked by 1umi

    GameDash, an AIO solution uniting billing, support & game server management platform.
    Visit our website or join our Discord to find out more.

Sign In or Register to comment.