Software for self-hosted distributed monitoring system?

NeoXiD · August 2015

Hey guys!

I'm looking for suggestions of products which I could use to provide my own self-hosted distributed monitoring system, spread across several LEBs. In terms of requirements, this would mean:

Ping/TCP/HTTP/SMTP/... checks
Graphs/Statistics from ALL locations/check servers
Settings like "Send alarm if atleast 3 servers/locations detect a failure"
(additional goodie: some kind of poller, SNMP for example, to monitor CPU/memory/disk...)

I've set up Icinga2 already (and a few other monitoring softwares aswell), just to discover that only one location executes a check => requirement 2 not fulfilled. Just to make sure that everyone understands what I want, here's an example:

28th August 2015, 21:00
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Offline (- ms) (=> no alarm)
HTTP google.com: *DE: Offline, FR: Offline, US1: Offline" (=> alarm)

28th August 2015, 21:01
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Online (82 ms)
HTTP google.com: *DE: Online (200), FR: Online (200), US1: Online (200) (=> mark as solved)

Does any kind of free software exist for that purpose? So far I've found commercial software which would require me to pay a lot of $$$ or software which doesn't execute the check from all locations. I heard something about Nagios, but don't have any experience with that and heard that it's a PITA, so I would be glad about any kind of further information.

Thanks in advance for every reply and enjoy your weekend.

Best regards, NeoXiD

pavs · August 2015

Look at nagios and smokeping.

Nagios does email alerts, smokeping will give you a history of latency and outages.

cassa · August 2015

LibreNMS
http://docs.librenms.org/Extensions/Distributed-Poller/

Jun · August 2015

I used to use Nagios + Cacti for 2 years. Works very well for my need.
Considering switch, though.
LibreNMS is a weekend project for this week.

perennate · August 2015

pybearmon should satisfy your first and third goals. It implements basic checks, and on every "check interval" (configured from database entry) it will execute the check on one server. If a server finds the check in a state that does not match previous state (database says online but check fails, or database says offline but check succeeds), then the system will have N-1 additional servers confirm the state change (where N is in config.py) and send an alert (email or Twilio SMS/voice or HTTP target).

The "bear" in pybearmon is for "bare", since the goal of the system is to remain as simple and minimal as possible. This means there are no fancy graphs. There is however a table check_events that stores the timestamp and type of state changes (times when check goes online/offline after the confirmations), from which graphs can be generated.

NeoXiD · August 2015

@cassa said: LibreNMS

We're using that at work as an additional tool to create a few graphs for our 10-40Gbps backbone switches, but I don't think it is really suitable for my needs. Seems more focussed on monitoring actual network equipment.

@pavs said: Nagios and SmokePing

Do you have any kind of setup like that up and running? Any recommended docs/howtos to get that working? I've took a sneak peek at Nagios so far and it seemed really complicated to get into it. Also, does Nagios support alerts like if this check fails from 3 out of 7 locations, send an alert, otherwise just display a warning?

@Jun said: Nagios and Cacti

I don't really like Cacti, but it seems like I should definitely look into Nagios. Just a pity that none of those solutions got such a nice user experience as Icinga2 Web has.

@perennate said: pybearmon

Looks interesting, thanks for the hint. Based on the functionality you described, it does all the things I'd need and a few things like graphs could be easily added by myself. Will give it a try. Also, it doesn't seem to be that complicated, so I could probably create something similar based on that in Rust (I always prefer compiled stuff). Nice work if it was done by you (I just assume so based on your signature)

patrick7 · August 2015

Nagios/icinga (v1) looks more complicated than it is. Smokeping is very easy to setup: https://www.howtoforge.com/monitoring_network_latency_smokeping_debian_etch

mpkossen · August 2015

Icinga 2 should be able to do this.

NeoXiD · August 2015

@mpkossen said:
Icinga 2 should be able to do this.

I've set up Icinga2 and I was able to setup distributed monitoring, however always only one satellite/cluster node reported back if the system was online or not. Searched forever in the internet if there's any way to change that, but it seems like Icinga 2 is hardcoded to always only show one result? I'd rather have an overview where I can clearly see which locations failed and/or succeeded - for each test.

mpkossen · August 2015

@NeoXiD said:
I've set up Icinga2 and I was able to setup distributed monitoring, however always only one satellite/cluster node reported back if the system was online or not. Searched forever in the internet if there's any way to change that, but it seems like Icinga 2 is hardcoded to always only show one result? I'd rather have an overview where I can clearly see which locations failed and/or succeeded - for each test.

I think this depends on the way the satellite is set up. If it's one with a local configuration it should be able to report to multiple masters.

NeoXiD · August 2015

@mpkossen said:

Ah, I think I got now what you mean. Your solution would require multiple masters, right? I'd prefer to have one user interface, showing the result of all locations for each check. Basically it should be somewhat similar to solutions like NodePing, just selfhosted instead.

mpkossen · August 2015

@NeoXiD said:

I think you can have one Icinga Web for multiple masters, but I'm not sure.

perennate · August 2015

NeoXiD said: Also, it doesn't seem to be that complicated, so I could probably create something similar based on that in Rust (I always prefer compiled stuff).

Yeah, all communication is channelled through MySQL and it assumes no one will need to lock more than a minute, makes things pretty simple. Here's a new one I've been working on in Go, solves some issues with the old one failing instead of reconnecting when database goes away and sending false positives if the database and worker are on the same server, also intended to handle a much larger volume of checks (state is stored in-memory instead of in MySQL, with fault tolerance achieved via view server) -- https://github.com/LunaNode/gobearmon

Howdy, Stranger!

Categories

In this Discussion

Software for self-hosted distributed monitoring system?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Software for self-hosted distributed monitoring system?

Comments