All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Software for self-hosted distributed monitoring system?
Hey guys!
I'm looking for suggestions of products which I could use to provide my own self-hosted distributed monitoring system, spread across several LEBs. In terms of requirements, this would mean:
- Ping/TCP/HTTP/SMTP/... checks
- Graphs/Statistics from ALL locations/check servers
- Settings like "Send alarm if atleast 3 servers/locations detect a failure"
- (additional goodie: some kind of poller, SNMP for example, to monitor CPU/memory/disk...)
I've set up Icinga2 already (and a few other monitoring softwares aswell), just to discover that only one location executes a check => requirement 2 not fulfilled. Just to make sure that everyone understands what I want, here's an example:
28th August 2015, 21:00
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Offline (- ms) (=> no alarm)
HTTP google.com: *DE: Offline, FR: Offline, US1: Offline" (=> alarm)
28th August 2015, 21:01
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Online (82 ms)
HTTP google.com: *DE: Online (200), FR: Online (200), US1: Online (200) (=> mark as solved)
Does any kind of free software exist for that purpose? So far I've found commercial software which would require me to pay a lot of $$$ or software which doesn't execute the check from all locations. I heard something about Nagios, but don't have any experience with that and heard that it's a PITA, so I would be glad about any kind of further information.
Thanks in advance for every reply and enjoy your weekend.
Best regards, NeoXiD
Comments
Look at nagios and smokeping.
Nagios does email alerts, smokeping will give you a history of latency and outages.
LibreNMS
http://docs.librenms.org/Extensions/Distributed-Poller/
I used to use Nagios + Cacti for 2 years. Works very well for my need.
Considering switch, though.
LibreNMS is a weekend project for this week.
pybearmon should satisfy your first and third goals. It implements basic checks, and on every "check interval" (configured from database entry) it will execute the check on one server. If a server finds the check in a state that does not match previous state (database says online but check fails, or database says offline but check succeeds), then the system will have N-1 additional servers confirm the state change (where N is in config.py) and send an alert (email or Twilio SMS/voice or HTTP target).
The "bear" in pybearmon is for "bare", since the goal of the system is to remain as simple and minimal as possible. This means there are no fancy graphs. There is however a table check_events that stores the timestamp and type of state changes (times when check goes online/offline after the confirmations), from which graphs can be generated.
We're using that at work as an additional tool to create a few graphs for our 10-40Gbps backbone switches, but I don't think it is really suitable for my needs. Seems more focussed on monitoring actual network equipment.
Do you have any kind of setup like that up and running? Any recommended docs/howtos to get that working? I've took a sneak peek at Nagios so far and it seemed really complicated to get into it. Also, does Nagios support alerts like if this check fails from 3 out of 7 locations, send an alert, otherwise just display a warning?
I don't really like Cacti, but it seems like I should definitely look into Nagios. Just a pity that none of those solutions got such a nice user experience as Icinga2 Web has.
Looks interesting, thanks for the hint. Based on the functionality you described, it does all the things I'd need and a few things like graphs could be easily added by myself. Will give it a try. Also, it doesn't seem to be that complicated, so I could probably create something similar based on that in Rust (I always prefer compiled stuff). Nice work if it was done by you (I just assume so based on your signature)
Nagios/icinga (v1) looks more complicated than it is. Smokeping is very easy to setup: https://www.howtoforge.com/monitoring_network_latency_smokeping_debian_etch
Icinga 2 should be able to do this.
I've set up Icinga2 and I was able to setup distributed monitoring, however always only one satellite/cluster node reported back if the system was online or not. Searched forever in the internet if there's any way to change that, but it seems like Icinga 2 is hardcoded to always only show one result? I'd rather have an overview where I can clearly see which locations failed and/or succeeded - for each test.
I think this depends on the way the satellite is set up. If it's one with a local configuration it should be able to report to multiple masters.
Ah, I think I got now what you mean. Your solution would require multiple masters, right? I'd prefer to have one user interface, showing the result of all locations for each check. Basically it should be somewhat similar to solutions like NodePing, just selfhosted instead.
I think you can have one Icinga Web for multiple masters, but I'm not sure.
Yeah, all communication is channelled through MySQL and it assumes no one will need to lock more than a minute, makes things pretty simple. Here's a new one I've been working on in Go, solves some issues with the old one failing instead of reconnecting when database goes away and sending false positives if the database and worker are on the same server, also intended to handle a much larger volume of checks (state is stored in-memory instead of in MySQL, with fault tolerance achieved via view server) -- https://github.com/LunaNode/gobearmon