Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Software for self-hosted distributed monitoring system?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Software for self-hosted distributed monitoring system?

NeoXiDNeoXiD Member

Hey guys!

I'm looking for suggestions of products which I could use to provide my own self-hosted distributed monitoring system, spread across several LEBs. In terms of requirements, this would mean:

  • Ping/TCP/HTTP/SMTP/... checks
  • Graphs/Statistics from ALL locations/check servers
  • Settings like "Send alarm if atleast 3 servers/locations detect a failure"
  • (additional goodie: some kind of poller, SNMP for example, to monitor CPU/memory/disk...)

I've set up Icinga2 already (and a few other monitoring softwares aswell), just to discover that only one location executes a check => requirement 2 not fulfilled. Just to make sure that everyone understands what I want, here's an example:

28th August 2015, 21:00
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Offline (- ms) (=> no alarm)
HTTP google.com: *DE: Offline, FR: Offline, US1: Offline" (=> alarm)

28th August 2015, 21:01
Ping 8.8.8.8: DE: Online (18ms), FR: Online (22ms), US1: Online (82 ms)
HTTP google.com: *DE: Online (200), FR: Online (200), US1: Online (200) (=> mark as solved)

Does any kind of free software exist for that purpose? So far I've found commercial software which would require me to pay a lot of $$$ or software which doesn't execute the check from all locations. I heard something about Nagios, but don't have any experience with that and heard that it's a PITA, so I would be glad about any kind of further information.

Thanks in advance for every reply and enjoy your weekend.

Best regards, NeoXiD

Comments

  • pavspavs Member

    Look at nagios and smokeping.

    Nagios does email alerts, smokeping will give you a history of latency and outages.

    Thanked by 2ehab NeoXiD
  • JunJun Member

    I used to use Nagios + Cacti for 2 years. Works very well for my need.
    Considering switch, though.
    LibreNMS is a weekend project for this week.

    Thanked by 2ehab NeoXiD
  • perennateperennate Member, Host Rep

    pybearmon should satisfy your first and third goals. It implements basic checks, and on every "check interval" (configured from database entry) it will execute the check on one server. If a server finds the check in a state that does not match previous state (database says online but check fails, or database says offline but check succeeds), then the system will have N-1 additional servers confirm the state change (where N is in config.py) and send an alert (email or Twilio SMS/voice or HTTP target).

    The "bear" in pybearmon is for "bare", since the goal of the system is to remain as simple and minimal as possible. This means there are no fancy graphs. There is however a table check_events that stores the timestamp and type of state changes (times when check goes online/offline after the confirmations), from which graphs can be generated.

    Thanked by 2ehab NeoXiD
  • NeoXiDNeoXiD Member
    edited August 2015

    @cassa said: LibreNMS

    We're using that at work as an additional tool to create a few graphs for our 10-40Gbps backbone switches, but I don't think it is really suitable for my needs. Seems more focussed on monitoring actual network equipment.

    @pavs said: Nagios and SmokePing

    Do you have any kind of setup like that up and running? Any recommended docs/howtos to get that working? I've took a sneak peek at Nagios so far and it seemed really complicated to get into it. Also, does Nagios support alerts like if this check fails from 3 out of 7 locations, send an alert, otherwise just display a warning?

    @Jun said: Nagios and Cacti

    I don't really like Cacti, but it seems like I should definitely look into Nagios. Just a pity that none of those solutions got such a nice user experience as Icinga2 Web has.

    @perennate said: pybearmon

    Looks interesting, thanks for the hint. Based on the functionality you described, it does all the things I'd need and a few things like graphs could be easily added by myself. Will give it a try. Also, it doesn't seem to be that complicated, so I could probably create something similar based on that in Rust (I always prefer compiled stuff). Nice work if it was done by you (I just assume so based on your signature)

  • patrick7patrick7 Member, LIR

    Nagios/icinga (v1) looks more complicated than it is. Smokeping is very easy to setup: https://www.howtoforge.com/monitoring_network_latency_smokeping_debian_etch

  • Icinga 2 should be able to do this.

  • @mpkossen said:
    Icinga 2 should be able to do this.

    I've set up Icinga2 and I was able to setup distributed monitoring, however always only one satellite/cluster node reported back if the system was online or not. Searched forever in the internet if there's any way to change that, but it seems like Icinga 2 is hardcoded to always only show one result? I'd rather have an overview where I can clearly see which locations failed and/or succeeded - for each test.

  • @NeoXiD said:
    I've set up Icinga2 and I was able to setup distributed monitoring, however always only one satellite/cluster node reported back if the system was online or not. Searched forever in the internet if there's any way to change that, but it seems like Icinga 2 is hardcoded to always only show one result? I'd rather have an overview where I can clearly see which locations failed and/or succeeded - for each test.

    I think this depends on the way the satellite is set up. If it's one with a local configuration it should be able to report to multiple masters.

  • @mpkossen said:

    Ah, I think I got now what you mean. Your solution would require multiple masters, right? I'd prefer to have one user interface, showing the result of all locations for each check. Basically it should be somewhat similar to solutions like NodePing, just selfhosted instead.

  • @NeoXiD said:

    I think you can have one Icinga Web for multiple masters, but I'm not sure.

  • perennateperennate Member, Host Rep
    edited August 2015

    NeoXiD said: Also, it doesn't seem to be that complicated, so I could probably create something similar based on that in Rust (I always prefer compiled stuff).

    Yeah, all communication is channelled through MySQL and it assumes no one will need to lock more than a minute, makes things pretty simple. Here's a new one I've been working on in Go, solves some issues with the old one failing instead of reconnecting when database goes away and sending false positives if the database and worker are on the same server, also intended to handle a much larger volume of checks (state is stored in-memory instead of in MySQL, with fault tolerance achieved via view server) -- https://github.com/LunaNode/gobearmon

Sign In or Register to comment.