Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


What do you look for in an external monitoring service? Thoughts on my approach welcome.
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

What do you look for in an external monitoring service? Thoughts on my approach welcome.

MasonRMasonR Community Contributor
edited December 2017 in General

Howdy all,

I've started working on a distributed monitoring service in my spare time to sharpen some of my Python skills. Why a monitoring service? Mainly because I'm not really satisfied with the options that currently exist (some don't have http response code check, https verifications, unable to specific a nonstandard TCP port, etc. etc.). Right now, I'm just trying to brainstorm all of the features that I will move forward with implementing.

Edit: Note that this will be an external monitoring system, meaning that the code will not be deployed to the nodes that you want monitored and thus won't support any resource or log-based analysis/alerting. Think of this as an open-source distributed python-based uptime robot or phpservermon clone

Current list of service checks to implement:

  • ping checks
  • http response codes (ex. non-200 = error)
  • https cert verification warnings (ex. cert expires in 7 days, etc.)
  • TCP port listening for connections
  • Steam game servers (via python-valve)

The service will be deployed in a distributed fashion. There will be one central server where a user would register their services they wish to monitor and a desired frequency (1 min, 5 min, 20 min, etc.). During an individual service check, the job will be sent to three satellite monitoring nodes via a RESTful API in a round-robin manner. The responses determine if a service is offline (2+ negative responses).

I've just started development on the satellite nodes stack/api so far -- holding off on the main node until the interface is in place. The satellite nodes will be firewalled to only allow jobs from the main node's IP and are using a nginx + gunicorn + flask stack. I'm not planning on monetizing this project at all and will post the code on github as I make progress, though I was planning on standing up a free service and give each user a 50ish service quota.

Any thoughts and recommendations are welcome.
Cheers!
-Mason

«1

Comments

  • subscribed. ;-)

    long version: I think your goals are quite clearly defined already and you know that I am agreeing to all of those things listed.

    I really like to have a lightweight alarming service like night-sky but to expand the kind of checks a bit and eventually send more warnings/notifications makes perfectly sense.

    stil SPOF in that design to me is the question what happens if the main node goes down, and is there a solution to have a fallback for that main node which is able to at least take over the monitoring itself in case of an incident...

    if this gets production ready I am willing to help providing this as a free service for the community.

    Thanked by 2MasonR vimalware
  • MasonRMasonR Community Contributor

    @Falzo said:
    subscribed. ;-)

    long version: I think your goals are quite clearly defined already and you know that I am agreeing to all of those things listed.

    I really like to have a lightweight alarming service like night-sky but to expand the kind of checks a bit and eventually send more warnings/notifications makes perfectly sense.

    Yeah, initially I was going to try and make some improvements to night-sky. But my knowledge and desire to know PHP is extremely limited :P Would like to make this as pluggable as possible to where if you want to add another type of check all you have to do is create a new REST endpoint and add another python file with the logic to the node.

    stil SPOF in that design to me is the question what happens if the main node goes down, and is there a solution to have a fallback for that main node which is able to at least take over the monitoring itself in case of an incident...

    Definitely agreed here. I have zero experience with HA so any recommendations to overcome this would be helpful. Considering possibly having a cluster of main nodes where one could take over if a server goes down or under too much load. Will have to investigate this further.

    if this gets production ready I am willing to help providing this as a free service for the community.

    Cheers! :) Initially I'll be trying to use my ridiculous amount of NAT servers to be the satellite nodes. But obviously, if this takes off and accumulates a large user-base then it would need to be scaled up to more beefy nodes :P

    Thanked by 1vimalware
  • The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Thanked by 1MasonR
  • MasonRMasonR Community Contributor

    @WSS said:
    The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

  • @MasonR said:

    @WSS said:
    The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

    Why would those be part of the master node?

  • MasonRMasonR Community Contributor

    @WSS said:

    @MasonR said:

    @WSS said:
    The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

    Why would those be part of the master node?

    I was imagining the master node to be the front end of the system (i.e. user logs in and sets up their monitors) and the dispatcher of the jobs (i.e. every x minutes dispatch the jobs to the satellite nodes). User logins + saved monitors would probably be stored to a mysql database living on the server. What did you have in mind?

  • @MasonR said:

    What did you have in mind?

    I'd completely separate the user interface from anything which actually does anything, because people are dickheads and your system can't tell you when it goes offline due to some asshole throwing a botnet at it for fun.

    The fact you're already working on an API is good; I'm wondering just how to keep multiple locales having access to the rundown of the queue to process. It won't be difficult, and would probably just work like a circular fifo in execution- but again, I've put about 3 seconds thought into this.

    Thanked by 2MasonR vimalware
  • MasonRMasonR Community Contributor

    @WSS said:

    @MasonR said:

    What did you have in mind?

    I'd completely separate the user interface from anything which actually does anything, because people are dickheads and your system can't tell you when it goes offline due to some asshole throwing a botnet at it for fun.

    Yeah you've got a point there... dicks. I'll definitely keep this in mind as I move towards implementing the actual master node and user interface.

  • Really, now the next concern would be if you have "runners" which queue up and send the nodes the set of what to do and wait for a response, which would waste sockets and a bit of overhead- or to have a queue that all nodes read through and respond accordingly. The only good part about the secondary design is that if the network gets fucked, those nodes will still have the current set of rules. However, if those stop talking to the parent, that can also go poorly.

    Thanked by 1MasonR
  • Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Thanked by 2MasonR MrH
  • @6ixth said:
    Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Hi. Welcome to the thread. Now what the fuck are you on about?

  • Falzo said: service like night-sky

    I've searched google for that "night-sky" in the past and again today. I can't find it. I remember a discussion about it closing shop, but I couldn't even find the link to their old site or software. Can you point me to the right direction?

    I know PHP and I want to give it a look.

  • MasonRMasonR Community Contributor

    @WSS said:
    Really, now the next concern would be if you have "runners" which queue up and send the nodes the set of what to do and wait for a response, which would waste sockets and a bit of overhead- or to have a queue that all nodes read through and respond accordingly. The only good part about the secondary design is that if the network gets fucked, those nodes will still have the current set of rules. However, if those stop talking to the parent, that can also go poorly.

    I think I've already decided on the former approach. Since I'd like the satellite nodes be as stateless as possible and simply just listening for a job and executing them as they come in. gunicorn enables you to run multiple workers on a single node, so they should be able to take and serve a high volume of requests. Will definitely need to do some load tests to make sure the deployed infrastructure can handle what's being thrown at it.

    But since the system will be scalable, if load picks up, new nodes can be spawned and register with the master node to start accepting jobs.

    Thanked by 1WSS
  • @WSS said:

    @6ixth said:
    Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Hi. Welcome to the thread. Now what the fuck are you on about?

    Well if you read the title of the thread, and the thread itself you shall find it asks for recommendations and suggestions for the service itself. Then read my post again and you will know what the fuck I am on about :)

  • WSSWSS Member
    edited December 2017

    @Harzem said:
    I know PHP and I want to give it a look.

    https://github.com/Ne00n/Night-Sky

    @6ixth said:
    Well if you read the title of the thread, and the thread itself you shall find it asks for recommendations and suggestions for the service itself. Then read my post again and you will know what the fuck I am on about :)

    Suggesting SMS is neat. We're still talking about the architecture. You can wait out in the hallway.

    Thanked by 3MasonR Harzem vimalware
  • Install Nagios, load your favorite checks, deploy NRPE on clients done.
    Nagios is able to notify by SMS, push to mobile phone, email and many more.

    You can implement your own checks very easily if you're not happy with the
    available ones. I implemented python S.M.A.R.T check for disks or wearout on
    SSDs for example.

    Thanked by 1MasonR
  • nagios

  • Guys.. he wants to build one, not extend one.

    Thanked by 2MasonR vimalware
  • MasonRMasonR Community Contributor

    Correct me if I'm wrong, but nagios requires you to install their shit on all the servers you want to monitor. I'm aiming to not have to make any changes to your setup.

    Thanked by 1vimalware
  • Timtimo13Timtimo13 Member
    edited December 2017

    @MasonR said:
    Correct me if I'm wrong, but nagios requires you to install their shit on all the servers you want to monitor. I'm aiming to not have to make any changes to your setup.

    This depends on the checks you wan't to perform.
    If you dont want to check for eg. cpu usage, you can check things from
    Nagios server externally without any problem.

    This would work out fine for

    Web content check

    SSL certificate validation

    etc.

    Thanked by 2MasonR WSS
  • NeoonNeoon Community Contributor, Veteran

    Well, theoretically you can do it with PHP, just put your software on Galera, and make sure with example timestamps and only the right server is running the cronjobs.

    You could put like 5 nodes into that cluster, nearly immortal.

    Still if you outsource the jobs to the satelites, make sure your main node is HA, otherwise users will not be able to control the jobs.

    Thanked by 1MasonR
  • MasonRMasonR Community Contributor

    @Neoon said: Well, theoretically you can do it with PHP, just put your software on Galera

    Galera looks nice. PHP on the other hand...

  • Timtimo13Timtimo13 Member
    edited December 2017

    @MasonR
    I know that feeling well, my first backup solution was self written with python, so do it as long as you have fun :-)

    In my spare time, I'm more into scripts for harvesting contents from eg DMAX online library :-P

    Thanked by 1MasonR
  • NeoonNeoon Community Contributor, Veteran
    edited December 2017

    @MasonR said:

    @Neoon said: Well, theoretically you can do it with PHP, just put your software on Galera

    Galera looks nice. PHP on the other hand...

    PHP maybe be not the best solution but it works fine. On the other hand, there is other stuff that makes it even worse than PHP.

  • You could (one of the last features maybe) implement notification groups / times for hosts and host groups.

    You'd serve mission critical services like this:
    Notify all admins by email (24 / 7) and only mission critical administrator by SMS (24 / 7)

    Thats the way it works with my nagios configuration

    Thanked by 1MasonR
  • MasonRMasonR Community Contributor

    @Timtimo13 said:
    You could (one of the last features maybe) implement notification groups / times for hosts and host groups.

    You'd serve mission critical services like this:
    Notify all admins by email (24 / 7) and only mission critical administrator by SMS (24 / 7)

    Thats the way it works with my nagios configuration

    That'd be interesting. I'll keep that in mind when I get to developing the web/notification portion of the project.

  • Maybe implement SNMP ? hmm

    There are far too many options here

  • MasonRMasonR Community Contributor
    edited December 2017

    @Timtimo13 said:
    Maybe implement SNMP ? hmm

    There are far too many options here

    I don't think I'll stray that far. A simple is it up or down is what I'm after. Don't care about cpu load, network traffic, disk space, etc. There's enough software that already provides that, I think, but I want a solution that doesn't require you to modify the services or servers you want to monitor at all.

    Thanked by 1vimalware
  • WSSWSS Member
    edited December 2017

    @MasonR said:

    @Timtimo13 said:
    Maybe implement SNMP ? hmm

    There are far too many options here

    I don't think I'll stray that far. A simple is it up or down is what I'm after.

    So you time how long it takes the server to respond:

    SYN 
    SYN-ACK
    ACK
    

    #dicks

    FIN
    ACK
    
    Thanked by 1MasonR
  • consul || (grafana && (telegraf || collectd))

Sign In or Register to comment.