Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In with OpenID
Advertise on LowEndTalk.com

In this Discussion

What do you look for in an external monitoring service? Thoughts on my approach welcome.

What do you look for in an external monitoring service? Thoughts on my approach welcome.

MasonRMasonR Member
edited December 7 in General

Howdy all,

I've started working on a distributed monitoring service in my spare time to sharpen some of my Python skills. Why a monitoring service? Mainly because I'm not really satisfied with the options that currently exist (some don't have http response code check, https verifications, unable to specific a nonstandard TCP port, etc. etc.). Right now, I'm just trying to brainstorm all of the features that I will move forward with implementing.

Edit: Note that this will be an external monitoring system, meaning that the code will not be deployed to the nodes that you want monitored and thus won't support any resource or log-based analysis/alerting. Think of this as an open-source distributed python-based uptime robot or phpservermon clone

Current list of service checks to implement:

  • ping checks
  • http response codes (ex. non-200 = error)
  • https cert verification warnings (ex. cert expires in 7 days, etc.)
  • TCP port listening for connections
  • Steam game servers (via python-valve)

The service will be deployed in a distributed fashion. There will be one central server where a user would register their services they wish to monitor and a desired frequency (1 min, 5 min, 20 min, etc.). During an individual service check, the job will be sent to three satellite monitoring nodes via a RESTful API in a round-robin manner. The responses determine if a service is offline (2+ negative responses).

I've just started development on the satellite nodes stack/api so far -- holding off on the main node until the interface is in place. The satellite nodes will be firewalled to only allow jobs from the main node's IP and are using a nginx + gunicorn + flask stack. I'm not planning on monetizing this project at all and will post the code on github as I make progress, though I was planning on standing up a free service and give each user a 50ish service quota.

Any thoughts and recommendations are welcome.
Cheers!
-Mason

Comments

  • FalzoFalzo Member

    subscribed. ;-)

    long version: I think your goals are quite clearly defined already and you know that I am agreeing to all of those things listed.

    I really like to have a lightweight alarming service like night-sky but to expand the kind of checks a bit and eventually send more warnings/notifications makes perfectly sense.

    stil SPOF in that design to me is the question what happens if the main node goes down, and is there a solution to have a fallback for that main node which is able to at least take over the monitoring itself in case of an incident...

    if this gets production ready I am willing to help providing this as a free service for the community.

    Netcup DE KVM: 1vC 1GB - 18,88€ or 2 Core 3GB 240GB - 54,88€ yearly /w 5€ off: 36nc15125155662 - 36nc15125155667
    UltraVPS.eu KVM in US/NL/DE, BLACK FRIDAY: 1GB 20€ or 2GB 40€ yearly or cheap 750G / 2TB storage offers

    Thanked by 2MasonR vimalware
  • @Falzo said: subscribed. ;-)

    long version: I think your goals are quite clearly defined already and you know that I am agreeing to all of those things listed.

    I really like to have a lightweight alarming service like night-sky but to expand the kind of checks a bit and eventually send more warnings/notifications makes perfectly sense.

    Yeah, initially I was going to try and make some improvements to night-sky. But my knowledge and desire to know PHP is extremely limited :P Would like to make this as pluggable as possible to where if you want to add another type of check all you have to do is create a new REST endpoint and add another python file with the logic to the node.

    stil SPOF in that design to me is the question what happens if the main node goes down, and is there a solution to have a fallback for that main node which is able to at least take over the monitoring itself in case of an incident...

    Definitely agreed here. I have zero experience with HA so any recommendations to overcome this would be helpful. Considering possibly having a cluster of main nodes where one could take over if a server goes down or under too much load. Will have to investigate this further.

    if this gets production ready I am willing to help providing this as a free service for the community.

    Cheers! :) Initially I'll be trying to use my ridiculous amount of NAT servers to be the satellite nodes. But obviously, if this takes off and accumulates a large user-base then it would need to be scaled up to more beefy nodes :P

    Thanked by 1vimalware
  • WSSWSS Member

    The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Thanked by 1MasonR
  • @WSS said: The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

  • WSSWSS Member

    @MasonR said:

    @WSS said: The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

    Why would those be part of the master node?

  • @WSS said:

    @MasonR said:

    @WSS said: The easiest way to handle this would probably be to have two "master" services which have a heartbeat monitor between them, then do an IP cutover to keep things cleaner and simpler on the monitoring nodes.

    Of course, I put like 3 seconds of thought into this, so there's probably plenty of more complicated ways of doing it using Node..

    Hmm... yeah that could work. I guess the databases with the users/services to monitor would need to be replicated between the two master nodes too.

    Why would those be part of the master node?

    I was imagining the master node to be the front end of the system (i.e. user logs in and sets up their monitors) and the dispatcher of the jobs (i.e. every x minutes dispatch the jobs to the satellite nodes). User logins + saved monitors would probably be stored to a mysql database living on the server. What did you have in mind?

  • WSSWSS Member

    @MasonR said: What did you have in mind?

    I'd completely separate the user interface from anything which actually does anything, because people are dickheads and your system can't tell you when it goes offline due to some asshole throwing a botnet at it for fun.

    The fact you're already working on an API is good; I'm wondering just how to keep multiple locales having access to the rundown of the queue to process. It won't be difficult, and would probably just work like a circular fifo in execution- but again, I've put about 3 seconds thought into this.

    Thanked by 2MasonR vimalware
  • @WSS said:

    @MasonR said: What did you have in mind?

    I'd completely separate the user interface from anything which actually does anything, because people are dickheads and your system can't tell you when it goes offline due to some asshole throwing a botnet at it for fun.

    Yeah you've got a point there... dicks. I'll definitely keep this in mind as I move towards implementing the actual master node and user interface.

  • WSSWSS Member

    Really, now the next concern would be if you have "runners" which queue up and send the nodes the set of what to do and wait for a response, which would waste sockets and a bit of overhead- or to have a queue that all nodes read through and respond accordingly. The only good part about the secondary design is that if the network gets fucked, those nodes will still have the current set of rules. However, if those stop talking to the parent, that can also go poorly.

    Thanked by 1MasonR
  • 6ixth6ixth Member

    Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Thanked by 1MasonR
  • WSSWSS Member

    @6ixth said: Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Hi. Welcome to the thread. Now what the fuck are you on about?

  • Falzo said: service like night-sky

    I've searched google for that "night-sky" in the past and again today. I can't find it. I remember a discussion about it closing shop, but I couldn't even find the link to their old site or software. Can you point me to the right direction?

    I know PHP and I want to give it a look.

    I got an "I like you" from @WSS. Am I one of you now?

    http://harzemdesign.com and https://fraudrecord.com/ I have these.

  • @WSS said: Really, now the next concern would be if you have "runners" which queue up and send the nodes the set of what to do and wait for a response, which would waste sockets and a bit of overhead- or to have a queue that all nodes read through and respond accordingly. The only good part about the secondary design is that if the network gets fucked, those nodes will still have the current set of rules. However, if those stop talking to the parent, that can also go poorly.

    I think I've already decided on the former approach. Since I'd like the satellite nodes be as stateless as possible and simply just listening for a job and executing them as they come in. gunicorn enables you to run multiple workers on a single node, so they should be able to take and serve a high volume of requests. Will definitely need to do some load tests to make sure the deployed infrastructure can handle what's being thrown at it.

    But since the system will be scalable, if load picks up, new nodes can be spawned and register with the master node to start accepting jobs.

    Thanked by 1WSS
  • 6ixth6ixth Member

    @WSS said:

    @6ixth said: Notifications via windows notification (most likely through a browser you have open) and SMS via numerous SMS providers that supports most countries at cheap prices.

    Hi. Welcome to the thread. Now what the fuck are you on about?

    Well if you read the title of the thread, and the thread itself you shall find it asks for recommendations and suggestions for the service itself. Then read my post again and you will know what the fuck I am on about :)

  • WSSWSS Member
    edited December 6

    @Harzem said: I know PHP and I want to give it a look.

    https://github.com/Ne00n/Night-Sky

    @6ixth said: Well if you read the title of the thread, and the thread itself you shall find it asks for recommendations and suggestions for the service itself. Then read my post again and you will know what the fuck I am on about :)

    Suggesting SMS is neat. We're still talking about the architecture. You can wait out in the hallway.

    Thanked by 2MasonR Harzem
  • Install Nagios, load your favorite checks, deploy NRPE on clients done. Nagios is able to notify by SMS, push to mobile phone, email and many more.

    You can implement your own checks very easily if you're not happy with the available ones. I implemented python S.M.A.R.T check for disks or wearout on SSDs for example.

    Thanked by 1MasonR
  • CoreyCorey Member, Provider

    nagios

    BitAccel - OpenVZ VPS / IRC,VPN,Anything Legal & Unrivaled Support!
  • WSSWSS Member

    Guys.. he wants to build one, not extend one.

    Thanked by 2MasonR vimalware
  • Correct me if I'm wrong, but nagios requires you to install their shit on all the servers you want to monitor. I'm aiming to not have to make any changes to your setup.

    Thanked by 1vimalware
  • Timtimo13Timtimo13 Member
    edited December 6

    @MasonR said: Correct me if I'm wrong, but nagios requires you to install their shit on all the servers you want to monitor. I'm aiming to not have to make any changes to your setup.

    This depends on the checks you wan't to perform. If you dont want to check for eg. cpu usage, you can check things from Nagios server externally without any problem.

    This would work out fine for

    Web content check

    SSL certificate validation

    etc.

    Thanked by 2MasonR WSS
  • NeoonNeoon Member

    Well, theoretically you can do it with PHP, just put your software on Galera, and make sure with example timestamps and only the right server is running the cronjobs.

    You could put like 5 nodes into that cluster, nearly immortal.

    Still if you outsource the jobs to the satelites, make sure your main node is HA, otherwise users will not be able to control the jobs.

    Thanked by 1MasonR
  • @Neoon said: Well, theoretically you can do it with PHP, just put your software on Galera

    Galera looks nice. PHP on the other hand...

  • Timtimo13Timtimo13 Member
    edited December 6

    @MasonR I know that feeling well, my first backup solution was self written with python, so do it as long as you have fun :-)

    In my spare time, I'm more into scripts for harvesting contents from eg DMAX online library :-P

    Thanked by 1MasonR
  • NeoonNeoon Member
    edited December 6

    @MasonR said:

    @Neoon said: Well, theoretically you can do it with PHP, just put your software on Galera

    Galera looks nice. PHP on the other hand...

    PHP maybe be not the best solution but it works fine. On the other hand, there is other stuff that makes it even worse than PHP.

  • You could (one of the last features maybe) implement notification groups / times for hosts and host groups.

    You'd serve mission critical services like this: Notify all admins by email (24 / 7) and only mission critical administrator by SMS (24 / 7)

    Thats the way it works with my nagios configuration

    Thanked by 1MasonR
  • @Timtimo13 said: You could (one of the last features maybe) implement notification groups / times for hosts and host groups.

    You'd serve mission critical services like this: Notify all admins by email (24 / 7) and only mission critical administrator by SMS (24 / 7)

    Thats the way it works with my nagios configuration

    That'd be interesting. I'll keep that in mind when I get to developing the web/notification portion of the project.

  • Maybe implement SNMP ? hmm

    There are far too many options here

  • MasonRMasonR Member
    edited December 6

    @Timtimo13 said: Maybe implement SNMP ? hmm

    There are far too many options here

    I don't think I'll stray that far. A simple is it up or down is what I'm after. Don't care about cpu load, network traffic, disk space, etc. There's enough software that already provides that, I think, but I want a solution that doesn't require you to modify the services or servers you want to monitor at all.

    Thanked by 1vimalware
  • WSSWSS Member
    edited December 6

    @MasonR said:

    @Timtimo13 said: Maybe implement SNMP ? hmm

    There are far too many options here

    I don't think I'll stray that far. A simple is it up or down is what I'm after.

    So you time how long it takes the server to respond:

    SYN 
    SYN-ACK
    ACK
    

    #dicks

    FIN
    ACK
    
    Thanked by 1MasonR
  • zilchzilch Member

    consul || (grafana && (telegraf || collectd))

    What are you hashtags?

  • in 20 minutes.

  • Hmm. System checks like CPU, RAM usages and an application which will keep playing a sound when system is down so I can wake up and deal with the issue if I'm sleeping and providing 24/7 support.

    Adobe Creative Cloud For Teams All Apps 1 Year @ $20. I'll go first, Skype: live:createprivateserver, Discord: 5TicksDaikin#0922 or PM

    Avast Ultimate 1 PC 3 Years @ $20

  • Timtimo13Timtimo13 Member
    edited December 7

    @CreatePrivateServer said: Hmm. System checks like CPU, RAM usages and an application which will keep playing a sound when system is down so I can wake up and deal with the issue if I'm sleeping and providing 24/7 support.

    Depending on your notification settings, you would get a notification when your webserver is down. This would be mostly the same. As Mason don't wants to work with local clients, you will not be able to check CPU / RAM / disk (...) usage without changes on the client which is being checked and Masons support for the client.

    If you would need this checks, check out Nagios

  • This looks exactly what I was thinking of building for myself (for the same reasons: python+flask skills) : External service monitoring from a quorum of POPs.

    Not nitpicking, but I would have gone with Postgresql as RDMBS for a greenfield project in 2017.

    PM a url to git or architecture wiki to see if it makes sense to contribute rather than build my own.
    All the Best! :)

    Thanked by 1MasonR
  • @vimalware said: This looks exactly what I was thinking of building for myself (for the same reasons: python+flask skills) : External service monitoring from a quorum of POPs.

    Not nitpicking, but I would have gone with Postgresql as RDMBS for a greenfield project in 2017.

    Haven't decided on a particular database for the main node quite yet and wouldn't mind using psql as I use it quite extensively for a couple projects at work. The choice will probably come down between MariaDB and PostgreSQL.

    PM a url to git or architecture wiki to see if it makes sense to contribute rather than build my own.
    All the Best! :)

    Cheers! Will do when I get the ball rolling more. So far the only thing decided is that the monitoring nodes will have a nginx -> gunicorn -> flask setup. And to try to make them as pluggable as possible so new extensions can be added easily. Got the basic skeleton in place, but wanted to get a couple of the modules banged out first (probably ping + http response) before adding to git.

  • mkshmksh Member

    @Timtimo13 said: Maybe implement SNMP ? hmm

    Yes, sometimes there is little choice besides SNMP when it comes to monitoring stuff but voluntarily working with this abomination of a protocol? Please tell me you are joking.

    To everyone shouting nagios. I think i've seen enough of nagios (icinga) to say he is better of designing his own solution from the ground up. Lots of room for a cleaner and nicer implementation there. Sure, he won't be able to avoid running some kind of software on the targets to monitor certain things but imo it won't be hard to come up with something that single handedly beats nsclient.

  • Can you pull the CPU and ram from something like htop?

    Dont'TalkAboutLETClub There is this thing called hoopla.

  • @AuroraZ said: Can you pull the CPU and ram from something like htop?

    psutil would be able to grab that info to keep everything in Pythonland. Though for this project, I'd rather stay away from user agents and the like and just focus on an external monitoring system.

    Thanked by 1vimalware
  • As long the 'runners' follow a http-based API, it lays the path for replacing the python bits with a Go binary, if anyone feels like it.

    Thanked by 1MasonR
  • @vimalware said: As long the 'runners' follow a http-based API, it lays the path for replacing the python bits with a Go binary, if anyone feels like it.

    That's a good point. There'd be nothing preventing someone from implementing their own monitor, even in a different language, as long as all the restful interfaces are defined.

  • @MasonR said:

    @AuroraZ said: Can you pull the CPU and ram from something like htop?

    psutil would be able to grab that info to keep everything in Pythonland. Though for this project, I'd rather stay away from user agents and the like and just focus on an external monitoring system.

    I was just thinking most if not all Admins install it so the info might be easy to pull. Still have it as an outside monitor because you wouldn't need to install anything special. Was just an idea.

    Dont'TalkAboutLETClub There is this thing called hoopla.

    Thanked by 1MasonR
  • AFAI understand, original objective (and mine) was a Blackbox monitoring system in something other than PHP.

    For whitebox monitoring, lots of solutions exist.

    Thanked by 1MasonR
  • @vimalware said: AFAI understand, original objective (and mine) was a Blackbox monitoring system in something other than PHP.

    For whitebox monitoring, lots of solutions exist.

    Precisely. Basically an open-source python-based uptimerobot.

    Thanked by 1vimalware
  • if you can combine with log analysis & alert, would be awesome.

    there is loggly, logentry, etc. but the don't have uptime / ping monitoring.

    OOT: monitoring ladies bathroom

  • @kassle said: if you can combine with log analysis & alert, would be awesome.

    there is loggly, logentry, etc. but the don't have uptime / ping monitoring.

    Perhaps zabbix

    Thanked by 1kassle
  • @kassle said: if you can combine with log analysis & alert, would be awesome.

    graylog2 maybe?

    MasonR has a vision for a blackbox monitoring platform.
    I'd rather see a tool that does one thing very well.

    Thanked by 2MasonR kassle
  • @kassle said: if you can combine with log analysis & alert, would be awesome.

    there is loggly, logentry, etc. but the don't have uptime / ping monitoring.

    OOT: monitoring ladies bathroom

    Unfortunately, that's outside of the scope that this aims to accomplish. The code that is produced here wouldn't be deployed to the machines that you want monitored.

    Thanked by 1kassle
  • @MasonR said:

    @kassle said: if you can combine with log analysis & alert, would be awesome.

    there is loggly, logentry, etc. but the don't have uptime / ping monitoring.

    OOT: monitoring ladies bathroom

    Unfortunately, that's outside of the scope that this aims to accomplish. The code that is produced here wouldn't be deployed to the machines that you want monitored.

    i see, but with rsyslog (as major linux distro support this) no need to install extra application but extra config :)

  • If you don't mind me chiming in then I would suggest you:

    • Use sanic instead of flask , it's basically a flask-like with asynchronous abilities.

    • For HA, try to use the Zookeeper library, trust me it does wonders. It is hard to use at first, but it will go farther than what you have described. I got a lot of help from the Netflix zookeeper recipes when I started using it.

    • Use Celery to distribute your workload across multiple workers, and do not use a Redis as a broker go for RabbitMQ.

    • Last but not least, I would try to look into using Go instead of python. I know you want to sharpen your python + flask skills. However, in 2017(almost 2018) Go is the king of the hill for these kind of apps.

    Good luck bro, I hope you succeed and I will be waiting to take a look at that source code.

    If you need help, I'm always here to help you. Just pm me with anything technical!

    Thanked by 1MasonR
  • @IAlwaysBeCoding said: If you don't mind me chiming in then I would suggest you:

    • Use sanic instead of flask , it's basically a flask-like with asynchronous abilities.

    Sanic looks nice and might eliminate the need for gunicorn since you can spawn multiple workers. Async is definitely a huge plus.

    • For HA, try to use the Zookeeper library, trust me it does wonders. It is hard to use at first, but it will go farther than what you have described. I got a lot of help from the Netflix zookeeper recipes when I started using it.

    I'll definitely look into Zookeeper as well -- being a complete noob to HA, I'll probably have to fiddle with a few different options out there.

    • Use Celery to distribute your workload across multiple workers, and do not use a Redis as a broker go for RabbitMQ.

    Added to the list of what to look into :)

    • Last but not least, I would try to look into using Go instead of python. I know you want to sharpen your python + flask skills. However, in 2017(almost 2018) Go is the king of the hill for these kind of apps.

    Yeah, not a bad idea. I think my initial pass (at least for the monitoring nodes) will be to use Python as that's what I'm more comfortable with. But since it'll all be API driven, as a Go exercise, I may rewrite the monitor in Go once things are up and running

    Good luck bro, I hope you succeed and I will be waiting to take a look at that source code.

    Cheers, I really appreciate your input!

    Thanked by 1IAlwaysBeCoding
Sign In or Register to comment.