What do you look for in an external monitoring service? Thoughts on my approach welcome.
I've started working on a distributed monitoring service in my spare time to sharpen some of my Python skills. Why a monitoring service? Mainly because I'm not really satisfied with the options that currently exist (some don't have http response code check, https verifications, unable to specific a nonstandard TCP port, etc. etc.). Right now, I'm just trying to brainstorm all of the features that I will move forward with implementing.
Edit: Note that this will be an external monitoring system, meaning that the code will not be deployed to the nodes that you want monitored and thus won't support any resource or log-based analysis/alerting. Think of this as an open-source distributed python-based uptime robot or phpservermon clone
Current list of service checks to implement:
- ping checks
- http response codes (ex. non-200 = error)
- https cert verification warnings (ex. cert expires in 7 days, etc.)
- TCP port listening for connections
- Steam game servers (via python-valve)
The service will be deployed in a distributed fashion. There will be one central server where a user would register their services they wish to monitor and a desired frequency (1 min, 5 min, 20 min, etc.). During an individual service check, the job will be sent to three satellite monitoring nodes via a RESTful API in a round-robin manner. The responses determine if a service is offline (2+ negative responses).
I've just started development on the satellite nodes stack/api so far -- holding off on the main node until the interface is in place. The satellite nodes will be firewalled to only allow jobs from the main node's IP and are using a nginx + gunicorn + flask stack. I'm not planning on monetizing this project at all and will post the code on github as I make progress, though I was planning on standing up a free service and give each user a 50ish service quota.
Any thoughts and recommendations are welcome.