Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


In this Discussion

Zabbix Tutorials
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Zabbix Tutorials

SplitIceSplitIce Member, Host Rep
edited November 2013 in Tutorials

I have been posting these to VPSB so I suppose I should CC LET :P

--

#1 Scalability

--

This is the first of what I hope will be a series of posts on the topic of Zabbix. For those not in the know, Zabbix is a free and open source piece of monitoring software with all the features of an enterprise solution. All this advice should be taken with a grain of salt and is based on our experiences over the past year. You mileage may vary.

When should I consider the scalability of my monitoring system?

Well, ideally you should plan for the future from the start but since this rarely happens you should begin planning no later than 100-200 values per second. Performance is dependent on many factors, including the number of checks you are performing on proxies (or agents) vs the number of simple checks. We use many simple checks so we hit our first issues at 100 items/s.

What hardware should I be looking at?

The best thing you can do for this software is to ensure its database is stored on a SSD. This alone will increase your performance more than you would believe. We use a 60GB plan from DigitalOcean and have been very impressed with the IOPS.
 
At around 150-200 items a second you should hit a point that the housekeeper can not delete enough records a second when completing with the insert mutexes (lock contention). At this point you will need to introduce partitioning on your history tables. Now you can either partition and keep the existing housekeeper or write your own housekeeper that runs via dropping partitions. If you have items with varying history storage periods you will most likely need to choose the first solution.
 
An example of what we use is below:

CREATE TABLE `history_uint` ( `itemid` bigint(20) unsigned NOT NULL, `clock` int(11) NOT NULL DEFAULT '0', `value` bigint(20) unsigned NOT NULL DEFAULT '0', `ns` int(11) NOT NULL DEFAULT '0', KEY `history_uint_1` (`itemid`,`clock`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 /*!50100 PARTITION BY HASH (clock DIV 86400) PARTITIONS 40 */

40 partitions was chosen as most of our data is stored for 30-40 days. This ensures data is always being inserted far away from where the housekeeping process is purging rows. You should partition all the tables you use extensively including trends and events as applicable.
 
As InnoDB does not recover unused table space if you create too many partitions (and I assume like any sane person you are using file per table) it will result in disk space wastage.
 
Be aware adding partitioning will take many hours on multi-gb tables. Factor this into your plan if applicable.

What software should I be looking at?

If possible check out the Zabbix 2.1 (or soon 2.2) branch. Its currently in beta but the performance improvements are exceptional. From our experience 2.1 (Beta 2) is bug free, or atleast the features we are using are.
 
For maximum performance we run Percona MySQL 5.5.
If you heavily utilize the API ensure that you have an opcode cache such as APC setup.

So how far can this scale?

Who knows? This new setup has us sitting with a load average of below 1.0. The old setup was over 15 (i7, 4GB ram, 2x500GB raid 1 spinning rust bucket) and well and truly overloaded.
 
 

Comments

  • SplitIceSplitIce Member, Host Rep

    #0 - An Introduction to Zabbix

    What is Zabbix?

    Free, enterprise quality monitoring software. Based on the Server / Agent model (with optional Proxy) it can either operate in Passive (connect to server) or Active (connect to agent) modes. It also has integration with SNMP, IPMI and JMX for legacy applications and where is not possible to use the agent.

    Why Zabbix and how does it compare to Cacti or Nagios?

    Zabbix includes all the features of Cacti and Nagios in one package, plus alot more. I am yet to see something I could do with Cacti that I am unable to do with Zabbix.

    In addition to this what does Zabbix offer?

    • Auto Discovery (Low Level and Network) - You can define custom scripts to automatically setup hosts and items / triggers / graphs inside that host based on the output from scripts. No need to integrate with the API and manage hosts through software (although that is possible too).

    • Web Checks - Through the execution of sequential steps perform browser actions and test your website. Useful features include extraction of content from the page into variables and validation of web page steps via regex.

    • Interface - A VERY good interface

    • Visualization - Can visualize any monitored item, no need to create a graph for once of queries

    • API - Full API for interacting with the data stored in the MySQL (or postgres) database. This can be used to create hosts, graphs, items or triggers as well.

    • Define complex template hierarchies with inheritance that you implement with your hosts. Never duplicate work setting up hosts.

    • Screens - Define custom screens from the interface that display only the data you want to see for specific tasks.

    • Variable monitoring interval per item

    • Complex trigger evaluation. Define triggers in a high level language (can feature data from multiple triggers, multiple hosts etc). Really flexible. Can include flapping detection etc (not automatic with Zabbix).

    • Real-time SLA reporting with custom fault trees.

    • And much more that I cant think of at the time of writing....

    What options for extending Zabbix are there?

    Once you get to play with it you will come to realize that almost anything can be done using external scripts as checks. The need to extend Zabbix (except where performance is needed) is rare. 2.1.x / 2.2.x includes a modular plugin architecture for defining custom checks. This could be used for example to check the full uptime of a minecraft server or getting the number of players for use in an item. This could easily be done with a script called (custom user parameter or system.run check) although it depends on your performance requirements.

    I will cover the development modules later in this series I hope.

    NOTE: Nagios refers to Nagios Core, Nagios XI is pretty damn expensive.

Sign In or Register to comment.