Pandora: Documentation en: Capacity planning

From Pandora FMS Wiki
Jump to: navigation, search

Go back to Pandora FMS documentation index

1 Capacity study

1.1 Introduction

Pandora FMS is a quite complex distributed application that has several key elements, that could be a bottleneck if it is not measured and configured correctly. The main aim of this study is to detail the scalability of Pandora FMS regarding on an specific serial of parameters to know the requirements that it could have if it gets an specific capacity.

Load test were made in a first phase, aimed to a cluster based system, with an unique Pandora server centralized in a DDBB cluster. The load test are also useful to observe the maximum capacity per server. In the current architecture model (v3.0 or higher), with N independent servers and with one "Metaconsole", this scalability tends to be linear, while the scalability based on centralized models isn't (it would be of the kind shown in the following graph)

Estudio vol0.png

1.1.1 Data Storage and Compaction

The fact that Pandora compact data in real time, it's very important related to calculate the size the data will occupy. An initial study was done that related the way that a classic system stored data with the Pandora FMS "asynchronous" way of storing data. This could be seen in the schema that is included in this section.

Estudio vol01.png

In a conventional system

For a check, with an average of 20 checks per day, we have a total of 5 MB per year in filled space. For 50 checks per agent, it is 250 MB per year.

In a non conventional system, asynchronous like Pandora FMS

For a check, with an average of 0.1 variations per day, we have a total of 12,3 KB per year in filled space. For 50 checks per agent, this results in 615 KB per year.

1.1.2 Specific Terminology

Next is described a glossary of specific terms for this study, for a better comprehension.

  • Fragmentation of the information: the information that Pandora FMS manages could have different performances: it could change constantly (e.g a CPU percentage meter), or be very static ( for example, determine the state of one service). As Pandora FMS exploits this to "compact" the information in the DB, it's a critical factor for the performance and the study of the capacity, so the more fragmentation, the more size in the DB and more capacity of process will be necessary to use in order to process the same information.
  • Module: is the basic piece of the collected information for its monitoring. In some environments is known as Event.
  • Interval: is the amount of time that pass between information collects of one module.
  • Alert: is the notification that Pandora FMS executes when a data is out of the fixed margins or changes its state to CRITICAL or WARNING.

1.2 Example of Capacity Study

1.2.1 Definition of the Scope

The study has been done thinking about a deployment divided in three main phases:

  • Stage 1: Deployment of 500 agents.
  • Stage 2: Deployment of 3000 agents.
  • Stage 3: Deployment of 6000 agents.

In order to determine exactly Pandora's FMS requisites in deployments of this data volume, you should know very well which kind of monitoring you want to do. For the following study we have taken into account in an specific way the environment characteristics of a fictitious client named "QUASAR TECNOLOGIES" that could be summarized in the following points:

  • Monitoring 90% based on software agents.
  • Homogeneous systems with a features serial grouped in Technologies/policies.
  • Very changeable intervals between the different modules /events to monitor.
  • Big quantity of asynchronous information (events, log elements).
  • Lot of information about processing states with little probability of change.
  • Little information of performance with regard to the total.

After doing an exhaustive study of all technologies and determine the reach of the implementation (identifying the systems and its monitoring profiles), we have come to the following conclusions:

  • There is an average of 40 modules/events per system.
  • The average monitoring interval is of 1200 seconds (20 min).
  • There are modules that reports information every 5 minutes and modules that does it once per week.
  • From all the group of total modules (240,000), it has been determined that the possibility of change of each event for each sample is the 25%
  • It has been determined that the alert rate per module is 1,3 (that is: 1,3 alerts per module/event).
  • It is considered (in this case it's an estimation based on our experience) that an alert has 1% probabilities of being fired.

These conclusions are the basis to prepare the estimation, and are codified in the Excel spreadsheet used to do the study:

Estudio vol1.png

With these start up data, and applying the necessary calculations, we can estimate size in DDBB, nº of modules per second that are necessary to process and other essential parameters:


Estudio vol2.png

1.2.2 Capacity Study

Once we've known the basic requirements for the implementation in each phase ( modules/second rate), nº of total alerts, modules per day, and MB/month, we're going to do a real stress test on a server quite similar to the production systems ( the test couldn't have been done in a system similar to the production ones).

These stress tests will inform us of the processing capacity that Pandora FMS has in a server, and what is its degradation level with time. This should be useful for the following aims:

  1. Through an extrapolation, know if the final volume of the project will be possible with the hardware given to do that.
  2. To know which are the "online" storage limits and which should be the breakpoints from which the information moves to the historic database.
  3. To known the answer margins to the process peaks, coming from problems that could appear ( service stop, planned stops) where the information expecting for being processed would be stored.
  4. To know the impact in the performance derived of the different quality (% of change) of the monitoring information.
  5. To know the impact of the alert process in big volumes.

The tests have been done on a DELL server PowerEdge T100 with 2,4 Ghz Intel Core Duo Processor and 2GB RAM. This server, working on an Ubuntu Server 8.04, has given us the base of our study for the tests on High Availability environments. The tests have been done on agent configurations quite similar to that of the QUASAR TECHNOLOGIES project, so we can't have available the same hardware, but replicate a high availability environment, similar to the QUASAR TECHNOLOGIES to evaluate the impact in the performance as times goes on and set other problems ( mainly of usability) derived from managing big data volume.

Estudio vol3.png

The obtained results are very positives, so the system, though very overload, was able to process an information volume quite interesting (180,000 modulos, 6000 agentes, 120,000 alertas). The conclusions obtained from this study are these:

1. You should move the "real time" information to the historical database in a maximum period of 15 days, being the best thing to do it for more than one week data. This guarantee a more quick operation.

2. The maneuver margin in the best of cases is nearly of the 50% of the process capacity, higher than expected, taken into account this information volume.

3. The fragmentation rate of the information is vital to determine the performance and the necessary capacity for the environment where we want to deploy the system

1.3 Methodology in detail

The previous chapter was a "quick" study based only in modules typer "dataserver". In this section we give a more complete way of doing an analysis of the Pandora FMS capacity.

As starting point, in all cases, we assume the worst-case scenario providing the we can choose. We assume that if we can't choose , it will be the " Common case" philosophy. It will be never considered anything in the "best of cases" so this phylosophy doesn't work.

Next we are going to see how to calculate the system capacity, by monitoring type or based on the information origin.


1.3.1 Data Server

Based on the achievement of certain targets, as we have seen in the previous point, we suppose that the estimated target, is to see how it works wiht a load of 100,000 modules, distributed between a total of 3000 agents, that is, an average of 33 modules per agent.

A task will be created (executed through cron or manual script) of pandora_xmlstress that has 33 modules, distributed with a configuration similar to this one:

  • 1 module type string
  • 17 modules type generic_proc.
  • 15 modules type generic_data.

We will configure the thresholds of the 17 modules of generic_proc type this way:

module_begin
module_name Process Status X
module_type generic_proc
module_description Status of my super-important daemon / service / process
module_exec type=RANDOM;variation=1;min=0;max=100
module_end

In the 15 modules of generic_data type, we should define thresholds. The procedure to follow is the following:

We should configure the thresholds of the 15 modules of generic_data type so data of this type will be generated:

module_exec type=SCATTER;prob=20;avg=10;min=0;max=100

Then, we configure the thresholds for these 15 modules, so they have this pattern:


0-50 normal
50-74 warning
75- critical

We add to the configuration file of our pandora_xml_stress some new tokens, to could define the thresholds from the XML generation. PLEASE CONSIDER THAT Pandora FMS only "adopts" the definition of thresholds in the creation of the module, but not in the update with new data.


module_min_critical 75
module_min_warning 50

We execute the pandora xml stress.

We should let it running at least for 48 hours without any kind of interruption and we should monitor (with a pandora agent) the following parameters:

Nº of queued packages:

 find /var/spool/pandora/data_in | wc -l

CPU de pandora_server

 ps aux | grep "/usr/bin/pandora_server" | grep -v grep | awk '{print $3}'

pandora_server Total Memory:

  ps aux | grep "/usr/bin/pandora_server" | grep -v grep | awk '{print $4}'

CPU de mysqld (check syntax of the execution, it depends of the mysql distro)

 ps aux | grep "sbin/mysqld" | grep -v grep | awk '{print $3}'

pandora DDBB response average time

 /usr/share/pandora_server/util/pandora_database_check.pl /etc/pandora/pandora_server.conf
Nº of monitors in unknown


 echo "select SUM(unknown_count) FROM tagente;" | mysql -u pandora -pxxxxxx -D pandora | tail -1


(where is written xxx write de ddbb password "pandora" to use it with the user "pandora")

The first executions should be useful to "tune" the server and the MySQL configuration.

We use the script /usr/share/pandora_server/util/pandora_count.sh to count (if are xml pending to process) the rate of package proccessing. The aim is to make possible that all the packages generated (3000) could be processed in an interval below the 80% of the limit time (5 minutes). This implies that 3000 packages should be processed in 4 minutes, so:


3000 / (4x60) = 12,5

We should get a processing rate of 12,5 packages minimum to be reasonably sure that pandora could process this information.

List of things to work on: Nº of threads, nº maximum of items in intermediate queue (max_queue_files), and, of course, all the parameters of MySQL that are applicable (very important)

Only one comment about the importance of this: One Pandora with a Linux server installed "by default" in a powerful machine, could not exceed from 5-6 packages by second, in a powerful machine well "optimized" and "tuned" it could perfectly reach 30-40 packages by second. It also depends a lot of the number of modules that would be in each agent.

Then we configure the system in order that the ddbb maintenance script at /usr/share/pandora_server/util/pandora_db.pl will be executed every hour instead of every day:


mv /etc/cron.daily/pandora_db /etc/cron.hourly

We leave the system working, with the package generator a minimum of 48 hours. Once this time has passed, we should evaluate the following points:

  1. Is the system stable?, Is it down? If there are problems, checl the logs and graphs of the metrics that we have got (mainly memory).
  2. Evaluate the tendency of time of the metric "Nº of monitors in unknown" There should be not tendencies neither important peaks. They should be the exception: If they happen with a regularity of one hour, is because there are problems withe the concurrency of the DDBB management process.
  3. Evaluate the metric "Average time of response of the pandora DDBB" It should not increase with time but remain constant.
  4. Evaluate the metric "pandora_server CPU" , should have many peaks, but with a constant tendency, not rising.
  5. Evaluate the metric "MYSQL server CPU"; should be constant with many peaks, but with a constant tendency , not rising.

1.3.1.1 Evaluation of the Alert Impact

If all was right, now we should evaluate the impact of the alert execution performance.

We apply one alert to five specific modules of each agent (from type generic_data), for the CRITICAL condition.Something not really important, like creating an event or writting to syslog (to not consider the impact that something with hight latency could have like for example sending an email).

We can optionally create one event correlation alert to generate one alert for any critical condition of any agent with one of these five modules:

We leave the system operating 12 hours under those criteria and evaluate the impact, following the previous criteria.


1.3.1.2 Evaluating the Purging/Transfer of Data

Supposing the data storage policy was:

  • Deleting of events from more than 72 hours
  • Moving data to history from more than 7 days.

We should leave the system working "only" during at least 10 days to evaluate the long term performance. We could see a "peak" 7 days later due to the moving of data to the history ddbb. This degradation is IMPORTANT to consider. If you can't have so many time available, it is possible to replicate (with less "realism") changing the purging interval to 2 days in events and 2 days to move data to history, to evaluate this impact.

1.3.2 ICMP Server(Enterprise)

Here we talk specifically of the ICMP network server.In case of doing the tests for the open network server, please, see the corresponding section of the network server (generic).

Supposing that you have the server already working and configured, we are going to explain some key parameters for its performance:

block_size X

It defines the number of "pings" that the system will do for any execution. If the majority of pings are going to take the same time, you can increase the number to considerably high numberm i.e: 50 or 70


On the contrary, the module ping park is heterogeneous and they are in very different networks, with different latency times,it is not convenient for you to put a high number, because the test will take the time that takes the slower one, so you can use a number quite low, such as 15 or 20.


icmp_threads X 

Obviously, the more threads it has, the more checks it could execute. If you make an addition of all the threads that Pandora execute, they will not be more than 30-40. You should not use more than 10 threads here, thought it depends a lot of the kind of hardware an Linux version that you are using.

Now, we should "create" a fictitious number of modules ping type to test. We assume that you are going to test a total of 3000 modules of ping type. To do this, the best option is to choose a system in the network that would be able to support all pings (any Linux server would do it)

Using the Pandora CSV importer(available in the Enterprise version), create a file with the following format:


(Nombre agente, IP,os_id,Interval,Group_id)

You can use this shellscript to generate this file (changing the destination IP and the group ID)

A=3000
while [ $A -gt 0 ]
do 
	echo "AGENT_$A,192.168.50.1,1,300,10" 
	A=`expr $A - 1`
done


Before doing anything, we should have the pandora monitored, measuring the metrics that we saw in the previous point: CPU consumption (pandora and mysqul), nº of modules in unknown and other interesting monitors.

We have to import the CSV to create 3000 agents (it will takes some minutes). After we go to the first agent (AGENT_3000) and we create a module Type PING.

We go after to the massive operations tool and copy that module to the other 2999 agents.

Pandora should then start to process those modules. We measure with the same metrics than the previous case and we will see how it goes. The objective is to leave an operable system for the number of modules of type ICMP required without any of them reaches the unknown status.

1.3.3 SNMP Server (Enterprise)

We are going to see here the SNMP Enterprise network server. If we do the test for the open network server, please see the corresponding section for the network server (generic).

Assuming that you have the server already working and configured, we are going to explain some key parameters for its working:

block_size X

It defines the number of SNMP requests that the system will do for each execution. You should consider that the server groups them by destination IP, so this block is only indicative. It is recommendable that it wouldn't be large (30-40 maximum). When an item of the block fails, an internal counter does that the Enterprise server will try it again, and if after x attempts it doesn't work, then it will pass it to the open server.

snmp_threads X 

Obviously, the more threads it has, the more checks it could execute. If you sum up all the threads that Pandora executes they wouldn't reach to 30_40. You shouldn't user more than 10 threads, though it depends on the kind of hardware and Linux version that you use.

The SNMP Enterprise server doesn't support version 3. These modules (v3) will be executed by the open version:

The faster way to test is through a SNMP device, applying all the interfaces, all the serial "basic" monitoring modules.This is done through the application of the Explorer SNMP (Agente -> Modo de administracion -> SNMP Explorer). Identify the interfaces and apply all the metrics to each interface. In a 24 port switch, this generates 650 modules.

If you generate other agent with other name, but same IP, you will have other 650 modules. Another option could be to copy all modules to serial of agents that will have all the same IP, so the copied modules works attacking the same switch.

Other option is to use an SNMP emulator, as for example the Jalasoft SNMP Device Simulator.

The objective of this point is to be able to monitor in a constant way an SNMP module pool during at least 48 hours, monitoring the infrastructure, to make sure that the mod/seg monitoring ratio is constant, and there are not time periods where the server produces modules in unknown status. This situation could be occur because:

  • Lack of resources (mem, CPU). It would be possible to see a tendency of these metrics in continual rise, what it is a bad signal.
  • Occasional problems:Re-start of the daily server (for logs rotation), execution of the ddbb scheduled maintenance execution, or other scripts that are executed in the server or in the DDBB server.
  • Network problems, due to not related processes (i.e: backup of a server in the network) that affects to the network velocity/availability

1.3.4 Plugins, Network (open) and HTTP Server

Here is applied the same concept that above,but in a more simplified way. You should check:

  • Nº of threads
  • Timeouts (To calculate the incidence in the worst case).
  • Check average time

Scaling with these data a test group and check that the server capacity is constant over time.

1.3.5 Traps Reception

Here, the case is more simple: We assume that the system is not going to receive traps in a constant way, but that it is about evaluating the response to a traps flood, from which some of them will generate alerts.

To do this, you will only have to do a simple script that generates traps in a controlled way and at hight speed:

#!/bin/bash
TARGET=192.168.1.1
while [ 1 ]
do
   snmptrap -v 1 -c public $TARGET .1.3.6.1.4.1.2789.2005 192.168.5.2 6 666 1233433 .1.3.6.1.4.1.2789.2005.1 s "$RANDOM"
done

NOTE: Cut it with CTRL-C after few seconds, so it will generate hundreds of traps in few seconds.

Once the environment is set up we need to validate the following things:

  1. Traps injection to a constant rate(just put one sleep 1 to the previous script inside the loop while, to generate 1 trap/sec. Let the system operating 48 hours and evaluate the impact in the server.
  1. Traps Storm. Evaluate moments before, during and the recovery if a traps storm occurs.
  1. Effects of the system on a huge traps table (>50,000). This includes the effect of passing the ddbb maintenance.

1.3.6 Events

In a similar way as with the SNMP, we will evaluate the system in two cases:

De forma similar a los SNMP, evaluaremos el sistema en dos supuestos:

1. Normal range of event reception. This has been already tested in the data server, so in each status change, an event will be generated.

2. Event generation Storm. To do this, we force the generation of evets via CLI. Using the following command: Para ello, forzaremos la generación de eventos via CLI.

/usr/share/pandora_server/util/pandora_manage.pl /etc/pandora/pandora_server.conf --create_event "Prueba de evento" system Pruebas

Note: Supposing that there is a group called "Tests".

This command, used un a loop as the one used to generate traps, it can be used to generate tens of events by second. It could be parallelize in one script with several instances to get a higher number of insertions. This will be useful to simulate the performance of the system if an event storm happens. This way we could check the system, before, during and after the event storm.

1.3.7 User Concurrency

For this, we should use another server, independent from Pandora, using the WEB monitoring functionality. We do a user session where we have to do the following tasks in this order, and see how long they take.

  1. Login in the console
  2. See events
  3. Go to the group view
  4. Go to the agent detail view
  5. Visualize a report (in HTML). This report should contain a pair of graphs and a pair of modules with report type SUM or AVERAGE. The interval of each item should be of one week or five days.
  6. Visualization of a combined graph (24hr).
  7. Generation of report in PDF (another different report).

This test is done with at least three different users. This task could be parallelize to execute it every minute, so as if there are 5 tasks (each one with their user) we would be simulating the navigation of 5 simultaneous users.Once the environment is set up, we should consider this:

  1. . The average velocity of each module is relevant facing to identify " bottle necks" relating with other parallel activities, such as the execution of the maintenance script, etc.
  2. . The impact of CPU/Memory will be measured in the server for each concurrent session.
  3. . The impact of each user session simulated referred to the average time of the rest of sessions will be measured. This is, you should estimate how many seconds of delay adds each simultaneous extra session.


Go back to Pandora FMS documentation index