Pandora: Documentation en: Intro Monitoring
- 1 Introduction to Monitoring
- 2 Monitoring by Software Agent vs. Remote Monitoring
- 3 Agents on Pandora FMS
- 4 Status Monitoring
- 5 Other Common Monitoring Parameters
1 Introduction to Monitoring
All user interaction with Pandora FMS is done through the WEB console. The console allows access through a browser without the need to install heavy applications, allowing management from any computer with a browser.
Monitoring is the execution of processes on all types of systems to collect and store information, take action and make decisions based on such data.
Pandora FMS is a scalable monitoring system that has multiple functionalities to extend the scope and volume of information collected practically without limits.
2 Monitoring by Software Agent vs. Remote Monitoring
Monitoring can be divided into two large groups based on how we collect the information: monitoring based on software agents and remote monitoring.
Agent-based monitoring consists of installing a small software that remains running on the system and obtaining information locally through the execution of commands and scripts.
Remote monitoring is the use of the network to run remote checks on systems, without the need to install any additional components on the equipment to be monitored.
As can be seen, agent based monitoring will obtain the information through local checks while remote monitoring will obtain the information through network checks from the Pandora FMS server.
With Pandora FMS, the monitoring can be carried out one way or another and also combined, producing a mixed monitoring.
3 Agents on Pandora FMS
All monitoring done by Pandora FMS is managed through a generic entity called 'Agent' which is incorporated into a more generic block called 'Group'. These agents will be equivalent to each of the different computers, devices, websites or applications we are monitoring.
The agents defined in the Pandora FMS console can present local information gathered through a software agent, remote information collected through network checks, or both. Therefore, it is worth highlighting the difference between agents as an organizational unit in the Pandora FMS console, and software agents as local data collection services.
4 Status Monitoring
When monitoring, we obtain values from a system, be it memory, CPU, chassis temperature, number of connected users, orders on an e-commerce website or any other numerical value. Sometimes we are only interested in the data, but generally we want to associate a STATUS with these values, so that when they overcome a "THRESHOLD", the status changes, to let us know if something is right or wrong. That's why when we talk about monitoring, we have to introduce the concept of STATUS.
Pandora FMS allows you to define thresholds to determine the status that a check will have based on the data it shows. The three possible status are: NORMAL, WARNING and CRITICAL. A threshold is a value from which we move from one status to another. The status of the modules will depend on these thresholds, which are specified by the following parameters present in the configuration of each module:
- Warning status - Min. Max.: lower and upper limits for the warning status. If the numerical value of the module is in this range, the module will go to warning status. If no upper limit is specified, it will be infinite (all values above the lower limit).
- Warning status - Str.: regular expression for alphanumeric modules (string). If any matches are found, the module will go to warning status.
- Critical status - Min. Max.: lower and upper limits for the critical status. If the numerical value of the module is in this range, the module will go to critical status. If no upper limit is specified, it will be infinite (all values above the lower limit).
- Critical status - Str.: regular expression for alphanumeric modules (string). If any matches are found, the module will go to critical status.
- Inverse interval: present for both the warning and critical threshold. If enabled, the module will change status when its values are outside the range specified in the thresholds. It also works for alphanumeric modules (string), if the text strings do NOT match the Warning/Critical Str., the module will change its status
In case the "warning" and "critical" thresholds match in any range, the "critical" threshold will always prevail.
4.1 Numerical thresholds - Case study 1
We have a CPU usage percentage module that will always be green in agent status, as it simply reports a value between 0% and 100%. If we want the CPU use module to go to into warning status (yellow ) when it reaches 70% of its use, and into critical status (red) when it reaches 90%, we must configure the thresholds as follows:
- Warning status Min.: 70
- Critical status Min.: 90
Thus, when the value 90 is reached, the module will appear in red (CRITICAL), while between 70 and 89.99 will be yellow (WARNING), and below 70 in green (NORMAL).
Due to the operation of the thresholds, in cases such as this, it is not necessary to set the upper limits. This is because if only the lower threshold is set, the upper threshold will be taken into account as "no limit", so any value above the lower limit will be taken into account as within the threshold. In addition, if the thresholds are crossed, the critical threshold will prevail over the warning, resulting in the graph of thresholds shown in the previous screenshot.
4.2 Text thresholds - Case study 2
If we have a string type module, we can configure the status using regular expressions in the Str fields of the parameters Warning Status and Critical Status. In this case we have a module that can return :"OK", "ERROR connection fail" or "BUSY too many devices", depending on the result of the query.
To configure the WARNING and CRITICAL states of the text module we will use the following regular expressions:
Warning Status: .*BUSY.* Crirical Status: .*ERROR.*
With this configuration, the module will have WARNING status when the data contains the string BUSY, and its status will be CRITICAL when the data contains the string ERROR. "Please, be careful, regular expressions are case sensitive."
4.3 Dynamic monitoring (Automatic strings)
Dynamic monitoring consists of automatically and dynamically adjusting the status thresholds of the modules in an intelligent and predictive way. The procedure consists of collecting the values for a given period and calculating an average and a standard deviation, which are used to establish the corresponding thresholds.
The configuration is done at module level, and the possible parameters are:
- Dynamic Threshold Interval: time interval to be considered for the calculation of thresholds. If we choose 1 month, the system will take all existing data from the last month and build the thresholds based on that data.
- Dynamic Threshold Two Tailed: if activated, the dynamic threshold system will also set thresholds below the average. If unchecked (default) only thresholds with values above the average will be set.
- Dynamic Threshold Max.: allows you to increase the upper limit by the indicated percentage . E.g.: if the average values are around 60 and the critical threshold has been set from 80 on, if we set the value Dynamic Threshold Max: 10, we will increase this critical threshold by 10%, so it would be 88.
- Dynamic Threshold Min.: only applies if the Dynamic Threshold Two Tailed parameter is active. Allows the lower limit to be reduced by the percentage indicated. E.g.: if the average values are around 60 and the lower critical threshold has been set to 40, if we set the value Dynamic Threshold Min: 10, we will reduce this critical threshold by 10%, so it would be 36.
There are also several additional configuration parameters in the file pandora_server. conf.
- dynamic_updates: this parameter determines how many times the thresholds are recalculated during the time period set in Dynamic Threshold Interval. If we set "Dynamic Threshold Interval" to a value of 1 week, by default the data is collected from one week backwards and the calculation is done only once, repeating the process again after one week. If we modify the parameter "dynamic_updates", we could increase this frequency. For example, setting the parameter to a value of 3 will cause the thresholds to be recalculated up to three times during the period of a week (or the period that we have set in "Dynamic Threshold Interval"). Its default value is 5.
- dynamic_warning: percentage of difference between the warning and critical thresholds. Its default value is 25.
- dynamic_constant: determines the deviation of the average that will be used to establish thresholds, higher values will make thresholds further away from the average values. Its default value is 10.
In the following example, the calculated average value is at the red line (approx. 30):
When the dynamic thresholds are activated, the upper threshold (approx. 45 and above) is set like this :
We have activated the parameter Dynamic Threshold Two Tailed, which means a critical threshold has also been set below the average values (approx. 15 and lower):
Now we've set the parameters "Dynamic Threshold Min." and "Dynamic Threshold Max." at 20 and 30 respectively, the thresholds have therefore been opened, them being slightly more permissive:
4.3.1 Case study 1
We start from a web latency module. The basic settings we have used take into account a week interval:
When saving changes, after running pandora_db, the thresholds have been set in this way:
The module will therefore switch to warning status when the alteration is greater than 0.33 seconds, and to critical when it is greater than 0.37 seconds. The graph will be shown as follows:
The threshold has been considered to be somewhat permissive, so we decide to make use of the parameter Dynamic Threshold Min. to lower the minimum thresholds. Since in this case the threshold has no maximum values because everything above a certain value will be considered incorrect, we will not use Dynamic Threshold Max. The modification made would be like this:
After applying changes and executing the pandora_db, the thresholds are set as follows:
And the graph will look like this:
4.3.2 Case study 2
In this example we are monitoring the temperature of a control room or a CPD, the graph shown, presents some values with little variation:
In this situation it is essential that the temperature remains stable and does not reach much higher values, neither much lower, so we will use the parameter "Dynamic Threshold Two Tailed" to delimit thresholds both above and below. The configuration is as follows:
The automatically generated thresholds have been these:
And the graph will look like this:
In this way, all values between 23'10 and 26 will be considered normal, since it is the acceptable temperature in our CPD or control room. If we need to, we could use the parameters "Dynamic Threshold Min." and "Dynamic Threshold Max." again to adjust the thresholds if necessary.
5 Other Common Monitoring Parameters
5.1 Monitoring Interval
Every synchronous monitoring system needs to define how often a test is performed. It can be 5 seconds, 5 minutes or 5 hours. But it has to be defined. At shorter time frequencies, more volume of information is stored. The data that is processed every 5 minutes, will have at the end of the day, 5x12x24=288 samples, while a data that is processed every 30 seconds, will have 2880 samples. In order to understand how much information our monitoring system will handle, it is key to understand this concept that in Pandora FMS we simply call "interval".
5.2 Historical Data
Pandora FMS optionally allows any individual data set to be saved. All modules keep a history (so they're able to generate graphs and include them in reports of the historical / evolutive kind) by default. However, in a very big implantation which requires a lot of data to be monitored, it's possible that you have no need to keep the history for some of them, thereby allowing for the possibility of occupying less resources.
This option allows the history of the modules where you don't need to keep a history to be deactivated. Even if you deactivate the history, the alerts will continue to work in exactly the same way e.g. as event generation and the view of the current state of this monitor.
5.3 FF Threshold
FlipFlop (FF) is known as a common phenomenon in monitoring: when a value fluctuates frequently between alternative values (RIGHT/WRONG), which makes it difficult to interpret. When this occurs, a "threshold" is usually used so that in order to consider something as having changed status, it has to "stay" more than X intervals in a state without being altered. We call this in Pandora FMS terminology: "FF Threshold".
The FF Threshold Parameter (FF=FlipFlop) is used to 'filter' the continuous changes of the state in the creation of events / statuses. In Pandora FMS, you can indicate that, until an element has adopted the same status at least X times after having changed from an original status, it won't be considered as changed. Let's see a typical example: A ping to a host where there is loss of packages. In an environment like this, it's possible to receive the following results:
1 1 0 1 1 0 1 1 1
However, the host is alive in all cases. What we really want to say to Pandora is: Until the host doesn't say that it's at least three times down, it doesn't show it as down, so in the previous case it would never be shown as down, and it would only be this way in this case:
1 1 0 1 0 0 0
From this point it will be shown as down - but not before that.
So the 'Flip_Flop' protections are pretty useful to avoid disturbing fluctuations. All modules implement it. Its use is to avoid the change of status (limited by the defined or automatic limits, as shown in the case of *proc modules).