Pandora:Documentation en:Intro Monitoring
From Pandora FMS Wiki
Introduction to Monitoring
Monitoring with Pandora FMS
All the user interaction with Pandora FMS is done through the WEB console. The Pandora FMS console is a WEB console which follows the latest standards and WEB technologies, so it requires an advanced browser and the optional use of Flash. It is recommended to use Firefox 2.x or higher. You can also use Internet Explorer 8 or higher, although it gives an uncomfortable user experience due to its peculiar way to manage some WEB controls.
Generally spoken monitoring consists of the execution of processes (through modules) in any system in order to send its resulting data to a server. The server will process these resulting data where the front-end (WEB console) is going to display it to the user.
Pandora FMS is a scalable monitoring tool. It would be possible to monitor about 1200 to 1500 agents with a single server, although the number of monitoring processes could grow without restrictions with the correct architecture (Meta Console).
Monitoring by Software Agent vs. Remote Monitoring
There are two main monitoring procedures with Pandora FMS: The software agent based (local) and the remote one.
The software agent based monitoring includes a piece of software (module) into the monitored system, e.g. the measurement of the percentage of CPU usage on a certain system while the remote monitoring is done through network tests without the use of modules, e.g. checking if a certain host is active or not.
The main difference between these two types is that whereas the software agent based is executed from the monitored system, the remote monitoring is executed from the Pandora FMS Server against the target system.
Agents on Pandora FMS
All monitoring done by Pandora FMS is managed through a generic entity called 'Agent' which is incorporated into a more generic block called 'Group'. An agent can only belong to one group.
The information is logically arranged by means of a hierarchy which is based on groups, agents, module groups and modules. There are Agents which are solely based on the information given by a software agent installed on the system, and Agents with exclusive network information - information that doesn't come from a software agent where installing software is not necessary which would execute the network monitoring tasks from the Pandora FMS Network Servers.
There are also agents which have network information -and- information obtained through software agents.
The information is collected in modules which are logically assigned to Pandora FMS agents in the console. It's important to distinguish the concept of Agents (where the modules which contain the collected info are located) from Software Agents which are getting executed on remote systems.
Status and Event Monitoring
With Pandora FMS 3.0, a new important functionality was added. The way in which Pandora FMS has been working until now, was changed. Pandora FMS allows the user to fix standards to define any data in three possible states:
'NORMAL', 'WARNING' and 'CRITICAL'.
Automatically, all modules of the 'proc' kind are defined as 'NORMAL' if they have a value of '1' or bigger than '1'. They will be defined as 'CRITICAL' if they have a value lower than '1' ('0' or a negative value).
But what happens with a value of CPU usage? How could the system know if it's a 'NORMAL', 'CRITICAL' or 'WARNING' value? It doesn't know it by default - it only gets a numeric value and if nothing has been defined for it, all the values would be 'right' in 'NORMAL' status.
There are two status fields in the agent configuration which haven't been mentioned before. These are:
- Warning status
- Critical status
Each of those two fields can possess two values: Minimum and Maximum. By configuring them correctly, you're going to realize that some values will show a module in 'warning' and others in a 'critical' status:
To understand these options better, it's best to see an example. The CPU module will always be on 'green' in the agent status, so it simply informs about a value between 0% and 100%. If we want the module of the CPU usage to be shown in yellow ('warning') if it has reached e.g. 70% of its use, and in red ('critical') if it e.g. reached 90%, it's recommended to configure:
- Warning status:70
- Critical status:90
If you're going to reach the 90% value with these settings, the module will be shown in red ('CRITICAL'), if it's between 70% and 89.99%, it will be yellow ('WARNING') and under 70% in green ('NORMAL').
If we have a module with a string type, you're able to configure the status using a regular expression in the Str fields of 'Warning' and 'Critical' status parameters. If we have e.g. a module that returns OK, ERROR: Connection fail or BUSY: Too much devices it depends on the query result.
To configure the 'WARNING' and 'CRITICAL' module status, we will use the following regular expressions:
Warning Status: .*BUSY.* Critical Status: .*ERROR.*
You have to be careful here, because this regular expressions are case sensitive. With this module configuration, the status will be 'WARNING' if the data contains the string BUSY and it's going to jump to 'CRITICAL' if the data string contains ERROR.
If, by any chance, both states are configured with the same values, the 'Critical' value will always have precedence. In this case, the 'Warning' state is unreachable, because the 'Critical' state is more important.
This is an example of the modules in each of the states:
It's obvious these fields have no sense for modules which only return boolean values ('1' or '0').
These values are shown in the main screen of the monitor view. You're instantly able to tell by taking a quick look on how many checks are in the 'Normal', 'Warning' or 'Critical' states.
Other Common Monitoring Parameters
Pandora FMS optionally allows to keep the history of any data individually. All modules keep a history (so they're able to generate graphs and include them in reports of the historical / evolutive kind) by default. In a very big implantation which requires to monitor a lot of data, it's possible that you have no need to keep the history for some, thereby allowing for the possibility of occupying a lot less resources in this way.
This option allows to deactivate the history of the modules where you don't need to keep a history. Even if you deactivate the history, the alerts will continue to work in exactly the same way e.g. as the event generation and the view of the current state of this monitor.
The FF Threshold Parameter (FF=FlipFlop) is used to 'filter' the continuous changes of the state in the creation of events / statuses. In Pandora FMS, you can indicate that, until an element hasn't adapted the same status at least X times after having changed from an original status, it won't get considered as changed. Lets see a classical example: One ping to a host where there is loss of packages. In an evironment like this, it's possible to receive results as these:
1 1 0 1 1 0 1 1 1
However, the host is alive in all cases. What we really want to say to Pandora is: Until the host doesn't say that it's at least three times down, it doesn't show it as down, so in the previous case it would never be as down, and it would only be this way in this case:
1 1 0 1 0 0 0
From this point it will be shown as down - but not before that.
So the 'Flip_Flop' protections are pretty useful to avoid disturbing fluctuations. All modules implement it. Its use is to avoid the change of status (limited by the defined or automatic limits, as shown in the case of 'proc' modules).
From 5.1 version, the FF threshold has two modes.
- All state changing: same value is used for all state changing, to normal, warning and critical.
- Each state changing: different value can be set for each state changing, to normal, warning and critical.
In async modules, the timeout (FF timeout) can also be set. It's useful if you want to fire an alert only when the data server received several critical/warning data in a short period of time. When data arrival interval exceeded the timeout, the counter of FF threshold is reset.
For example, if you want to fire an alert only when agent sends critical data twice in 5 minutes (you don't want to fire an alert when data arrival interval exceeds 5 minutes.), set the FF threshold to 1 and the FF timeout to 300.