Pandora: Documentation en: PandoraFMS Engineering

From Pandora FMS Wiki
Jump to: navigation, search

Go back to documentation index

1 Pandora FMS Engineering Details

In this annex we are going to explain some of the design principles and particularities of Pandora FMS. It is only an introduction, since some technical aspects, such as the Pandora FMS Database schema would require much more explanation.

1.1 Pandora FMS Database Design

Pandora FMS first versions, from the 0.83 to the 1.1, were based in a very simple idea: one data, one insertion in the database. This was very easy to develop and allowed the program to do very simple searches, insertions and other operations.

These had a lot of advantages and a big problem: scalability. This system has an specific limit in maximum number of modules that could support, not having to implement difficult mechanisms of clustering that would allow more load, and even with this, with certain number of data,the functioning was not so quick (> 5 millions of elements).

The solutions based in MySQL cluster are not easy, and always add some minor problems.Neither they offer even a long term solution.

The current version of Pandora FMS implements a data compression in real time for each insertion. It also allows data compression based on interpolation. It also implements -as in previous versions- an automatic deletion of data after a certain period of time.

The new data processing system keeps only «new» data. If a duplicated value enter in the system, it will not be kept in the database. It is very useful to keep the database reduced. This works for all Pandora FMS modules:numeric, incremental, boolean,and chain. In the boolean data kind, the compressing index is very high, so they are data that rarely change. Nevertheless, the «index» elements are kept every 24 hours, so there is minimum information that is useful as a reference when compacting the information.

This solves part of the scalability problem reducing the database usage in a 40%-70%. There is also another solution for scalability problems: the total breaking up of components in Pandora FMS, that allows to balance the data files processing load and the network modules execution in different servers. Now it it possible to have several Pandora FMS servers(network servers, data or SNMP), like Pandora FMS Web consoles, and also a database or a high performance cluster (with MySQL5).

The adjustments imply big changes when reading or interpreting data. We have redesign and implement from zero the graphic motor to could represent data in a quick way with the new data storage model. With the new version, if an agent could not communicate with Pandora FMS, and the Pandora FMS server does not receive data from the agent, then this absence of data could not have a graphic representation. And regarding the module graph, there will be no changes.

A graphic with a perfect horizontal line will be obtained. Pandora FMS, if it doesn't receive new values, it will believe that there aren't any, and everything will appear as it was in the last notification. Similar to the behavior of MRTG, for example.

To see a graphic example, this image shows the changes for each data, received every 180 seconds.

Module graph full.jpg

This would be the equivalent graphic for the same data, expect a connection failure, from 05:55 to 15:29 approximately.

Module graph peak.jpg

In Pandora FMS 1.3 a new general graphic for the agent is introduced. It shows its connectivity, and it shows the access rate from the module agents. This graphic complements the other graphs that are shown when the agent has activity and is receiving data. This is an example of an agent that is regularly connected with the server:

Access graph full.jpg

If you have peaks (lows) in this graphic, you could have problems or slow connections in the Pandora FMS agent connectivity with the Pandora FMS server. It is also possible that you could have connectivity problems from the network server.

From Pandora's version 5 onwards, a new feature was introduced, which makes possible to cross the data of the "unknown module" type events with the graphs, to show in the graph the piece of data series that is under unknown condition, complementing the graph for a better interpretation, for example:


Grafica-dsconocido.jpg


1.1.1 Improvements in Index and Other Technical Aspects in the DDBB

We have implemented small improvements in the relational model of the Pandora FMS database. One of the changes that we have introduced is the indexation by module kinds.In this way, the access to the information is quicker, so the Pandora FMS logic agent, that is useful to "upload" all the information about monitoring. It is distributed in small pieces of information that could came form very different sources. In the new version of Pandora FMS we have planned until four kinds of new specific servers that will offer a wider variety of information kinds to process.

Module distribution.jpg

In later versions, table partitioning (by time stamps) is recommended to further improve data history access performance.

We also have added factors such as the numerical representation of time marks (in UNIX format, or number of seconds from the 1 January 1970), speeds the searches of date ranks, comparisons of these ones, etc. This work has allowed a considerable improvement in the search times an in the insertions.

1.1.2 Database Main Tables

Next is shown an ER diagram and also a detailed description of the main tables of Pandora FMS database. The rest of the tables are briefly commented too.

Pandora db eer.png

  • taddress: Contains agent additional addresses.
  • taddress_agent: Addresses asociated to an agent(rel. taddress/tagente).
  • tagente: Contains the information of Pandora FMS agents.
    • id_agente: Agent unique identifier.
    • nombre: Name of the agent (case sensitive).
    • direccion: Agent address. It is possible to assign additional addresses through taddress.
    • comentarios: Free text.
    • id_grupo: Identifier of the group the agent belongs to (ref. tgrupo).
    • ultimo_contacto: last date of agent contact, either through a software agent or through a remote module.
    • modo:Way in which the agent runs, 0 normal, 1 training.
    • intervalo: agent execution interval. Depending on this interval, the agent will be showed as out of limits.
    • id_os: Agent SO identifier (ref. tconfig_os).
    • os_version: SO version (free text).
    • agent_version: Agent version (free text). Updated by the software agents.
    • ultimo_contacto_remoto: Last date of contact received by the agent. In case of software agents, and unlike the last contact, the date is sent by the agent itself.
    • disabled: Agent status, enabled (0) or disabled (1).
    • id_parent: Identifier of the agent parent (ref. tagente).
    • custom_id: Agent customized identifier. Useful to interact with other tools.
    • server_name: Name of the server the agent is assigned to.
    • cascade_protection: Cascade protection. Disabled at 0. When is at 1 it avoid that he alerts that are associated to the agent would be fired if a critical alert of the parent of the agent has been fired. For more info, check the section about Alerts.
  • tagente_datos: Data received from each module. If for the same module the last received data is the same as the immediate previous one it will be not added (but tagente_estado is updated). The incremental and string type data are kept in different tables.
  • tagente_datos_inc: Incremental data type.
  • tagente_datos_string: String kind data.
  • tagente_estado: Information of the current status of each module.
    • id_agente_estado: Identifier.
    • id_agente_modulo: Module identifier.(ref. tagente_modulo).
    • datos: Value of the last received data.
    • timestamp: Data of the last data received (could come from the agent).
    • estado: Module status: 0 NORMAL, 1 CRITICAL, 2 WARNING, 3 UNKNOWN.
    • id_agente: Agent identifier associated to the module (ref. tagente).
    • last_try: Data of the module last successful execution.
    • utimestamp:Data of the module last execution in UNIX format.
    • current_interval:Module execution intervale in seconds.
    • running_by: Name of the server that executed the module.
    • last_execution_try: Date of the module execution last try.The execution could have failed.
    • status_changes: Number of status changes that have been occurred. It is used to avoid continuous status changes. For more info, check out the Operation section.
    • last_status: Module previous status.
  • tagente_modulo: Module configuration.
    • id_agente_modulo: Module unique identifier.
    • id_agente: Agent identifier associated to the module(ref. tagente).
    • id_tipo_modulo: Kind of module (ref. ttipo_modulo).
    • descripcion: Free text.
    • nombre: Module name.
    • max: Module maximum value. Higher data than this value will be consider invalid.
    • min: Module minimum value. Lower data than this value will be consider invalid.
    • module_interval: Module execution interval in seconds.
    • tcp_port: Destination TCP port in network modules and plugin. Name of the column to read in WMI modules.
    • tcp_send:Data to send in network modules. Namespace in WMI modules.
    • tcp_rcv: Expected answer in network modules.
    • snmp_community:SNMP community in network modules. Filter in WMI modules.
    • snmp_oid: OID in network modules. WQL Query in WMI modules.
    • ip_target: Destination address in network modules, plugin and WMI.
    • id_module_group:Identifier of the group the module belongs to (ref. tmodule_group).
    • flag: Flag of forced execution. If is at 1 the module will be executed though it would be not entitled by interval.
    • id_modulo: Identifier for modules that could not been recognized by its id_tipo_módulo. 6 for WMI modules, 7 for WEB modules.
    • disabled: Module status, 0 enabled, 1 disabled.
    • id_export: Identifier of the export server associated to the module (ref. tserver).
    • plugin_user: User name in plugin and WMI modules, user-agent in Web modules.
    • plugin_pass: Passwork in plugin modules and WMI, number of reattempts in Web modules.
    • plugin_parameter: Additional parameters in plugin modules, configuration of Goliat task in Web modules.
    • id_plugin: Identifier of the plugin associate to the module in plugin modules (ref. tplugin).
    • post_process: Value with the module data will be multiplied by before being kept.
    • prediction_module: 1 if it is a prediction module, 0 in any other case.
    • max_timeout: time to wait in seconds in plugin modules.
    • custom_id: Module customized identifier. Useful to interact with other tools.
    • history_data: If it is at 0 module data will not be kept at tagente_datos*, only tagente_estado will be updated.
    • min_warning: Minimum value that activates the WARNING status.
    • max_warning: Maximum value that activates the WARNING status.
    • min_critical: Minimum value that activates the CRITICAL status.
    • max_critical: Maximum status that activates the CRITICAL status.
    • min_ff_event: Number of times that should be a condition of status change before this change take place. It is is related with tagente_estado.status_changes.
    • delete_pending: If it is at 1 it will be deleted by the maintenance script of pandora_db.pl database.
    • custom_integer_1: When prediction_module = 1 this field is the module id that is used to predict. When prediction_module = 2 this field is the service id assigned to the module
    • custom_integer_2:
    • custom_string_1:
    • custom_string_2:
    • custom_string_3:
  • tagent_access: A new entry will be inserted each time that it is received data from an agent to any of the servers, but never more than one by minute to avoid overload the database. It could be deactivated setting the agentaccess to 0 in the pandora_server.conf configuration file.
  • talert_snmp: Configuration of SNMP alerts.
  • talert_commands: Commands that could be executed from actions associated to an alert (eg. send mail).
  • talert_actions: Command instance associated to one alert (eg. send mail to administrator).
  • talert_templates: Alert templates.
    • id: Template unique identifier.
    • name: Template name.
    • description: Description.
    • id_alert_action: Identifier of the default action associated to the template.
    • field1: Customized field 1(free text).
    • field2: Customized field 2(free text).
    • field3: Customized field 3 (free text).
    • type: kind of alert depending on the shot condition ('regex', 'max_min', 'max', 'min', 'equal', 'not_equal', 'warning', 'critical').
    • value: Value for alerts kind regex (free text).
    • matches_value: To 1 it inverts the logic of the shot condition.
    • max_value: Maximum value for max_min and max alerts.
    • min_value: Minimum value for max_min and min alerts.
    • time_threshold: Alert interval.
    • max_alerts: Maximum number of times that an alert will be fired during an interval.
    • min_alerts: Minimum number of times that the shot condition should be shown during an interval to the alert will be fired.
    • time_from: Time from which the alert will be active.
    • time_to: Time to which the alert will be active.
    • monday: To 1 the alert is active on Mondays.
    • tuesday: To 1 the alert will be active on Tuesdays.
    • wednesday: To 1 the alert will be active on Wednesdays.
    • thursday: To 1 the alert will be active on Thursdays.
    • friday: To 1 the alert will be active on Fridays.
    • saturday: To 1 the alert will be active on Saturdays.
    • sunday: To 1 the alert will be active on Sundays.
    • recovery_notify: To 1 activate the alert recovery.
    • field2_recovery: Customized field 2 for alert recovery (free text).
    • field3_recovery: Customized field 3 for alert recovery (free text).
    • priority: Alert priority: 0 Maintenance, 1 Informational, 2 Normal, 3 Warning, 4 Critical.
    • id_group: Identifier of the group the template belongs to (ref. tgrupo).
  • talert_template_modules: Instance of an alert template associated to a module.
    • id: Alert unique identifier.
    • id_agent_module: Identifier of the module associated to the alert (ref. tagente_modulo).
    • id_alert_template: Identifier of the templated associated to the alert (ref. talert_templates).
    • internal_counter: Number of times that the alert shot condition has occurred.
    • last_fired: Last time the alert was fired (Unix time)
    • last_reference: Start of the current interval (Unix time).
    • times_fired: number of times the alert was fired (could be different from internal_counter)
    • disabled: At 1 the alert is deactivated.
    • priority: Alert priority : 0 Maintenance, 1 Informational, 2 Normal, 3 Warning, 4 Critical.
    • force_execution: At 1 the action of the alert will be executed thought it has not been fired. It is used for the alert manual execution.
  • talert_template_module_actions: Instance of an action associated to one alert (ref. talert_template_modules).
  • talert_compound: Compound alerts, the columns are similar to the talert_templates.
  • talert_compound_elements: Simple alerts associated to a compound alert, each one with its correspondent logic operation (ref. talert_template_modules).
  • talert_compound_actions: Actions associated with a compound alert (ref. talert_compound).
  • tattachment: Attachments associated to one incident.
  • tconfig: Console configuration.
  • tconfig_os: Valid Operative systems in Pandora FMS.
  • tevento: Event entries. The severity values are the same ones than for the alerts.
  • tgrupo: Defined groups in Pandora FMS.
  • tincidencia: Incident entries
  • tlanguage: Available languages in Pandora FMS.
  • tlink: Links showed at the console menu lower side.
  • tnetwork_component: Network components. They are modules associated to a network profile used by the Recon Server. After they result in an entry at tagente_modulo, so the columns of both tables are similar.
  • tnetwork_component_group: Groups to classify the network components.
  • tnetwork_profile: Network profile. Network components group that will be assigned to recognition tasks of the Recon Server. The network components associated to the profile will result in modules in the created agents.
  • tnetwork_profile_component: Componentes de red asociados a un perfil de red (rel. tnetwork_component/tnetwork_profile).
  • tnota: Notes associated to an incident.
  • torigen: Possible origins of an incident.
  • tperfil: User profiles defined at the console.
  • tserver: Registered servers.
  • tsesion: information on actions that toke place during an user session for administration and statistical logs.
  • ttipo_modulo: Kinds of modules depending on their origin and kind of data.
  • ttrap: SNMP traps received by the SNMP console.
  • tusuario: Registered users at the console.
  • tusuario_perfil: Profiles asociated to an user (rel. tusuario/tperfil).
  • tnews: News showed at the console.
  • tgraph: Customized graphs created in the console.
  • tgraph_source: Modules associated to a graph (rel. tgraph/tagente_modulo).
  • treport: Customized reports created at the console.
  • treport_content: Elements associated to one report.
  • treport_content_sla_combined: Components of an SLA element associated to one report.
  • tlayout: Customized maps created at the console.
  • tlayout_data: Elements associated to a map.
  • tplugin: Plugin definitions for the Plugin Server.
  • tmodule: Kinds of modules (Network, Plugin, WMI...).
  • tserver_export_data: Data to export, associated to a destination.
  • tplanned_downtime: Programmed stops.
  • tplanned_downtime_agents: Agents associated to a programmed stop (rel. tplanned_downtime/tagente).

1.1.3 Data Compression in Real Time

To avoid overload the database, the server does a simple compression in time of insertion.One data won't be stored at the database unless it would be different to the previous one or it would be a difference of 24 hours between both of them.

For example, supposing an interval of approximately 1 hour, then the sequence 0,1,0,0,0,0,0,0,1,1,0,0 is kept in the database as 0,1,0,1,0. It won't kept other consecutive 0 unless 24 h. have passed.

The graph that is shown next has been drawn from the data of the previous example. Only the data in red has been inserted in the database.

Data compression 01.png

The compression affects to the algorithms of data processing. Either to the metrics as to the graphs, and it's important to consider that you should fill in the blanks that are caused by the compression.

Considering all the previous things, in order to calculate with the data of a given module the interval and the starting data, you should follow these steps:

  • Search for the previous data out of the interval and date given. If it exists, you have to put it at the beginning of the range. If it doesn't exist, then previously there was no data.
  • Search the following data out of the range and data given until a maximum equal to the module interval. If it exists, then you have to put it at the end of the interval. If not, you have to extend the last available value until the end of the interval.
  • All data should be check, considering that one data is valid until we get another data.


1.1.4 Data compaction

Pandora FMS has included a system to "compact" database information. This system is focus on small / mid-size deployments (250-500 agents, < 100,000 modules) which want to have a long history information but "loosing" some resolution.

Pandora FMS database maintance, which is executed each day do a scan of old data subject to be compacted. This compactation is done using a simple linear interpolation, that means, if you have 10,000 points of information in a day, you will get a result of a process of interpolation, which replace that 10,000 points for 1000 points.

This, obviously "loose" information, because is an interpolation, BUT also saves database storage and on long term graphs (monthly, yearly) the graphs are mostly the same. In big databases this behaviour coult be "costly" in terms on database performance, and should be disabled and you should use the history database model instead.

Sample of non compact data

Compact1.png

Same graph after a compactation

Compact2.png

1.1.5 History database

This is an Enterprise feature, and is used to store the information from a given point in time, for example, data with more than one month in a different database. This database must be in a different physical server (no virtualize here, please!). Automatically, when you request a data graph for 1 year, Pandora FMS will look the first XX days in the "realtime/main" database and the other information in the history database. In this way you can avoid to have performance penalties when you store a huge ammount of information in your system.

To configure this, you need to setup manually in another server, a history database (importing the Pandora FMS DB Schema into it, without data), and setup permissions to allow access to it from the main Pandora FMS server.

Go to Setup -> History database and configure there the settings to access the history database.



Bbddhist.png



Some settings interesting which need to be explained:

  • Days: max days information is stored in main database. After that date, data will be moved to history db. 30 days is a good default.
  • Step: This acts like a buffer, database maintance script, will take XX registers from database, will insert it in the history database and will delete it from main database. This is timeconsuming, and size depends on your setup, 1000 is a good default value.
  • Delay: After a block of step modules, script will wait for delay seconds. Useful if your database performance is poor, to avoid locks. Use values only between 1-5.

The default configuration of Pandora FMS does NOT transfer string type data to the historical database, however, if we have modified this configuration and our historical database is receiving this type of information it is essential that we configure its purging otherwise it will end up occupying too much time, causing big problems, besides having a negative impact on the performance.

To configure this parameter we must run a query directly in the database to determine the days after which this information will be purged. The table we are interested in is tconfig and the field string_purge. If we wanted for example to set 30 days for the purging of this type of information, for example, we would run the next query directly on the historical database:

UPDATE tconfig SET value = 30 WHERE token = "string_purge";

A good way to test this is to run the database maintance script manually:

/usr/share/pandora_server/util/pandora_db.pl /etc/pandora/pandora_server.conf

There shouldn't be any reported error.

1.2 Status of The Modules in Pandora FMS

In Pandora FMS the modules can have different status: Unkown, Normal, Warning, Critical or with Fired Alerts.

1.2.1 When is Each Status Set?

Each module has the Warning and Critical thresholds set in its configuration. These thresholds define its data values for which these status will be activate. If the module gives data out of these thresholds, then it will be considered that it's on Normal status.

Each module has also a time interval that will fix the frequency with which it will get the data. This interval will be taken into account by the console to collect data. If the module has the double of its interval without collecting data, then, it'll be considered that this module is in Unknown status.

Finally, if the module has configured alerts and any of them have been fired and have not been validated, then the module will have the corresponding Fired Alert status.

1.2.2 Spreading and Priority

In Pandora's organization, some elements depend on others, as for example the modules of one agent or the agents of one group.These can also be applied to the case of the Pandora's FMS Enterprise policies, which have associated some agents and some modules that are considered associated to each agent.

This structure is specially useful in order to evaluate easily the status of the modules. This is obtained spreading up the status in this organization, giving status to the agents, groups and policies this way.


1.2.2.1 Which status will an Agent have?

An agent will have the worst of its modules's status. Recursively, a group will have the worst of the agent's status that belong to it, and the same for the policies, that will have the worst status of its assigned agents.

This way, by seeing one group with a critical status, for example, we'll known that at least one of its agents has the same status. When we locate it, we could get down another level to get to the module or modules that have caused the spreading of the critical status to the upper level.


1.2.2.2 Which should be the Priority of the status?

When we say that the worst of the status is spread, we should be sure which status are the most important ones. This way, there is a priority list, being the first status in it the one that has highest priority over the others and the last one the one that has the lowest. This one will be shown only with all elements have it.

  1. Fired Alerts
  2. Critical status
  3. Warning status
  4. Unknown status
  5. Normal status


We can see that when a module has fired alerts, its status has priority over the rest, and the agent to which it belongs will have this status and also the group to which this agent belongs to. On the other hand, in order to one group, for example, has a normal status, all its agents should have this status; which implies that all the modules of these groups will have normal status.

1.2.3 Color Code

Each one of the commented status has a color assigned, in order to could view in the network maps, with a quick view, when something isn't working properly.

Orange status.png Fired alerts status
Red status.png Critical status
Yellow status.png Warning status
Grey status.png Unknown status
Green status.png Normal status

1.3 Pandora FMS graphs

Graphs are one of the most complex implementations on Pandora FMS, because they gather information in real-time from the DB, and no external system is used (rrdtool or similar).

There are several behaviors of the graphs that depend on the type of the data:

  • Asynchronous modules. It is assumed that there is no data compaction. Data stored in the DB are all the real samples of the data (therefore, no compaction). It produces more "exact" graphs without possible misinterpretation.
  • Text string modules. Shows the rate of the gathered data.
  • Numerical modules. Most modules report such data.
  • Boolean modules. This are numerical data on *PROC modules: for instance, ping checks, interface status, etc. 0 means wrong, 1 means "Normal". They raise events automatically when they change of status.

1.3.1 Compression

Compression affects on how the graphics are represented. When we receive two data with the same value, Pandora does not store the last data, but interprets that the last known value can be used for the present time if we don't have another value. When we are painting a graph, if we do not have a reference value just when the graphic starts, Pandora searches 48 hours back in time to find the last known value to take as reference. If it doesn't find anything, it will start from 0.

In asynchronous modules, although there are not compression, the backwards search algorithm behaves similar.

1.3.2 Interpolation

When composing a graph, Pandora takes 50xN samples, being N the resolution factor of the graphs (this value can be configured in the setup). A monitor that gathers data every 300 seconds (5 minutes) will have 12 samples per hour, and 12x24 = 288 samples in a day. So when we ask a graph of a day, we are not printing 288 values, we are "compressing" or interpolating the graphic using only 50x3=150 samples (by default, graph resolution in Pandora is 3).

This means that we lose some resolution and the more samples. When we have a lot of values, for instance the 2016 samples of a week, of 8400 samples of a month, we must compress them in the 150 samples of a graph. This is why sometimes we lose detail and do not see some details, that's why the graphs can be queried with different intervals and to zoom in or out.

Graph-explain.png

In the normal graphs, the interpolation is implemented in a simple way: if withing an interval we have two samples (p.e: interval B of the example), we do the average and we draw its value.

In boolean graphs, if within a sample we have several data (we can only have 1 or 0), we take the pessimist approach, and draw 0. This helps for the visualization of failures within an interval, having priority showing the problem that the normal status.

In both cases, if within a sample we don't have any data (because it's compressed or because it's missing), we will use the last known value of the previous interval to show the data, like the interval E of the above example shows.

1.3.3 Avg/Max/Min

Grafica avg max min.png

The graphs by default show the average, maximum and minimum values. Because a sample (see interpolation) can have several data, we show the average values of the data, the maximum or the minimum. The more interpolation needed (the longer the period we are visualizing and we have considerably more data), the higher the interpolation level will be, therefore the difference between maximum and minimum values will be greater. The lower the range of the graph (an hour or so), there will not be interpolation, or it will be minimum, so we'll see the data with its real resolution, and the three series will be identical.


Go back to documentation index