Mar 262010

Performance Management is about polling data, aggregating it, running thresholds on it and reporting of performance parameters. In this article, I will concentrate on the polling and data retention side of it.

Performance managers deal with lots of data from several resources. This mass amount of data directly impacts the disk space requirements. Since the disk space is not limitless, we should play with some parameters to limit it based on the customer’s requirements.

One of the most important parameters that need to be asked to the customer is the retention period of the data. This depicts the time the system should wait before it purges the data. I used several different retention periods in the projects I have involved. This parameter highly depends on the customer requirements. Some customers may want to see the daily KPIs for a year, while others may require a month.

The second important parameter is the polling period. We always tend to set lower polling periods. However, setting low level polling periods may lead to problems. Here are some examples:

Suppose you are polling (via SNMP GET) from a device interface to get the Inbound Octets KPI. The SNMP object you are polling is a 32-bit counter. In order to get the octets passed, you should subtract previous polling’s counter value from the current one. This is ok. The problem arises when you “wait” too much. If your polling period is 15 minutes for example and this is a highly utilized interface, after it reaches the counter value of 2^32, it resets itself. Even, in some cases, it resets itself multiple times within these 15 minutes. The result is: wrong, misleading information on the reports. To cope with this situation there are some formulas I have been using. But solving this via formulas is not to best way as they are not very reliable at all times. The best way to deal with this particular case is to use 64 bit interfaces (if available in the MIB) or reduce the polling period.

Another example would be from the SQM domain. Suppose you are polling each 5 minutes and forward the KPIs to an SQM system. The SQM system collects those data and run some thresholds on them to do the service impact analysis. In your first poll, your SQM found that the data received violates the threshold limits. The system then marks the status of the service to down and started calculating the downtime. In order for the system to detect the next service status, it should wait for min. 5 minutes. This condition causes the service downtimes to appear as multiples of 5’s in the reports. 5, 10, 15… minutes. If you commit on the %99, 9999 availability to a customer, this granularity is simply not enough. What you should do? Reduce the polling period again.

Reducing the polling period is not very straightforward. When you reduce the polling period, you should ensure that all the pollings and their processing should be finished within that period. Suppose you have a poller which polls 1000 resources each 1 minute. If we assume that each polling takes 2 seconds (including processing delays, propagation delays etc.) this makes 2000 seconds. In 1 minute, we have 3600 seconds so no problem in here. But this is the sunny day scenario. What happens if 10% of my resources cannot respond in 2 seconds or they are not available at all? Obviously there’s a risk to consume the 3600 seconds before finishing the polling. What we can do? Well if we cannot increase the polling period, we can increase the poller count. Instead of using 1 poller, we can use 5 pollers and poll in parallel. Additional pollers bring additional license costs and introduce additional machines to the infrastructure.

Polling periods and retention periods heavily impact the disk size requirements for the performance management solutions. They should be studied carefully before the roll out of any PM implementation.