Apr 062010

Thresholds are the way to detect anomalies in the performance data. They act as proactive tools for the operators to take early actions on the degradations before they cause a fault and possible service loss.

There are 2 types of thresholds.

1- ) Static or Burst thresholds
2- ) Dynamic or Baseline thresholds

The static type is the one that triggers an action whenever the specified value is crossed. Once violated, those thresholds will not trigger the same action in each new data that is above the threshold. (Most products count the occurrence times, until the value returns back to normal limits)

The second type of threshold is the one that takes into account the baseline (historical) information. These thresholds look at the baseline data to see the variations from the baseline. For example, suppose a router’s interface utilization is generally around 1% on Sunday mornings. When it becomes 10% in any Sunday morning, this should be considered a variation from the baseline.

We attach actions to the thresholds and normally these are SNMP traps. This trap has to have some attributes such as severity, additional text, managed object, probable cause etc.
Thresholds should also define a clear value which also initiates a clear trap. The clear trap will cause the alarm to be cleared on the FM side. (In order for a system to send an SNMP trap, it has to comply with a MIB. Most of the times, this MIB is enterprise specific. If you want to integrate your FM with PM over the SNMP traps, you will need the FM vendor’s MIB.)

Thresholds can be applied on raw data or aggregated data. Some products do not support the aggregated one so you have to check this important detail with your vendor.

Another important detail is the number of thresholds you deploy on your PM. Putting thousands of thresholds will eventually impact the performance of your PM. You should check with your PM vendor to understand the impacts of thresholds on the performance of the product. If it has limits (it should have limits), you may better group and classify your resources to reduce the threshold needs. In some cases you may even want to rely on RMON thresholds (the thresholds applied to and managed by the resources themselves).

Setting the right thresholds is another practice. It is organization specific and requires domain knowledge. If we don’t have a PM baseline data (we may be delivering the FM and PM in the same project), we should apply the best practice thresholds and “fine-tune” them to the organization by trial and errors.