Mar 222010

Whenever a fault occurs on a resource, this may impact multiple resources (physical or logical) in the infrastructure. This impact may also lead to alarm generation from those items.

In order for the operational staff to react to incidents in a more effective way, the root cause of these alarms should be identified. After identifying the root alarm, all the child alarms could be linked to that alarm and “hide” from the operator console.

In fault management, the root cause analysis is done via applying correlation algorithms among a set of alarms. Fault managers use expert systems to run these correlations.

There are 2 types of correlation that is implemented by fault management systems.

• Rule based correlation
• Topology based correlation

Rule based correlation is the simplest type of correlation. The main idea is to collect the alarms that are arrived within a predefined time period (window) and apply some rules (if statements) on those alarms to find the root cause (sometimes call the mother alarm) of those alarms.

Topology based correlation is harder to impalement. In this type of correlation, the resource models should be imported to the expert systems. These are hierarchical service models that describe the mother-child relationships between the resources. A simple example would be port->interface->sub interface. Generally these experts do not have rich user interfaces. This is because they deal with thousands of alarms and complex resource models.

The resource models are similar to SQM models. So, some customers decide to do the service impact analysis on their fault managers. This is a completely legal approach and sometimes the right one especially if you don’t have a scalable SQM tool. However, if there is an SQM tool on the environment, service impact analysis responsibility should be given to that tool.