Jun 142010
 

Fault Managers (FM) collect alarm and event information from the network elements. There are several interface types that fault managers use in order to collect this data. These interfaces are called the northbound interfaces (NBI) of the given alarm source. Alarm sources could be network elements or more often element management systems(EMS). A Fault Manager could also be an alarm source to another OSS system such as SQM, SLA Management or another Fault Manager. (In the context of manager-of-manager).

Most popular NBIs are SNMP based. These NBIs use the technology of SNMP traps to deliver the fault/event information to the target NMS system. TL1 and CORBA interfaces are also popular ones but they started to be considered as legacy. JMS is gaining popularity among the NBIs on the market.

Fault manager implementations are rather straightforward.

First, you need to identify the type of NBI you will connect to. You can learn this from the product vendor. Second, you need to collect the necessary connection parameters such as security, port number (corba/rmi could use dynamic ports), IP addresses etc. Most EMS systems will allow you to select the type of alarms/events that will be forwarded to the NBI. If you do not have any EMS and you are directly interfacing with the devices, you will have to configure the devices for alarm forwarding.

Fault Management products collect the alarms on their mediation layers where they have modules that know how to collect alarms from a specific source (device/interface type). These modules are also responsible with resynchronization of alarm information.

Resynchronization is an important concept in the Fault Management area. Devices/EMS systems forward their alarms to the fault management systems and do not care whether they have been received or not. (This is especially the case for the UDP based SNMP traps). Thus, if the network connectivity between the FM and EMS is lost, all new alarms or updates to previous alarms (clear alarms) will be lost too. Resynchronization is the process to recover the alarms after a connectivity issue. How this happens? Simple. EMS should maintain an active alarm list on it. It should also be able to provide this list to an OSS system through it’s NBI(most probably via a method call, or setting an SNMP OID value). The OSS system that takes the active alarm list will then apply a diff algorithm to find and apply the deltas to its own repository. If the EMS does not have the “active alarm list” feature, then there is no luck!

After the raw alarms are arrived to the FM platform, filtering phase starts. There may be thousands of active alarms even on small-scale networks. It is impossible for a NOC to track and manage them all. Thus, filtering becomes an essential step. Based on the customer requirements, we put filters to the alarm flow to pass only the ones we need. Aside from the simple pass/ no pass filters, there could be other types of filters that can handle specific fault scenarios such as link flaps.(link goes up and down). Let me try to explain this. When the link goes down, EMS will send an alarm. After ½ seconds, it goes up and it emits a clear alarm for the previous alarm. These conditions should be filtered as there is no need to take an action for this on the upper layers. (a notification could be send if the flaps continue).A filter could “wait” for a clear event for a specific period of time before sending the alarm to the upper layer. This could prevent the flappings to generate an alarm flood in the platform.
The next phase after the filtering is the enrichment phase. In this phase we enrich the alarm information by using external data sources. Most of the times, raw alarm values are meaningless to the NOC operator. In order for the operator start corrective actions on an alarm instance, he/she needs to get quick and usable information from the alarm. For example, if the alarm has a field named Device and its value is an IP address, NOC operator would go to a manual procedure to find the host name, region of that device before sending to the correct back-office. This time consuming manual processes should be automated on the Fault Manager. Enrichments are generally applied via custom scripts which are using the API of the FM platform.

All the filtered/enriched alarms are now ready for correlation. Correlation is the process of grouping similar alarms together to increase the efficiency of the NOC and the assurance process. You may have a look at my previous post on this topic.

The last important concept to mention is the expert rules. Expert rules are the automatic actions that are run on specific cases. An expert rule could be triggered whenever a severe alarm is received by the system or a specific text in the AdditionalText attribute is detected. The actions could be sending e-mail, SMS, creating trouble tickets or just manipulating the alarm fields.( such as changing state to Acknowledged.)

All the Fault Management systems have similar alarm interfaces where you will see a data grid and alarms inside. They also employ fancy network maps which are not usable at all.

Fault Managers have several interactions with other OSS systems such as Trouble Ticket, Workforce Management, SQM, Performance Managers, Inventory Managers etc.

The most important integration which is usually implemented first is the trouble ticketing integration. The faults should be tracked and solved quickly and trouble tickets are the instruments for that. TTs could be opened manually by the NOC operator or automatically by an expert rule.

Fault Managers are must have OSS systems. Their basic functionality is not very hard to implement however advanced features such as correlations could lead to time and resource consuming implementations.