Living Knowledgebase Platforms for Automated Healing

 Configuration Management, Fault Management, Inventory Management  Comments Off on Living Knowledgebase Platforms for Automated Healing
Dec 112011

In an operational telecom environment, each fault or quality degradation will be handled by the NOC engineers and repaired by following the necessary steps. These steps are written in knowledgebase management systems or in people’s heads based on the past experiences.

Each time a new problem occurs on the network, it is detected by the network management platforms and if implemented, automatic trouble ticket generation is initiated for the root cause alarms. NOC engineers, handle each trouble ticket separately. During the troubleshooting process, a separate knowledge base system may also be consulted. However, due to the added operational costs, most of the time, knowledge base systems seem not so efficient.

A self-healing method could be used in order to automate these knowledge base systems. In this approach, each reconfiguration activity over the configuration management platform is logged for further reference. In the mean time, alarm information is also logged in the trouble management platform. The  alarms along with the configuration management logs are fed into a database platform where they can be further correlated. The node id (IP address, MAC address etc.) field along with other inventory related configuration  information (such as card id, slot id) can be used as the primary key for this correlation.

During the day to day operation, when a new root cause alarm occurs on the network, the RCA type will be looked up in the knowledge base for a best match to a configuration template.  If a match is found, then the configuration template can be populated to create the self-healing  re-configuration information to be applied to the faulty device.

This way, fully automated could be run without running an end-to-end incident management process. An incident process can and should be triggered as these configuration activities will not be finalized in a second and the service degradation or outages may have been experienced by the customers. However, the first task in the incident flow could be the checking of the alarm to identify if it is applicable for the automated self healing process. If the self healing processes does not apply to the scenario on hand, the incident flow can continue on it’s way. Again, each configuration task that is done over the configuration management platform will  continue to feed the self healing system with new profiles. The more data in the system will lead to better results with the template matching algorithm.

Fault Management

 Fault Management  Comments Off on Fault Management
Jun 142010

Fault Managers (FM) collect alarm and event information from the network elements. There are several interface types that fault managers use in order to collect this data. These interfaces are called the northbound interfaces (NBI) of the given alarm source. Alarm sources could be network elements or more often element management systems(EMS). A Fault Manager could also be an alarm source to another OSS system such as SQM, SLA Management or another Fault Manager. (In the context of manager-of-manager).

Most popular NBIs are SNMP based. These NBIs use the technology of SNMP traps to deliver the fault/event information to the target NMS system. TL1 and CORBA interfaces are also popular ones but they started to be considered as legacy. JMS is gaining popularity among the NBIs on the market.

Fault manager implementations are rather straightforward.

First, you need to identify the type of NBI you will connect to. You can learn this from the product vendor. Second, you need to collect the necessary connection parameters such as security, port number (corba/rmi could use dynamic ports), IP addresses etc. Most EMS systems will allow you to select the type of alarms/events that will be forwarded to the NBI. If you do not have any EMS and you are directly interfacing with the devices, you will have to configure the devices for alarm forwarding.

Fault Management products collect the alarms on their mediation layers where they have modules that know how to collect alarms from a specific source (device/interface type). These modules are also responsible with resynchronization of alarm information.

Resynchronization is an important concept in the Fault Management area. Devices/EMS systems forward their alarms to the fault management systems and do not care whether they have been received or not. (This is especially the case for the UDP based SNMP traps). Thus, if the network connectivity between the FM and EMS is lost, all new alarms or updates to previous alarms (clear alarms) will be lost too. Resynchronization is the process to recover the alarms after a connectivity issue. How this happens? Simple. EMS should maintain an active alarm list on it. It should also be able to provide this list to an OSS system through it’s NBI(most probably via a method call, or setting an SNMP OID value). The OSS system that takes the active alarm list will then apply a diff algorithm to find and apply the deltas to its own repository. If the EMS does not have the “active alarm list” feature, then there is no luck!

After the raw alarms are arrived to the FM platform, filtering phase starts. There may be thousands of active alarms even on small-scale networks. It is impossible for a NOC to track and manage them all. Thus, filtering becomes an essential step. Based on the customer requirements, we put filters to the alarm flow to pass only the ones we need. Aside from the simple pass/ no pass filters, there could be other types of filters that can handle specific fault scenarios such as link flaps.(link goes up and down). Let me try to explain this. When the link goes down, EMS will send an alarm. After ½ seconds, it goes up and it emits a clear alarm for the previous alarm. These conditions should be filtered as there is no need to take an action for this on the upper layers. (a notification could be send if the flaps continue).A filter could “wait” for a clear event for a specific period of time before sending the alarm to the upper layer. This could prevent the flappings to generate an alarm flood in the platform.
The next phase after the filtering is the enrichment phase. In this phase we enrich the alarm information by using external data sources. Most of the times, raw alarm values are meaningless to the NOC operator. In order for the operator start corrective actions on an alarm instance, he/she needs to get quick and usable information from the alarm. For example, if the alarm has a field named Device and its value is an IP address, NOC operator would go to a manual procedure to find the host name, region of that device before sending to the correct back-office. This time consuming manual processes should be automated on the Fault Manager. Enrichments are generally applied via custom scripts which are using the API of the FM platform.

All the filtered/enriched alarms are now ready for correlation. Correlation is the process of grouping similar alarms together to increase the efficiency of the NOC and the assurance process. You may have a look at my previous post on this topic.

The last important concept to mention is the expert rules. Expert rules are the automatic actions that are run on specific cases. An expert rule could be triggered whenever a severe alarm is received by the system or a specific text in the AdditionalText attribute is detected. The actions could be sending e-mail, SMS, creating trouble tickets or just manipulating the alarm fields.( such as changing state to Acknowledged.)

All the Fault Management systems have similar alarm interfaces where you will see a data grid and alarms inside. They also employ fancy network maps which are not usable at all.

Fault Managers have several interactions with other OSS systems such as Trouble Ticket, Workforce Management, SQM, Performance Managers, Inventory Managers etc.

The most important integration which is usually implemented first is the trouble ticketing integration. The faults should be tracked and solved quickly and trouble tickets are the instruments for that. TTs could be opened manually by the NOC operator or automatically by an expert rule.

Fault Managers are must have OSS systems. Their basic functionality is not very hard to implement however advanced features such as correlations could lead to time and resource consuming implementations.

Root Cause Analysis

 Fault Management  Comments Off on Root Cause Analysis
Mar 222010

Whenever a fault occurs on a resource, this may impact multiple resources (physical or logical) in the infrastructure. This impact may also lead to alarm generation from those items.

In order for the operational staff to react to incidents in a more effective way, the root cause of these alarms should be identified. After identifying the root alarm, all the child alarms could be linked to that alarm and “hide” from the operator console.

In fault management, the root cause analysis is done via applying correlation algorithms among a set of alarms. Fault managers use expert systems to run these correlations.

There are 2 types of correlation that is implemented by fault management systems.

• Rule based correlation
• Topology based correlation

Rule based correlation is the simplest type of correlation. The main idea is to collect the alarms that are arrived within a predefined time period (window) and apply some rules (if statements) on those alarms to find the root cause (sometimes call the mother alarm) of those alarms.

Topology based correlation is harder to impalement. In this type of correlation, the resource models should be imported to the expert systems. These are hierarchical service models that describe the mother-child relationships between the resources. A simple example would be port->interface->sub interface. Generally these experts do not have rich user interfaces. This is because they deal with thousands of alarms and complex resource models.

The resource models are similar to SQM models. So, some customers decide to do the service impact analysis on their fault managers. This is a completely legal approach and sometimes the right one especially if you don’t have a scalable SQM tool. However, if there is an SQM tool on the environment, service impact analysis responsibility should be given to that tool.