Importance of Data Profiling in Synthetic Monitoring

 Fault Management, Performance Management  Comments Off on Importance of Data Profiling in Synthetic Monitoring
Apr 092017
 

Monitoring of web services/sites via Synthetic Monitoring Systems is a common practice that is implemented by most of the companies. Cloud-based monitoring platforms give us the ability to watch our services from the Internet/ through the eyes of our customers.

However, the alarms that are emitted through these platforms are limited, most of the times, to the faults. They tell us our web service is not reachable at all, which is the worst text message a system admin would receive at 3 AM in the morning.

Can these systems be used for proactive monitoring? Yes, and this article is about this practice.

Based on your cloud-based monitoring provider capabilities, you can apply several proactive monitoring practices. But, before that, we need to elaborate on the only KPI we have for an Internet-initiated synthetic service check: Response Time.

Response Time is affected by several factors: Network delays, DNS resolution delays or Service (HTTP/FTP etc) delays.

These delays can be measured by most of the cloud-based monitoring platforms and can provide some insight to the upcoming outage that will occur soon.

How these KPIs can be used for pro-active monitoring, though, depends on the capabilities of your monitoring platform.

Monitoring platforms will usually allow you to put threshold values for several response time types that are mentioned above. Whenever the response time is above that threshold, these systems will emit alarms on several channels to alert the system administrators. Response times will rise in several occasions. These are:

· DNS issues (less likely but can happen)

· Network congestion/saturation

· HTTP Server issues (too many connections, load balancer problems, backend slowness such as RDBMS or external service provider calls)

But how will we know the threshold values? Well, we need to do some investigation for that. The easiest way is to look at the daily/weekly/monthly reports to figure out our traffic profile by ourselves. By simple calculations, we can come up with some approximate response time numbers.

The problem with the manual profiling is its tendency to lead to errors especially if your monitoring points is on the Internet. The Internet is a very dynamic environment and you have no control over it. The Internet’s this dynamic nature will result in fluctuating response times which will give us hard times to predict the thresholds. The other factor to consider is our own business. We have peak times and maintenance periods. During these times our response times may also rise and this is an expected behavior.

How can we set the correct thresholds in this very dynamic environment? There are certain steps to increase the predictability:

1-) Get Closer

If your service resides in Europe, and your customers are in Europe, there is no point of monitoring it from Hong Kong. Getting closer to your target means fewer routers to be passed along the way and reduced fluctuations in monitored response times.

2-) Create an Automated Baseline

Creating a baseline should be automatic to reflect your business. Say you take backups at 2 AM every Monday. Or your traffic goes rocket high on every Friday between 5PM-8PM. You are prepared for these events but if your monitoring system is not, you will receive false-positive alarms indicating“slowness” in your services. Your monitoring provider should provide the automatic profiling option based on hours and/or weeks. The profiling should be done at least on an hourly basis. So, if I select the weekly profiling option, the system should calculate the duration based on an algorithm (mean/median etc) for that weekday – hour. This way, the system will “capture” the event of our weekly backup event and will not emit any alarms for that event.

Following screenshot was taken from a monitoring system platform: PingTurk. This platform localizes the monitoring to the city level to increase the effectiveness of the synthetic monitoring. The table shows the calculated baseline response times for day-hour tuples per city in Turkey.

This example follows the 2 steps mentioned above (being near and baselining automatically) and will reduce the number of false-positive alarms during the operation.

The effectiveness can even be improved by introducing a local monitoring point located in the customer infrastructure (say at the DMZ). This way, we can be sure that the slowness is caused by the Internet or from our own infrastructure.

Living Knowledgebase Platforms for Automated Healing

 Configuration Management, Fault Management, Inventory Management  Comments Off on Living Knowledgebase Platforms for Automated Healing
Dec 112011
 

In an operational telecom environment, each fault or quality degradation will be handled by the NOC engineers and repaired by following the necessary steps. These steps are written in knowledgebase management systems or in people’s heads based on the past experiences.

Each time a new problem occurs on the network, it is detected by the network management platforms and if implemented, automatic trouble ticket generation is initiated for the root cause alarms. NOC engineers, handle each trouble ticket separately. During the troubleshooting process, a separate knowledge base system may also be consulted. However, due to the added operational costs, most of the time, knowledge base systems seem not so efficient.

A self-healing method could be used in order to automate these knowledge base systems. In this approach, each reconfiguration activity over the configuration management platform is logged for further reference. In the mean time, alarm information is also logged in the trouble management platform. The  alarms along with the configuration management logs are fed into a database platform where they can be further correlated. The node id (IP address, MAC address etc.) field along with other inventory related configuration  information (such as card id, slot id) can be used as the primary key for this correlation.

During the day to day operation, when a new root cause alarm occurs on the network, the RCA type will be looked up in the knowledge base for a best match to a configuration template.  If a match is found, then the configuration template can be populated to create the self-healing  re-configuration information to be applied to the faulty device.

This way, fully automated could be run without running an end-to-end incident management process. An incident process can and should be triggered as these configuration activities will not be finalized in a second and the service degradation or outages may have been experienced by the customers. However, the first task in the incident flow could be the checking of the alarm to identify if it is applicable for the automated self healing process. If the self healing processes does not apply to the scenario on hand, the incident flow can continue on it’s way. Again, each configuration task that is done over the configuration management platform will  continue to feed the self healing system with new profiles. The more data in the system will lead to better results with the template matching algorithm.

Fault Management

 Fault Management  Comments Off on Fault Management
Jun 142010
 

Fault Managers (FM) collect alarm and event information from the network elements. There are several interface types that fault managers use in order to collect this data. These interfaces are called the northbound interfaces (NBI) of the given alarm source. Alarm sources could be network elements or more often element management systems(EMS). A Fault Manager could also be an alarm source to another OSS system such as SQM, SLA Management or another Fault Manager. (In the context of manager-of-manager).

Most popular NBIs are SNMP based. These NBIs use the technology of SNMP traps to deliver the fault/event information to the target NMS system. TL1 and CORBA interfaces are also popular ones but they started to be considered as legacy. JMS is gaining popularity among the NBIs on the market.

Fault manager implementations are rather straightforward.

First, you need to identify the type of NBI you will connect to. You can learn this from the product vendor. Second, you need to collect the necessary connection parameters such as security, port number (corba/rmi could use dynamic ports), IP addresses etc. Most EMS systems will allow you to select the type of alarms/events that will be forwarded to the NBI. If you do not have any EMS and you are directly interfacing with the devices, you will have to configure the devices for alarm forwarding.

Fault Management products collect the alarms on their mediation layers where they have modules that know how to collect alarms from a specific source (device/interface type). These modules are also responsible with resynchronization of alarm information.

Resynchronization is an important concept in the Fault Management area. Devices/EMS systems forward their alarms to the fault management systems and do not care whether they have been received or not. (This is especially the case for the UDP based SNMP traps). Thus, if the network connectivity between the FM and EMS is lost, all new alarms or updates to previous alarms (clear alarms) will be lost too. Resynchronization is the process to recover the alarms after a connectivity issue. How this happens? Simple. EMS should maintain an active alarm list on it. It should also be able to provide this list to an OSS system through it’s NBI(most probably via a method call, or setting an SNMP OID value). The OSS system that takes the active alarm list will then apply a diff algorithm to find and apply the deltas to its own repository. If the EMS does not have the “active alarm list” feature, then there is no luck!

After the raw alarms are arrived to the FM platform, filtering phase starts. There may be thousands of active alarms even on small-scale networks. It is impossible for a NOC to track and manage them all. Thus, filtering becomes an essential step. Based on the customer requirements, we put filters to the alarm flow to pass only the ones we need. Aside from the simple pass/ no pass filters, there could be other types of filters that can handle specific fault scenarios such as link flaps.(link goes up and down). Let me try to explain this. When the link goes down, EMS will send an alarm. After ½ seconds, it goes up and it emits a clear alarm for the previous alarm. These conditions should be filtered as there is no need to take an action for this on the upper layers. (a notification could be send if the flaps continue).A filter could “wait” for a clear event for a specific period of time before sending the alarm to the upper layer. This could prevent the flappings to generate an alarm flood in the platform.
The next phase after the filtering is the enrichment phase. In this phase we enrich the alarm information by using external data sources. Most of the times, raw alarm values are meaningless to the NOC operator. In order for the operator start corrective actions on an alarm instance, he/she needs to get quick and usable information from the alarm. For example, if the alarm has a field named Device and its value is an IP address, NOC operator would go to a manual procedure to find the host name, region of that device before sending to the correct back-office. This time consuming manual processes should be automated on the Fault Manager. Enrichments are generally applied via custom scripts which are using the API of the FM platform.

All the filtered/enriched alarms are now ready for correlation. Correlation is the process of grouping similar alarms together to increase the efficiency of the NOC and the assurance process. You may have a look at my previous post on this topic.

The last important concept to mention is the expert rules. Expert rules are the automatic actions that are run on specific cases. An expert rule could be triggered whenever a severe alarm is received by the system or a specific text in the AdditionalText attribute is detected. The actions could be sending e-mail, SMS, creating trouble tickets or just manipulating the alarm fields.( such as changing state to Acknowledged.)

All the Fault Management systems have similar alarm interfaces where you will see a data grid and alarms inside. They also employ fancy network maps which are not usable at all.

Fault Managers have several interactions with other OSS systems such as Trouble Ticket, Workforce Management, SQM, Performance Managers, Inventory Managers etc.

The most important integration which is usually implemented first is the trouble ticketing integration. The faults should be tracked and solved quickly and trouble tickets are the instruments for that. TTs could be opened manually by the NOC operator or automatically by an expert rule.

Fault Managers are must have OSS systems. Their basic functionality is not very hard to implement however advanced features such as correlations could lead to time and resource consuming implementations.

Root Cause Analysis

 Fault Management  Comments Off on Root Cause Analysis
Mar 222010
 

Whenever a fault occurs on a resource, this may impact multiple resources (physical or logical) in the infrastructure. This impact may also lead to alarm generation from those items.

In order for the operational staff to react to incidents in a more effective way, the root cause of these alarms should be identified. After identifying the root alarm, all the child alarms could be linked to that alarm and “hide” from the operator console.

In fault management, the root cause analysis is done via applying correlation algorithms among a set of alarms. Fault managers use expert systems to run these correlations.

There are 2 types of correlation that is implemented by fault management systems.

• Rule based correlation
• Topology based correlation

Rule based correlation is the simplest type of correlation. The main idea is to collect the alarms that are arrived within a predefined time period (window) and apply some rules (if statements) on those alarms to find the root cause (sometimes call the mother alarm) of those alarms.

Topology based correlation is harder to impalement. In this type of correlation, the resource models should be imported to the expert systems. These are hierarchical service models that describe the mother-child relationships between the resources. A simple example would be port->interface->sub interface. Generally these experts do not have rich user interfaces. This is because they deal with thousands of alarms and complex resource models.

The resource models are similar to SQM models. So, some customers decide to do the service impact analysis on their fault managers. This is a completely legal approach and sometimes the right one especially if you don’t have a scalable SQM tool. However, if there is an SQM tool on the environment, service impact analysis responsibility should be given to that tool.