IP Probes

 SQM  Comments Off on IP Probes
Mar 242010

In order to get the end-to-end network performance data, we should install probes to the infrastructure. Probes are the most effective way to reach the end-to-end performance that is perceived by the end user. An alternative approach is to correlate multiple node-to-node performance data to reach the end-to-end performance metrics. This is very hard to maintain and may cause misleading results so we should always use probes where possible.

There are two types of probes. Active and Passive.

Passive probes passively monitor the packets that are passed through them. Basically in the entrance point, they monitor the packet header to identify the source address, destination address and some other information. The remote probe at the exit point does the same. Both probes send this information along with the timestamps to the management system where they are correlated and converted to performance metrics. Passive probing is a costly solution. Its’ main benefit is, they do not generate any traffic so maximum throughput is maintained.

Active Probes, on the other hand, generate special traffic between each other to measure the end-to-end performance. Some probe vendors use ICMP PDUs for this purpose. Other vendors, such as Cisco, prefer to send special PDUs that have additional parameters. Active Probes are cost effective when compared to the passive ones. They should be the preferred approach to measure IP based traffic.

Probes can be external (hardware based) or internal (software based). Software based probes are easier to deploy and maintain. Most popular software based internal probes are Cisco SAA probes (IP SLA).

Probes provide granular data. Typically this data is collected and further aggregated by the performance management systems and forwarded to other systems such as SQM.

Mar 242010

SLA (Service Level Agreement) is a contract between the service provider and the customer. This contract makes commitments about the service’s quality that is perceived by the customer.

There are 2 main types of SLAs: Customer and Operational. Customer SLAs are the ones that are sold to the customers. Operational SLAs are also two types: OLAs and underpinning SLAs. OLAs are internal to the provider. For example they are the commitments that are agreed between two departments. Underpinning SLAs, on the other hand, are signed with suppliers/partners of the provider.

SLA Management starts with the business commitment. As SLAs need the full support of the organization involving multiple functional groups, there should be a common understanding about the vocabulary, methodologies and consequences.

A typical SLA Management process includes the following 5 steps:

1- ) Creating the SLA Template

SLA templates define the SLAs that will be negotiated with the customers. They include service level indicators (SLIs) that will be monitored such as availability (downtime), reliability (exp. MTBF) and maintainability (exp. MTTR).
Typically SLA Templates also include service levels such as gold, silver, bronze that indicates the acceptable, offered SLI values. A gold service level may say %99,99 availability and 1 hour MTTR while a silver one may commit on %99,98 availability with a 2 hour MTTR. Service levels SHOULD be decided with the cooperation among the marketing and the operational teams. I have seen some examples in which the service levels are decided by the marketing teams (most probably with values that the competitors are committing on) and mandated to the operational teams. The operational teams however were complaining that those values were almost impossible to be maintained.
SLAs should be designed with care as they have a direct interface with the customers and have financial impacts. Service levels also limit the “expectation creep” and set the acceptable targets.

There are other parameters of the SLA templates; terms and conditions, penalties, roles and responsibilities, calendar to name a few.

2- ) Negotiate the SLA

This step mainly belongs to the sales area. In this section the provider and the customer works on the SLA Templates to construct the “customized” SLA that aligns with the customer’s business. In this step, the customer, hopefully, selects a service level that suits the needs. However, customers may (and generally do) want to have some SLI commitments that do not match any service level in the template. The reasons could be several. For example, the customer may be running a business critical service and the committed SLI values may not satisfy the customer. Another example would be the case of aligning OLAs/underpinning SLAs with customer SLAs. (I will explain this in a later post).

Sales can agree on any values with the customer to gain the customer. We should avoid this situation. All the custom service levels should be negotiated with the operations before the contract is signed by the customers.

3- ) Monitor the SLAs

SLAs should be monitored and violations and degradations should be notified. After the contract is signed, SLA instance is created within the SLA Management tool (and the SQM tool if it is separate). This step is the service quality monitoring step and it is mainly targeted to the operational teams of the provider. There may be a customer interface for the customer to see the current accumulated downtime / incident records but this is real time and exposing this to the customer is not chosen by most of the providers.

4- ) Report the SLAs

SLA Reports should be generated at the end of reporting periods. The reports should not directly be sent to the customer from the tool as they may have financial impacts. There should be a control mechanism on the provider side before they are “published”. The customer should be given an interface to the SLA Tool to see his previous SLA reports. (If featured by the tool)

5- ) Review the SLAs
SLAs and their parameters should be reviewed occasionally and service levels should be fine-tuned.

SLA Management is a complex process that involves multiple tools, organizational units and the customers. There is a lot to talk about the SLA Management. I will continue writing about SLA Management to explore more details on specific areas.

Root Cause Analysis

 Fault Management  Comments Off on Root Cause Analysis
Mar 222010

Whenever a fault occurs on a resource, this may impact multiple resources (physical or logical) in the infrastructure. This impact may also lead to alarm generation from those items.

In order for the operational staff to react to incidents in a more effective way, the root cause of these alarms should be identified. After identifying the root alarm, all the child alarms could be linked to that alarm and “hide” from the operator console.

In fault management, the root cause analysis is done via applying correlation algorithms among a set of alarms. Fault managers use expert systems to run these correlations.

There are 2 types of correlation that is implemented by fault management systems.

• Rule based correlation
• Topology based correlation

Rule based correlation is the simplest type of correlation. The main idea is to collect the alarms that are arrived within a predefined time period (window) and apply some rules (if statements) on those alarms to find the root cause (sometimes call the mother alarm) of those alarms.

Topology based correlation is harder to impalement. In this type of correlation, the resource models should be imported to the expert systems. These are hierarchical service models that describe the mother-child relationships between the resources. A simple example would be port->interface->sub interface. Generally these experts do not have rich user interfaces. This is because they deal with thousands of alarms and complex resource models.

The resource models are similar to SQM models. So, some customers decide to do the service impact analysis on their fault managers. This is a completely legal approach and sometimes the right one especially if you don’t have a scalable SQM tool. However, if there is an SQM tool on the environment, service impact analysis responsibility should be given to that tool.

Implementing Service Quality Management

 SQM  Comments Off on Implementing Service Quality Management
Mar 222010

Implementing Service Quality Management

As the customers depend more on service providers, they started demanding the commitment on the quality of the services that they receive. The SQM (Service Quality Management) concept came into practice to fulfill this demand. SQM is a set of practices to ensure that the customers receive the service quality that they need. The goal is to manage (negotiate, monitor, report, predict/review, take action) the quality/level of the services that are provided to the customers.

Service Quality Management is implemented in several steps. I try to apply the following 8 – step methodology in the SQM projects that I have involved.

1-) Decide on the services to be monitored

In this step, we decide which services will be monitored by the SQM process. These services are mainly customer facing services that are perceived by the customers. We may also want to monitor the services that we receive from 3PPs.

2-) Design the service model

The services have components and the components have parameters. Service components also have dependencies between them. All of these aspects construct the “model” of the service. Service model is an hierarchical structure of elements. That’s why it is sometimes called the service tree. The backbone of the SQM is the service models. Service models use the concept of status propagation in order to find a lower level service component’s impact on the service which resides at the up most level. This analysis through status propagation is also called service impact analysis.

Designing the service model is the most important step of an SQM implementation. It requires domain knowledge, therefore should involve all the functional units that are responsible for managing different components of a service. Before designing the service model, we should decide on the granularity of the service. This is a very important step of the service design. The granularity of a service defines the number of service components in the service model. When this number gets bigger, we end up with more complex service models. Service models that have lots of service components become unmanageable and may lead to scalability problems on the SQM tools.

After deciding the level of granularity we will use, we can start “drawing” the model. Most SQM tools provide drag drop based service designers for the creation of new models.

3-) Decide on the monitoring points and data sources

After designing the service model, the next step is to “feed” this service model with data, collected from several points on the infrastructure. The data sources for the SQM will typically be other OSS tools mainly Performance Management, Fault Management and Trouble Ticketing systems. Active or Passive probes that provide end-to-end network and application performance should also be introduced where available.

The data sources will provide us raw-data which should be mapped to KPIs. From the KPIs, we may choose to develop secondary parameters (KQIs). Those KPIs/KQIs should then be mapped to services/service components in the service model.

4-) Design the data collection

After we define the raw data to model the mappings, the next step is to define the rules about polling, aggregation and no-data-policy. Polling intervals define the granularity of the downtime data so the intervals should be as small as possible. However, polling frequently will lead to performance problems.

5-) Design the thresholds and the actions that will be applied on those thresholds

In order to play a proactive role, the provider should be notified on service level degradations and violations. Thresholds are the tools that enable proactive monitoring. Setting the correct thresholds requires domain and application knowledge. The thresholds should also support the SLAs that will be given to the customers. In SQM, a specific service component parameter may be assigned several levels of thresholds for the same KPI. Typically there will be one violation threshold and multiple degradation thresholds. The thresholds manipulate the status of the service components which in turn propagates up to the service level.

Whenever a threshold is breached, an action should be triggered. This action could be very simple such as sending a notification via email or complex such as triggering a traffic engineering script. It is a good practice to open trouble tickets (performance degradation reports in eTOM terms) in the TT systems to trigger a process that will take corrective actions on the service quality degradations.

6-) Create Customer SLAs and/or OLAs

SLAs are the drivers of the SQM and for a complete solution; we should introduce customer SLA’s or OLAs. SLAs and OLAs are negotiated with the customer/supplier/partner/internal departments. SLA violations will cause penalties. In order to prevent SLA violations, proactive thresholds should also be put on the SLA parameters. Customer SLAs should be supported by OLAs. Therefore, assigning the right thresholds for OLAs is very important.

We may also assign thresholds on the SLA / OLAs. The SLA and OLA parameters are sometimes called service level indicators and they are negotiated and listed in the SLA/OLA contracts.

7-) Design the Service Quality Reporting

The service quality data that is created by the SQM systems are distributed by service quality reports. Different reports should be designed for users that have different perspectives. For example, an executive summary report may include some statistics about total SLA violations and degradations. While the summary reports may provide enough detail for the upper management, different departments in the organization may be interested with more detailed, technical reports.
The reports should be distributed automatically on daily/weekly/monthly basis. The automatic report distribution should be internal and the SQM reports should not be sent to the customer directly, without a control mechanism.

8- ) Getting the commitment from the organization for the SQM

SQM should be implemented by separate functional groups namely service operating centers or SOCs. The SOCs have the end-to-end visibility of the services and they are customer aware. SOC constantly monitors the services just like the NOC monitors the resources. Whenever a service problem or degradation is detected, the SOC should take the responsibility and coordinate the necessary actions. This may involve communicating with the NOC (or IT) for the resolution of the resource problems.

SQM is beneficial but it also brings more stress and extra work for the operational staff. This will lead to resistance to change. That’s why SQM requires commitment from upper levels of the organization. SQM brings additional vocabulary. Necessary trainings should also be provided to the operational teams to avoid unnecessary confusions.

There are several products on the market that does the SQM. Each of them has strengths and weaknesses. As we can see from the 8-step methodology, rolling out an SQM implementation could be a very time consuming process. Therefore the tools that will support the SQM process should be flexible, scalable and easy to implement. For example, one tool may require you to “compile” the service model before you start monitoring it. This will require vendor involvement or you should maintain more skilled staff. Looking from the scalability perspective, one tool may provide you a very rich service quality monitoring dashboard but it lacks of scalability on the total number of the managed objects in the service tree, or the number of service instances. Selecting on the right product is the key to successful implementation.