Are you ready for SLA Management?

 SLA Management  Comments Off on Are you ready for SLA Management?
Feb 282013
 
This should be one of the top questions of the mature operators that provide corporate services to it’s customers. Customers are demanding more quality, less downtimes while the network capacity has been under pressure of these smart phone facebook traffic, M2M transactions etc. In such circumstances, it is hard to commit on a bit rate, or downtime percentage. Of course you can say that I will have 10 days of downtime maximum, but nobody will buy it. Or from other way around, if we commit that we wont’ have more than 1 mins of dowtime in a month, most probably we will fail and fell into a penalty situation which will not cover the extra gains we had by our next generation SLA offering.

So, if we want to provide SLA management we need to measure first. After that, we can predict. What we need to measure is most probably on our hands already. We have plenty of KPIs in our PM platform, lots of resource and service impact alarms in our FM and nicely enriched tickets in the TT platforms.

After having the KQI set which will be the base for the given SLAs, we need to identify the KPIs that will feed those KPIs. Before moving to the nature of the KPI data itself, we should measure the health of our OSS data flows. That is to say, if I collect KPI data from the datasource every 5 mins, how many times in a day I encounter “no data in the datasource” or “TCP connect failure” type of scenarios? These kind of questions will reveal my OSS performance. OSS Performance is very important. (That’s why in some SQM systems, we monitor OSS and its collection processes.) If we run short on performance in the collection section, we should fix this first before moving to KPI value elaborations.

If the collection is running fine, then we need to start baselining the KPI data. Baselining is provided by most of the PM tools in the form of some OTS reports. We can look at these to see how best we can perform throughout a reporting period. This manual process is effective but the automatic one will be better. We can push the PM vendors to provide a mechanism to export that prediction data to somewhere so that we can use it as a KPI to compare with our current delivered thresholds. If the system finds that we are approaching the red zone, it can open up a change / TT to trigger a capacity upgrade process for involved resources in the SLA delivery chain.

This is the technical side of the story. There’s another side, which seems to me a little bit harder to deal with: the business.
You need to train the business people on concepts. What’s SLA, KPI, KQI, Baselining, Product Offering, CFS, RFS … Probably they wont cooperate in your SLA initiative that much if they do not understand anything. They should be able to calculate the risks, convert the risks to dollar amounts, and create an SLA offering taking care of these amounts. SLA offerings should also cover the internal SOC organization support costs. The sales guys should never have the option to play with the parameters in the offerings. These should be fixed with the baselines and issued as SLA templates.

The analysis work also involves the 3rd parties. If we rely any third parties along our service delivery path, we should baseline their performance also. We should sign-up necessary underpinning contracts with them taken into account our SLA objectives.

SLA Management, should be a bottom-up process. Top-down approaches will be too risky and most decision makers will not approve your project. A well planned SLA Management process can bring additional revenue as well as a huge competitive advantage.

From SQM to Customer SLA Management

 SLA Management, SQM  Comments Off on From SQM to Customer SLA Management
Nov 102011
 
Customers will expect you to deliver the quality of service you have committed in the presales phase. Most of the operators, however, fell short on delivering this expectation and service outages occur every time.Most operators who do not trust their current network, do not implement any customer SLA management process. Lacking an end to end view, these operators just accept trouble tickets coming from the customer side. Also since most telecommunications regulatory bodies enforce customer SLA management processes, CSP calculates the SLAs based on customer facing trouble ticket information.

This reactive, primitive approach is no longer welcomed by today’s corporate customers. These customers expect more reporting capabilities that will also include outages that they are not aware of.

If the customer has a current SQM process, this could be an enabler for an effective customer SLA  management system. When a process name includes  “Customer” in it, you should be careful. If you show an outage to the customer in a report, you give them the option to claim a rebate. Therefore, we, as CSPs, need to adjust the outages after the outage has already occurred.

Most SQM systems will not allow you to edit already occurred outage events.  Therefore, these events should be exported to another system(NW DWH, EDWH ) for further correlation with force-major outage information.

If the CSP has not invested in an SQM solution before, they could utilize a Network Analytics (NW DWH) solution but this will not provide all the benefits that an SQM system brings. The best combination would always be SQM and Customer SLA Management solutions working together in a fully managed reporting environment.

OK, but how can we correlate force-major outage with a service outage information coming from a network probe for example? Or the question should be, how can we identify the force-major? The key component in here could be the service management platform where the problem management processes have been implemented. SQM systems will auto populate resource facing incidents for proactive recoveries. These incidents can after be inspected for a common root cause and if the root cause of the outage is not because an operator fault, it can be excluded in the force-major list.

Jan 152011
 

As the level of maturity increases in the operators, they focus more on improving the quality of services they give. As I explained in my previous posts, SQM (or sometimes call BSM-Business Service Management) is the key to measure the quality of the overall service. In SQM, we model the service and the service is composed of multiple service components. These components can be HLR, Charging, Core Network, Packet Network, Applications, Databases etc. and each of them is managed by a functional unit in the organization.

In an SQM”less” scenario, from the top-down approach, when a service problem is identified on the very top level, it is questioned the source of the problem. Identifying the root source of the problem may be time consuming. (The departments will most probably blame each other).  SQM can pinpoint the source and take necessary actions that increases the effectiveness. But who runs the SQM?

According to eTOM, SQM belongs to the Service Management & Operations functional grouping, so in best practice, it is advisable to assign a department for this process. (eTOM does not mandate it should be separate. This could be a role that can be assigned to an existing department. But as we will see, it won’t be effective to consolidate functions)

A new term, Service Operation Centers arise with the introduction of service quality management  concept. Service Operation Center or SOC is an organizational “department” that monitors the quality of the overall service and take the necessary actions in the case of service degradations and outages. The main data source for the SOC screens will be the SQM. The operators in the SOC will continuesly monitor SQM and coordinate with other departments to decrease the MTTR of the service outages.

A typical operator has a Network Operation Center or NOC inside it’s organization. This NOC, manages the NMS systems, monitors faults and events, track the performance of the network and troubleshoot the problems at the first hand. (L1 support). However, as the name implies, the main purpose of the NOC is to manage the network.

The network is only one part of the service. There are other components from IT. There may even some components from outside the organization such as content. NOC’s primary responsibility is to deal with those network specific complex problems. They should not communicate with the content provider to resolve a problem.

Because a network service provider’s main product is “network”, up to now, NOCs were sufficient to achieve overall assurance activities. But, as the services get more diverse and complex, SOC concept became much more logical.

As SOC deals with cross-functional teams, they should be sponsored by a upper level organizational entity to be effective. SOC should also have necessary interfaces to the other units (most likely the TT system) where strict OLAs are applied to. The people in SOC should include experts with  skills in networking, IT and other necessary topics to streamline the troubleshooting activities.

Mar 242010
 

SLA (Service Level Agreement) is a contract between the service provider and the customer. This contract makes commitments about the service’s quality that is perceived by the customer.

There are 2 main types of SLAs: Customer and Operational. Customer SLAs are the ones that are sold to the customers. Operational SLAs are also two types: OLAs and underpinning SLAs. OLAs are internal to the provider. For example they are the commitments that are agreed between two departments. Underpinning SLAs, on the other hand, are signed with suppliers/partners of the provider.

SLA Management starts with the business commitment. As SLAs need the full support of the organization involving multiple functional groups, there should be a common understanding about the vocabulary, methodologies and consequences.

A typical SLA Management process includes the following 5 steps:

1- ) Creating the SLA Template

SLA templates define the SLAs that will be negotiated with the customers. They include service level indicators (SLIs) that will be monitored such as availability (downtime), reliability (exp. MTBF) and maintainability (exp. MTTR).
Typically SLA Templates also include service levels such as gold, silver, bronze that indicates the acceptable, offered SLI values. A gold service level may say %99,99 availability and 1 hour MTTR while a silver one may commit on %99,98 availability with a 2 hour MTTR. Service levels SHOULD be decided with the cooperation among the marketing and the operational teams. I have seen some examples in which the service levels are decided by the marketing teams (most probably with values that the competitors are committing on) and mandated to the operational teams. The operational teams however were complaining that those values were almost impossible to be maintained.
SLAs should be designed with care as they have a direct interface with the customers and have financial impacts. Service levels also limit the “expectation creep” and set the acceptable targets.

There are other parameters of the SLA templates; terms and conditions, penalties, roles and responsibilities, calendar to name a few.

2- ) Negotiate the SLA

This step mainly belongs to the sales area. In this section the provider and the customer works on the SLA Templates to construct the “customized” SLA that aligns with the customer’s business. In this step, the customer, hopefully, selects a service level that suits the needs. However, customers may (and generally do) want to have some SLI commitments that do not match any service level in the template. The reasons could be several. For example, the customer may be running a business critical service and the committed SLI values may not satisfy the customer. Another example would be the case of aligning OLAs/underpinning SLAs with customer SLAs. (I will explain this in a later post).

Sales can agree on any values with the customer to gain the customer. We should avoid this situation. All the custom service levels should be negotiated with the operations before the contract is signed by the customers.

3- ) Monitor the SLAs

SLAs should be monitored and violations and degradations should be notified. After the contract is signed, SLA instance is created within the SLA Management tool (and the SQM tool if it is separate). This step is the service quality monitoring step and it is mainly targeted to the operational teams of the provider. There may be a customer interface for the customer to see the current accumulated downtime / incident records but this is real time and exposing this to the customer is not chosen by most of the providers.

4- ) Report the SLAs

SLA Reports should be generated at the end of reporting periods. The reports should not directly be sent to the customer from the tool as they may have financial impacts. There should be a control mechanism on the provider side before they are “published”. The customer should be given an interface to the SLA Tool to see his previous SLA reports. (If featured by the tool)

5- ) Review the SLAs
SLAs and their parameters should be reviewed occasionally and service levels should be fine-tuned.

SLA Management is a complex process that involves multiple tools, organizational units and the customers. There is a lot to talk about the SLA Management. I will continue writing about SLA Management to explore more details on specific areas.