This should be one of the top questions of the mature operators that provide corporate services to it’s customers. Customers are demanding more quality, less downtimes while the network capacity has been under pressure of these smart phone facebook traffic, M2M transactions etc. In such circumstances, it is hard to commit on a bit rate, or downtime percentage. Of course you can say that I will have 10 days of downtime maximum, but nobody will buy it. Or from other way around, if we commit that we wont’ have more than 1 mins of dowtime in a month, most probably we will fail and fell into a penalty situation which will not cover the extra gains we had by our next generation SLA offering.
So, if we want to provide SLA management we need to measure first. After that, we can predict. What we need to measure is most probably on our hands already. We have plenty of KPIs in our PM platform, lots of resource and service impact alarms in our FM and nicely enriched tickets in the TT platforms.
After having the KQI set which will be the base for the given SLAs, we need to identify the KPIs that will feed those KPIs. Before moving to the nature of the KPI data itself, we should measure the health of our OSS data flows. That is to say, if I collect KPI data from the datasource every 5 mins, how many times in a day I encounter “no data in the datasource” or “TCP connect failure” type of scenarios? These kind of questions will reveal my OSS performance. OSS Performance is very important. (That’s why in some SQM systems, we monitor OSS and its collection processes.) If we run short on performance in the collection section, we should fix this first before moving to KPI value elaborations.
If the collection is running fine, then we need to start baselining the KPI data. Baselining is provided by most of the PM tools in the form of some OTS reports. We can look at these to see how best we can perform throughout a reporting period. This manual process is effective but the automatic one will be better. We can push the PM vendors to provide a mechanism to export that prediction data to somewhere so that we can use it as a KPI to compare with our current delivered thresholds. If the system finds that we are approaching the red zone, it can open up a change / TT to trigger a capacity upgrade process for involved resources in the SLA delivery chain.
This is the technical side of the story. There’s another side, which seems to me a little bit harder to deal with: the business.
You need to train the business people on concepts. What’s SLA, KPI, KQI, Baselining, Product Offering, CFS, RFS … Probably they wont cooperate in your SLA initiative that much if they do not understand anything. They should be able to calculate the risks, convert the risks to dollar amounts, and create an SLA offering taking care of these amounts. SLA offerings should also cover the internal SOC organization support costs. The sales guys should never have the option to play with the parameters in the offerings. These should be fixed with the baselines and issued as SLA templates.
The analysis work also involves the 3rd parties. If we rely any third parties along our service delivery path, we should baseline their performance also. We should sign-up necessary underpinning contracts with them taken into account our SLA objectives.
SLA Management, should be a bottom-up process. Top-down approaches will be too risky and most decision makers will not approve your project. A well planned SLA Management process can bring additional revenue as well as a huge competitive advantage.