Living Inventory Managers

 Inventory Management, Performance Management  Comments Off on Living Inventory Managers
Sep 182012
 

Inventory management systems are a must for most of the root cause analysis and service impact analysis that we rely on.

One of the other primary benefits of a NIM is that it gives you a holistic view of your infrastructure. This way, you can use this infrastructure more effectively and reduce your CAPEX. You can pinpoint your idle resources and assign the next work package to them rather procuring a new one. Seeing the processing capacity, the planning department can apply more processes to lower utilized devices.

The problem that may arise in here is the static nature of the NIM. The data you enter manually or import to the system is static. That is to say, the system does not go and fetch data by itself. You define the device, you define locations, you define IP addresses. All static. Theres nothing wrong with this as the NIM lives like this without a problem. A healthy process architecture, can make the NIM data concrete and up-to-date.

But wouldnt’ it be nice if the NIM becomes more active? For example, I need a virtual machine to be installed on my architecture. I have a look at my devices and see the they are currently at their capacity.
So I need to procure a new device? This may not be the case. Looking at the performance management, I see that the device X is CPU loaded only at the midnight for 2 hours period. Other times its’ utilization is 1% only. For sure this VM can be installed onto this machine if the application on this VM will not use CPU that much on that time.

If my NIM can somehow fetch this load data and reflect it to the provisioning process, the admin or the expert system who will assign the VM to a machine can choose to install it on this device rather starting a new device procurement process.

Offcourse this VM example can be extended to virtual routers or TDM resources. The approach will save resources and promote re-usability while reducing CAPEX.

Do current NIM vendors ready for such change or are they willing to?

Mar 222012
 

Today, I want to talk about a new trend that seems to popped up in the SQM/CEM field: Mobile Device Agents.

Mobile Device agents are software components that reside on user devices and collect statistics about the quality of user experience which will enable the operator to act upon service degradation. Operator can also have the same data correlated with service quality data to plan future service improvements.

Device agent term is fairly new for the mobile industry. However, this is not the case for the fixed line. In fixed line, operators have been collecting metrics about the given end-to-end service for years. These metrics are collected from CPE(Customer Premises Equipment) devices (mostly routers and L2 switches) that reside on the customer premises. Ideally, but not necessarily, these devices are also managed by the operator, taking the name Managed CPE. By utilizing data coming from these CPEs, operators are able to measure not only the core network health, but also the Access side.

In order to increase user perceived quality,  service providers continuously seek new datasources that will give clues about the customer’s service perception. Customer usage data can be collected in several places:

–          Probe systems

–          DPI systems

–          Device Agent systems

Probe system and DPI system can provide the top most visited URLs, throughput/speed kind of statistics that will give clues about the service usage. Probe systems can additionaly provide call drop statatistics and catch device configuration errors.

Device agents can do both. But, they also provide device related information such as signal strength or battery status. They even can tell which software along with their versions are installed on the phone.

If we collect all this data (usage + device + signalling) and correlate successfully, we can do lot’s of customer experience related analysis with it. We can detect that a specific service usage drop from the DPI system and correlate this with mobile phone configuration errors.  The dropped calls can be correlated with device battery information to see if the dropped call has occurred because of a device problem. In some cases where the operator has not done any investment to DPI and probe systems, just the Device Agent system can provide all that data.

But why device agents are not so popular? First answer is the privacy. Most people will not want agents on their phones that are sending their usage patterns to somewhere else. There are not so many regulations around this but we should expect to see them soon.

The second answer is more technical. The agents consume processing power and drain the batteries soon. In order to get rid of this, agents should not always be on-line, and collected statistics should be uploaded in relatively longer intervals (a couple hours). That late data cannot be utilized by SQM systems so it can only be used for late correlation and planning purposes.

Device agents use push mechanism and upload their statistics to a central server where further correlation and reporting functions can be executed. However, because of the reasons I have provided, they cannot be real-time data sources which are required by most SQM/CEM systems.

Application Performance Managers

 Performance Management  Comments Off on Application Performance Managers
Jun 112010
 

In the performance management area, I have talked about network and device performance management. I should also mention the application performance management to complete the picture.

Application performance managers(APMs) track how much time an application is available and how well it meets the expected functionality. APMs have different types: Transactional Monitors, User Monitors, Application Server Monitors and Database Monitoes.

Transactional Monitors:

Typically, if you want to monitor an application, you should first determine on the use cases that will be implemented by the monitor. Use cases are a set of activities or business processes that the monitor should perform. A use case example could be:

– Open the application URL
– Enter username and password and press submit
– After successful login, click on the report 1 and wait for the report to be displayed.
– Logoff

APMs run these activities (transactions) one by one and wrote down the response times. They also check if any errors occurred during the process.

User Monitors:

Most of the APMs simulate the user behavior but some of them also “sniff” the user actions. An agent program that is installed on the user machine tracks the user actions in the form of KPIs. One of the most important KPI is the “think time” which represents the end-user’s thinking period. (between an action’s result and the next action.)

Application Server Monitors:

APMs are also able to track application server performances such as Websphere, Tomcat etc. These APMs report the application server’s performance and monitor the process stacks to find the most time consuming method calls. These statistics are extremely important if you are dealing with an in-house application and trying to pinpoint a performance degradation points.

Database Monitors:

Database specific APMs monitor the well-known databases( Oracle, DB2, MS SQL Server etc.) and their performances.

APM statistics should be correlated along with the server (OS Level) statistics. An end-to-end view will also require the network related statistics and customer experience management statistics(from active probes). At the end of each monitoring, a set of KPIs are exposed. These KPIs are fed to performance management systems and SQM systems to be further analyzed.

Performance Management – Thresholds

 Performance Management  Comments Off on Performance Management – Thresholds
Apr 062010
 

Thresholds are the way to detect anomalies in the performance data. They act as proactive tools for the operators to take early actions on the degradations before they cause a fault and possible service loss.

There are 2 types of thresholds.

1- ) Static or Burst thresholds
2- ) Dynamic or Baseline thresholds

The static type is the one that triggers an action whenever the specified value is crossed. Once violated, those thresholds will not trigger the same action in each new data that is above the threshold. (Most products count the occurrence times, until the value returns back to normal limits)

The second type of threshold is the one that takes into account the baseline (historical) information. These thresholds look at the baseline data to see the variations from the baseline. For example, suppose a router’s interface utilization is generally around 1% on Sunday mornings. When it becomes 10% in any Sunday morning, this should be considered a variation from the baseline.

We attach actions to the thresholds and normally these are SNMP traps. This trap has to have some attributes such as severity, additional text, managed object, probable cause etc.
Thresholds should also define a clear value which also initiates a clear trap. The clear trap will cause the alarm to be cleared on the FM side. (In order for a system to send an SNMP trap, it has to comply with a MIB. Most of the times, this MIB is enterprise specific. If you want to integrate your FM with PM over the SNMP traps, you will need the FM vendor’s MIB.)

Thresholds can be applied on raw data or aggregated data. Some products do not support the aggregated one so you have to check this important detail with your vendor.

Another important detail is the number of thresholds you deploy on your PM. Putting thousands of thresholds will eventually impact the performance of your PM. You should check with your PM vendor to understand the impacts of thresholds on the performance of the product. If it has limits (it should have limits), you may better group and classify your resources to reduce the threshold needs. In some cases you may even want to rely on RMON thresholds (the thresholds applied to and managed by the resources themselves).

Setting the right thresholds is another practice. It is organization specific and requires domain knowledge. If we don’t have a PM baseline data (we may be delivering the FM and PM in the same project), we should apply the best practice thresholds and “fine-tune” them to the organization by trial and errors.

Mar 262010
 

Performance Management is about polling data, aggregating it, running thresholds on it and reporting of performance parameters. In this article, I will concentrate on the polling and data retention side of it.

Performance managers deal with lots of data from several resources. This mass amount of data directly impacts the disk space requirements. Since the disk space is not limitless, we should play with some parameters to limit it based on the customer’s requirements.

One of the most important parameters that need to be asked to the customer is the retention period of the data. This depicts the time the system should wait before it purges the data. I used several different retention periods in the projects I have involved. This parameter highly depends on the customer requirements. Some customers may want to see the daily KPIs for a year, while others may require a month.

The second important parameter is the polling period. We always tend to set lower polling periods. However, setting low level polling periods may lead to problems. Here are some examples:

Suppose you are polling (via SNMP GET) from a device interface to get the Inbound Octets KPI. The SNMP object you are polling is a 32-bit counter. In order to get the octets passed, you should subtract previous polling’s counter value from the current one. This is ok. The problem arises when you “wait” too much. If your polling period is 15 minutes for example and this is a highly utilized interface, after it reaches the counter value of 2^32, it resets itself. Even, in some cases, it resets itself multiple times within these 15 minutes. The result is: wrong, misleading information on the reports. To cope with this situation there are some formulas I have been using. But solving this via formulas is not to best way as they are not very reliable at all times. The best way to deal with this particular case is to use 64 bit interfaces (if available in the MIB) or reduce the polling period.

Another example would be from the SQM domain. Suppose you are polling each 5 minutes and forward the KPIs to an SQM system. The SQM system collects those data and run some thresholds on them to do the service impact analysis. In your first poll, your SQM found that the data received violates the threshold limits. The system then marks the status of the service to down and started calculating the downtime. In order for the system to detect the next service status, it should wait for min. 5 minutes. This condition causes the service downtimes to appear as multiples of 5’s in the reports. 5, 10, 15… minutes. If you commit on the %99, 9999 availability to a customer, this granularity is simply not enough. What you should do? Reduce the polling period again.

Reducing the polling period is not very straightforward. When you reduce the polling period, you should ensure that all the pollings and their processing should be finished within that period. Suppose you have a poller which polls 1000 resources each 1 minute. If we assume that each polling takes 2 seconds (including processing delays, propagation delays etc.) this makes 2000 seconds. In 1 minute, we have 3600 seconds so no problem in here. But this is the sunny day scenario. What happens if 10% of my resources cannot respond in 2 seconds or they are not available at all? Obviously there’s a risk to consume the 3600 seconds before finishing the polling. What we can do? Well if we cannot increase the polling period, we can increase the poller count. Instead of using 1 poller, we can use 5 pollers and poll in parallel. Additional pollers bring additional license costs and introduce additional machines to the infrastructure.

Polling periods and retention periods heavily impact the disk size requirements for the performance management solutions. They should be studied carefully before the roll out of any PM implementation.