Importance of Parallel and Local Measurements in Web Monitoring

 Cloud, Performance Management  Comments Off on Importance of Parallel and Local Measurements in Web Monitoring
Apr 172015

Every company that relies on web business should invest in web monitoring platforms. Web monitoring platforms connect to a web site and measure KPI’s such as DNS resolution time, page download time and page consistency. (Checking a specific header or content value). These synthetic transactions, that are run by the probes, help to identify server reachability.

The term “reachability” is important in here as it is not the same as “availability”. There may be some cases where your web application and it’s dependent infrastructure (Web server, DB server, Application server etc) seems to be running smoothly from your side but not the customer’s. This is usually due to routing problems on the network and problems on the remote DNS server.

It is important to know these downtime scenarios when supporting your customers. In some situations you may even take some corrective actions such as guiding users to change their DNS server settings or even opening a ticket to the remote ISP for investigation.

There are 2 important selection criteria to consider when investing in web monitoring service.

First, the service should have local probes. If your business resides in Istanbul/Turkey but your probe resides in Philadelphia/US, the response times or availability calculations may not reflect the truth. Suppose the country has a problem reaching Internet. Your probe will notify you about a downtime. However most of your local users will still be able to reach you.

Second, the service should do parallel calculations. This is for covering the load balancer scenarios. Load balancers will typically work in a round robin fashion to distribute the load across a web server farm. So, if you measure 1 time and the current web server on the queue does not have any problems, you will measure the service as “up”.

However, the next server on the pool may suffer from performance problems or even downtime. If you make at least 3 measurements at relatively same time, you would catch individual server problem within a pool. This is a very important feature you would be seeking when deciding on a web monitoring tool.

Local and parallel calculations will help you identify web server problems and troubleshoot them more quickly.

Raw Data Sizing

 Performance Management  Comments Off on Raw Data Sizing
Jan 012014

I have been asked multiple times how much disk space will “a” performance management system take on given business requirements.

Well, this depends first on the implementation of the data schema. One system could persist just the name-value pairs in the raw data file, other system introduce extra columns such as last poll time, unit etc. and ask for more space.

Most PM systems maintain the raw records for a period of time in order to summarize data. If for example the first summarization is on the hourly level, 1 hour raw data must be maintained. After the summarization, the raw data can be purged. An the higher level summarizations will utilize the summarization 1 step below.

Because of the reasons such as late data arrival or manual data insertion, PM implementations maintain the raw data at least one day. This is because, it will need raw data for the recalculation of the summarizations. Regulatory reasons may also force the retention period of the raw data.

But how much disk space will be occupied if we retain 1 day of raw data? The math is simple and I will try to explain it below. However you have to take into action some side factors.  For example the solution can utilize file compression which can reduce the size required down to 20%.

A very rough sizing excluding the compression factor is below:

The requirements of the customer is: 100 devices on the network. Each device has at least 3 interfaces. 1 month of raw data retention.

So lets begin;

1 KPI must occupy at least 2 columns in the database/flat file excluding metadata:
KPI Id Column: Integer: 4 Byte
KPI Value Column: Double: 8 Byte
Total: 12 bytes per KPI

Device Based KPI Count: 10 (CPU Utilization, Memory Utilization, Uptime etc.)
Interface Count:3
Interface Based KPI Count: 10 (Throughput, Utilization, Speed, Packet Loss, Delay, Queue etc.)
Device Count: 100

10 + 3*10=40 KPIs per device.

40 KPI takes (12 bytes each)=480 bytes per device poll.
For 100 device; 48000 bytes for the whole network poll= 48KByte.

48K * 60 * 24 = 69120 KBytes=~ 70 MByte per day =~ 2,1 Gbyte per month(raw data)

Please note this would be the minimum. PM may need much more space in order to maintain the retention mechanism. You should be in close contact with your vendor for the correct sizing. But I advise you to do a quick math by yourself and compare the results with the vendor’s before sending the purchase order for the disk arrays.

Justifying Your OSS Investment

 General, Strategy  Comments Off on Justifying Your OSS Investment
Nov 302013

OSS have been seen as the cost center since it was born. That’s why the business has been less motivated to invest in it.
We, as OSS professionals have struggled to justify the investments in terms of operational excellence, improved quality, increased productivity. These has been used multiple times for the justifications however I assure you these are not interesting the sponsors anymore.

To justify our OSS, we need to convert it to a product. We have to convert OSS functions to revenue generating functions and start selling them either as standalone products or add-ons.

Here are some areas where you can collect revenue from your OSS investment.

– Advanced Reporting Platform (OSS)
– Pay as You Grow Services (BSS)
– Advanced Notification Services (OSS)
– Customer based correlation (business) rules (OSS & BSS)
– Customer SLA Management (OSS & BSS)

Other ideas? Please reply on this topic or join the Linked-in discussion. Your valuable comments are always appreciated.


 Other, Security, SQM  Comments Off on SIEM
Nov 302013

SIEM stands for “Security Information and Event Management” and it is a well known OSS system in the security world. It is not much visible to other domains because it has been used mainly for internal purposes.

Today, I am going to talk about what SIEM is, and elaborate on possible uses of it.

Every system, servers (VM, Hypervisor, Physical), routers, switches, access points, firewalls, IPSs produce logs. These logs can be huge. To process all these logs we should talk in big data terms but this is not todays’ topic. So lets’ decrease the scope: To the access logs level.

The access logs on a system (Login/Logout requests, Password change requests, FW dropped access) should be collected for security purposes. Who connected where?, who took exception in the login process, who sweeps the IP addresses in the network? who was dropped on the Firewall/IPS? (The “who” portion can be the real identity of the user (via integration with AD/LDAP).)

The SIEM system’s, first goal is the store these logs in the most effective way. Since the log data can be high in terms of volume and velocity,the archiving system should be a specialized one and utilize fast disks and/or in-memory databases.

After the log collection, the SIEM’s second goal can be achieved: Log correlation.

In the Log correlation phase, SIEM system will correlate the logs from multiple sources to combine under a single incident. The correlation can be rule based or topology based. SIEM system for example take a connect event and look for the destination IP address in the blacklist (C&C Control center db etc) database. If there is any match, an alarm will be created in the system that can be forwarded to TT or FM systems. Or it can directly generate an alarm in the case of a login failure to a critical system. These are good candidates for rule based correlation.

A topology based example could be: If 3 login failed alarms are received from the same system then assign a score to this server. If the same system had a Firewall block in the same day, enrich the score. If more than 2 of the servers in a specific server farm have decreased to low scores than generate an alarm. This is a simple example for a statefull pattern that can be identified by SIEM.

(By the way, some operators do not have the desire to generate security related alarms. The main driver for a SIEM could also be reducing the logs to a human manageble level for further review by the security staff.)

Alternatives are limitless but managing and maintaining this mechanism could be a burden for the security department.

I see 2 problems with SIEM investments: First one is the maintenance. Security personnel are not generally familiar with the OSS and their integrations. They can provide the rules but will not be able to (or have time to) implement those on the SIEM.
So, they will rely on the OSS department (if any) which will not know anything about security. The miscommunication may lead to problems and under utilization of the investment. A solution to this problem could be outsourcing of implementation to the vendor. The vendor solutions are generally “empty” in terms of pre-defined rules so each rule should be built from scratch. The cost for the implementation could grow dramatically.

Second problem is overlapping of functions. For example, your goal is to be notified when a system tries to connect to a well-known bot center. This requirement can be achieved by SIEM but also with other “cheaper” security components. Or if you have an SQM, why not consider using it if your topology based correlation requirements are less?

When investing on a SIEM you should elaborate if you would be able to fully utilize the system, as this OSS component is generally a not cheap one.

Auto Discovery and Reconciliation

 Configuration Management, Inventory Management  Comments Off on Auto Discovery and Reconciliation
Sep 232013

We have inventory databases (either NIM or CMDB) that we use to manage our network operations. We open up tickets based on this data, we plan our strategies based on this data, we make procurements based on this data.

However, if this data is wrong, we would be loosing money or our quality of service. There will be customers who are not given the correct service, there will be service consuming subscribers who are not on the billing anymore, there will be wrongly configured services or orphan devices that are deployed and forgotten.

The aim of Auto Discovery and Reconciliation tools (which are generally sold as an add-on to the NIM or CMDB) is to collect the real world information and correct any discrepancy between what we see and what we know.

These tools aim to discover network and service topology from the operational devices. The discovery process starts with the input of destination networks (or IPs) and via different types of protocols (SNMP, CORBA, SOAP etc.) they discover the devices. The tools employ “adapters” for different types of devices and protocols. (Here the customers should wisely choose the vendor as some vendors are strong in discovering IT assets while others concentrated more on network assets.)

After the discovery, the data generally are converted into a generalized model. The model could be SID based or propriety.

The second set of adapters would collect data from the master database. That will be the data source where our authorized, real world, operational assets reside. This data should also be converted to the tools model and the reconciliation process can start. (This central standardized model employed by the tools is not a must but a preferred way as the data models that are subject to reconciliation may not be the same)

The discovery and reconciliation is triggered either manually or by a scheduled process. The schedule period may vary from minutes to days. This depends heavily on the nature of the data sources. If the data source changes frequently or tied to revenue this increased the importance of the correctness, therefore smaller intervals are preferred. However for network infrastructure, 1-day interval would be enough.

One important thing to mention is we do not discover the whole network element. In the design phase, we typically decide on the element types (routers, switches, firewalls etc.) and also their attributes (hostname, interfaces etc.) that will be used in the discovery and reconciliation process. What we will discover is heavily dependent on the contents of our main, master data source.

The results of the reconciliation are the deltas between the real world and the “known” world. If there are any deltas, that means our data source which we rely on for our operations is wrong. We should take actions to correct any mistakes. Some actions can be automatically triggered by the tool itself. A naming convention failure on a device can be automatically synced by a provisioning tool. However, some actions would require manual intervention or a human eye. These are for example, unknown devices on the network, unknown interfaces or even unknown customers. In such cases a work order task should automatically assigned to the owner of the element for review and take action. The actions that would be taken by this functional group could be;

– Removal of the element from the network.
– Triggering a corresponding change action that would update the main data source.

To repeat, the ultimate goal is to keep the inventory up-to-date. For an organization whose processes work perfectly the tool should never find any discrepancy. But for non-mature organizations the tool would be a pain in the neck and bring burden on operations. That’s why I always recommend my customers to concentrate on the processes first, than the data as the data is manipulated by the processes.