Importance of Data Profiling in Synthetic Monitoring

 Fault Management, Performance Management  Comments Off on Importance of Data Profiling in Synthetic Monitoring
Apr 092017
 

Monitoring of web services/sites via Synthetic Monitoring Systems is a common practice that is implemented by most of the companies. Cloud-based monitoring platforms give us the ability to watch our services from the Internet/ through the eyes of our customers.

However, the alarms that are emitted through these platforms are limited, most of the times, to the faults. They tell us our web service is not reachable at all, which is the worst text message a system admin would receive at 3 AM in the morning.

Can these systems be used for proactive monitoring? Yes, and this article is about this practice.

Based on your cloud-based monitoring provider capabilities, you can apply several proactive monitoring practices. But, before that, we need to elaborate on the only KPI we have for an Internet-initiated synthetic service check: Response Time.

Response Time is affected by several factors: Network delays, DNS resolution delays or Service (HTTP/FTP etc) delays.

These delays can be measured by most of the cloud-based monitoring platforms and can provide some insight to the upcoming outage that will occur soon.

How these KPIs can be used for pro-active monitoring, though, depends on the capabilities of your monitoring platform.

Monitoring platforms will usually allow you to put threshold values for several response time types that are mentioned above. Whenever the response time is above that threshold, these systems will emit alarms on several channels to alert the system administrators. Response times will rise in several occasions. These are:

· DNS issues (less likely but can happen)

· Network congestion/saturation

· HTTP Server issues (too many connections, load balancer problems, backend slowness such as RDBMS or external service provider calls)

But how will we know the threshold values? Well, we need to do some investigation for that. The easiest way is to look at the daily/weekly/monthly reports to figure out our traffic profile by ourselves. By simple calculations, we can come up with some approximate response time numbers.

The problem with the manual profiling is its tendency to lead to errors especially if your monitoring points is on the Internet. The Internet is a very dynamic environment and you have no control over it. The Internet’s this dynamic nature will result in fluctuating response times which will give us hard times to predict the thresholds. The other factor to consider is our own business. We have peak times and maintenance periods. During these times our response times may also rise and this is an expected behavior.

How can we set the correct thresholds in this very dynamic environment? There are certain steps to increase the predictability:

1-) Get Closer

If your service resides in Europe, and your customers are in Europe, there is no point of monitoring it from Hong Kong. Getting closer to your target means fewer routers to be passed along the way and reduced fluctuations in monitored response times.

2-) Create an Automated Baseline

Creating a baseline should be automatic to reflect your business. Say you take backups at 2 AM every Monday. Or your traffic goes rocket high on every Friday between 5PM-8PM. You are prepared for these events but if your monitoring system is not, you will receive false-positive alarms indicating“slowness” in your services. Your monitoring provider should provide the automatic profiling option based on hours and/or weeks. The profiling should be done at least on an hourly basis. So, if I select the weekly profiling option, the system should calculate the duration based on an algorithm (mean/median etc) for that weekday – hour. This way, the system will “capture” the event of our weekly backup event and will not emit any alarms for that event.

Following screenshot was taken from a monitoring system platform: PingTurk. This platform localizes the monitoring to the city level to increase the effectiveness of the synthetic monitoring. The table shows the calculated baseline response times for day-hour tuples per city in Turkey.

This example follows the 2 steps mentioned above (being near and baselining automatically) and will reduce the number of false-positive alarms during the operation.

The effectiveness can even be improved by introducing a local monitoring point located in the customer infrastructure (say at the DMZ). This way, we can be sure that the slowness is caused by the Internet or from our own infrastructure.

Aug 302016
 

In OSS, we use the polling concept often to pull statistics and configuration data from the devices. If the devices we are dealing with are implementing the pull based protocols such as SNMP or FTP, we cannot get rid of this.

All types of polling processes come with a polling period. If I have 100 routers and a polling period of 5 minutes, each and every 5 minutes I will have to connect each device and pull the necessary KPIs to be injected into my DataMart.

If you look at the CPU and Memory utilization of a performance management server (poller) during the process, you will see high peaks at the start of the polling periods. If we follow the 5 minutes polling example above, we will see the peaks at the minutes, for example, 0,5,10,15,20,25,30,35,40,45,50,55. If your polling period is 5 minutes, you have 5 minutes to finish your job. If it exceeds that period, you will fall into data consistency issues. As the node and KPI count increase, you have to throw more hardware to finish soon. (For each device connection, we will most probably want to open up a separate thread until we hit the point of diminishing returns)

Considering the whole collection process does not occupy the whole 5 minutes’ period, the remaining period will be wasted in the waiting state for the server. Since the hardware configuration was designed for the peak times, our server will remain to be “expensive”.

Assigning a polling time to a specific node is the key to this problem. In this approach, we divide the polling period to sub-periods. So, if the polling period is 5 minutes, we can divide it like:

10 nodes Zeroth second of First Minute, 10 nodes Thirtieth second of First Minute, 10 nodes Zeroth second of Second Minute, 10 nodes Thirtieth second of Second Minute…

Here we put 10 nodes into each 30 seconds timeframe, to finalize polling of 100 nodes in 5 minutes.

We also need to consider the speed of these nodes. Some nodes will suffer performance problems due to weak hardware configuration or high load. The response time of those may exceed the 30 seconds timeframe.

In order to cope with this problem, we should also consider putting the slowest responding nodes to the earliest sub-frames. This way, a node’s polling can “extend” to the next subframe and can still be finalized in the given 5 minutes. This, of course, requires you to maintain a continuous baseline of node response times at the server side.

Splitting the polling period and distributing the nodes wisely to the sub-periods will help you to reduce your hardware costs.

Aug 152016
 

Today’s topic is about the Network Sweeping and how it can be optimized. As you may know from the previous topics, sweeping means searching a subnet by attempting to connect to each and every possible IP addresses it has.  Usually, the initial protocol is ICMP due to its’ low overhead. (In that case, the sweep is called Ping Sweep). SNMP and even HTTP interfaces are also used as sweep protocols.

Sweeping is used in different domains, such as;

  • Security
  • Inventory Management
  • Performance Management
  • Configuration Management

Sweeping could be time and resource consuming (both for sender and receiver side). That’s why, for most enterprise customers, it is normally done daily.

For large networks, it may take hours to complete a sweeping process. Consider the scenario of sweeping a class C IP subnet. (It will have at least 254 IP addresses.). Also, suppose that only 10 devices exist in that subnet. I am supposing I will be using ICMP for discovery. That is the simple ping request and at least I need to send 2 ICMP packets to be sure that there is a device there. (50% packet loss still means the remote side is up)

For the reachable devices, the round-trip ping time should not exceed 5ms. Considering we have 2 ICMP packets, it would be 10ms per check. We have 10 devices and it would take around 100ms which is well below 1 sec. That’s a great performance if you just consider pinging the “up” devices. But what about the remaining 244 down ones?

ICMP timeout kicks in when dealing with the dead devices or vacant IP addresses. ICMP timeout is the duration in milliseconds for the ping software will wait until an ICMP echo reply package arrives. If the packet does not arrive within that period, it will report it as “down”. The default timeout for ICMP in Cisco routers is 2 seconds. So, using the defaults, if you use 2 seconds as the timeout, for 2 packets in the test, you will have to wait 4 seconds per test. If we do the math, the total wait time for the class C subnet on hand would be 976 seconds, roughly 16 minutes. Organizations that rely on sweeping normally have much bigger subnets with thousands of possible IP addresses. The sweeping process would take hours in such kind of networks.

Luckily, we can tweak this process so it will take less time.

1: Use of Parallel Measurements:

This is the first thing we need to do. Opening multiple threads of ICMP operation at the same time. How about opening up 1000 threads? It will be finished in 4 seconds. Isn’t it great? Not really, it has some consequences.

  • Increased LAN traffic: Sending 1000 ICMP packets at the same second will generate lots of traffic in your LAN/WAN. (around 70 bytes per packet * 1000 threads = 70000 bytes/sec =560000 bits/sec = 560Kbps one-way traffic. Considering there would be replies to these requests, the total bandwidth consumption can easily reach 1Mbps.
  • CPU Cycles: Each thread will consume CPU and Memory resources. Source machine should be able to cope with this. 

This is just the sweeping part of it. In the real world scenarios, no inventory or security tool will stop there after it discovered a live IP address. It will go ahead and try to fetch more information. So these two parameters can boost if you open up too many threads.

2: Optimize your ICMP Packet Timeout

I told that the default ICMP timeout is 2 seconds. Luckily this is configurable. Go ahead and send some pings to those destination IP addresses. For the “live” ones, capture the round trip time. This is the network delay (plus the processing delay of the remote NIC). That delay will not change much on LAN links, may slightly change on WAN links. Baseline this. So if it is 100msec you can easily put a timeout of 300 msec. This is 3 times more than the baseline but still well below 2 seconds default.

Keep in mind that ICMP is one of the protocols which has the lowest overhead. Layer 7 protocols like SNMP and HTTP will have much more overhead, so above suggestions may bring greater value.

Long sweep times can also result in inconsistencies between the sweep periods. Suppose you started with 10.1.1.1 /24 and found out that 10.1.1.1 is vacant. You continue your sweeping and 10 seconds later 10.1.1.1 became up. If you sweep every day, your inventory (and other dependent OSS systems) will not know this device until the next day. (If you don’t have a change process in place for this device) That’s why there should be a mechanism to listen for new IP address activity during the sweep time. DHCP logs could be a good alternative for the networks that utilize DHCP for IP addressing. A costlier solution could be listening for Syslog events or switch span ports.

Importance of Parallel and Local Measurements in Web Monitoring

 Cloud, Performance Management  Comments Off on Importance of Parallel and Local Measurements in Web Monitoring
Apr 172015
 

Every company that relies on web business should invest in web monitoring platforms. Web monitoring platforms connect to a web site and measure KPI’s such as DNS resolution time, page download time and page consistency. (Checking a specific header or content value). These synthetic transactions, that are run by the probes, help to identify server reachability.

The term “reachability” is important in here as it is not the same as “availability”. There may be some cases where your web application and it’s dependent infrastructure (Web server, DB server, Application server etc) seems to be running smoothly from your side but not the customer’s. This is usually due to routing problems on the network and problems on the remote DNS server.

It is important to know these downtime scenarios when supporting your customers. In some situations you may even take some corrective actions such as guiding users to change their DNS server settings or even opening a ticket to the remote ISP for investigation.

There are 2 important selection criteria to consider when investing in web monitoring service.

First, the service should have local probes. If your business resides in Istanbul/Turkey but your probe resides in Philadelphia/US, the response times or availability calculations may not reflect the truth. Suppose the country has a problem reaching Internet. Your probe will notify you about a downtime. However most of your local users will still be able to reach you.

Second, the service should do parallel calculations. This is for covering the load balancer scenarios. Load balancers will typically work in a round robin fashion to distribute the load across a web server farm. So, if you measure 1 time and the current web server on the queue does not have any problems, you will measure the service as “up”.

However, the next server on the pool may suffer from performance problems or even downtime. If you make at least 3 measurements at relatively same time, you would catch individual server problem within a pool. This is a very important feature you would be seeking when deciding on a web monitoring tool.

Local and parallel calculations will help you identify web server problems and troubleshoot them more quickly.

Living Inventory Managers

 Inventory Management, Performance Management  Comments Off on Living Inventory Managers
Sep 182012
 

Inventory management systems are a must for most of the root cause analysis and service impact analysis that we rely on.

One of the other primary benefits of a NIM is that it gives you a holistic view of your infrastructure. This way, you can use this infrastructure more effectively and reduce your CAPEX. You can pinpoint your idle resources and assign the next work package to them rather procuring a new one. Seeing the processing capacity, the planning department can apply more processes to lower utilized devices.

The problem that may arise in here is the static nature of the NIM. The data you enter manually or import to the system is static. That is to say, the system does not go and fetch data by itself. You define the device, you define locations, you define IP addresses. All static. Theres nothing wrong with this as the NIM lives like this without a problem. A healthy process architecture, can make the NIM data concrete and up-to-date.

But wouldnt’ it be nice if the NIM becomes more active? For example, I need a virtual machine to be installed on my architecture. I have a look at my devices and see the they are currently at their capacity.
So I need to procure a new device? This may not be the case. Looking at the performance management, I see that the device X is CPU loaded only at the midnight for 2 hours period. Other times its’ utilization is 1% only. For sure this VM can be installed onto this machine if the application on this VM will not use CPU that much on that time.

If my NIM can somehow fetch this load data and reflect it to the provisioning process, the admin or the expert system who will assign the VM to a machine can choose to install it on this device rather starting a new device procurement process.

Offcourse this VM example can be extended to virtual routers or TDM resources. The approach will save resources and promote re-usability while reducing CAPEX.

Do current NIM vendors ready for such change or are they willing to?