Monitoring of web services/sites via Synthetic Monitoring Systems is a common practice that is implemented by most of the companies. Cloud-based monitoring platforms give us the ability to watch our services from the Internet/ through the eyes of our customers.
However, the alarms that are emitted through these platforms are limited, most of the times, to the faults. They tell us our web service is not reachable at all, which is the worst text message a system admin would receive at 3 AM in the morning.
Can these systems be used for proactive monitoring? Yes, and this article is about this practice.
Based on your cloud-based monitoring provider capabilities, you can apply several proactive monitoring practices. But, before that, we need to elaborate on the only KPI we have for an Internet-initiated synthetic service check: Response Time.
Response Time is affected by several factors: Network delays, DNS resolution delays or Service (HTTP/FTP etc) delays.
These delays can be measured by most of the cloud-based monitoring platforms and can provide some insight to the upcoming outage that will occur soon.
How these KPIs can be used for pro-active monitoring, though, depends on the capabilities of your monitoring platform.
Monitoring platforms will usually allow you to put threshold values for several response time types that are mentioned above. Whenever the response time is above that threshold, these systems will emit alarms on several channels to alert the system administrators. Response times will rise in several occasions. These are:
· DNS issues (less likely but can happen)
· Network congestion/saturation
· HTTP Server issues (too many connections, load balancer problems, backend slowness such as RDBMS or external service provider calls)
But how will we know the threshold values? Well, we need to do some investigation for that. The easiest way is to look at the daily/weekly/monthly reports to figure out our traffic profile by ourselves. By simple calculations, we can come up with some approximate response time numbers.
The problem with the manual profiling is its tendency to lead to errors especially if your monitoring points is on the Internet. The Internet is a very dynamic environment and you have no control over it. The Internet’s this dynamic nature will result in fluctuating response times which will give us hard times to predict the thresholds. The other factor to consider is our own business. We have peak times and maintenance periods. During these times our response times may also rise and this is an expected behavior.
How can we set the correct thresholds in this very dynamic environment? There are certain steps to increase the predictability:
1-) Get Closer
If your service resides in Europe, and your customers are in Europe, there is no point of monitoring it from Hong Kong. Getting closer to your target means fewer routers to be passed along the way and reduced fluctuations in monitored response times.
2-) Create an Automated Baseline
Creating a baseline should be automatic to reflect your business. Say you take backups at 2 AM every Monday. Or your traffic goes rocket high on every Friday between 5PM-8PM. You are prepared for these events but if your monitoring system is not, you will receive false-positive alarms indicating“slowness” in your services. Your monitoring provider should provide the automatic profiling option based on hours and/or weeks. The profiling should be done at least on an hourly basis. So, if I select the weekly profiling option, the system should calculate the duration based on an algorithm (mean/median etc) for that weekday – hour. This way, the system will “capture” the event of our weekly backup event and will not emit any alarms for that event.
Following screenshot was taken from a monitoring system platform: PingTurk. This platform localizes the monitoring to the city level to increase the effectiveness of the synthetic monitoring. The table shows the calculated baseline response times for day-hour tuples per city in Turkey.
This example follows the 2 steps mentioned above (being near and baselining automatically) and will reduce the number of false-positive alarms during the operation.
The effectiveness can even be improved by introducing a local monitoring point located in the customer infrastructure (say at the DMZ). This way, we can be sure that the slowness is caused by the Internet or from our own infrastructure.