In OSS, we use the polling concept often to pull statistics and configuration data from the devices. If the devices we are dealing with are implementing the pull-based protocols such as SNMP or FTP, we cannot get rid of this.
All types of polling processes come with a polling period. If I have 100 routers and a polling period of 5 minutes, each and every 5 minutes I will have to connect each device and pull the necessary KPIs to be injected into my DataMart.
If you look at the CPU and Memory utilization of a performance management server (poller) during the process, you will see high peaks at the start of the polling periods. If we follow the 5 minutes polling example above, we will see the peaks at the minutes, for example, 0,5,10,15,20,25,30,35,40,45,50,55. If your polling period is 5 minutes, you have 5 minutes to finish your job. If it exceeds that period, you will fall into data consistency issues. As the node and KPI count increase, you have to throw more hardware to finish soon. (For each device connection, we will most probably want to open up a separate thread until we hit the point of diminishing returns)
Considering the whole collection process does not occupy the whole 5 minutes’ period, the remaining period will be wasted in the waiting state for the server. Since the hardware configuration was designed for the peak times, our server will remain to be “expensive”.
Assigning a polling time to a specific node is the key to this problem. In this approach, we divide the polling period to sub-periods. So, if the polling period is 5 minutes, we can divide it like:
10 nodes Zeroth second of First Minute, 10 nodes Thirtieth second of First Minute, 10 nodes Zeroth second of Second Minute, 10 nodes Thirtieth second of Second Minute…
Here we put 10 nodes into each 30 seconds timeframe, to finalize polling of 100 nodes in 5 minutes.
We also need to consider the speed of these nodes. Some nodes will suffer performance problems due to weak hardware configuration or high load. The response time of those may exceed the 30 seconds timeframe.
In order to cope with this problem, we should also consider putting the slowest responding nodes to the earliest sub-frames. This way, a node’s polling can “extend” to the next subframe and can still be finalized in the given 5 minutes. This, of course, requires you to maintain a continuous baseline of node response times at the server side.
Splitting the polling period and distributing the nodes wisely to the sub-periods will help you to reduce your hardware costs.