Aug 302016
 

In OSS, we use the polling concept often to pull statistics and configuration data from the devices. If the devices we are dealing with are implementing the pull based protocols such as SNMP or FTP, we cannot get rid of this.

All types of polling processes come with a polling period. If I have 100 routers and a polling period of 5 minutes, each and every 5 minutes I will have to connect each device and pull the necessary KPIs to be injected into my DataMart.

If you look at the CPU and Memory utilization of a performance management server (poller) during the process, you will see high peaks at the start of the polling periods. If we follow the 5 minutes polling example above, we will see the peaks at the minutes, for example, 0,5,10,15,20,25,30,35,40,45,50,55. If your polling period is 5 minutes, you have 5 minutes to finish your job. If it exceeds that period, you will fall into data consistency issues. As the node and KPI count increase, you have to throw more hardware to finish soon. (For each device connection, we will most probably want to open up a separate thread until we hit the point of diminishing returns)

Considering the whole collection process does not occupy the whole 5 minutes’ period, the remaining period will be wasted in the waiting state for the server. Since the hardware configuration was designed for the peak times, our server will remain to be “expensive”.

Assigning a polling time to a specific node is the key to this problem. In this approach, we divide the polling period to sub-periods. So, if the polling period is 5 minutes, we can divide it like:

10 nodes Zeroth second of First Minute, 10 nodes Thirtieth second of First Minute, 10 nodes Zeroth second of Second Minute, 10 nodes Thirtieth second of Second Minute…

Here we put 10 nodes into each 30 seconds timeframe, to finalize polling of 100 nodes in 5 minutes.

We also need to consider the speed of these nodes. Some nodes will suffer performance problems due to weak hardware configuration or high load. The response time of those may exceed the 30 seconds timeframe.

In order to cope with this problem, we should also consider putting the slowest responding nodes to the earliest sub-frames. This way, a node’s polling can “extend” to the next subframe and can still be finalized in the given 5 minutes. This, of course, requires you to maintain a continuous baseline of node response times at the server side.

Splitting the polling period and distributing the nodes wisely to the sub-periods will help you to reduce your hardware costs.

Aug 152016
 

Today’s topic is about the Network Sweeping and how it can be optimized. As you may know from the previous topics, sweeping means searching a subnet by attempting to connect to each and every possible IP addresses it has.  Usually, the initial protocol is ICMP due to its’ low overhead. (In that case, the sweep is called Ping Sweep). SNMP and even HTTP interfaces are also used as sweep protocols.

Sweeping is used in different domains, such as;

  • Security
  • Inventory Management
  • Performance Management
  • Configuration Management

Sweeping could be time and resource consuming (both for sender and receiver side). That’s why, for most enterprise customers, it is normally done daily.

For large networks, it may take hours to complete a sweeping process. Consider the scenario of sweeping a class C IP subnet. (It will have at least 254 IP addresses.). Also, suppose that only 10 devices exist in that subnet. I am supposing I will be using ICMP for discovery. That is the simple ping request and at least I need to send 2 ICMP packets to be sure that there is a device there. (50% packet loss still means the remote side is up)

For the reachable devices, the round-trip ping time should not exceed 5ms. Considering we have 2 ICMP packets, it would be 10ms per check. We have 10 devices and it would take around 100ms which is well below 1 sec. That’s a great performance if you just consider pinging the “up” devices. But what about the remaining 244 down ones?

ICMP timeout kicks in when dealing with the dead devices or vacant IP addresses. ICMP timeout is the duration in milliseconds for the ping software will wait until an ICMP echo reply package arrives. If the packet does not arrive within that period, it will report it as “down”. The default timeout for ICMP in Cisco routers is 2 seconds. So, using the defaults, if you use 2 seconds as the timeout, for 2 packets in the test, you will have to wait 4 seconds per test. If we do the math, the total wait time for the class C subnet on hand would be 976 seconds, roughly 16 minutes. Organizations that rely on sweeping normally have much bigger subnets with thousands of possible IP addresses. The sweeping process would take hours in such kind of networks.

Luckily, we can tweak this process so it will take less time.

1: Use of Parallel Measurements:

This is the first thing we need to do. Opening multiple threads of ICMP operation at the same time. How about opening up 1000 threads? It will be finished in 4 seconds. Isn’t it great? Not really, it has some consequences.

  • Increased LAN traffic: Sending 1000 ICMP packets at the same second will generate lots of traffic in your LAN/WAN. (around 70 bytes per packet * 1000 threads = 70000 bytes/sec =560000 bits/sec = 560Kbps one-way traffic. Considering there would be replies to these requests, the total bandwidth consumption can easily reach 1Mbps.
  • CPU Cycles: Each thread will consume CPU and Memory resources. Source machine should be able to cope with this. 

This is just the sweeping part of it. In the real world scenarios, no inventory or security tool will stop there after it discovered a live IP address. It will go ahead and try to fetch more information. So these two parameters can boost if you open up too many threads.

2: Optimize your ICMP Packet Timeout

I told that the default ICMP timeout is 2 seconds. Luckily this is configurable. Go ahead and send some pings to those destination IP addresses. For the “live” ones, capture the round trip time. This is the network delay (plus the processing delay of the remote NIC). That delay will not change much on LAN links, may slightly change on WAN links. Baseline this. So if it is 100msec you can easily put a timeout of 300 msec. This is 3 times more than the baseline but still well below 2 seconds default.

Keep in mind that ICMP is one of the protocols which has the lowest overhead. Layer 7 protocols like SNMP and HTTP will have much more overhead, so above suggestions may bring greater value.

Long sweep times can also result in inconsistencies between the sweep periods. Suppose you started with 10.1.1.1 /24 and found out that 10.1.1.1 is vacant. You continue your sweeping and 10 seconds later 10.1.1.1 became up. If you sweep every day, your inventory (and other dependent OSS systems) will not know this device until the next day. (If you don’t have a change process in place for this device) That’s why there should be a mechanism to listen for new IP address activity during the sweep time. DHCP logs could be a good alternative for the networks that utilize DHCP for IP addressing. A costlier solution could be listening for Syslog events or switch span ports.

Living Knowledgebase Platforms for Automated Healing

 Configuration Management, Fault Management, Inventory Management  Comments Off on Living Knowledgebase Platforms for Automated Healing
Dec 112011
 

In an operational telecom environment, each fault or quality degradation will be handled by the NOC engineers and repaired by following the necessary steps. These steps are written in knowledgebase management systems or in people’s heads based on the past experiences.

Each time a new problem occurs on the network, it is detected by the network management platforms and if implemented, automatic trouble ticket generation is initiated for the root cause alarms. NOC engineers, handle each trouble ticket separately. During the troubleshooting process, a separate knowledge base system may also be consulted. However, due to the added operational costs, most of the time, knowledge base systems seem not so efficient.

A self-healing method could be used in order to automate these knowledge base systems. In this approach, each reconfiguration activity over the configuration management platform is logged for further reference. In the mean time, alarm information is also logged in the trouble management platform. The  alarms along with the configuration management logs are fed into a database platform where they can be further correlated. The node id (IP address, MAC address etc.) field along with other inventory related configuration  information (such as card id, slot id) can be used as the primary key for this correlation.

During the day to day operation, when a new root cause alarm occurs on the network, the RCA type will be looked up in the knowledge base for a best match to a configuration template.  If a match is found, then the configuration template can be populated to create the self-healing  re-configuration information to be applied to the faulty device.

This way, fully automated could be run without running an end-to-end incident management process. An incident process can and should be triggered as these configuration activities will not be finalized in a second and the service degradation or outages may have been experienced by the customers. However, the first task in the incident flow could be the checking of the alarm to identify if it is applicable for the automated self healing process. If the self healing processes does not apply to the scenario on hand, the incident flow can continue on it’s way. Again, each configuration task that is done over the configuration management platform will  continue to feed the self healing system with new profiles. The more data in the system will lead to better results with the template matching algorithm.

Workforce Management Systems

 Configuration Management, Order management  Comments Off on Workforce Management Systems
Oct 042011
 

Most of the processes that we see in the OSS/BSS environment require some kind of manual activity. These activities differ from configuring a server over a telnet session or a truck roll to the customer site to install an equipment.

In telco world, we call these manual tasks as work orders. Work orders can be assigned to people from our internal organization or to supplier/partners. Most operators use the outsourcing model and rely on supplier/partners to handle their installation and maintenance tasks for their networking equipment. Usually, as the geography increase, multiple supplier partners are also enter into the picture.

When work order counts start to increase and there are multiple supplier/partners that are in charge of specific regions, it is hard to manage and coordinate all these transactions.

Here comes the Workforce Management Systems. Workforce management systems, manage the work orders and their coordination between different parties.

Other than coordination, these systems have another very important task: Scheduling and Dispatching.

Scheduling means, maintaining the calendar of all the involved parties in the work order chain. This way, Allocated/Free time periods can be tracked and best time to accomplish a task can be determined. Dispatching function can then auto-dispatch the work order to a resource that has free time in his calendar. Dispatching can also take into account resource skills. After entering the resource skills to the system, the system can also dispatch the task to a person who have the necessary skills.

Unfortunately, it is hard to find so many Telco workforce management systems nowadays. Some OSS/BSS vendors claim that they can run this function via their trouble ticket systems however these solutions fall short in responding to some of the workforce management challenges such as skill tracking. Nevertheless, there are also some ISVs in the sector that are offering promising solutions.

Carefully designed workforce management platforms help CSPs increase efficiency in their repair as well as order management activities.

Introduction to Configuration Management

 Configuration Management  Comments Off on Introduction to Configuration Management
Apr 282010
 

Configuration management is the process that manages the resource configurations. The term “provisioning” is sometimes used interchangeably with CM.

First feature of the configuration managers is to provide abstraction, over the lower layer devices, which leads us to business process consolidation, reduced silos and increased enterprise effectiveness.

Let me give an example on this. Suppose, I have a customer order manager which manages the e2e order-to-cash business process. In a typical scenario, OM will trigger a service activator to activate the service. The SA will then go to the element managers/NEs to configure them, however, for each specific device; there may be different provisioning steps. For example, if the device is Cisco IOS based, you would probably ssh first, enter the password, then enter privileged exec mode, enter the commands. (I remember the times I was using “expect” scripts to achieve this). For another type of device, it could be done by SNMP SET requests. The examples differ. Here, via CMs GUI or NBI, I can issue the same “enable VLAN 1” command and it will be applied to multiple types of devices by the CM.

The second feature of the CMs is configuration tracking. CMs connect to the devices on scheduled intervals and pull the configuration data. They put this data in their repository. This behavior enables us to rollback to a previous configuration in the case of a failure.

CMs also provide user-friendly Diff in order to highlight you which parts of the configuration have been changed.

The third feature is to enforce the policies. Organizations are obligated to conform to specific policies and those should be reflected to the configurations. For example, for a router, disabled telnet access (they want to use ssh) requirement can be implemented on the CM. This case, CM will not allow this request to be passed to the devices, providing a control mechanism.

Individuals may bypass the CM and go and configure the systems directly. This should be considered as an extra-ordinary situation. However, the CMs will also help you to detect discrepancies in the configuration data in their next configuration polling and generate notifications (via trap, email etc.)

We see 3 types of configuration managers in today’s telecommunications environment:

– Server Configuration Managers – manage the servers/PCs. They install OS, install applications, and apply patches.

– Network Configuration Managers – manage firewalls, IPS/IDS, Routers, switches. Their configurations, IOS updates, bulk configurations etc.

– Storage Configuration Managers – manage the storage systems (SAN, NAS etc.)

Most of the devices have element management systems that do the configuration management function, and they are very successful in this. However, these element management systems can only manage specific types of elements. So, if your infrastructure is a multi-vendor environment involving multiple types of devices, investing on a configuration management will definitely bring efficiency to your organization.