Complex Systems & Autonomic Networks, Part 4: Catastrophic Problems

Catastrophic Problems

“There are only two types of networks, those which have failed and those which have yet to fail.”
- Anonymous bar joke among network and OSS designers

To the best of my recollection in the passage of time and with acknowledgement that I did not have an active role in this example, so some details may be wrong, this is another historical example.


Problem: Once upon a time, long long ago in the nineties, a major service provider’s frame relay network went unstable. Customer end points started becoming unreachable. Each time, repair operators followed a NOC procedure list of probable cause and do-this-lists of tests, actions and scripts, all without result. While well down on the list, rebooting the router resulted in service restoration for the complaining customer. More and more of these isolations started occurring. Early on, it was difficult to see any causal linkage or pattern because new isolations occurred in distant parts of the network from where the original isolation occurred. Finally, the problems started moving in waves across the network with customer’s endpoints becoming isolated and then spontaneously fixing themselves. More and more customers were affected until a big part of the customer base experienced this issue. The trade journal press got hold of the problem and lots of bad publicity resulted, eventually making headlines in the major press.

Response: Let’s break down the services provider’s response to this outbreak. A few weeks in, as soon as it was realized that a significant trans-network problem was occurring, Tiger Teams were established to find the problem. [A Tiger Team is a collection of experts usually spanning several organizations like Operations, Engineering, and IT, which are removed from ordinary job activities to work closely together to solve this problem.] Original network activity logs were reviewed and the team identified that a new router/switch load had been installed about a week before the problem started. The Lab’s network tests of the current ‘problem’ load were reviewed and then repeated but resulted in “no problem found.” Nevertheless, no other course of action being revealed, reversion to the old network software load was recommended. A trial program was started to return a subset of the routers in one zone to the prior network load. The problem did not stop.

This was very difficult for management to accept and the Tiger Team was reorganized. The enhanced team recognized the problem as “routing storms” floating through the network. Eventually the problem did not disappear until all of the several hundred switches were returned to the prior load. Restoration eventually fixed the problem, but several weeks of outage problems angered customers, and the press bashing lasted much longer than the actual network problems.

Vendor/Provider Interactions: During this entire time, as in standard practice, the problem had been escalated to the switch/router vendor to isolate the problem. Their lab tests were also inconclusive, so eventually a line by line review of all software changes in the routing code was called for. Finally it was found that a line of the OSPF routing code had been replaced with a debug line that was mistakenly never removed and carried into the network release. Major finger pointing started occurring between the vendor and the provider and long established trusts went down the tube.

Post Mortem: What had happened was this. To test routing update propagations, the vendor’s switch programmers had changed the timing of OSPF updates from a longish period to a rather short period - this so they could observe the routing changes propagating and stabilizing. As everything was working in the test network of a dozen or so switches; a vendor supervisor OK’ed distribution of the update code (without the debug line being removed). When it got to the service provider network and was placed on their test bed of a dozen or so routers, all the script tests passed fine. The software was scheduled for loading in the network, a few switches every night. This was a good, cautionary approach to testing, qualifying, and deployment of the new code by the service provider. Eventually the network rollout was complete.

Problems in the real network began to be observed about a week later. But in retrospect, a post mortem of network and OSS logs found that problems began occurring when something-like half the switches were converted. But the seemingly random, unrelated nature of the problems went ‘under the radar’ as they were each fixed in isolation by the NOC “reboot the router card” procedure. Routing tables are switch control systems for which of the many outward bound links to specifically palace each packet. In routing code whenever a change or reboot occurs, routing update messages are sent out, which cause each switch to respond with routing table “access to” entries, and the return messages are then received back at the rebooted router. Then the router computes its new routing table. Only the timing, and thereby the frequency, of update messages was changed. So what happened?

The problems never occurred in the test bed lab networks because they were too small and switches were too close together. An update message was sent whenever a change occurred in the router. Each update message caused a response from connected switches and resulted in the message sender recalculating its routing tables. With the smaller network, the number of responses was small so the time to re-compute the table was small. Network behavior was stable for smaller networks. This problem could not be observed as behavior in a smaller network. Even debugging code inserted to specifically watch for this only discovers if the frequency and time of updates is inline.

My Post Mortem: The problem manifested when two things occurred: (1) the networks became large enough, in number of connected routers, that the computation times were very long; and (2) a return request for a routing update from another switch occurred before the re-computation from the earlier request was complete. This restarted the computation problem and sometimes, but not always, isolated the links which had not been updated. The problem became major with NOC responses. It turned out, that the NOC action of fixing the isolated links, for which no apparent problem could be found, by rebooting the router was a major participant in the cause of the network failure. Each reboot caused a new update to be sent out, isolating more links. A positive feedback loop started.

Every attempt to fix the problem caused more problems to occur, at random, and distant from the fix. Eventually, the number of request messages and re-computation starts was so large, that a new stability point was engendered in the networks and the problem was self sustaining. Customers came in and out of service as the routing storm raged across the network. This network was sufficiently large in scope and extent that the storm was self sustaining. [A new Markov Mixing stability point had occurred - more on this later.] This is why fixing one section of the network with the old ‘good’ software load did not work, because the old load was subjected to these same forces.

The original, simple change in the frequency of update requests in debug code had potentially destabilized a network. This change in update timings caused occasional link drops in distant switches inside a large network with significant geographical distance between switches. Left alone, a few problems continuously would have occurred, and then later fixed themselves; but not a large number, because routing code is written to suppress this kind of behavior. But the NOC procedures for response had feed the problem by causing many reboots of the switches thereby sending the number of routing update requests above a threshold, causing the catastrophic response.

At the time of the problem and solution, even getting the NOC to acknowledge that they participated in the problem was difficult. The assessment presented here was not accepted officially. Finger pointing continued between vendor, engineering and operations.

Costs: This was one example, but in the nineties, every major Frame Relay service provider experienced some form of catastrophic network systems behavior. The costs of these failures were enormous. Outages occurred for days and good will was lost. Customers switched networks and sometimes entire accounts, with all product groups, were affected. If you include the costs of the failures to the customers, whose internal operations were affected, the costs increase by orders of magnitude. If you include these external costs with the costs of operating a network, than a truer cost of OSS design and network design can be reached. But these catastrophic and customer externalized costs are never quantified into the “efficiency computation factors” of design and purchase decisions. “Lean and mean” decisions will result in suboptimal cost allocations and losses in the network.

Complex Systems: This Frame Relay routing update failure example is presented as a lead in to the next section. Networks are now so complex and so large that they exhibit specific system level effects. These effects are not evident in the micro behavior of the individual network components. Also, we must be broad in our inclusion of “what is the network”. Actions of the NOC and OSS with their procedures designed around failure incidents, helped cause the problem. So NOC and OSS both must be included as actors in the complex system if we are to understand network behavior.

It is hoped that these blog articles will began to change this. In the next section, we will look formal study of complex systems and emergent behaviors.

Details


Author: Wedge Greene
Published: 06/21/2006
 

 

Wedge Greene is a consultant for LTC International.

Comments are now closed on 'Complex Systems & Autonomic Networks, Part 4: Catastrophic Problems'.

In order to keep all of our discussion timely and appropriate, we've stopped accepting comments on this entry. If you still have questions or want to offer your opinion about the article, contact us at: information@LTCinternational.com.