What happened to the Internet on Friday

Note to readers: Judging from the past, this blog will have posts related to both computer science and politics. If you like, you can view just CS or just politics posts, or subscribe to feeds for just CS or just politics.

On Friday, a large disruption of Internet traffic made the news as an experiment gone awry. What actually happened? It's a good lesson in how fragile and insecure the Internet's routing protocol can actually be.

There was indeed a major event on Friday. A plot by Earl Zmijewski of Renesys shows that at the moment the experiment started — 8:41 GMT — about 3,000 IP prefixes became unstable. That is, the routes to these prefixes were quickly changing or being advertised and withdrawn. (An IP prefix is a chunk of destination IP addresses, the basic unit on which Internet routing operates.) Since there are roughly 300,000 prefixes announced globally, this is about 1% of the prefixes on the entire Internet.

We can also observe the effects by looking at the total amount of "chatter" in the Internet's global routing protocol, BGP. I created the following graphs based on raw data from the Route Views project.

This plot shows the rate of BGP messages received by one particular router, located at the London Internet Exchange. Routers are continually exchanging messages about new, changing, or unavailable routes to destinations all around the world. However, as you can see, the event in question vastly increased the rate of routing updates, exceeding the "background radiation" of messages by about a factor of 6.

The event was visible globally. Here is the same plot, for a router at Equinix in Ashburn, VA.

How can a disruption of this magnitude happen? Based on a note from RIPE, it went something like this:

  1. Researchers at RIPE and Duke create a BGP announcement message, which advertises the availability of an IP prefix under their control. The message uses an unusual format, but one which complies with the BGP protocol format.
  2. The message begins to propagate from router to router on the Internet, as normal, until...
  3. ...it reaches some router running the Cisco IOS XR software. These routers have (or had) buggy software which, upon receipt of the unusual message, corrupted the message before propagating it to other neighboring routers.
  4. A neighboring router (call it N) has now received a malformed message from the Cisco router (call it C). N then follows the BGP protocol specifications which require that N terminates its BGP connection to C. This disrupts traffic to any destination which N reached via C (and vice versa) — not just traffic to the prefix originally announced by RIPE!
  5. It is likely (depending on router configuration, so I'm not sure how common this is) that either C or N then attempts to re-establish the BGP connection. In this case, C re-advertises every route it knows about to N — perhaps all 300,000 of them. And one of these would presumably be the corrupted message, causing the connection to again be terminated, and the process to repeat indefinitely.

It's always a good idea to isolate security problems to contain damage. And many BGP problems can be isolated close to the origin of the bad announcement. This event, on the other hand, apparently caused (brief) widespread damage for two reasons.

  • It spread geographically because the original announcement message was entirely valid, and was handled correctly by many routers. Thus, the message could reach buggy routers anywhere on the planet.
  • It spread to many IP prefixes beyond the original announced prefix, because the BGP protocol spec asserts that if a router sends a bad message for one prefix, it's unsafe to communicate with that router for destinations in any prefix.

Similar events have occurred in the past.

Despite the headlines ("Research experiment disrupts Internet") on Slashdot and Network World and the Renesys post's point of view, I find it hard to place blame on the researchers. I assume it was not their intent to stress-test the live Internet. Clearly, one problem was the software bug which Cisco quickly acknowledged and fixed.

But we can also think about the protocol design. One way to better isolate the damage might have been for the router N to discard only the single malformed message from C, logging an error message but not terminating the entire BGP session between N and C. The counter-argument is that receipt of one malformed announcement raises the probability that other announcements are malformed, too. Indeed, in this bug, C apparently declared an incorrect header length; something similar to this could plausibly confuse N's parsing of all subsequent messages from C.

Looking into the much more distant future, a very different approach would be to base routing decisions on end-to-end observable behavior (do my packets actually get through to the destination along this path, or not?) rather than on relatively uninformative and attack-prone control plane announcements. This robustness is one potential benefit of designs like our pathlet routing and Xiaowei Yang's NIRA.

No comments:

Post a Comment