People have been designing communication network topologies for more than 150 years. But usually their structure is quite constrained. Building a wide area network, one has to follow the locations of cities or railroads. Building a supercomputer, a regular structure enables simple, deadlock-free routing. Building a traditional data center network, one might use a tree-like design amenable to the Spanning Tree routing protocol. Even state-of-the-art high-capacity data center networks might use a multi-rooted tree like a fat-tree.
3-level fat-tree · 432 servers, 180 switches, degree 12
Jellyfish random graph · 432 servers, 180 switches, degree 12
This project, the work of Ph.D students Ankit Singla and Chi-Yao Hong, along with Lucian Popa of HP Labs and myself, has two goals. First, high bandwidth helps servers avoid bottlenecks while streaming big data across the network, and gives cloud operators the agility to place virtual machines on any physical host without worrying about bandwidth constraints between hosts.
Second, we want a network that is incrementally expandable. Cloud service providers continually expand and modify their networks to meet increasing demand and to reduce up-front capital expenditure. However, existing high-capacity network interconnects have relatively rigid structure that interferes with incremental modification. The fat-tree, hypercube, butterfly, and other proposed networks can only be built from a limited menu of fixed sizes and are difficult to expand by, for example, adding a rack of servers at a time. Of course, there are some workarounds: one can replace some switches with ones of larger port-count or oversubscribe them, but this can make network bandwidth constrained or uneven across the servers. One could leave ports free for future network connections but this wastes investment while they sit idle.
Our solution is to simply give up on structure, and build a random network among the network routers. This sloppiness yields significantly more flexibility than past designs: a few random link swaps is all it takes to incorporate additional components, making a new random network. It can naturally support varying port-counts, and scales to arbitrary sizes rather than limiting the network to coarse design points. In fact, we show in the paper that Jellyfish reduces the cost of incremental expansion quite substantially over a past expansion heuristic for fat-tree-like (Clos) networks. Intuitively, Jellyfish makes network capacity less like a structured solid and more like a fluid. Coincidentally, it also looks like a jellyfish.
At this point, one natural reaction is that a completely random network must be the product of a half-deranged intellect, somewhere between 'perpetual motion machine' and 'deep-fried butter on a stick'. Won't network capacity decrease, due to the sloppy interconnections? How does one route packets through a completely unstructured network? Isn't it possible to randomly pick a bad network, or for failures to cause problems? How do you physically cable up a network that bears more than a passing resemblance to a bowl of spaghetti? How could it possibly work?
The first surprise is that rather than sacrificing bandwidth, Jellyfish supports roughly 25% higher capacity than a fat-tree of equal cost. That is, a completely random network makes more efficient use of resources than a carefully-structured one. The intuition is that Jellyfish's diverse connections — in theoretical terms, it is a good expander graph — give it low average path length, which in turn means that sending each packet across the network takes less work. In fact, there is reason to believe that the random graph is pretty close to the best possible network for maximizing throughput, perhaps within 10%.
Routing is another interesting question. Chi-Yao and Ankit found that load-balanced routing works well as long as the routing protocol provides sufficiently high path diversity, as can be obtained with OpenFlow, MPLS, or other recent proposals that go beyond STP or simple shortest path routing. In addition, it turns out that the questions of consistent performance, resilience, and cabling have favorable, and we believe reasonably practical, answers. There are some very interesting theoretical questions that come up as well, which we're now looking into.
Jellyfish is sufficiently unlike past designs that implementation challenges will certainly arise. But so far it seems like an unstructured network just might work. And happily, rather than running into a tricky tradeoff, the two design goals of high bandwidth and incremental expansion appear to be satisfied by one network.
Congratulations to Ankit and Chi-Yao who have done (and are continuing to do) great work on this project. Don't miss Chi-Yao's talk in the Thursday afternoon data center networking session! Finally, thanks to NSF for supporting this research.