A paper by John Ousterhout (PDF), which discusses the data centre (DC) network burden and protocols has generated some ‘noise’ online again, with a lively discussion on Hacker News, a quite direct rebuttal (and its inevitable comment tree), and a considered review, which also attracted some comment.
Below is the abstract of Ousterhout’s paper:
“In spite of its long and successful history, TCP is a poor transport protocol for modern datacenters. Every significant element of TCP, from its stream orientation to its expectation of in-order packet delivery, is wrong for the datacenter. It is time to recognize that TCP’s problems are too fundamental and interrelated to be fixed; the only way to harness the full performance potential of modern networks is to introduce a new transport protocol into the datacenter. Homa demonstrates that it is possible to create a transport protocol that avoids all of TCP’s problems. Although Homa is not API-compatible with TCP, it should be possible to bring it into widespread usage by integrating it with RPC frameworks.”It’s Time to Replace TCP in the Datacenter
Ousterhout is currently at Stanford, having had a long career at the University of California, Berkeley, and Sun Microsystems (now Oracle). He is a recipient of the Grace Hopper Award for his contributions to computer science.
For those who don’t know the origins of the modern Internet, Berkeley Standard Unix (BSD) is the platform that released ‘sockets’, implemented by them in the C-language, shipped as a library in the UNIX platform, and now standards committee-defined as an Applications Programming Interface or API.
Sockets were transformative, first available on the platforms of choice in tertiary education systems, then building out the Internet worldwide. The explosion of the Internet across the 80s and 90s tracks the deployment of UNIX systems based on BSD and AT&T’s UNIX System V aligned operating systems (which finally adopted this socket model).
If you’ve managed remote systems in times past, you’ve probably used the ‘Expect’ package, which was used to drive scripted connections over the network to talk to routers and switches for configuration. Expect was a software system implemented in TCL by Don Libes that ‘expected’ to see specific patterns of text and sent back the equivalent of the keystroked commands so a user could drive an interactive command and control system from a shell script or program. It was also transformative to, for example, automate configuration backup and deployment to the routing and switching systems.
Ousterhout has written on the applicability of the Transmission Control Protocol (TCP) which sits on top of IP, in the context of a data centre. TCP provides for a streaming model (unlike User Datagram Protocol (UDP), which doesn’t necessarily require a continuous flow of packets, so doesn’t always establish the state to manage a continuing stream) and the important qualities of a stream in networks include that it manages its ‘flow control’ (how much data it sends, how quickly, and how it shares the available bandwidth) and also provides for reliable end-to-end transport.
In a global Internet, these stream behaviours are vital. Ousterhout focuses on the DC context. In many cases, this is almost a purely switched data fabric, the important distinction being that in switching, decisions on packet forwarding are made almost entirely in the layer below the IP layer (if IP is a network layer, and thus Layer 3 in the 7-layer model of protocols, a descriptive approach to talking about the roles of the network ‘stack’), the Link layer, or Layer 2. Switches are often ‘crossbar’ switches with almost no interior routing, and no visible end-to-end routing role between the communicating hosts. The fabric presents as if the two hosts talk to each other directly and thus have no imposed barrier to full-speed, error-free delivery (assuming the hosts don’t have problems with their mutual speed, and behaviours).
If the fabric is error-free, and can send and receive between hosts at interface speed with no buffering or delay, a case can be made for a different kind of stream protocol, which implies perhaps less overhead per host to manage that stream of data. And this is essentially why Ousterhout argues that TCP may be a poor fit for the DC context.
But the case here is not so clear-cut. Firstly, this isn’t the first time people have argued for a distinct protocol in the ‘local context’ — the origins of the Internet actually lie in this separation of local and remote contexts. Many protocols have been designed to suit purely local networks, such as the Digital Equipment ‘LAT’ method, which was used for many years. It’s a protocol with few overheads that worked well for terminal and device communications between Ethernet nodes, but reached its limits when more complex switch-to-switch bridged networks emerged, which was totally unsuited to routed networks.
This introduces the next issue — the DC is not a single flat switching fabric any more, but is a large, complex and often internally-routed construct. There are applications of BGP to the interior routing model inside the DC and an emerging need to construct complex mesh and variant networks on a case-by-case basis.
It is likely that buffering and speed mismatches still occur, and flows suffer packet loss — the precise conditions that a stream protocol with adaptive ‘window’ buffering is designed to address.
Note: The adaptive window is the amount of data allowed to be in flight, sent without yet being acknowledged as received: It sets a limit to the amount of retransmission and buffering which might be required to maintain the stream under packetloss.
Some protocols are designed to work in the DC. The BBR flow control model is often said to be a reaction to the intersection of high-speed interfaces and low-cost chips in switching fabrics and is designed to minimize the buffer burdens along the path. BBR is a flow control model ‘inside’ TCP (like a sub-protocol within the TCP model) and is applicable to other stream protocols like QUIC. It is arguable that advances in TCP flow control such as BBR are capable of rectifying perceived mismatches for the DC context.
Higher protocols, layers at or above TCP logistically (such as the Transport Layer Security model or TLS), do incur overheads and burdens, particularly initializing the connection (often this requires several back-and-forth exchanges). Adjustments to the behaviour of the initial packet exchange to reduce delay at the start are important but add complexity. So, a case can be made for considering models of protocol that achieve 1 round-trip time (RTT) dataflow, to reduce this initial delay burden.
Ousterhout argues the higher protocol layer which is Remote Procedure Call (RPC) is a better interface abstraction to focus on, in designing the DC communications path. There are some merits to this case.
The RPC represents a point of abstraction of a real-world problem where a user’s code ‘knows’ it needs the result from a discrete, autonomous, asynchronously running service reached over the network (presumably; It could in fact be co-hosted in the user’s own machine) via an RPC call. ‘Remote’ here is contextual, it doesn’t actually have to be anywhere else, it’s just that it isn’t inside the local code’s running space. It’s dependent on something else to do work and respond.
Logistically, this point is unambiguous — a user can either send lots of these RPC calls and get responses back one by one, or wait for one to complete then send another (parallel/asynchronous or serial/synchronous). Why not leverage this knowledge of the intent of the code in designing how the underlying protocols send and receive the data? Why not look for a better fit for the RPC in the transport layer?
Many still seek forward planning models through path or channel reservation. Arguably, inside a complex DC with lots of tenancies sharing a data fabric, this would be as important as on the long haul. Not all DC tenants would need the same bandwidth guarantees, and protocols that align with their specific needs might permit a DC to route around congestion, balance delay, and load to match ‘market expectations’ inside the DC.
Finally, there’s the barrier to widespread deployment. Mike Lesk from Bell Labs said that the time it takes to deploy a protocol into a logistical ‘vacuum’ is significantly smaller than to ‘push matter aside’ and replace an existing protocol with a new one. He observed this in the context of his own ‘UUCP’ protocol, which was instrumental in bootstrapping the pre-Internet dial-up universe between hosts. It’s just as applicable in the context of replacing TCP. One example is the QUIC protocol despite its different overheads, delay issues, and benefits to TCP. Although QUIC is being actively deployed, it is not a simple overnight replacement for TCP. There are substantial barriers to overcome in order to replace a ubiquitous transport layer.
John Ousterhout’s contribution to computer science and networking is secure, no matter what the current community makes of this proposal to think about replacing TCP. It’s a lively discussion, worth reading and following. Ousterhout may have over-sold his case, and the most likely outcome is a discussion and perhaps some change in the use of TCP and other protocols.
However, replacing TCP in the data centre seems unlikely if only for Mike Lesk’s stated reason — it’s harder to push matter aside than fill a vacuum, and TCP is fully occupying that space.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.