With the move to cloud computing, data centre operators need to manage the growing tension created by the co-location of multiple tenants sharing common resources. This results in many tenant isolation methods to provide resource guarantees, often expressed in terms of Service Level Objectives (SLOs). Below are a few illustrative examples, from small to global scale:
- CPUs provide isolation to prevent information leakage across processes.
- Hypervisors guarantee that virtual machines share the computational power of underlying hardware in a fair way.
- Bandwidth in private Wide Area Networks (WANs) is carefully allocated to flows within and across data centres.
Unfortunately, the effective range of SLOs ends at the WAN’s edge. For traffic traversing into the open Internet, no guarantees are available, and ‘best-effort’ is the only existing delivery paradigm. Businesses requiring stronger assurances for their communications — and, crucially, availability guarantees under Denial of Service (DoS) attacks — have no choice but to look at much more expensive and inflexible solutions, such as leased lines or MPLS.
Can we provide SLOs in an open Internet in a scalable manner, even under the threat of DoS attacks? Our team at ETH Zurich tackled this research question in a recent publication and proposed Colibri, a system to bootstrap and enforce delivery guarantees for Internet communication.
In this post, we describe the several challenges we faced in designing Colibri, and show that Colibri can be an effective solution to bridge the gap between end hosts and the cloud, finally moving SLOs beyond the cloud’s edge.
SLOs through bandwidth reservations
In Colibri, delivery guarantees are achieved through end-to-end bandwidth reservations, which ensure that all traffic below a certain rate is neither dropped nor buffered with high probability, irrespective of congestion on the path. In practice, a reservation takes the form of a token embedded in the packet header, which enables the attribution of each packet to an existing reservation, and thus its prioritization and monitoring along the path.
As outlined above, centralized domains, such as private WANs and Autonomous Systems (ASes), are able to offer bandwidth reservations internally. We leveraged this to increase Colibri’s scalability, as we can delegate the bandwidth provisioning and reservation to individual ASes. The complex challenge, then, is in the coordination of these AS-internal reservations across domains to ‘stitch’ them in a single end-to-end protected bandwidth tunnel.
Lessons learnt
This approach to bandwidth reservations isn’t new, and yet none of the systems proposed in the literature ever saw widespread adoption. Ultimately, they were dismissed because of their complexity or lack of strong security guarantees. In the design of Colibri, we had to incorporate all the lessons learnt from the shortcomings of past attempts (Table 1).
Challenge | Enabling technology |
Per-flow state in the fast path | Packet carried forwarding state |
Re-convergence changes reservation path Path-hijack invalidates reservation | Path stability |
No reservation space available on path On-path adversary | Path choice |
Large number of reservations | Isolation domains and segment types |
Framing attack with spoofed packets | Per-packet source auth. |
Authentication overhead of signatures | Symmetric-key auth. |
Framing or DoS through packet replay | Duplicate suppression |
Overuse of legitimate reservation | Prob. monitoring |
Difficult admission decisions | Reservation hierarchy |
On the control plane, the setup of reservations on a path must be carried out quickly and efficiently to increase scalability and avoid DoS attacks. Moreover, reservation requests must be authenticated to avoid impersonation attacks. We also have to be wary of adversaries trying to reserve so much bandwidth that there is none left for legitimate users (using for example a Sybil attack). This calls for a mechanism to ensure a ‘fair sharing’ of reservable bandwidth, which our team also had to develop from scratch.
On the data plane, we must ensure that flows protected by reservations remain unaffected by congestion. This requires, broadly, that reservations cannot be forged, stolen, or overused. In turn, this involves authenticating and monitoring the bandwidth usage of each flow in the reservation to ensure that adversaries cannot maliciously inject traffic on behalf of others. This is a particularly difficult challenge to solve, as adversaries have many avenues at their disposal to try and spoof the source and pretend to be using the reservation, or even replay packets from a legitimate reserved flow, thus executing a framing attack.
One final major problem is that routing instability invalidates any attempt at achieving robust bandwidth reservations: If traffic is shifted away from the reservation path, the reservation becomes useless. Unfortunately, routing in today’s Internet is inherently unstable and routes between hosts change frequently, leading to hundreds of weekly outages. Even worse, Internet paths are targeted daily by hijacking attacks, further sinking any hope for routing stability in adversarial conditions.
SCION, a solid foundation
With Colibri, we believe we have found the first comprehensive solution to all of the problems mentioned above. This was possible thanks to several recent advances in networking, enabled by the SCION next-generation Internet architecture (to learn more about SCION, see a previous post on this blog).
Routing instability and hijacking attacks are solved by SCION’s cryptographically protected routing. The bandwidth monitoring and replay suppression systems — necessary to thwart framing and overuse attacks — have already been built on top of SCION’s packet-carried forwarding state.
One of the biggest outstanding challenges we solved with Colibri is how to build the line-rate authentication of reservations. To achieve this, we started from DRKey, a system to share symmetric keys among ASes. Colibri uses these keys to authenticate the cryptographic tokens that contain information on the source AS, the reserved bandwidth, and a packet-specific timestamp. Once a packet reaches a border router, the tokens can be authenticated within tens of nanoseconds, and the packet can be then forwarded further with strict priority over best-effort traffic.
This procedure is so efficient that our implementation can authenticate and forward more than 34 million packets per second — more than 270Gbps with 1KB packets — of reservation traffic on a 16-core commodity server. Further, we could show that a 400Mbps Colibri reservation does not experience a single dropped packet even if a 40Gbps link to the server is completely overrun by a DoS attack.
Please read our paper if you are interested in more results on Colibri’s scalability and security.
Beyond reservations
Many researchers contributed to the development of Colibri over multiple years, and yet we have only just scratched the surface of this entire research area. With Colibri as an enabler for Internet-wide SLOs, we can now start addressing more fundamental questions on how reservations can be used to improve the efficiency, speed, and fairness of Internet communication.
We hope that these research directions will kindle the interest of other researchers and inspire new solutions. For this reason, Colibri is open source and is already deployed on the SCIONLab global research network, where anyone can design and deploy applications that use Colibri bandwidth reservations.
Giacomo Giuliari is a PhD student at the Network Security Group at ETH Zurich.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.