As networks grow in size and complexity, operators find themselves needing an increasing array of tools to manage the flow of traffic. The requirements an operator finds themselves needing to satisfy can be very wide-ranging, with competing internal interests, regulatory requirements, and hardware capabilities needing to be balanced.
Finding solutions
Operators will tend to start off with basic solutions. Adding bandwidth is generally the easiest, and almost always helpful — there are very few network operators that will complain they have too much bandwidth or too much fibre, although financial considerations push against this.
The next consideration is a solid Quality of Service (QoS) design and implementation. While other technologies can help, having queueing, traffic priorities and the like clearly defined and consistently deployed is a cornerstone for the network as a whole. Static routes tend to start off being used for quick and easy solutions to very specific problems, but documentation and lifetime of the routes become issues.
If an operator’s problems can be solved with actions based solely on destination address, additional options for dynamically adjusting routing are available, including Interior Gateway Protocol (IGP) metric manipulations at the advertisement point, IGP as Multi-Exit Discriminator (MED) and Accumulated IGP (AIGP) if BGP is in use. These allow a more dynamic environment than static routes, and potentially make understanding the routing path easier, as opposed to seeing a number of static routes pinned on different routers.
Figure 1: Using IGP as MED to address sub-optimal routing.
Once beyond the capabilities of those options, going up another rung in complexity are options that can handle a nearly limitless set of requirements. Resource Reservation Protocol – Traffic Engineering (RSVP-TE) has been the go-to for over a decade, and continues to be used and supported in a number of vendor platforms. RSVP-TE requires an MPLS underlay to be available, and its own protocol to be enabled.
Segment Routing Traffic Engineering (SR-TE) is a newer alternative, although that newness presently limits its real-world usability — existing hardware generally will not support it well, a Path Computation Engine (PCE) needs to be deployed for most use cases, and generally it requires BGP with Link State (BGP-LS) to be enabled for its own operation. While most focus is on using SR with the same MPLS underlay as RSVP-TE, a version is being developed that leverages extensions to the IPv6 header (SRv6), although support for this in hardware is still immature.
Identifying the problem(s)
As operators are going through and investigating these options, they normally run into a similar set of problems. The very first problem that should be resolved is ‘What actually are the requirements?’. An investigation into defining exactly what the problems are can be illuminating. A problem may turn out to be a need for raw bandwidth, QoS prioritization, or a level of isolation that can be handled easier with an L3VPN or EVPN-based deployment. A need for multicast and at what scale can restrict options — while both RSVP-TE and SR-TE support multicast, there are restrictions on both (increased label scale with RSVP-TE, potential lack of wide support currently and limits on the number of receivers in SR-TE). Label scaling can be a problem in the core with unicast and RSVP-TE in very large networks, although work is underway in the IETF with new proposals to reduce the severity of that problem.
The next common problem is one of ‘technical debt’, or having legacy hardware still in use. While serviceable, these platforms will have older hardware that needs to be carefully investigated to see if it can support the new design. Platforms have limits on how far they can look into a packet in order to perform L2 (Link Aggregation Group (LAG)) or L3 (Equal Cost Multi-Path (ECMP)) multipath hashing, and with how many labels they can push onto the stack at once. Care needs to be taken here, as while vendors may claim higher capabilities, that may come at a sacrifice to overall throughput capability due to reflow — a mechanism by which the Application-Specific Integrated Circuit (ASIC) will do part of the operation, and then send it back through again to finish.
Figure 2: Complexities arising from legacy hardware limitations.
Another problem to consider is one of operational complexity and being able to quickly troubleshoot problems when they arise. SR-TE’s requirement of a controller is a new paradigm for operators to contend with, and operators should take care to determine not just initial deployment, but operational and support changes that will arise from the shift in methodologies. RSVP-TE and SR-TE both put traffic into various tunnels, which require learning a new set of commands to diagnose and understand. Consider questions of ‘How much traffic is going through this tunnel?’, ‘Where is this traffic actually going?’, ‘Where did this traffic come from?’, and ‘How can I divert this traffic around a node that is about to go out-of-service?’ How do you determine the answers to these at the head-end? On an intermediate node? On the egress node? Which of these questions actually need to be answerable? The answers need to be documented and well-tested — determining the answers (or lack thereof) during a 2AM outage is not the time or place.
There are a lot of options and technologies out there to investigate, and there will always be something new being worked on. A careful understanding of the problems, understanding where your network is today, and keeping in mind the need to support and maintain what is built are, as always, critical to the successful engineering of networks.
Andrew Gray is a Principal Engineer at Charter Communications.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.