Making Segment Routing user-friendly

By Dmytro Shypovalov on 6 Dec 2024

Tags: Guest Post, How to, segment routing

Segment Routing (SR) was supposed to make Multiprotocol Label Switching (MPLS) easier and give more power to network operators. Sadly, vendors decided to make it harder by selling weird protocols and over-engineered controller bloatware.

MPLS is actually great

Despite some anti-MPLS marketing from SD-WAN vendors and the like, as a transport technology, there is no real alternative to MPLS¹. MPLS provides:

Low-overhead underlay for services, so only provider edge routers (PE) need to carry full Internet tables and VRFs for L3VPN.
Traffic engineering allows evenly load-balanced traffic in any topology, provides a low-latency path for services that need it, a disjoint path for A/B feeds, and so on.
Fast reroute enables sub-50ms convergence after link / node failures.

However, throughout its evolution, MPLS has become unnecessarily complex. Let’s try to understand why.

Traditional MPLS stack

Normally the MPLS network would run IS-IS or OSPF for IP routing, Label Distribution Protocol (LDP) or Resource Reservation Protocol (RSVP) for label distribution, and Border Gateway Protocol (BGP) with its many address families for MPLS services. Something like Figure 1.

There are three problems with the traditional MPLS stack: Operational complexity, protocol limitations, and interoperability.

Operational complexity

LDP is fairly simple to configure and operate but doesn’t support Traffic Engineering and has limited support for fast rerouting using remote LFA.

RSVP supports Traffic Engineering but requires a mesh of p2p tunnels, which limits scalability.

In practice, ‘LDP or RSVP’ often becomes ‘LDP and RSVP’. LDP within each PoP and RSVP on WAN links connecting different PoPs, with targeted LDP sessions over RSVP links. This further increases complexity as now you have to operate two label protocols. Something like Figure 2.

Figure 2 — Targeted LDP topology. — Figure 2 — An example of targeted LDP topology.

Protocol limitations

These are fundamental and apply to all implementations:

Poor ECMP support: RSVP doesn’t support Equal Cost Multipath (ECMP at all. LDP supports it but can run into limitations (1, 2). ECMP and leaf / spine designs are very common in modern networks.
LDP-IGP sync: While its purpose is to prevent traffic blackholing during reconvergence, it’s easy to shoot yourself in the foot and create a situation when Interior Gateway Protocol (IGP) sessions get stuck and never come up.
Traffic engineering with RSVP is non-deterministic: Since every router signals its LSP independently of others, the ultimate state of routing depends on the sequence of events and can be different every time the network reconverges.
Ironically, independent LSP signalling doesn’t mean better scalability. On the contrary, every router must maintain a state for all transit LSP going through it, so RSVP doesn’t scale well.

Interoperability issues

Since we’re dealing with a stack of at least three protocols and many extensions to them, building an MPLS network using hardware from different vendors becomes a big pain:

Not all implementations support all features and extensions. With traditional MPLS architecture this often means if one router doesn’t support a certain feature, you can’t use it.
Different vendors sometimes interpret RFC differently so they both claim to support a certain standard but when you connect their boxes together, it just doesn’t work.

Back in TAC I spent many days troubleshooting things like RSVP refresh reduction or fast rerouting between two big mainstream vendors, or even two different operating systems of the same vendor! Imagine what happens when you introduce lesser-known implementations.

In practice, all of this means that the only viable way to build a traditional MPLS network is to buy all routers from one of the big vendors and follow their validated design. This results in vendor lock-in and makes MPLS inaccessible for many smaller networks that cannot afford to spend fortunes on network hardware.

Segment Routing basics

SR throws away all the label distribution garbage and adds a few extensions to IS-IS and OSPF to advertise Segment ID (SID) along with links and prefixes. There is no more ‘label switching’²; the ingress LER just pushes the SID of the egress LER onto the packet and all transit LSRs don’t change that label (at least in theory, practical implementations still swap the label with the same label).

Since SR is IP-based and has no circuit-switching roots like RSVP, it natively supports Equal-Cost Multipath (ECMP) and anycast. With good network design, SR scales very well.

Traffic engineering with SR

So shortest path forwarding works great, now what about traffic engineering? Unlike RSVP, SR doesn’t signal LSP but just adds multiple SIDs to forward the packet along the traffic-engineered path.

Figure 3 — An example of basic SR traffic engineering (SR-TE) topology.

In the topology shown in Figure 3, to forward a packet from R1 to R6 via blue links, R1 pushes three labels:

Node SID of R3 -> forwards the packet to R3 using the shortest path.
Adjacency SID (Adj-SID) of R3 towards R5 -> uses that specific link.
Node SID of R6 -> delivers the packet to the actual destination (R6).

Here is a catch: Since SR is stateless, proper traffic engineering requires a controller. Without a controller, it’s impossible to use bandwidth reservations, and while you can do Constrained Shortest Path First (CSPF) with just affinity or explicit path, it requires extra functionality on routers, making implementation more complex.

Lowering barrier to entry into MPLS

Since we’re outsourcing SR-TE policy calculation to the controller anyway, this makes SR implementation on the router very simple:

IGP extensions to advertise Segment Routing Global Block (SRGB), Segment Routing Local Block (SRLB), prefix and Adj-SID (just a few new Type-Length-Values (TLVs)).
BGP-LS to advertise link-state topology to the SR controller.
BGP-SRTE or PCEP to receive SR-TE policies from the controller.

Actually (2) and (3) are optional, and so are other features like Topology-Independent Loop-Free Alternate (TI-LFA), Flex Algo, and so on. The very minimal implementation of SR is just adding a couple of new IS-IS TLV and that’s it. Then you can use one router from another vendor to export IGP topology to BGP-LS; for receiving policies, good old BGP-LU serves as a nice workaround for routers that don’t support BGP-SRTE or PCEP.

Unlike RSVP with its many extensions that took many years for vendors to implement and refine, basic SR routers can be built by small companies or open source projects. Network operators therefore have a much wider choice of available hardware or can build something on their own using open source software like Free Range Routing (FRR) and Vector Packet Processing (VPP).

Easier interoperability

With SR there is a much lower risk of interoperability issues compared to the traditional MPLS stack. Since there is no RSVP signalling, and no LDP-IGP sync, interoperability problems can happen pretty much only on the IGP level (for example, different timers or TLV format), but those happen in LDP / RSVP setups as well.

Perhaps the only annoying thing is SRGB — every vendor decided to use a different default range, but at least you can change it when required.

SR also makes a hybrid approach viable — when using hardware from big vendors for network core, but something cheaper for aggregation and access.

What went wrong with SR

While SR has its roots in Cisco, the technology is an open standard so everyone can implement it. As I explained above, implementing basic SR is actually very easy and there are several implementations supporting it to some extent.

Selling a simple network design that everyone can replicate was unacceptable for big vendors. So they came up with ‘best practice’ designs using traffic engineering with Path Computation Element Protocol (PCEP) (with its many extensions) and over-engineered controllers that lock you in with their SR implementation.

An SR-TE controller is just a router with some extra functionality like processing BGP-LS and calculating policies with CSPF. It should be even easier than your average MPLS-TE implementation since there is no need for LSP signalling.

Yet what you see in actual controller implementations is some bloatware that needs a supercomputer to run and does all kinds of things like network monitoring, automation, NetFlow collecting and OSS/BSS functionality. This is great but who asked for any of those on a routing platform?

Self-defeating paradigm

Of course, the real business reasons for this over-engineering are:

Vendor lock-in (as explained above). Since implementing SR is much easier than implementing RSVP with its many extensions, vendors moved proprietary magic to the controller.
Selling services. If the controller is so difficult to deploy and operate, network operators will have to buy expertise from vendors.

The second point is ironic. If the product is purposely made too complicated to deploy and operate for network operators, it will also be too complicated for engineers working for a vendor.

Recently I was reading ‘Introduction to SRv6‘ from Juniper and the chapter about traffic engineering has a beautiful sentence:

As we don’t have a controller in our setup, we will not demonstrate the PCEP provisioning from the controller.
Introduction to SRv6

Juniper sells its own SR-TE controller! Yet it is so complex that even Juniper engineers writing a book about SR-TE couldn’t set it up in their lab. This is not a complaint about the book (which is actually good and I recommend reading it), just an illustration of the point I made above. Also the same applies to other SR controller vendors, not just Juniper.

Dumbest network design ideas

Quote from the book ‘Segment Routing, Part 1’:

…the always-on RSVP-TE full-mesh model is way too complex because it creates continuous pain for no gain as, in most of the cases, the network is just fine routing along the IGP shortest-path. A tactical TE approach is more appealing. Remember the analogy of the raincoat. One should only need to wear the raincoat when it actually rains.
Segment Routing, Part 1

The engineers who invented SR based their research on decades of industry experience, and the feedback they collected from many MPLS network operators, so I think we should listen to their advice on network design.

What they recommend is to use IGP shortest path routing whenever possible and then deploy some SR-TE policies for traffic that needs to be forwarded via a different path. This might not be always possible, but in either case, regular IGP routing (with SR extensions) should be the baseline the network can always fall back to should the SR-TE controller fail.

Despite this common sense advice, some people have produced designs that make even basic end-to-end connectivity depend on the SR-TE controller, since this somehow makes the network ‘software-defined’ and ‘programmable’³. I haven’t been able to understand how it’s more ‘programmable’ than the normal design where the controller is deployed on top of basic IGP routing. The only real difference is that now controller failure leads to a catastrophic network outage, which should not be the case in a good SR design.

The real-world consequence of those designs is making company execs scared of any network automation and Software Defined Networking (SDN) as in their minds it now equates to fragility.

Building a user-friendly SR-TE controller

Considering all of the above, a good SR-TE controller should be:

A pure routing platform. Collect routing information and calculate policies — that’s it. Provide API for automation and CLI for troubleshooting but don’t attempt to combine all network management platforms in one.
Easy to deploy, configure and operate. Industry-standard CLI has been working great for routers and switches; there is no reason it shouldn’t work for an SR controller. Some people really love to hate CLI, but in practice a networking product without a good CLI is unusable.

Figure 4 — The SR-TE controller belongs strictly to the control plane. Turning the controller into a management platform was a mistake.

Supporting basic routers: It’s great to support several features, but the controller should work with the minimal SR implementation that doesn’t support PCEP, On-Demand Nexthop (ODN) and other complex stuff. Just SR extensions for IGP and BGP-LU to install policies – pretty much everyone supports that.
Lightweight in basic setup: It’s very handy to just deploy a controller as a Docker container running on a router, so there is no need to maintain an extra server in a remote data centre, setting up redundant connections and so forth.
Natively supporting SR designs: Using ECMP and anycast SID in policies, Egress Peer Engineering (EPE), null endpoint, and so on. It’s not enough to take CSPF algorithms from traditional MPLS-TE and replace RSVP with SR — the proper SR controller must natively use SR capabilities.

To illustrate the last point, consider a typical leaf-spine topology.

Figure 5 — An example of typical leaf-spine topology.

Normally traffic from L1 to L6 will ECMP via all spines. As I pointed out earlier in this article, SR already gives multiple advantages over LDP in this topology, especially as we try to scale it (1, 2).

Now what if, for traffic engineering reasons, we want traffic from L1 to L6 to go strictly via S1 and S2. There are two ways this can be done:

Use link affinities (also known as admin groups or colours).
Use explicit path.

If we configure a loopback with anycast IP on S1 and S2 and use that IP as an explicit path loose hop, the controller should resolve it via both routers and use ECMP. This was not possible with RSVP.

Now, what should be the segment list in this policy? Using two segment lists: <S1, L6> and <S2, L6> will consume more Forwarding Equivalence Class (FEC) entries in hardware. If S1 and S2 share an anycast SID, the controller should figure that out and use the anycast SID in the segment list⁴.

Egress Peer Engineering

EPE is a way for an ingress router to forward traffic to a specific egress peer of a specific egress router. An egress router should allocate an MPLS label (BGP Peer SID or BGP-LU) per egress peer and advertise it to the controller, so the controller can program a policy instructing the ingress router how to forward traffic.

It’s easy to integrate SR with EPE as we can just add an EPE label to the policy.

Bandwidth and affinity constraints for EPE

There is no RFC to advertise TE extensions with BGP Peer SID, similar to RFC 3630 / RFC 5305 / RFC 5329 for IGPs. So we can just configure constraints on the controller, which will correlate those with the BGP Peer SID it receives from the egress routers.

Null endpoint

It is possible to configure SR-TE policies with Null endpoint (0.0.0.0 or ::). This is perfect when we want to send traffic to the closest egress peer matching the constraint. In network design, this is also known as hot potato routing.

Variable endpoint

SR-TE policy can also change from a regular (node) endpoint to an egress peer endpoint. Consider Figure 7’s topology.

In Figure 7, site 1 and site 2 both advertise their prefixes to the Internet, but the preferred method of communication is over dark fibre links. If one of the links fails, and the remaining link doesn’t have enough bandwidth to accommodate all traffic, the SR-TE policy can be rerouted to the null endpoint — to the egress peer. In other words, cold potato routing changes to hot potato routing.

Poor man’s Automated Steering

Automated Steering (AS) is a powerful way to map service routes (IP or VPN prefixes) to SR-TE policies by aligning the route’s colour in the extended community (extcommunity) with the SR-TE policy’s colour.

What if the router doesn’t support BGP SR-TE or PCEP? Earlier I made a point that a good controller should work even with the most basic SR implementation. Sure, we can use BGP-LU to advertise policies from the controller, but then there is no way to map different services to different policies… Actually, you could:

Configure a separate loopback that is NOT advertised in IGP.
Advertise this loopback in BGP-LU with low LOCAL_PREF.
Change the next hop of the service routes to this loopback.
When the SR-TE controller uses BGP-LU to send SR-TE policies, it advertises this ‘service-loopback’ rather than the actual policy endpoint.

This works almost as great as automated steering! Just needs a bit more configuration, but this is the price to pay for using cheap routers.

Introducing Traffic Dictator

I developed Traffic Dictator as a minimalistic, user-friendly SR-TE controller. It is a routing platform, with configuration resembling a router so every network engineer familiar with SR and BGP can intuitively figure out how to use it.

root@TD1:/# tdcli
### Welcome to the Traffic Dictator CLI! ###
TD1#conf
TD1(config)#traffic-eng policies
TD1(config-traffic-eng-policies)#policy ?
  <POLICY_NAME> Traffic-eng policy name

It can run in a docker container (even on a router that can run containers). It is SR-native, so it supports ECMP, anycast SID, mixed IPv4/IPv6 SID, policies through IS-IS/OSPF/BGP domains, EPE, Null endpoint and so on. Although BGP SR-TE with automated steering is preferred, the controller can also work with very basic SR implementations, using BGP-LU with service-loopback to install policies.

You can download Traffic Dictator from the Vegvisir website and follow the documentation to install it. Also, check out the white paper. Alternatively, try Traffic Dictator in a pre-configured Containerlab setup with Cisco XRd or Arista cEOS.

I will be posting more technical articles about SR, Egress Peer Engineering, network design and automation. Whether you agree or disagree with my ideas, have any suggestions or want to defend the big vendor approach, please leave a comment or write me an email.

Notes

Some networks may use VXLAN for L3 connectivity simply because they found cheaper VXLAN-capable switches at the time. ↩︎
At least in theory. Practical implementations may still just swap the label with the same label ↩︎
The inspiration for those designs is RFC 8604, which is just a hypothetical concept illustrating that SR allows building a network with more routers than 20-bit MPLS label space can address. This never made any sense in the actual ISP network design with hundreds or thousands of routers. ↩︎
Yes, I know, in this topology the spine SID wouldn’t be actually pushed on the wire, it would just be used to resolve the next hop. In more complex topologies anycast SID becomes more relevant. ↩︎

Dmytro Shypovalov is the founder of Vegvísir Systems and has a background as a Network Architect. He is primarily interested in routing protocols.

Originally published at Routing Craft.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.