Part one of this series discussed the fundamentals of prefix and length, and explored how they help routing do its job because everyone maintains lists. As was said in that post:
“The magic technology the Internet uses for routing, is lists of numbers. The routing problem is simply a list problem.”Routing concepts you may have forgotten, part 1: Prefixes
These lists say how to find everything, and they’re called ‘tables’ or ‘information bases’. The thing is, everyone makes their own list. They’re all trying to describe the same global Internet, but they’re all looking at the Internet from their own little corner of it. Although they’re describing the same thing, they’re still different perspectives of it, and this gives rise to some interesting possibilities.
This brings us to the topic for today: Border Gateway Protocol (BGP). This post will explore how BGP harnesses these different lists of Internet addresses.
So how are these lists made?
The lists are made a couple of ways. They can be learned locally, such as how to find each machine on the local network. These things were learned, sure, but they weren’t learned by speaking BGP.
Your home router did learn some things locally, such as the list of how to find each machine on the local network. This is learned from its manufacturer-burned-in Media-Access-Control address (MAC address) which is often a unique 48-digit address a layer below the network (below the Internet Protocol) on either your house Wi-Fi, bluetooth (if you do Internet over bluetooth), or a cable to your home router, if like me, you prefer your Internet to be delivered over wires.
There are a lot of ‘war stories’ between network engineers about passing on things you learned inside your own network to BGP. Want automatic, smart, learning networks? Be careful what you wish for!
BGP is how routing lists are learned
The basis of BGP is how you can start with zero knowledge except what you, as an Autonomous System (AS), know locally (every BGP speaker is an AS).
Read more: What, exactly, are ASNs?
You announce to your fellow BGP speakers what you know. You shout to the global Internet, “I can originate this prefix!” and you identify yourself by your Autonomous System Number (ASN).
You also listen to what everyone else is saying. What do they tell you? They tell you what they originate, and they tell you what you have heard. There’s a cacophony of BGP speakers, all chattering like gossipy neighbours, all the time.
Their chatter includes them announcing their own ASN, as well as the hops through different ASNs to get to you. What I hear from my neighbour isn’t just my neighbour’s ASN. It’s my neighbour telling me a list of ASNs where the message came from, and adding their own ASN at the end. I hear each hop along the chain, and that’s just a message from that neighbour.
I’m also probably listening to the neighbour across the street, who is gossiping away about a chain of ASes that they heard from.
So what do I do?
I take in all that information, and I start my own ‘gossip’ about what I’m hearing. I start gossiping about my own chain of ASNs. That said, if you don’t want people to find your prefixes, you don’t have to shout them out. You can speak BGP while also staying silent about what you have.
What goes in the gossip?
But why bother participating in this gossip session? What’s actually in these chains of ASes that we’re taking about?
This is all about finding the optimal route to a destination. As you listen to these different chains of ASes, you’re trying to figure out the shortest way to get from point A to B. By seeing some common patterns in the ASNs, you can possibly shave some length off certain routes. It may be possible that to go from A to E, you don’t need B and D, because one piece of gossip you heard managed to do it in just A, C, and E.
You may also hear loops that you can cross off your list. If some other neighbour mentioned they heard the route A, B, C, B, A, B, D then you probably just focus on the A, B, D and cut out the rest. When you do your own gossiping, you’re not going to include that terrible route.
In fact, the path really exists to help detect if you have a loop. If somebody is telling you things you heard before about this prefix, the path is now not growing usefully, because you see a loop in it. Loops are bad. They don’t get you where you want to go.
As you learn things about prefixes you are learning them in terms of which Origin-AS they are announced from, and which path of ASNs you heard that origin from. They’re the two fundamental qualities about a BGP path — they have an origin, and a path of ASes that you see to get to that origin.
Are the shortest paths always the ‘best’ paths?
BGP’s job is to ‘optimize’ the routes.
If the path is taking more AS hops, then without some other reason to use it, it’s not your best path. You can keep it, but decide not to use it right now.
This reflects a reality, that when most networks are built using the same technology, and noting the real world limits of the speed of light, things will take the same amount of time if they cover the same distance. Of course, in reality, one might have more data loss, more traffic, or less bandwidth, therefore things can be slower or faster for reasons that have nothing to do with BGP.
But setting this aside, imagine that two networks are identical in every other respect and can process your data without loss or delay. In this scenario, the path that doesn’t have to actively look up a table and decide how to forward the packet ‘as much’ (remember, each hop-by-hop element of the path of ASNs for your prefix has to do things) is going to be faster.
Well, you hope it will be faster. The real world outcome of actual paths is very different to the BGP hypothetical ‘best’ path, because nothing is perfect. Some people simply forward packets better than others; some systems are more or less loaded. That’s what non-path based routing policy is for — to work out how to optimize things to take account of preferences that don’t line up with the shortest path.
The second simplification about ‘best’ path, is that if you have two paths, and one refers to a larger block of addresses, the one that refers more narrowly to a smaller block of addresses is held to be ‘better’.
This is a longest match rule. The longest matching prefix refers to fewer things and is more specific, therefore is better than the wider, encompassing match. The problems come from accidentally hearing about a longer prefix than you really should want. You will prefer this longer match, sometimes to your detriment. That’s what BGP hijacking is often about.
Why is BGP keeping those other, longer paths?
BGP wants to keep the least good paths, because the best path can disappear. By hanging on to the other paths, BGP can update how it forwards things as paths come and go.
Paths disappear when you are told they have been withdrawn. Either the origin can say “I no longer want to announce this path”, or because a link breaks, something else along the path causes somebody to announce they can no longer ‘see’ a path. This is why holding on to the other paths is good — if you lose your ‘best’ path, you should hopefully have some alternate path, and can pick the best (shortest) one to be your new best path.
When BGP breaks, often what you see is a sequence of withdraws. Those withdraw messages are successive BGP speakers telling everyone else they can’t see the best path they had. And, as they meander through the system, what is often happening is the path is getting longer and longer, until eventually nobody can see the Origin-AS behind the prefix. It no longer has a best path and it’s gone.
What can go wrong?
One of the nice things about BGP is how simple it is. The shortest path wins and the longest match wins. But, what about when things go wrong?
An example of this was the Pakistan YouTube incident, which was an outcome of a social policy in the Pakistani Internet community to restrict access to YouTube because of local regulations. Their method was to declare the prefix and Origin-AS for YouTube as ‘not acceptable’ inside Pakistan. They did this in BGP by referring to a single prefix YouTube announced as the origin of their content, and they made it a more specific prefix to the addresses used by Google.
It was the longest match, and longer prefixes are more specific and thus preferred. This made it more preferred than the real announcement YouTube (Google) was authorized to make.
The problem was the ISP in question announced this ‘outside’ of Pakistan to its transit path peer and caused the entire BGP speaking world (nearly) to think that all YouTube content was best served from this new origin and path.
The traffic for YouTube all went to Pakistan, crashing servers. The mitigation? YouTube was re-announced by Google, coming from two /25 (longest-match) prefixes, so that the rest of the world saw these in preference to the ‘hijacked’ /24.
Gossip isn’t always true
Sometimes, BGP doesn’t manage to learn something has gone away. Maybe a withdraw message gets dropped or a BGP system decides not to pass something on. Because BGP doesn’t send periodic updates but relies on the last update or withdraw being canonical (the truth about the world), it’s possible that you get told a route, and then never hear the route has gone.
These ‘zombie routes‘ can persist basically forever, until your BGP speaking router is rebooted.
Sometimes BGP accidentally announces routes to things it shouldn’t, like the ‘default’ prefix 0/0, and the consequence of this is what is sometimes called ‘accidentally acquired default routing,’ which means that there’s a BGP route to some neighbour AS, for every address in the global Internet address space.
This helps explain why packets can arrive at your door for routes that you shouldn’t see, and why your BGP speakers seem happy to pass on packets that you think you told them not to forward.
Why believe what you hear in BGP?
Trust has to come from checking. And that will come next in the series.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.