This is the third post in a series of questions asked of me about Border Gateway Protocol (BGP) routing security, and my efforts at trying to provide an answer.
Why don’t we outsource validation?
Domain Name Security Extensions (DNSSEC) has been deployed in a rather odd fashion.
End users, or stub resolvers, don’t generally validate DNS responses. Instead, they rely on recursive resolvers to perform DNSSEC validation. The recursive resolver passes the response back to the stub resolver in an unencrypted DNS response with a single bit set to indicate that the recursive resolver has performed DNSSEC due diligence on the answer. And the stub resolver believes it! If we believe that this rather insubstantial veneer of security is good enough for the DNS, then surely its good enough for BGP!
Read: DNSSEC validation revisited
We’ve taken steps in this direction with RFC 8097. We just tag all BGP updates with the extended community ‘validation state’ with a value of 0 and everything is valid! Right?
Ok, that was a low shot, and yes, this particular RFC is clear in saying that this is a non-transitive attribute that should not leak outside of an Autonomous System (AS). But what about an exchange point operator? As part of their service as a trusted broker of routing then what’s the problem in not only validating the BGP routes being passed across the exchange but marking these routes with a community attribute to show to exchange peers that the exchange has validated this route?
I have often heard the observations that ‘outsourced security’ is a contradiction in terms and ‘transitive trust’ is equally a misnomer! For the same reason that outsourced DNSSEC validation is a somewhat inappropriate leap of faith, my view is that outsourced routing validation is a leap of faith. But perhaps there is more to it than this and we all are sucked into outsourced validation in Resource Public Key Infrastructure (RPKI), like it or not.
In DNSSEC, a client can say in a query “Please don’t do validation on my behalf, and just tell me what you know, irrespective of whether the data passed or failed your validation”. The EDNS(0) Checking Disabled Flag is an interesting breakout of the inferred model of outsourced validation.
What about BGP? Can a BGP speaker say to its neighbour “Please send me what you have, Route Origin Validation (ROV) valid or not”, allowing the local BGP speaker to perform its local validation without reference to the validation outcomes of its peer? Well, no, it can’t do that. A router that performs RPKI validation these days typically drops invalid routes. This ‘invalid drop’ mechanism essentially imposes an outsourced validation model on its peers by not even telling them of routes that failed validation. If it told them of these invalid routes, then what is it supposed to do with the consequent packet flow?
Shouldn’t we view this with the same level of scepticism that we use for outsourced DNSSEC validation? What’s the difference?
The answer lies in the very nature of the routing process itself. BGP speakers tell their neighbors a promise: ‘If you pass me a packet destined to this address I promise to pass in onward’. If the BGP speaker’s RPKI validation process results in a dropped route then the BGP speaker cannot convey such a promise to its BGP neighbors, as the dropped route says that it will not pass any such packets onward. In so many ways routing is a cooperative undertaking that relies on trust in what neighbors tell each other in the first place.
I would say that in the action of dropping invalid routes, each BGP speaker is not only making a local decision but is also making a decision on behalf of its BGP neighbors. The Drop Invalid local decision is, in fact, an implementation of outsourced security already.
Why should I use a hosted RPKI service?
As a part of a program to encourage the deployment of RPKI several folk, including the Regional Internet Registries, offer a so-called ‘hosted service’ for RPKI. As part of this service, they operate the RPKI publication point for the client, and perform all the certificate management services on their behalf, offering the client a functional interface that hides the inner details of the RPKI certificate system.
It seemed like a good idea at the time.
But this is another case of adding an external point of vulnerability to the system. While all of our experience points to the futility of expecting comprehensive perfection, our innate optimism often triumphs over the disappointments of past experiences. Pushing this function to another party does not necessarily mean that the function will be performed to a higher level of performance, and at the same time, you lose control over the service itself. If this is critical to your online service then perhaps this is not what you want to do.
At the same time, the tools available to host your own repository are improving, and current hosting tools offer a similar level of functional interface to hosted solutions. In many ways, the two approaches are now similar in terms of operational complexity to the network operator. Does that mean we should all host our own RPKI publication points so as to assume greater control over our own security environment? Well, maybe not.
The RPKI publication model is very rudimentary, and it strikes me as having a lot in common with the web publication models before content distribution networks just took over the web world. Having distinct RPKI publication points for each network means that in a few years we would be looking at some 100,000 distinct RPKI publication points. And having achieved the goal of universal general adoption (!) then each of these 100,000 networks would also be sweeping across all of these publication points every two minutes. Right? So that’s 100,000 sweepers passing across 100,000 publication points every 120 seconds. Yes, computers are fast and networks are big, but that seems a little too demanding.
Perhaps we should all use hosted services, so these 100,000 clients need to sweep across 10 or so publication points every 120 seconds. Surely that’s an easier target? Well yes, but the implications of having one of these hosting services drop out would have a dramatic impact on routing. After all, in a universal adoption model the absence of credentials is as bad as fake credentials as everything that is good is signed and everything else is bad. That’s what universal adoption really means. So, our tolerance for operational mishaps in the routing security infrastructure now drops to zero, and we are now faced with the challenge of making a large system not just operationally robust, but operationally perfect!
But let’s not forget that we are dealing with signed objects, and as long as the object is validated through your trust anchor then you know its authentic and cannot be repudiated. Who cares whether you get the object from an elegant (and expensive) certificate boutique in the fashionable part of town or pick it out of the gutter? Authenticity is not about where you found it, but whether you are willing to trust its contents. If it validates its good!
So why don’t we get over all this hosted or distributed publication point issue and just toss it all into the Content Delivery Network (CDN) world? For the CDN world these numbers of the order of millions of objects with millions of clients are not only comfortably within current CDN capabilities, but so comfortably within their capability parameters that the incremental load is invisible to them. As has been observed in recent years in the conversation on the death of transit, who needs an Internet at all when you have CDNs?
Read: The death of transit?
Any more questions?
I hope this post and ultimate series has shed some light on the design trade-offs that were behind RPKI work so far, and point to some directions for further efforts that would shift the needle with some tangible improvements in the routing space.
As always, I welcome your questions in the comments below, which I may add to and address in this series in time.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.