Internet hall-of-famer Randy Bush has a unique writing style. Short, pithy, sometimes abrupt. But it’s unquestionably his own. Some people hate it, others are confused by it. I love it, because it’s direct and easy to work with. So when he had some recent advice on a NANOG mailing list, I was keen to dive in and explore it.
And, bear in mind, Randy wrote the quotes here, but I wrote everything that follows. If you don’t agree with things, you’re arguing with me, not him, because this is an interpretation.
For context, this is from a discussion about a deployment of more recent Internet Routing Registry Daemon (IRRd) code, including an ability to filter a Routing Policy Specification Language (RPSL) IRR object by the Resource Public Key Infrastructure Route Origin Authorization (RPKI ROA) that is a ‘stronger’ cryptographic proof of intent and can be used to filter and contextualize RPSL statements.
Randy said this:
“Get your ROAs correct and monitor their correctness. Use an external service which ensures your ROA data is propagating correctly globally.Randy Bush
Use reliable and correct relying party software; there is crap out there
Monitor your routers to ensure they are getting current relying party data
Familiarize your NOC and engineers with a toolset to provide assurance of and debug all of the above”
“Get your ROAs correct and monitor their correctness. Use an external service which ensures your ROA data is propagating correctly globally.”
This advice is really great. It’s one thing to make a ROA, but it’s another to make a ROA and then walk away. Creating a ROA then abandoning it isn’t helpful.
A ROA has a lifetime and a value that might not reflect what you really do, and so might be ‘adrift’ from your real BGP routing intent.
How will you know? You need to check it, and you need to check it from the point of view of somebody outside your own service, so you don’t fall into the pit of believing what your own assets tell you — global BGP depends on seeing how other people see you. Propagation is a significant problem in the global network; you can make assertions but they can still lie behind a service nobody can fetch from (or maybe you can but an Access Control List blocks everyone else).
Don’t just make a ROA. Make a commitment to checking it, and managing your ROA and BGP together.
“Use reliable and correct relying party software; there is crap out there”
This is blunt, but wise and real-world derived advice. RPKI is a work in progress. We have a mixture of older and newer code, and code in C and C++, Perl, Python, Java, Haskell and Rust. We also have embedded code in different router vendors, which may be in their test or early release train, and not in their core engineering train. Some earlier versions of code use incorrect validation algorithms (Routinator before 0.8.1 for instance) or have memory leak and runtime issues, or have even been deprecated by the author (RIPE v2 and v3 validator).
You need to think very hard about the Service Level Agreement your BGP is generating for your users. Deploying software you can’t maintain, or which isn’t working well, is a critically bad decision if you have high-availability goals. I’m not going to recommend specific systems — that’s best left to other people in the community at large. The best thing to do is to talk to your peers (in the human sense) in the operations communities.
Have a chat and canvass opinions. Read up on what’s worked and what hasn’t.
“Monitor your routers to ensure that they are getting current relying party data”
This feels like a corollary of checking your ROAs but is actually a distinct observation. Most BGP speaking systems don’t directly compute and hold validation states, and aren’t ‘relying parties’ in the cryptographic sense because a ROA is computed outside of the BGP engine and comes into BGP via use of the RPKI-RTR protocol.
For the emerging world of secure BGP, there is a different cryptographic exercise of validation of AS-Path information passing in BGP. You might not see how this is relevant, but it’s similar in that ongoing monitoring is key.
It is vital to make sure you can demonstrate RPKI-RTR is working properly and is sending current, correct information to your routers. I experimented with a BGP monitoring system and found that using the externally visible ‘beacons’ was useful because they generate both BGP visible state changes, and RPKI visible state changes, which permitted me to test these questions:
- Have I seen a fresh ROA state for this?
- Have I seen a fresh RPKI-RTR state for this?
Answering these questions helps ensure the full lifecycle is working.
“Familiarize your NOC and engineers with a toolset to provide assurance of and debug all of the above”
It’s remarkably easy to use your awesome powers to deploy something like RPKI and start being active in ROAs. APNIC encourages that, we make it simple, and there’s plenty of code including Docker images out there to run things.
But the sad reality is, networks don’t run on awesome powers, they run on dedicated staff in a position to monitor and manage complex, distributed systems.
These systems need love and attention, but don’t forget that the people who do this monitoring also need love and attention: they need to know what the ROA generation method is. They need to know how to regenerate it, how to test it, how to disable and reenable it, and how to handle the unexpected. The playbooks here will be your own, in your own language and methodology, but they need to exist. If you have a good Network Operations Centre (NOC), and good engineering, they’re going to be people who push back on ‘awesome’ ideas and instead ask for a process that respects the necessary changes to the routing infrastructure.
Help them along. Be kind to your NOC. Be kind to your engineers.
But even if we’re respecting our people, we don’t necessarily have all the tools we need. There are things the wider software development and devops community could work on:
- How does a NOC ask their validator and other relying-party software to show the input ROAs that led to the specific Validated ROA Payloads (VRPs)? How do they identify specific network validation inputs?
- (Tools like JDR and RPKIViz help, but this should be integrated with the RP codebase you use in the NOC).
- NOC and Operations staff shouldn’t have to check documentation to find out how to ask the router system what it’s doing with RPKI.
- Vendors should be required to support RFC 6945, which defines the SNMPv2 Management Information Base (MIB) for the RPKI elements. Then, all the NOC tools can use the data.
- Deploy tools like NTT’s bgpalerter and get signals directly into your network management system, alerting systems and the like.
Is it really that simple?
Randy’s pithy words are often aimed at putting a lid on the rambling conversation in the mailing list. It’s a version of the <microphone drop> from another day.
The point is that none of these suggestions should be a surprise, and modern network operators are encouraged to look at words like these, nod, and agree.
Randy didn’t actually shut down this mail thread, but he could have. There isn’t a lot of value in dressing this up in pompous language, and Randy is anything but pompous here — he’s direct and simple. These are wise words for how to make your BGP and ROA decisions work in the real world.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.