2020 was a year that brought many logistical opportunities and challenges to ISPs worldwide. Among the biggest challenges were managing capacity as people worked from home and ensuring the Internet was as secure as possible, during a global pandemic.
Over the course of the year, we at Vocus were rolling out RPKI across our network. This involved testing several validators as well as ensuring we could implement this cleanly across our multi-vendor environment.
Initially we chose two validators, RIPE NCC and Routinator. However, in late 2020, RIPE announced they were stopping development and support of their validator. Our design decision was to always use two validators, due to the risks of hitting bugs with one vendor, and how each validator deals with RFCs slightly differently.
After testing a number of options and having discussions with other carriers, we decided to use rpki-client with Cloudflare’s GoRTR in front. Splitting the functions will enable us, in the future, to deploy GoRTR internationally while pointing back to a single validation cache, keeping the content close to the routers, and reducing the risk where some routers only support plain text transportation.
The first piece in our RPKI rollout at Vocus involved signing all of our routes from all of our ASNs. We chose to do this via APNIC rather than rollout our own certificate authority (CA). APNIC has a great blog post on how to do this.
Signing our routes was incredibly easy. We are currently investigating running our own CA long term; software such as Krill makes this easy. This software could be tied with internal automation to automatically create ROAs as required and reduce the number of hands touching the APNIC portal.
To help monitor our routes we implemented BGPAlerter to watch over our ASNs and IP address space. This tool, written by the team at NTT, is simple and effective, so we highly recommend it to monitor your assets. In our setup we deployed it in a Docker container and send alerts to Slack. You can also use alternative software such as BGP Artemis.
Our rollout at Vocus included tagging all routes from upstreams and downstream with BGP communities for validation. With these communities in place, we were able to easily monitor any invalid routes we received from our customers and work with them as required to have these resolved.
A simple script was created to report on invalids daily as well as graph these over time. We wanted to ensure that when we started to drop invalid routes, it would be seamless for our customers. The most common cause of an invalid route was invalid length. An analysis of what could be contributing and how to avoid such invalids is available on this blog post.
Table 1 — The Vocus Looking Glass used these communities to label routes as valid, unknown, or invalid. This provided an easy tool for our customers to check their routes.
Customer communication is the key to a successful rollout. During our lengthy testing stage, my team and I reached out to impacted customers directly to help resolve issues. Formal customer communications were also sent to all transit customers in early 2021 informing them of the change. We wanted to encourage every customer to sign their own routes.
Rolling out RPKI is not as scary as it seems, so please do not worry. With the majority of Tier 1 carriers globally (GTT, Level3, Lumen, HE, and Cogent to name a few) already filtering prefixes, the chance of blocking a valid prefix is very slim.
During our testing and rollout, we discovered several limitations from our different network vendors.
- Cisco released a feature in 6.7.1 that specifies the source address for RPKI sessions. Prior to this, the source would be the egress interface IP of the router — this makes securing RTRs hard as it is increasing the number of permitted IPs in appropriate firewall filters.
- JunOS currently only supports RPKI sessions over plain text with no plans to implement it over SSH. However, we were able to mitigate this by keeping the RTRs as close as possible to our equipment, and implementing strict ACLs on our validators as well as our routers.
Vocus is one of the only providers in Australia to provide an extensive list of BGP communities to help customers steer traffic. One of these communities allows our transit customers to attach a defined community, 4826:666, to blackhole traffic (the alternative option is to use the SLURM option in RPKI validators); this is allowed down to a /32. With our implementation of ROA validation, this community is processed before any ROA validation as these routes would generally not have a valid ROA attached. We tested this in our lab to ensure the route policies across our multiple platforms worked as expected.
Monitoring of validation sessions towards the caches is important. During our implementation we noticed a number of routers lose connectivity, and unfortunately, there is no SNMP monitoring for this. Both Routinator and rpki-client have methods to monitor the number of clients connected. We monitor both these platforms in Grafana and alert when the clients drop off to prevent routers having stale databases.
During our rollout of RPKI, the decision was made to drop invalid routes on our retail network, AS9443, which services Primus and Dodo. This network is downstream of AS4826. This was implemented in August 2020 and we are pleased to say there have been no customer complaints.
From 16 March 2021, AS4826 dropped all invalid routes from both upstream and downstream customers.
Vocus has now successfully eliminated route mis-origination.
Phil Mawson is a Senior IP Network Engineer at Vocus AS4826 — one of Australia’s leading service providers. He has over 15 years’ experience in the Internet Service Provider industry and is passionate about providing the greatest service to the end user.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.