Resilience in the RPKI

By Geoff Huston on 9 May 2025

In this post, I’ll examine how operators of the Resource Public Key Infrastructure (RPKI) have deployed this infrastructure to maximize its availability and performance. It will also explore measures taken to harden the infrastructure against potential service disruptions. In other words, it’s an examination of the resilience of the RPKI infrastructure.

For those who came in late, the RPKI is a hierarchically structured set of X.509 public key certificates that binds a public key value against an enumerated set of number resources (IPv4 address prefixes, IPv6 address prefixes and Autonomous System Numbers (ASNs)). It allows a key holder to demonstrate their control over a collection of number resources by demonstrating using a private key, and others, to validate such an attestation of control through the verification of this via the matching public key certificate.

RPKI is used in the area of routing security by attaching verifiable digital signatures to certain routing protocol transactions to provide assurance about their authenticity. Given its role in the global routing system, there are natural concerns relating to the resilience of the RPKI infrastructure, as there is no desire to add to the operational risk profile of the Internet’s global routing system by adding a security framework that impairs the resilience of this system.

Just to be pedantic here, the RPKI system is not a single hierarchy, but rather uses five such hierarchies, each rooted in a Trust Anchor (TA) that is published by each of the five Regional Internet Registries (RIRs).

When considering the topic of resilience of the RPKI infrastructure, it is important to appreciate several distinctions between this infrastructure and other distributed data systems used in the infrastructure of the Internet, such as the Domain Name System (DNS).

DNS operation

The DNS uses on-demand interrogation in providing its service of mapping names to associated data, such as IP addresses. Local clients are primed with the IP addresses of the nameservers that serve the root zone of the DNS. The task of resolving a DNS name starts with a top-down discovery process to find the set of authoritative nameservers that serve the zone in which the name to be resolved is defined, followed by a query to one of these nameservers to retrieve the required data attribute.

Every name resolution task commences with a query to a root zone server as the first query in the discovery process. This is a notional concept rather than a coded behaviour because name resolvers normally cache the responses they receive for potential reuse during a specified cache lifetime.

Notwithstanding this caching behaviour, all DNS resolvers must have access to one or more root nameservers all the time. If such access is prevented (such as when a local DNS resolver is isolated), it will continue to function for the duration of the remaining cache residence lifetimes in its local cache, but once these records have expired, the resolver will be unable to resolve names.

For this reason, and for reasons of improved query performance, much effort has been placed in operating a global network of root zone nameservers, attempting to ensure that root service is available to all DNS resolver clients all the time.

See ‘The root of the DNS’ for more details.

RPKI operation

The RPKI does not operate using an on-demand query/response mode as used by the DNS. The RPKI uses a mode of pre-provisioning, where local tools assemble and locally validate credentials into RPKI-secured Border Gateway Protocol (BGP) speakers (routers) in the form of filter lists, which are then applied to received (and sent) BGP updates.

Each network operates one (or potentially more than one) local RPKI client instance (strictly speaking, a ‘local relying party service’ instance), which maintains a local cache of all currently valid RPKI certificates and signed objects. This allows the client to assemble a list of authorized address prefixes and their associated originating ASes.

This list is converted into a router filter list, and the client service uses a dedicated protocol to pass the changes to this filter list to a collection of managed routers. The routers can then apply this filter list to all incoming BGP announcements (and potentially to all outgoing BGP advertisements as well).

Each RPKI client periodically sweeps across all the RPKI publication points to ensure that its local copy of all RPKI objects is complete and current. The IETF standards do not specify how often an RPKI client should comb through the collection of published RPKI objects (certificates, manifests and signed objects) to detect changes. Many network operators use a client configuration that performs such a sweep every 10 minutes. Others use shorter intervals, and some use longer intervals.

The implication of this behaviour is that all RPKI publication points should ideally be accessible to all RPKI clients at all times. However, the system is more tolerant of operational interruptions than this suggests. An RPKI signed object remains valid for the duration specified in the date fields of the public key certificate used to sign it.

Once loaded and cached by an RPKI client, the object remains valid until expiry — unless the certificate is listed on a revocation list or omitted from the relevant publication point manifest. This means that if a publication point becomes temporarily inaccessible, previously cached and still-valid objects will continue to be treated as current. This behaviour is roughly equivalent to the DNS cache lifetime directive and makes the RPKI system somewhat tolerant of short-term interruptions in connectivity.

The DNS is effectively a two-state system, where the queried data exists or does not exist. The RPKI is a tri-state system, where a route object is either accepted as a ‘valid’ prefix, is a candidate for rejection (‘invalid’) as existing validated RPKI objects contradict this route object, or ‘unknown’, where no valid RPKI objects are relevant to this route object.

Conventional operational practice is to construct route filters that accept route advertisements for ‘valid’ and ‘unknown’ route objects, and discard advertisements for ‘invalid’ routes. If a section of the RPKI remains inaccessible to a local RPKI client for a prolonged period and the associated validity timers expire, the router’s behaviour would not materially change —previously ‘valid’ objects would transition to ‘unknown’ status but generally would not be treated as ‘invalid.’

Resilience in RPKI object publication

Notwithstanding the RPKI system’s tolerance for interruptions in the availability of public RPKI objects, it remains desirable to ensure that the RPKI material, including the Trust Anchor Locators (TALs) and the content published at each RPKI Certificate Authority’s (CA’s) publication point, is managed as resiliently as possible.

Trust Anchor Locators

These objects are the notional equivalent of the root zone servers of the DNS, but at that point, the similarity ends. These TAL objects are not used in the validation of RPKI objects but are pointers to where the TA objects (self-signed certificates) can be found.

These TAL objects are published by each RIR:

These URLs are all published from unicast servers and do not make use of a Content Distribution Network (CDN). Why not take it a step further and load these objects into a CDN, enabling replicated publication that is both more resilient and faster to retrieve?

The reason lies in the observation that this information is not retrieved by RPKI client systems during the validation of RPKI-signed objects. Rather, it is configuration data, loaded at client start-up. The information in these TAL objects is sufficiently static that several package distributors bundle them with the RPKI software suite, for example, Debian’s rpki-trust-anchors package. As a result, no online query is performed at runtime in any case.

It would appear that there is little to be gained in terms of operational resilience by any alteration of these arrangements.

RPKI publication points

It’s a slightly different story for the information available at the RPKI publication points provided by RPKI publishers.

However, this information (the certificates issued by the RPKI CA, all RPKI-signed objects signed by the CA, and the manifest) does not form part of any real-time dependency used by routers when processing BGP updates or switching packets. The assembly of RPKI signed objects by client software and the construction of filter lists to pass to RPKI-aware routers is a background task that is offloaded to an RPKI engine. The performance issue of time to retrieve the objects in an RPKI publication point is not necessarily a critical performance issue.

There are two access protocols used by RPKI client tools. The original protocol, RSYNC, is not readily amenable to publication platforms that use duplicated instances of the content and anycast transport. The alternative, namely the RPKI Repository Delta Protocol (RRDP), uses conventional HTTP URI syntax and can be supported using CDNs and replicated publication that use either anycast or DNS steering of clients to the nearest server.

RRDP is designed to scale much better than RSYNC. In particular, RRDP is designed to allow use of an HTTPS caching infrastructure to reduce load on primary repository servers and increase resilience against denial-of-service attacks on the RPKI publication service.

RPKI publishers, clients and operational resilience

The RPKI environment differs from other security applications that validate the authenticity of information. Typically, clients perform on-demand validation of a single object, with the upper-layer transaction being blocked until the validation function returns a result. This is seen in the validation of credentials during a TLS handshake or in the validation of a signed DNS response in DNSSEC.

The goal of the routing system is to flood all routing information to all parts of the network at all times. The aim of the RPKI framework is to provide an efficient means for all clients to validate this routing information using the credentials published by all RPKI publishers.

For a publisher, resilience implies that all RPKI clients need to be able to reliably access the information published by all RPKI publishers at all times. Similarly, all RPKI publishers need to publish their information in a way that all clients can efficiently access this published information.

In other realms, such as the DNS or in web content, there has emerged in some realms a push for service self-sufficiency at the level of an economy or a region. For example, all DNS nameservers for the Top-Level Domain (TLD) of ‘.xx’ should also use names drawn from ‘.xx’, be physically located in the geography defined by the domain ‘.xx’. and preferably local copies of the DNS root zone would also be located within the same geography. The idea is that, even if all forms of external connectivity were disrupted, sufficient local services would remain available to allow local DNS operations to continue functioning, or so the reasoning goes.

Do the same considerations apply to the RPKI? Not really. In the extreme scenario of comprehensive failure of external connectivity, the inability to access various RPKI publication points would presumably cause local RPKI client caches to expire. This, in turn, would cause RPKI validation to fail. However, as noted above, such a failure does not lead the validation process to mark routes as ‘invalid’. Instead, the failure mode reverts routes to ‘unknown’ status, which implies that local RPKI-aware routers would continue to accept whatever BGP routes were being announced under such circumstances. There is no underlying objective of achieving self-sufficiency or enhanced operational resilience by attempting to localize instances of RPKI credentials and TAs.

Conclusions

While the RPKI is an instance of a distributed data framework, the considerations relating to its resilience and efficiency of operation are somewhat unique.

The best current approach to ensuring the broad and continuous availability of published RPKI credential material appears to be the use of RRDP in conjunction with CDNs. This leverages the CDN’s ability to replicate content across a large-scale distribution network, while anycast or DNS-based steering directs the client’s URL retrieval requests to the nearest instance. This strategy maximizes availability for the diverse set of RPKI client instances.

The design of the RPKI system was shaped by a firm stance taken by vendors of BGP router implementations during its inception. The system was to operate as an overlay to the existing BGP framework, and BGP functionality was not to be changed. As a result, the RPKI credential system, where publishers must distribute their material to all clients, cannot take advantage of one of the most efficient flooding protocols available — BGP. Instead, we have had to revert to a less efficient system where clients need to periodically poll publishers to detect if the publisher has changed its published material.

I suspect that the current design and operational practice of RPKI credential distribution has reached the limits of what can be achieved within these constraints. If we want to pursue goals such as greater scalability, faster responses, and greater operational resilience, then we may need to question this very basic constraint of being unable to use BGP to perform this security credential flooding function.

So far, RRDP and the selective use of CDNs have reached a generally acceptable level of operational performance and resilience. However, if we change our expectations or impose new roles on this RPKI framework, it will doubtless be necessary to re-evaluate just how far we can push the current model and when it is worth embarking on a completely different approach.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.