Disaster recovery with DNSSEC

By on 14 Feb 2022

Category: Tech matters

Tags: , , ,

Blog home

In 2021, we at SIDN replaced our Hardware Security Modules (HSMs) and during the changeover project wanted to get a better understanding of what to do in case of an emergency. So, we created some ‘what if’ scenarios. One of those scenarios was:

What if we lost all the keys in the HSMs?

Our first thought was that we could recover from this scenario by restoring the keys from a backup stored at a different location. But if we didn’t create a backup for every Zone Signing Key (ZSK), there would be a possible issue. However, the Key Signing Key (KSK) has a longer lifetime and is available in every backup we make.

Our setup

Figure 1 shows our signing setup with two data centres. Both data centres have a chain of systems leading to a signed zone that is published.

Figure 1— Signing setup with HSMs.
Figure 1— Signing setup with HSMs.

The most relevant parts are the signers (*-signp1) and HSMs (*-hsmp1). The signers are in an active standby setup and the HSMs are setup as a High Available (HA) cluster. This means that both HSMs are available to create signatures and get updated with new keys.

Impact of lost keys

If the keys were lost, the impact would start with a loss of service because no updates for the zone could be published until keys are available again. If no action is taken after losing the keys, domains will start to fail because the Resource Record Signatures (RRSIGs) will expire. The last RRSIG to expire will be in the Start of Authority (SOA) record. So, testing an SOA record will fail last.

In our previous setup, the KSK was held in every backup made so it could be restored, but the active ZSK might not be held in every backup. If we can’t restore the ZSK and have to introduce a new ZSK, all the DNSSEC-signed zones would be unavailable for the TTL of the DNSKEY and/or RRSIGs. For .nl, this would result in an outage of at least one hour. Because 60% of resolvers used in The Netherlands validate (see Figure 2), a lot of people would notice, making for a very bad news headline.

Figure 2 — Percentage of validating resolvers in The Netherlands.
Figure 2 — Percentage of validating resolvers in The Netherlands.

Prevention

It’s possible to argue that removing the Delegation Signer (DS) record from the parent could be a solution to this problem. It does make it possible to publish a new (unsigned) zone after the DS TTL has expired at the parent, but this is only a last resort because every zone under .nl would go insecure. Therefore, DANE and SSHFP would no longer work for any .nl domain.

To be able to restore from a backup, a backup would have to be made every time a ZSK is created. For .nl, this is every 90 days, and because we have other TLDs, that would mean several backups. Because the backups are offline, it would take extraordinary effort.

Improvements made

In OpenDNSSEC, we enabled the RequireBackup setting to ensure keys are only used after OpenDNSSEC knows they were included in a backup. The backup procedure was also adjusted so that OpenDNSSEC would also know when a backup was made.

Because AutomaticKeyGenerationPeriod‘s default is one year, OpenDNSSEC will generate all keys needed for a year, so those keys can be included in the backup. We tested a configuration in which the ZSKs would roll frequently, resulting in 4000 keys in our HSM. Having that number of keys in the HSM had a significant impact on some of OpenDNSSEC’s tools such as the ods-hsmutil list command, which became inoperably slow.

As a result, we had to wipe that HSM partition and change the AutomaticKeyGenerationPeriod to a more sensible value of roughly four hours. For the production setup of .nl, we use an AutomaticKeyGenerationPeriod of one year because the ZSK is changed every 90 days.

We also created scheduled tickets in our ticketing system to ensure backups of the new ZSKs are created every six months.

Figure 3 — Times related to the signing process.
Figure 3 — Times related to the signing process.

As shown in Figure 3, we changed to a longer refresh period of eight days and a shorter re-sign period of five hours. During a lost keys incident, the changes made will now allow at least seven days to restore the keys (although actions should be taken faster than that, because of loss of service).

Final thoughts

Working through this ‘what if’ scenario, we found a suitable way to lower the impact of the lost keys, by adjusting our configuration and procedures. With these changes we were able to considerably improve our DNSSEC setup, keeping .nl secure and available.

Stefan Ubbink is a DNS and System Engineer at SIDN.

NLnet Labs’ Berry van Halderen contributed to this work.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published.

Top