The Domain Name System Security Extensions (DNSSEC) is a suite of specifications for securing certain kinds of information provided by the Domain Name System (DNS).
Although it’s ‘relatively’ straightforward for any sized enterprise to deploy, there are challenges, especially for enterprises that have large zones with frequently changing records.
At DNS-OARC 31, my colleague, Allison Mankin and I, gave a presentation on our experience at Salesforce from deploying DNSSEC on our very large, dynamic live production zones, hoping these will help other organizations when they take this road.
Deploying DNSSEC at Salesforce
How to find a preferred DNS provider
At Salesforce, we started our DNSSEC deployment by analysing our zones and finding third-party providers that satisfied our DNSSEC requirements.
There are many managed DNS providers; each offer provider-specific features, have different policies and infrastructure, and may have limitations in the extent of their DNSSEC support.
To find the right provider for us, my colleagues, Pallavi Aras and Shumon Huque, performed functional testing on several DNS providers from various perspectives, including zone signing algorithms, NSEC3 support, and zone sizing support. Besides the functional testing, they also simulated the workloads of our company’s DNS provisioning scenarios in both off-peak and peak hours to do performance testing on both provisioning and resolution for these DNS providers, including API use, server utilizations, and updating propagation delays.
From all this we found no single provider satisfied all our requirements. As a result, we categorized our zones into two groups based on size, update frequency and key management policy, and applied two different DNSSEC models, as described below.
The benefit of having multiple DNS providers
Shumon has previously spoken about the ideal way to deploy DNSSEC at multiple DNS providers.
In his models, multiple providers each independently sign and serve the same zone, which requires each provider to import at least the public Zone Signing Keys (ZSKs) of all other providers into their DNSKEY RRsets. Because not enough providers had implemented these models at the time of our research, we applied two other models.
The first is a traditional ‘hidden signing primary’ model that can be run on an in-house primary server (with a Rest API provisioning service). It maintains and signs zone data. The zone is then propagated to two providers via DNS zone transfer, and those providers serve the zone, as shown in Figure 1.
The second model is a ‘third-party signing provider’ model where we use the provider’s APIs to update zone content, and then the update is pushed out to two totally separate infrastructures of this DNS provider (Figure 2).
Overcoming the challenge of using two models
Because neither model could fully satisfy the requirements for each group, by applying these two models, we have all our zones on two providers or two totally separate infrastructures.
However, the consequence of using two models requires us to migrate the zones from the old DNS providers to the new ones. Such a zone migration can be fragile for live and frequently updated zones — a hard cut-over is not an option because of the complexity of syncing the NS changes and the provisioning activity. For example, changes to the Rest API calls and NS change cannot be effective at exactly the same time due to code release cadency and NS records’ TTL.
Therefore, to seamlessly migrate the zones without introducing any impact, we applied the following steps:
- Bootstrap: Install the zone to the new provider while our provisioning only sends the DNS updates to the current provider via Rest API calls.
- DNSSEC signing: Sign the zone on the new provider, but note — the DS records are not published at this stage.
- Zone content sync 1: As the new provider does not receive any of the updates made during the zone installation process, we must first handle the inconsistent records between the current and new providers by using the records on the current provider as the source of truth.
- Active/Passive mode: Change our provisioning to send the DNS updates to both the current provider in active mode and the new provider in passive mode, meaning that any failed updates to the new provider do not impact the updates to the current provider.
- Zone content sync 2: Handle the inconsistent records between the current and new providers to fix any updates that failed while in passive mode in Step 4.
- Active/Active mode: Change our provisioning to send the DNS updates to both the current and new providers in active mode, meaning that if the Rest API call to one fails the update to the other is reverted. From this point on the zones on the two providers should be fully in sync.
- NS update: Change the NS records to point to the new provider. Wait for twice the NS TTL and ensure the zone is correctly resolved via the new provider.
- Publish DS records in parent: Now publish the DS records so a DNSSEC chain of trust exists and the zone can be validated.
- Active mode on the new provider: Change our provisioning to only send DNS updates to the new provider.
Beware of historical parent and child zones
Although the above procedure seamlessly migrates the zones, there are other details that must also be taken into account.
Since zones migrated with this process remain on the old provider until after the full migration is completed, care must be taken, if both a parent and child zone are on the same provider. For example, if ONLY the child is migrated (via a NS change) then the old provider is likely to still serve the child zone after the NS change because many DNS implementations serve locally available zones in preference to honouring a delegation in the parent.
To solve this problem, we decided to migrate both parent and child zones at the same time because this doesn’t introduce any downtime of provisioning, doesn’t introduce outdated records, and is easier to roll back compared to the other solutions.
This solution, however, comes at a price: it may take 48 hours for the rollback of even the child zone. For example, if the parent is a second-level domain then the TTL of the NS records there is constrained to be 48 hours.
Fortunately, based on our observations as well as the experiments done by Wes Hardaker, many resolvers are using the child’s TTL. The best practice here is that before changing the NS records, always decrease the NS records’ TTL to 30 seconds on the child authoritative servers, so the rollback can be effective more quickly.
Remember: hidden primary models need to be re-signed periodically
As mentioned earlier, we use a hidden primary model for some of the zones.
The records in a dynamically updated zone on a hidden primary need to be re-signed periodically. If the re-signing of all the records in a zone occurs at the same time, there will be a lot of XFRs to the secondaries in a short period of time, which may introduce some performance issues, particularly for very large zones.
The stack vendor for our hidden primary implemented jittering for dynamically updated zones to minimize this issue. The distributions of the RRSIG expiration times for a testing zone before and after applying the jittering are shown in Figures 3 and 4.
As we can see in the figures, the distribution is evenly spread across the re-signing interval with jittering, which helps to more evenly distribute the XFRs load to the secondaries.
Test all features
Besides the above re-signing issue, we also found that NSEC3 support can be awkward to use and hard to debug. For example, we discovered a bug that resulted in incorrect negative proofs provided by a specific DNS implementation.
We also found that one public recursive resolver was unable to properly authenticate responses for records that involved certain complex configurations: CNAMEs that crossed zone boundaries, wildcard synthesis, and a combination of secure and insecure delegations in the chain.
The overall lesson we learned was to carefully test all the features and capabilities to be provided and not only test the authoritative name servers, but also the public recursive resolvers.
Another set of challenges we had was to deploy DNSSEC on hardware load balancers — for F5 BIG IP. For example, the option to specify a signature inception offset in the configuration was missing, and some HA deployments can only be supported by a specific Active/Active GTM mode. More details about this issue can be found in the talk given by Neda Kianpour and Tyler Shaw at RIPE 79.
And monitor
One important procedure involved in DNSSEC is the need for monitoring.
Because DNSSEC introduces more data for XFRs, the first thing we monitor is whether zones are synced among the multiple name servers (in the multiple providers) and/or have significant propagation lags introduced.
Another thing we monitor is DNSSEC correctness of the zones, which we have done by using DNSviz to produce data for a dashboard and trigger pages out when any errors are found.
We use our own open-source platform Refocus to visualize the monitoring results, and we have given a talk about our monitoring at DNS-OARC 27.
Teamwork is the key to a successful deployment
Granted, DNSSEC deployment may not be as simple as pushing a single button. It needs preparation, migrations, and monitoring, in addition to the DNSSEC-specific tasks.
There are challenges and surprises in deploying DNSSEC in a large enterprise, but it can be driven to success as we’ve found at Salesforce. The key for us was having a great team. Everyone on the team is an author of this work, including Pallavi Aras, Sara Dickinson, Shumon Huque, Neda Kianpour, Allison Mankin, Tim Wicinski, Baula Xu, and Han Zhang. We would also like to thank the engineers and product managers at the providers we’ve worked with.
Han Zhang is a software engineer on the DNS team at Salesforce.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.