Earlier (but not ‘early’) in my career I had a hand in developing the Domain Name System Security Extension (DNSSEC).
Ordinarily, when developing something, you start with a set of requirements or goals. But DNSSEC was a research project, so in place of requirements, developers set expectations of what needed to be done and what could be done to solve the DNS security problem. With DNSSEC, this meant making lots of assumptions about how operation staff would deploy DNSSEC. These assumptions were made by developers, about operators.
As with many other extensions, DNSSEC deployment has been rather slow. Some thought this was due to a lack of education, assuming operators just needed to know more about DNSSEC and they’d fall in love with it. After rounds of education, workshops, tutorials, and presentations, deployment was still slow. Even developing documents of recommendations for deploying DNSSEC didn’t move the deployment needle.
It became clear that what was needed was a better appreciation of what operators do. Perhaps the underlying issue lies in ‘assumptions made by developers about operators’. To help figure this out, a long-term project, now running for 10 years, began to observe and note what operators of top-level domain (TLD) name registries did regarding DNSSEC. By noting the records published in a zone related to DNSSEC, a lot could be deduced about an operator’s approach.
Why TLD registries? Two basic reasons:
- There is an assumption that these would be examples of well-run DNS operations. If they weren’t, nothing below the TLD would work.
- The group is well-defined. To be a TLD one has to be listed in the DNS root zone.
There are many other well-run DNS operators but drawing an objective line around them is difficult. A crucial component of any long-term study is a stable roster of items being observed, in this case, operations.
Looking at the data, you could see the choices made by deploying operators varied in relation to the developer-recommended practices. Roughly speaking, operators followed developer recommendations when it came to choices and processes that were entirely internal but deviated when it came to time-related parameters and processes that involved interactions with other operators. This divide is evident by looking at DNSSEC key life cycles.
What can we learn from DNSSEC key lifecycles?
One of the foundational assumptions about DNSSEC is that the cryptographic keys used to generate digital signatures will have a limited lifetime (a private key can be eventually guessed or broken). Therefore, keys will have to change, and a changing of keys is visible in the zone. Not only the coming and going of the key, but how it is used.
Complicating the simple use of digital signatures in DNSSEC is the use of caching in the DNS. Caching means that requesting servers will get some information from an authoritative source and hold on to it for some period. The information may be changed at the source, but the requesters will still rely on the old information for some time. Fortunately, the DNS anticipates this fuzziness and can manage it (more or less, there are always weird implementations that break the rules).
Combining the need to change keys and handle the fuzziness, DNSSEC keys experience different states in their life cycle. There is:
- Pre-publishing, when the goal is to get the public key into all necessary requester caches.
- Active use, when the goal is to rely upon the key for DNSSEC’s security goals.
- Retirement, when the goal is to make sure that a cached copy of data can still be validated between the key’s removal from active service and the expiration of the data itself.
In reality, the lifecycle is a bit more complicated than this, but such a lifecycle captures the general idea.
Visualizing these states and presenting on a timeline has been a personal long-term goal. Besides needing time to develop a means to do the visualization, the extra years presented more and more interesting cases. One rule about operations remains true: Not everything goes according to plan. Exceptions prove to be the most interesting events to study and visualize in operations, but also the most complicated.
Finally, I’ve found time to develop the visualizations and plot them in charts that show how TLD operators have managed their keys since mid-2011. It’s interesting to see differences between the regular cadence of highly automated operators versus manually run operations and when they changed hands. We’ve also begun to see how operators re-factor their operations based on gained experience, with regular cadences ending and then shifting to another cadence.
Further examination reveals ‘scars’ of incidents when operators faced a situation and adjusted to overcome it. The life cycle visualization does not reveal the why’s and how’s of the incidents and recoveries, but the visualization does open the door to conversations and lesson sharing.
What have the visualizations uncovered?
I’ve included a sample of one visualization (Figure 2) to give some colour to this blog post. The sample shows DNSSEC keys involved in the signing of the DNS root zone from 15 March 2021 until 15 July 2021. This time span shows the lifetime of a Zone Signing Key (ZSK) of the root zone.
What’s a ZSK? There are at least two kinds of DNSSEC keys, a Key Signing Key (KSK) and a ZSK. The keys differ by what they do and how they are managed, thus the two different visualizations.
The KSK for the root zone, shown in a blue rectangle, does not change its state during the time span. It’s labelled ‘Chained Active KSK’ meaning it is trusted by validators.
Knowing a little about how a ZSK is managed in the root zone helps when understanding the sample visualization, and why this time span is featured. For the root zone, the calendar is first divided into annual quarters and then again into 10-day units. (As quarters of the year may be 90, 91, or 92 days, the final unit could be 10 – 12 days.) The life cycle of a ZSK includes being in pre-publication for the last 10-day (up to 12) unit of a quarter, active for the next quarter, and then in retirement for the first 10-day unit of the following quarter.
In the sample visualization, there is a light green rectangle labelled 14631 (the number ’naming’ the key) from 22 March until 31 March 2021. This represents the pre-publication state. From 1 April until 30 June, a medium green rectangle labelled ‘Actively Signing Zone’ denotes the active phase. And from 1 July to 10 July a dark green rectangle, with no label, is the retirement stage. Other keys, the previous ZSK (42531) and the following ZSK (26838), show how the stages of one key overlap the other.
There’s still further to go with these visualizations
Interesting states of keys are still being discovered by looking through the visualizations. Basic questions such as ‘what should a key’s lifetime be?’ remain to be determined. It’s not as simple as categorizing operators by how long they change a key, as many operators have varied their approach as time goes by. And if an operator has not changed a key in more than 10 years, it is impossible to know if they ever plan to make the change or what their approach would be once that change is necessary.
Edward Lewis is a Senior Technologist in ICANN’s Office of the CTO.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.