There is no consensus on how to choose DNS time-to-live (TTL) values for domain names. Yet, TTLs are incredibly important, given that they indirectly control how long resolvers cache records, directly influencing user experience.
We at SIDN Labs and USC/ISI carried out a measurement study to understand how different TTL value choices affect operational networks, with the goal of helping operators to make informed choices on TTL values for their scenarios.
We notified eight ccTLD operators that had short TTL values, resulting in Uruguay’s .uy increasing the TTL of its NS records and with it significant improvements on users’ latency.
The following post is a summary of the findings and recommendations set out in our research paper, which we will present at the upcoming ACM IMC 2019 conference in Amsterdam.
Key Points:
- DNS TTLs indirectly control caching, thus affecting user experience.
- Eight ccTLDs were notified that the TTL values of their NS records were considered to be too low; three increased them in response.
- Uruguay’s .uy achieved a significant performance gain by changing its NS TTL from 5 minutes to 1 day with a decrease in median latency from 28ms to 8ms, and in the 75th percentile latency from 183ms to 21ms. Real users of .uy can expect similar performance gains, and the other two ccTLDs as well.
- The use of anycast with a short TTL on the authoritative server-side cannot match the gains of longer TTLs.
New domain, what TTL?
Say you are about to register a new domain name either for yourself or for your employer. You must choose a time-to-live (TTL) value for its associated records. For example, Figure 1 below shows the dashboard of an actual registrar we use to manage the records of the domain cachetest.net. Observe TTL values set to 1 hour (1 Uur in Dutch) and 1 day (1 Dag).
Figure 1 — Dashboard for cachetest.net and its respective DNS TTLs.
DNS TTL values may vary from 0 seconds to 248555 days (2^31 -1 seconds). Given such a large range of possible values, it’s difficult to know what values to choose. Understanding exactly how TTL value choices affect operational networks is challenging, due to interactions across the distributed DNS service.
What are DNS TTLs actually used for?
TTLs indirectly control resolver caching. Caching, in turn, is the cornerstone of DNS performance: a 15ms answer is fast, but a 1ms cached answer is far faster.
To resolve a domain name, a user typically connects to a DNS resolver (as shown in Figure 2 below). That resolver will then forward the user’s query to the DNS authoritative server, which, in turn, sends back a response to the query with an accompanying TTL value (1h in this figure). That means that the resolver can cache the retrieved answer for up to 1 hour.
Then, when other users ask the same query within the one hour window (as shown by blue and green arrows), instead of having to query the authoritative server again, the queries can be answered by referring directly to the DNS resolver’s cache, which is typically far faster.
In that sense, caching can be seen as a form of ephemeral replication on the resolver side (we have shown in a previous study how caching protects users when authoritative servers are under DDoS attack).
Figure 2 — Clients, resolver with caching, and authoritative servers.
If there is no consensus, what values are used in the wild?
Even though there is no consensus as to what the best TTL values are, folk must set them when configuring their domain names.
To understand what values are typically used in the wild, we crawled five data sources:
- One country-code top-level domain (.nl, the Netherlands), which has 5.8 million domain names
- Three top lists (Alexa, Majestic, and Umbrella)
- The Root DNS zone
For each domain, we asked for NS, A, AAAA, MX, DNSKEY records from the child authoritative servers.
Figure 3 shows the results for NS and A records for the domains in question (the other records can be found in our research paper). We see that:
- TTLs show a large variation in values, from 1 minute to 48 hours, for all lists and record types.
- The top-level domains in the root zone (root in Figure 1 ) are way more conservative than the top lists (Alexa, Majestic, Umbrella): most TTLs for both NS and A are at least 24 hours long (remember, these are child delegation TTLs).
- For all lists, NS records tend to have longer TTLs than A records, but, again, there is no consensus on how long or short.
- Umbrella contains many fully qualified domain names, such as those used by CDNs, including wp-0f21050000000000.id.cdn.upcbroadband.com. That’s one of the reasons we see lower A record TTLs on Umbrella than on the other lists, which comprise second-level domains (such as example.nl and example.co.uk): CDNs are well known for using short TTL values, partly for load balancing.
Figure 3 — CDF of TTLs for NS and A records.
Uruguay’s .uy latency boost from longer TTL
While crawling the root zone (top-level domains), we found 34 top-level domains (TLDs) with TTL values for NS records under 30 minute — that is very short compared with the other TLDs. We contacted 8 out of 34 country-code TLDs and notified them of our observation. We received answers from five; three increased the TTLs of their NS records after our initial contact. These included:
- Uruguay’s .uy NS records were changed from 300s NS TTL to 86400 (1 day)
- An African ccTLD and a middle-eastern ccTLD increased their NS TTLs from 480s and 30s to 86400s as well.
By chance, we had carried out DNS measurements on the .uy ccTLD before the TTL change: 10,000 Ripe Atlas probes were used to ask for NS and A records. We repeated the measurement after the change to see how that impacted users’ experience (Figure 4 shows the results).
Figure 4 shows that the median latency was reduced from 28ms to 8ms, and the 75th percentile latency was reduced from 183ms to 21ms — just by changing one parameter. In other words, a median user of .uy noticed a 20ms change in response time simply as a result of the TTL change. And a user on the 75th percentile will have experienced an improvement of more than 160ms.
Our results also show the latency gains experienced at the Atlas probe vantage points according to their geographical region: a performance gain was experienced by all regions. It is important to note that these are significant performance improvements that only required one parameter change and no change in the .uy infrastructure at all.
That is no small feat: DNS operators are constantly striving to improve latency. IP anycast is also frequently used to place more authoritative servers close to resolvers in order to improve performance. But, as we show in our paper (section 6.2), caching with longer TTLs is even faster than anycast with shorter TTLs.
Figure 4 — RTT from RIPE Atlas VPs for NS .uy queries before and after changing TTL NS records. Top — VPs combined. Bottom — median and quantities of RTT per region.
Reasons for choosing long or short TTLs
There are many reasons why network operators choose long or short TTLs:
- Longer caching results in faster responses: a longer TTL enables caching for longer periods, and cache hits are far faster than retrieving answers from authoritative servers, as the .uy experience illustrates. We designed several experiments to investigate this, which are described in our paper. The results show that longer caching improves results even more than having a large anycast network.
- Longer caching results in lower DNS traffic: authoritative operators may be interested in setting higher TTLs because caching reduces the number of queries they receive. That is especially important if the DNS service is metered.
- Longer caching is more robust to DDoS attacks on the authoritative DNS server: DDoS attacks on a DNS service provider have harmed several prominent websites. Recent work has shown that DNS caching can greatly reduce the effects of DDoS on the DNS, provided caches last longer than the attack.
- Shorter caching facilitates operational changes: an easy way to transition from an old server to a new one is to change the DNS records. Since there is no way of removing cached DNS records, the TTL duration represents the transition delay necessary to fully migrate to a new server. Therefore low TTLs allow for more rapid transition. However, when deployments are planned further in advance than the length of the TTL, TTLs can be lowered just before a major operational change and raised again once the change is effected.
- Shorter caching can help with a DNS-based response to DDoS attacks: some DDoS scrubbing services use the DNS to redirect traffic during an attack. Since DDoS attacks arrive unannounced, DNS-based traffic redirection requires the TTL be kept quite low at all times to be ready to respond to a potential attack.
- Shorter caching helps DNS-based load balancing: many large services use DNS-based load balancing. Each incoming DNS request provides an opportunity to adjust the load, so short TTLs may be desirable for rapid response to traffic dynamics (although many recursive resolvers have minimum caching times of tens of seconds, placing a limit on agility.)
Recommendations
While our analysis does not suggest one ideal TTL value, it does clarify the trade-offs, enabling us to make the following recommendations for different situations:
TTL duration: the choice of a TTL value depends in part on external factors, so no single recommendation is appropriate for all networks or network types.
- For general zone owners: we recommend longer TTLs, at least one hour, and ideally 4, 8, or 24 hours. Assuming planned maintenance can be scheduled in advance, long TTLs have little cost.
- For TLD and other registry operators: DNS operators that allow public registration of domains (such as most ccTLDs, .com, .net, .org and many SLDs) allow clients to duplicate the TTLs in their zone files for client NS records (and glues if in-bailiwick). In Section 3.3 of our paper, we show that most resolvers use TTL values from the child delegation and some use the parent’s TTL. We therefore recommend longer TTLs for both parent and child NS records (at least one hour, preferably more).
- Users of DNS-based load balancing or DDoS-prevention may require short TTLs: TTLs may be as short as 5 minutes, although 15 minutes may provide sufficient agility for many operators. Shorter TTLs here help with agility; they are an exception to our first recommendation of longer TTLs.
Contributors: John Heidemann, Ricardo Schmidt, Wesley Hardaker
Giovane Moura is a Data Scientist with SIDN Labs (research arm of SIDN, the .nl registry) and Guest Researcher at TU Delft, in the Netherlands.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.