Notes from ICANN’s DNS Resolver Symposium

By Geoff Huston on 23 Dec 2021

ICANN hosted a Resolver Operator Forum in mid-December and the session had several interesting presentations that I would like to comment on here.

DNS resolver evolution

The first presentation in this forum was from Paul Mockapetris. He pointed to the original academic published paper, Development of the Domain Name System, by Paul Mockapetris and Kevin Dunlap, published in the proceedings of ACM SIGCOMM’88. The paper noted that by 1983 it was obvious that the shared HOSTS.TXT file was not a scalable solution, particularly as shared mainframes were changing to personal computers at the time, and it described the initial efforts to change this name-to-address mapping function into a distributed database that it hoped would allow scaling of the system. The paper adopted the approach of identifying the notable successes, failures, and surprises in the process.

DNS in the late 80s

It is certainly interesting to look back more than thirty years and identify what was seen at the time as a readily mutable implementation setting becoming cast in concrete over the intervening years. For example, the paper notes: “Labels are limited to 63 octets and names are restricted to 256 octets total as an aid to implementation, but this limit could be easily changed if the need arose.” This limitation is firmly in place today.

It has been observed for some time that the proposition of queries seen at the root servers for non-existent domain names is persistently high — more than 70% in some cases for the root servers. The paper voiced an entirely different expectation: “We expected that the negative responses would decrease, and perhaps vanish, as hosts converted their names to domain-name format”. This expectation is yet to be achieved.

On the issue of caching: “Our conclusion is that any naming system that relies on caching for performance may need caching for negative results as well.” And further notes: “This feature will probably become standard in the future.” And this subsequently happened.

On the choice of UDP: “The use of datagrams as the preferred method for accessing name servers was successful and probably was essential, given the unexpectedly bad performance of the DARPA Internet.” The use of UDP today is seen as a critical factor in scaling the DNS, allowing servers to handle large query loads without excessive consumption of system resources. The issue with UDP’s lack of implicit acknowledgement did create some concerns with retransmission behaviours, and the paper noted: “Much unnecessary traffic is generated by resolvers that were developed to the point of working, but whose authors lost interest before tuning.”

The issue of UDP packet fragmentation was not seen as an issue at the time: “The restriction to approximately 512 bytes of data turns out not to be a problem.” It should be noted that the predominant MTU size in the Internet in the mid-1980s was 576 octets.

In what was described as a success was the observation that caching works extremely well: “The caching discipline of the DNS works well, and given the unexpectedly bad performance of the Internet, was essential to the success of the system.”

A failure in the original model, which Paul asserts continues through to today, is that data types are hard to create (although the recent experience with the SVCB and HTTPS data types appears to point to a different conclusion, that data types are relatively easy to create!). As Paul suggests, perhaps it was the overly complex IETF registration procedures that were pushing DNS users to overload the TXT record as an alternative to creating more data types.

Some original ‘features’ of the DNS protocol were never implemented. Queries have a 16-bit query count field, and there was provision to stack multiple queries into the same DNS protocol data unit, theoretically allowing for a set of queries in a single transaction. The query count value has always remained at 1.

Perhaps the biggest failure at the time that persists today was the decision to use a restricted ASCII character set, with equivalence of upper and lower-case alpha characters. In retrospect, it appears that the protocol would have benefitted from an exact binary match algorithm, leaving character equivalence as an application function rather than an intrinsic protocol attribute. Paul had thought that the label itself might, in some way, dictate the equivalence algorithm to use over DNS labels, with the system itself would transparently handle binary data. The result has been that we are attempting to retro-fit Unicode binary labels into this restricted character set with some surprising challenges.

Fast forward to 2010

The major shift in the DNS architecture in the intervening two decades was the step of interceding a network between the client and the resolver. DNS recursive resolvers no longer reside on the same platform as the applications that call for DNS resolution, and they are connected not by any form of Remote Procedure Call (RPC), but by the DNS protocol itself. This has been both an amazing success and a critical vulnerability. It was a success because it allowed the DNS infrastructure to continue to scale to meet ever-increasing demands, without being constrained by other infrastructure investments. The DNS was not an intrinsic part of the underlying network, nor was it an intrinsic part of the server or client platforms. The intrinsic elements of failure in this model are based on the use of a simple open transaction protocol in the DNS, which was available for inspection by any well-positioned observer and could also be manipulated by active intermediaries.

The underlying vulnerability was that the DNS query structure is both recursive and stateless. If a resolver cannot directly respond to a query from its local cache, it will make its own queries; these ‘triggered’ queries are made without any reference to the original query and include no details of the original querier. When the resolver responds to the original querier there is no indication how the resolver gathered the information, whether by use of the local cache or by performing further queries. This makes the DNS opaque, in the sense that the original client has no ability to trace the queries being made on the client’s behalf and has no way to understand how the resolver has assembled the information to generate a response.

This leads to what Paul described as a notable failure of the time — the exercise to retrofit verifiable authenticity to the DNS in the form of DNSSEC. It should be stressed that DNSSEC is not a failure in terms of the ability of a user to validate a response if the original DNS zone has been DNSSEC-signed and there is a clear linkage of signed zones from the target zone all the way back to the root zone. The failure has been in the manner of its integration into the DNS. It has been bolted onto the DNS without any changes to the base DNS protocol. More information needs to be passed through the DNS, which adds to the size of DNS responses and increases the number of queries required to complete validation. Both behaviours are anathema to the efficiency of the DNS, and the result is that the timelines for the adoption of DNSSEC-signed zones and the adoption of validation, both at the recursive resolver level and at the end client stub resolver level, is at best sluggish. In summary, the situation with DNSSEC — depending on your point of view — there is either too much of it, or not enough of it, and we can’t agree which is the ‘right’ answer!

What has been surprising is that the DNS has been used as a policy control point. To counter the spread of malware, spam, and similar, it has been effective to simply block the resolution of the associated DNS names in the DNS. This started as a simple denylist for a resolver to consult before attempting resolution and has been subsequently standardized as a policy construct in Response Policy Zones (RPZ). These DNS controls are not only commonplace but have been taken up by many actors from governments to open DNS resolvers.

Another surprising development in resolvers is that some of the original simple principles of DNS resolution that we knew in the 1980s are too often ignored in recent implementations. A case in point is the so-called ‘Tsunami’ vulnerability in the DNS where a circular chain of references in the DNS may cause a resolver to chase the loop for an extended time. This vulnerability has been known since the inception of the DNS and the simple principle from the 1980s to defend against such situations is to limit the total sum of resources that a resolver will use to work on any query.

Fast forward to today

The increasing distance between clients and their recursive resolvers has opened the DNS to all kinds of privacy incursions. The notable development in the past decade has been the standardization of encryption of DNS queries and responses. Some browsers have been quick to fold encrypted DNS functions into the browser itself, creating a privacy envelope that excludes both the operating system platform and the network from either observing or manipulating DNS queries and responses. Viewed at one level, this has the potential to be a clear success for the DNS, effectively countering many forms of covert DNS manipulation. However, it’s unclear if this is an unqualified success. Pushing the DNS from an infrastructure function to an application function can encourage several fragmentary pressures in the DNS, where each application can place the user into a DNS policy environment of their own devising.

In the larger issue of so-called ‘apex predators’ in today’s surveillance-dominated Internet, DNS privacy has changed nothing. While the DNS could represent a rich vein of information about the user, the existing techniques of user profiling from these large-scale advertising brokers do not rely on access to the DNS, and the IETF’s moves to support encryption of the DNS appear to be actively encouraged by these larger enterprises, perhaps as a move to entrench their position and shut out the entry of potential competitors.

Centralization of service in the DNS is a dominant feature of the DNS. There are dominant actors in the name registration business, the DNS registry business, the name hosting business, and in the name resolution business. At this point in time, these are all different enterprises who dominate each sector, but in the ebb and flow of business mergers and acquisitions this may be just a temporary aberration in a larger trajectory to absolute control by a single entity.

Even today this centrality has it’s associated vulnerabilities, where an outage of a single provider can cause a very broad outage. The 2016 DYN attack caused large-scale outages in the US. The more recent Akamai outage similarly illustrates that level of criticality of these apex DNS actors. The DNS appears to have been swept up into the increasing centralization of content distribution, and recent studies indicate a highly dominant position of Akamai and Cloudflare in DNS hosting, due largely to the way content hosting these days relies on DNS redirection. We are left with an ongoing situation of fewer bigger failures!

What do users want from the DNS?

Paul pointed out that, at times, his DNS service has been hijacked by his security provider for the gain of his security provider. His queries have been filtered by his network provider, his typos have been trapped and redirected to advertisements, and he has contributed to a shared DNS history database, all without his express permission or even knowledge in most cases. Paul’s experience parallels that of every other Internet user. What we have is a far cry from most users’ expectations.

What users actually want is perhaps the key question here. At one level, users don’t want to be in the situation where this question is even pertinent! Why should we be asked this question at all? Perhaps it is more informative to ask what users assume about the DNS, if they ever think about the DNS at all. They want it to work, to be fast, and completely unseen and untouched by them. Most end users never change the settings of their ISP, their operating system platform, and their applications.

Do users want diversity of providers in the DNS? Probably not, although not for the reasons that are commonly offered. Warren Buffett has said that diversification is silly if you know what you are doing, and a rational choice if you don’t. Most users would claim no familiarity whatsoever with the DNS but, as we’ve seen, diversity would not help in any case. What users want is a rational default configuration. They would prefer that configuration to not be hostile to their interests. Beyond that, they really have no further preference. However, it is not clear that the commercial interests that dominate the DNS have the intention to produce outcomes that reflect even a benign attitude to users. The outcomes we see in the commercial DNS tend towards a commercial self-interest that is unremittingly exploitative of users, to the exclusion of any other consideration.

If you ask what various DNS providers want, then you would get an entirely different set of responses. Depending on their particular role, they would like to observe DNS queries and responses, or selectively alter responses or redirect queries. Some would like to assemble profiles of users, others would like to assemble profiles of malware activity, while others want to intercede in the operation of malware and disrupt it.

All of this is supposedly their perception of user interest, or the interests of the network itself. It’s challenging to predict where this is heading, and even more challenging to maintain an optimistic outlook that this will be resolved in ways that enhance user privacy and network resilience.

The curious court case of Quad9

Quad9 is an open DNS resolver service, started in 2016 with the support of IBM, the Global Cyber Alliance, and Packet Clearing House. The service is a so-called ‘clean-feed’ service, with blocklists assembled from more than 20 security intelligence providers, intended to disrupt various forms of criminal abuse of Internet users through DNS blocking. Quad9 is based in Switzerland and operates under the provisions of Europe’s General Data Protection Regulation (GDPR) regulations as well as Swiss Data Protection measures. The resolvers are located in 90 economies, with 180 points of presence. It is operated as a free service.

As John Todd observed, DNS filtering is an effective way for cooperating users to be protected against certain threats they wish to avoid. On the other hand, it is entirely counter-productive in attempting to coerce unwilling users into not connecting to certain proscribed services. Many economies have such blocklists of proscribed services, and these measures are commonly implemented by service providers as DNS blocks applied to users within that economy.

The global open DNS recursive resolver providers pose a challenge to such national measures. The general intention is to provide the same service to all queries irrespective of the supposed location of the querier. Given that most national blocking measures are unclear about who and how such blocks should be maintained, these distributed open resolvers have managed to stay one step removed from such measures. However, Intellectual Property Rights (IPR) interests associated with Sony Music Germany bought a suit against Quad in a German court ruling that Quad9 must block resolution of a domain name of a website in the Ukraine that itself does not hold copyright infringing material, but instead contains pointers to another website that is reported to hold alleged copyright infringements.

Quad9’s interpretation of this ruling is that queries from IP addresses for this particular domain name that can be geolocated to Germany and sent to Quad9 resolver instances located in Germany will generate a SERVFAIL response. The implication here is that this imposes an additional cost on the DNS service as it needs to perform a geo-lookup and invoke a policy rule in the case of a geo-match.

There are a number of curious aspects of this situation. It appears that the other significant open DNS resolver providers (Google, Cloudflare, and Cisco’s OpenDNS) have not been similarly targeted by Sony’s legal action in Germany. Perhaps the Swiss domicile of Quad9 made them a more appealing target for German legal action. Or Quad9’s small size made them a vulnerable first target for Sony and the IPR industry. However, it’s hard to see the overall rationale for this move in a larger context of geopolitical presence in the Internet.

We see emerging disquiet in the European Union (EU) over the dominant position of US corporate interests in almost every aspect of the Internet and the DNS space is no exception. Largely, the EU is being treated in the same way as the bygone imperial empires treated their colonies, and for the EU it’s a novel and deeply discomforting place to be. The EU is trying hard to position EU enterprises in direct competition to these US-based Internet giants. The DNS has been caught up in these efforts and there are some recent EU initiatives, such as the DNS4EU program that is intended to create some EU-based competitive positions. However, this German court decision has the opposite effect. If this makes DNS operators domiciled in Europe more at risk from expensive-to-implement regulatory measures that are not imposed on foreign DNS providers, then this entire EU initiative is probably going to go nowhere useful!

Quad9’s preferred response, aside from getting this particular court decision overturned with a legal challenge, is to encourage policymakers to specifically mention recursive DNS services as exempt from mandatory censorship requirements. At a minimum, they would like to see recursive DNS-based models of content control optional. As far as I can tell, in the words of that immortal Australian film classic, The Castle, ‘They’re dreaming!’

Jio and DNS tunnelling

Some two thirds of the world’s Internet user’s direct their DNS queries through ISP-provided recursive resolvers. This means that some of the larger resolvers are located in some of the larger retail ISPs, and these are located in the most populous economies. In India, the largest retail provider is Reliance Jio, and they have some 430M subscribers. Their DNS queries are directed to one of 23 DNS resolver farms, and they use 265 DNS resolver engines. The peak query rate per resolver is some 300,000 queries per second (QPS) and the aggregate peak is some 15.9M QPS. It’s a large-scale DNS deployment.

They have been asked by the national regulator to block DNS tunnels as these mechanisms bypass the existing national DNS censorship measures. If you want to hide your DNS queries there are two basic options: You can encrypt the DNS packet itself, or you can leave the DNS packet in the clear and encrypt the query inside the query label. In this case, the technique uses the latter option, and it’s very similar to the oblivious DNS technique. The query is encoded in Base64, and the new query name is directed to a cooperating decoder, that then performs the resolution on the user’s behalf like any normal recursive resolver. The response is encoded again using Base64 and passed back as a TXT response to the original query.

The problem in automated detection of such DNS tunnels is that long, seemingly random query names, are used in several legitimate contexts including many content data hosting configurations. They believe that they have now deployed an effective tunnel detector and blocker. Interestingly, DNS tunnel traffic was observed to be as high as 2% of the total DNS traffic in their network. If this is an indicator of the level of user demand for bypassing these national blocks, then it’s likely that they are now engaged in a somewhat unproductive escalation process of move and countermove with no clear end in sight. The sheer size of the Jio DNS environment tends to suggest that the advantage lies in devising and promulgating the active bypass techniques in the DNS, while the large-scale deployment of DNS resolvers weighs down the agility of the deployed defensive mechanisms.

Google’s public DNS service

The largest DNS resolution environment on the Internet is operated by Google. This is their open DNS recursive resolver, that responds to queries passed to 8.8.8.8. This project is now 11 years old (it was launched on 3 December 2009).

Google has been very active in implementing new DNS standards once published in a stable form. Their service is now the largest DNSSEC validating DNS resolver system on the Internet, which was launched in March 2013. In recent years, Google DNS has introduced support for DNS over TLS (DoT) and DNS-over-HTTPS (DoH) (both in 2019), and in 2020, Google supported Query Name Minimization (with minimization being performed for up to three name levels, as I understand). Google is also using aggressive NSEC caching, cache poisoning protection via 0x20 bit munging, nonce prepending, and DNS cookies. They are experimenting with DNS to authoritative servers with a few selected authoritative operators. Perhaps more controversially, Google supports Extension Mechanisms for DNS (EDNS) Client Subnet. This includes some client information in the query passed onward from the recursive resolver to the authoritative server. It allows the authoritative server to provide a geo-targeted answer based on the assumed location of the client, but at the cost of client privacy and cache efficiency.

The use of Google’s service continues to grow, and the number of users passing queries through Google’s resolution service has grown by 50% over the past 15 months. Within this there are some other notable trends.

The number of queries for IPv6 addresses (AAAA queries) has risen from 2% of queries at the end of 2019 to a current proportion of 7% of all queries (or a little more than triple the proportion of queries over this period). This is curious in that the measurements of the population of IPv6 users have risen from 24% to 30%, or a relative growth of 25%. A possible explanation is that the query traffic seen by Google’s DNS encompasses more than user queries, and Google’s traffic profile may include a sizeable proportion of other query traffic, such as transaction log data analysis and even query log replay. The query volume for DoT and DoH has increased five-fold, from 2.5% of queries to 12% of queries. The split between DoT and DoH is approximately even these days. This makes a lot of sense for clients of Google’s service if its clients are required to reach Google over an Internet path. However, the context for those ISPs who use Google via a forwarding arrangement (of which there appear to be many) is not so clear.

Google uses a multi-tier Internet architecture The frontends are simple caching-only resolver engines that handle various encapsulation protocols as well as the DNS protocol with the client. If the query cannot be answered from the local cache, the query is passed to a backend resolver engine, which will either answer from its local cache or pass it on for resolution with proxy query servers handling individual queries to authoritative servers. For reasons of simplicity, there is no shared cache in this architecture.

For Google there is an increasing level of integration of the DNS into their existing web delivery infrastructure and the DNS, even without the added impetus of DoH, is being treated in a similar manner to another component of web-based infrastructure with common service elements for TLS termination, HTTP transports, DoS protection and OAM management. This allows the Google DNS service to leverage existing Google service modules where available. For example, the DNS service uses a larger Google DoS blocking service and does not need to separately operate its own DoS detection and blocking service.

The system is intended to operate in a largely autonomous manner, that — except for very large DoS attacks and outages of certain TLDs — has been successful so far. The DoS defence mechanisms are tuned to detect certain forms of query attacks and not pass these queries on to the authoritative servers. The system also rate limits the capacity of reflection attacks that attempt to use Google’s DNS.

At this point, Google maintains that they perform no DNS filtering or censorship of any form in any of their points of presence. Obviously, the issues behind the German court decision relating to Quasd9 are not going away any time soon.

Summary

The DNS appears to operate in a style that is like other collaborative group projects. The group appears to pick up a theme and run with it intensively for a while and then apparently lose interest and move on to the next theme.

For a while, the range of defensive techniques to use against DNS-based DoS attacks was a consistent theme of DNS conversations. We then moved on to the issues of the increasing baroque ornamentation of the DNS protocol in the ‘DNS Camel’ conversations. The past couple of years has seen what could be called an obsessive interest in channel encryption for the DNS. More recently, we’ve been looking at service, SVCB, and HTTPS records, and ways the DNS can be augmented to provide an application-level rendezvous function, complete with a set of application-level protocol parameters, distinct from basic name-to-address mapping.

There is a larger tension at play as well between scalability and functionality of DNS responses. The initial efforts in the DNS were directed to provide consistency and uniformity of responses. The positive benefit of this was caching, which in turn, was a major reason why the DNS was able to scale so readily. Caching pushed DNS traffic out to the edge of the network, which reduced the traffic impacts on authoritative servers. This, in turn, has allowed individual zones to bloat in size, most notably the .com and .net zones, where .com now contains more than 145M name registrations and .net contains a little over 13M names.

At the same time, we have been centralizing increasing amounts of DNS infrastructure. Within the ISP, the DNS recursive function is being operated within the ISP by outsourced providers, such as Secure64 or Akamai’s AnswerX. This is complemented by the rise of a small set of open DNS resolvers, notably Google’s DNS service. Over on the authoritative server side there is similar centralization, where a small number of large enterprises operate much of the DNS’ server infrastructure, including Amazon’s Route53, Akamai’s Edge DNS, and Google’s Cloud DNS.

This means that we are more confident of the DNS infrastructure to scale to ever larger levels. In terms of the tension between scalability and customization within the DNS we are less concerned about scalability. These days we are confident in running very popular domain names with short TTLs (Facebook’s outage was triggered by short TTLs) and using the DNS to perform content steering and rendezvous (such as what can be achieved by a combination of EDNS Client Subnet and HTTPS records). We are evidently prepared to reduce the levels of caching and place greater stress on the DNS as a result. The use of Chrome’s sensing queries and the use of nonces in discovery queries to prevent certain forms of DNS poisoning also deliberately bypass local DNS caches.

It’s not just downplaying the benefits of caching. We are now also prepared to contemplate ditching UDP and not only head towards DNS over TCP, but to also introduce encryption into the process. It was long believed that the use of UDP was not just a simplification for the DNS, but a core element of the efficiency of the DNS protocol.

Currently, the tension between scalability and functionality in the DNS favours functionality at the expense of cache effectiveness and scalability. How long we can sustain this stance and allow the DNS to continue to chase down various forms of functional bloat is, at this stage, anyone’s guess!

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.