DNS-OARC 30: Bad news for DANE

By Geoff Huston on 27 May 2019

Tags: DANE, DNS, DNS-OARC, DNSSEC, measurement, opinion, security

DNS-OARC held its 30th meeting in Bangkok from 12 to 13 May. Here’s what attracted my interest from two full days of DNS presentations and conversations, together with a summary of the other material that was presented at this workshop.

See the event website for the full agenda and presentation materials.

Some bad news for DANE (and DNSSEC)

For many years the Domain Name X.509 certification system, or WebPKI, has been the weak point of Internet security. By ‘weak point’ you could as easily substitute ‘festering, rancid, underbelly’ and you would still be pretty much right on the mark! The massively distributed trust system has proved to be unmanageable in terms of integrity and there is a regular flow of stories of falsely issued certificates that have been used to perform intrusion attacks, eavesdrop on users, corrupt data and many other forms of malicious behaviours.

The efforts of the CAB Forum to instil some level of additional trust in the system appear to be about as effective as sticking one’s fingers into a leaking dam. The number of trusted CAs has extended conventional credibility well beyond the normal boundaries and has pushed the unsuspecting user into a fragile state of credulity.

Efforts to improve this mess, such as Extended Validation (EV) certificates, have gained no traction with users, as they are largely immune to subtle changes in the content and colours of the browser’s navigation bars, and certificate transparency logs appear to be completely ineffective in catching CA-related name hijack events in real time.

The effort to define DANE, or Domain Keys in the DNS, was an effort to provide a different mechanism of name-based assurance, by using the DNS to convey credentials to the user rather than a third party-operated X.509 PKI infrastructure. DNSSEC provided a way to allow any entity to directly assure itself that the response it had received from the DNS, relating to a record held in the DNS, was indeed precisely that DNS record at that time. If the entire objective of the Web PKI and all these domain name certificate issuers was, in the end, to associate the control of a key pair with the control of a delegated domain name in the DNS, then DANE would cut out this morass of intermediaries and allow the domain holder to store the name operator’s public key in the DNS in a manner that would be hard for attackers to corrupt.

Shumon Huque is of the view that DANE was the reason why DNSSEC was worth the effort (and I agree with him!). This was the way to finally bring some robust security into the use of the name system and allow applications to ensure themselves that they are indeed connecting to the genuine named service.

A basic sticking point for browsers has been the extended time taken to perform DNSSEC validation. There was a concerted effort to address this through a mechanism called DNSSEC Chain Extension that was to be ‘stapled’ to the TLS material in the credential exchange. The document describing this approach was initially approved in the IETF’s TLS Working Group as a candidate standard track document in March 2018, and an implementation was funded and planned for Mozilla. But this effort crumbled in what was described by Shumon as a “huge fight” in the working group, and the draft was abandoned. The result is that DANE is effectively dead for browsers for the time being.

It is incredibly frustrating to see these developments. The intentions behind both DANE and free domain name certificates were laudable, as affordable high-quality security for all was the intended result. But what we are left with is no better than before, and possibly worse. Truly reliable, robust security is even more of a luxury good than ever.

The modality of mortality of domain names

Are all new domain name registrations basically junk? Do they live Hobbesian lives that are ‘nasty, brutish and short’? How are these short-lived names destroyed? Farsight’s Paul Vixie reported on a study of the mortality of domain names that exist in the DNS for less than one week.

The study observed a creation rate of around two per second, or 150,000 new domain delegations per day and the creation of new hostnames at a far higher rate of some 300 per second, or some 12 million names per day. They took a six-month window and studied some 23.8M newly delegated domain names. A little under 10% of these names had died within seven days. And of these, most die within the first five hours, and 60% of these short-lived delegations die within 24 hours.

The major cause of this early demise is blacklisting of the domain name. Blacklisting is a very rapid response, with some 80% of blacklisted domains entered into the lists within 24 hours of the time of first use of the name. More than 30% of blacklisting occurs within one hour.

A second cause is the removal of the delegation record. This form of name removal takes longer, with a median of some two days. Only 20% of the names that are removed in this way are removed in under one day. Another cause is the removal of the name’s authoritative name servers, and here while one-quarter of the name removal events occur within one day, the median time of death by this cause is some four days. The longer time here may be an artefact of credit card transaction clawback or similar.

The majority of the short-lived names were observed in the gTLD space, and here blacklisting is the primary cause of name death. This was also observed in those ccTLDs that are used as generic TLDs. Overall, some 8% of new names die within seven days.

The observation from this study is that we appear to be spending a huge set of resources to remove names that should never have existed in the first place. If further rounds of new gTLD rounds turn out to be little more than an exercise to offer more choices for spammers, then why are we doing this to ourselves?

Hyper-hyper-local roots

RFC 7706 describes how a recursive resolver can configure a local copy of the root zone and use this local copy as a fast alternative to performing queries directed to a root server.

Ray Bellis described the approach as being too prescriptive. There is no need to put the root into every recursive resolver, and if a network operator wanted to go down this path, a local root resolver should be capable of supporting many recursive resolver clients. To illustrate this, Ray used the ldns DNS library to implement a fast root server in a tiny hardware platform. He used pre-compiled answers by generating pre-computed compression offsets. He uses raw sockets and stateless TCP to speed up the server’s TCP performance. It’s blindingly fast on small processors, and Ray achieved 15,000 queries per second on a Raspberry Pie 3B. It has a very economical 13Mb ram footprint.

More generally, it’s possible to generalize this approach and take relatively small zones and use this technique to tune them to offer very high-performance DNS servers on extremely small devices.

Deploying authoritative servers

What’s the best way to set up a zone’s authoritative name servers? Are many better than just one or two? Is anycast useful for authoritative name servers? Is the design of the root zone server infrastructure with 13 named servers and associated anycast services something that we should all copy? Or should something less ambitious be entirely adequate for the job?

In many ways, the design of an authoritative server system represents the outcome of balancing several factors. There is a consideration of server availability, server performance for both positive and negative answers, and the behaviour of recursive name servers.

The IETF’s standards point to a strong preference for zones to have at least two authoritative name servers and preferably disperse them so they do not fate share. They justify this preference as being robust in the face of individual failures. As a result, many zones including those considered critical to many enterprises operate with a large number of NS records per zone.

If a zone is served by a number of name servers in the form of multiple NS records, how do recursive resolvers choose a name server to query? There is a widely held belief that a recursive resolver will regularly sample the time to query each authoritative name server and then use the fastest server for the next sample period. Work by Akamai’s Kyle Schomp looking at queries to an Akamai zone largely bears this out, but with a few important caveats. The issue is that the concentration of use of resolvers is highly skewed, and while a small subset of these resolvers perform a high volume of queries that allows them to cache responsiveness per zone per server, the rest have a far lower overall query volume and the server selection algorithm gives inconsistent results in such circumstances.

If you thought that many distributed authoritative name servers for a zone gives faster overall name resolution performance, then this work challenges that assumption, to some extent. A large name server collection will work well for some resolvers that will make the best choice from the available set, but not for many others. Perhaps anycast is a better approach for optimizing the server set in terms of query times and at the same time offering a line of defence against DOS attacks.

Short notes

The following are some short notes on a number of other interesting presentations over the two days.

Resolver testbed

Paul Hoffmann of ICANN reported on an effort to build a test framework using a virtualbox VM filled with resolvers, a simulated root server and a mechanism to generate particular resolver to root query interactions simulations. He has made the code available for those interested.

DNS security

Ralk Weber presented a historical perspective of security issues in the DNS, including efforts to corrupt the DNS via cache poisoning, and later by the Kaminsky attack. There were DNS DOS amplification attacks, DNS Changer, and random subdomain name attacks on authoritative servers. These days we are seeing orchestrated multi-part attacks that exploit weaknesses in domain name registrar systems to hijack a domain name.

DNS interception

The rise of open DNS resolvers as an alternative to the ISP-provided resolvers has been a prominent feature of recent years. Such resolvers have been around for some decades, such as the DNS service behind 4.4.4.4 and that operated by OpenDNS.

It gained more attention with the launch of Google’s service, which has been promoted as a fast and ‘honest’ service, in that it does not filter or alter responses, and does not perform NXDOMAIN substitution.

But such moves to bypass ISP-provided DNS resolvers have inevitably provoked a reaction. We hear of ISPs advertising the anycast IP addresses of these open servers in order to intercept such DNS queries and redirect them back to the original resolution environment. Other ISPs perform DNS interception, where all UDP (and most times TCP) traffic to port 53 is passed to a local DNS resolver irrespective of the IP packet’s destination address.

How prevalent is this practice? This presentation described an experiment that attempted to measure the extent to which DNS interception is taking place. It is a challenging measurement to perform at scale, and while various probe-based test platforms (such as RIPE Atlas probes) can perform these DNS tests, the issue with these particular platforms is an issue of scale and selection bias.

So yes, DNS interception happens but it’s not clear how many users in the entire Internet have their DNS intercepted in this manner.

Multi-signer DNSSEC management

The attack on the DYN service in October 2016 in the US had a number of consequences. One of these was the realization that using a single service provider to run your DNS authoritative name service is not necessarily a good idea.

However, outsourcing the serving of a DNSSEC-signed zone to multiple service providers can present some challenges. If the service providers also perform various customized responses (what is often called ‘stupid DNS tricks’), and use their own keypair and perform on-the-fly signing, then a multi-provider DNS service model can be made to work.

Shumon Huque’s presentation explored how various permutations of shared and per-provider KSK and ZSK keys can be made to work in a reliable manner.

Unsupported DNSSEC algorithms

The world of cryptography is one of constant change. New algorithms appear and existing algorithms are deprecated. What happens with DNSSEC tools when unsupported algorithms are used in the various parts of the zone signing, serving and validation processes.

Matthijs Mekking reported on the results of testing a number of widely used DNS signers, servers and resolvers to investigate their behaviour when unsupported algorithms are encountered. In general, the tools work as expected, treating unsupported algorithms in the same manner as unsigned data in general. Some tool crashes and anomalous behaviours were observed in some cases.

Offline KSK in Knot

Jaromír Talíř reported on how the .CZ domain was signed in the past and the introduction of a Knot DNS signer allowed the use of an offline KSK in the zone signing process.

DNS Flag Day

When the extension mechanisms for DNS (EDNS(0)) were introduced, DNS resolvers adopted a conservative stance. If a query containing EDNS(0) options did not elicit a response from the authoritative server, the resolver used a number of workarounds, re-querying without EDNS(0) options and re-querying using TCP.

DNS Flag Day was the ‘stop day’ when resolvers no longer supported this workaround behaviour, and authoritative servers needed to correctly respond to EDNS(0) queries. The flag day was largely deemed to be a success.

This has prompted consideration of another of these flag days for the future as a means of improving the robustness of the DNS. For the next DNS Flag Day, the objective is evidently somewhat more ambitious in scope, as the plan is to address the current issues with large DNS responses over UDP and the problems with the reliability of IP fragmentation of the large UDP packet, particularly in the case of IPv6.

The ultimate stub resolver

As Olafur Gudmundsson explained, the original stub resolver was implemented as a simple call into an operating system module that performed DNS resolution with a limited query repertoire. This model was refined with language-specific libraries, such as DNSjava, DNSpython and similar. These modules were ‘unpacked’ into DNS libraries and APIs to provide an application with greater flexibility and control over the DNS resolution function.

Cloudflare’s experience with DNSdist is interesting. The tool is positioned as a DNS load balancer across multiple DNS servers, whereas it is actually a highly effective traffic steering device with caching. All stub resolvers should have this level of functionality!

The same approach works for the so-called recursive resolver farms, where traffic steering by query name, coupled with caching, allows each member of the farm to operate exclusively across a set of query name and query types, eliminating the need to share query responses across the entire farm.

OpenINTEL

In the same way that search engines repeatedly crawl the web space to assemble their index data, it is possible to crawl parts of the DNS space. This project does precisely that, repeating the crawl on a daily basis, and the resultant data set becomes, in effect, a long-term record of the name space and its evolution. They query some 216 million domains per day, collecting some 2.3 billion DNS records per day in the process.

One illustration of this tool’s use was an analysis of authoritative servers before and after the October 2016 DNS attack. Many key customers of DNS providers switched from using one provider to multiple providers in the aftermath of the attack.

DNSKEY queries and the KSK Roll

Ray Bellis of ISC reported on a look at the volume of DNSKEY queries and RFC 8145 queries seen at E and F root servers over the KSK roll.

The installation of the KSK-2017 into the DNSKEY record did not generate a visible change in DNSKEY query levels seen by these root server clusters. The KSK roll itself did generate a 3x increase in observed DNSKEY queries. The absolute level of queries was not a concern, but the reasons for the higher query rate were not clear.

The revocation of KSK-2010 in January 2019 saw a further 5x increase in query levels. The removal of the revocation entry saw the query levels revert to the post-roll level, and subsequent investigation pointed to a bug in earlier versions of the Bind resolver that caused query repetition. But we have still yet to see query levels drop to the levels seen before the KSK roll, and the reasons for this are again unclear.

What part of ‘NO’ is so hard to understand?

I presented on the queries seen when the server’s response is ‘no such domain’ (or NXDOMAIN). Instead of a single query, we observe an average of 2.4 queries seen by the zone’s authoritative server when the domain name itself does not exist.

The presentation attempts to explain this, looking at Happy Eyeballs, DNSSEC signed vs unsigned, the impact of DNS resolver farms and the curious observation that NXDOMAIN elicits more queries than a positive response. The overall behaviour of the DNS is sometimes rather difficult to fully explain given the interaction between various independent timers and various resolver architectures.

Incentivizing the adoption of new standards

It was reported that the reason behind a large number of DNSSEC-signed zones in .se was a financial incentive to registrars where a signed zone was charged a lesser registration fee. A similar program was used in .nl and this has been extremely successful. They are now using financial incentives to promote the adoption of IPv6 DNS servers, DMARC and STARTTLS, promoting IPv6 and tools to support secure mail.

DNS fragment attack

Kazunori Fujiwara of JPRS described a cache poisoning attack using IP fragment substitution. His presentation described how spoofed PTMU ICMP messages can prompt fragmentation and how an attacker can then attempt to insert fragments into the DNS response. He proposed some techniques to protect against this attack vector.

Flamethrower – DNS load and functional testing

An alternative tool to DNSperf with realistic query rate patterns. Code is available.

Respdiff – Regression and interoperability testing

A tool to generate and send queries to many name server instances and compare the responses. Code is available.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

9 Comments

William May 27, 2019 at 1:59 pm

Hi there,

Your opening paragraph describes x.509 ” By ‘weak point’ you could as easily substitute ‘festering, rancid, underbelly’ and you would still be pretty much right on the mark!”

Not only is this highly inappropriate writing style, but it disrespects all the people who work on x509 and the work that has actually gone into making x509 secure. Today, because of efforts by many companies for certificate transparency is issuance, hsts for dynamic pinning, pinning of other roots, and more, x509 protects more people every day than any other technology on the planet.

I think this is really deceptive, misleading and inappropriate, and would ask you to change it please.

Reply ↓
Geoff Huston May 27, 2019 at 3:39 pm

Perhaps the entire issue of the abuses of X.509 certificates should be an article in its own right. There are hundreds of Certificate Authorities and every end client, such as my browser, trusts all of them all of the time. Any such diverse and dispersed trust system will have its problems, and the WebPKI CA system is no exception. The issue with the WebPKI framework is that the end client does not necessarily know which CA has issued the legitimate certificate for any given domain name. So a compromise of any single CA can compromise any domain name – a sad case of “weakest link” security.

It seems that we have all but given up on making all CAs work with absoluute integrity all of the time, which is both a realistic assessment and a sad one at the same time. The problem is the ‘pinning problem’ (informing the end client as to the ‘correct’ CA) remains unsolved. Hopes with the CAA record in the DNS and HTTPS Public Key Pinning (HPKP) have not eventuated, and all we are left with today is Certificate Transparency where all issued certificates are logged. The ill-fated Symantec certified the domain name “example.com” and it was evidently not noticed in the CT logs for 6 months! In a world where malicious domain hijacks need only last a very small number of hours it is beyond my ability to understand how Certificate Transparency helps me.

Part of the problem with the WebPKI is the state of denial on the part of the CA industry. Assertions that it “protects more people every day than any other technology on the planet” fail to acknowledge that it protects most users most of the time, but there are a steady stream of failures and these failures become the point of exploitation. Out demands of a security framework for the internet are incredibly demanding – it has to be robust for all users all of the time. The WebPKI and the CA system repeatedly fails to achieve this demanding standard.

Is the problem getting worse? It appears that name hijacks are fueling a growing industry, so the answer appears to be: Yes. Are we getting any better in defending the integrity of the CA system? No. Should end users express their frustration with a system that is obviously failing to deliver universal protection for all users all of the time? I think so.

My comment is not a reflection on the diligence on the folk who work in the CA industry, but a sad reflection that the diverse and dispersed WebPKI system itself is just not good enough to defend itself against consistent attack all of the time, largely due to the lack of secure and reliable pinning. DANE looked like a promising response, but it appear that some browser folk don’t believe that DANE can improve the situation. So DANE looks like its not going to happen as a result. We are left with the unsettling “mostly works most of the time” outcome and, yes, that’s frustrating and incredibly disappointing.

But it’s not going to get any better until we are prepared to acknowledge that the WebPKI has issues and these issues need fixing. Denial is just not a useful response in this situation. So I’ll stand by my comment.

Reply ↓
anon May 28, 2019 at 11:21 am

Re Dane:

I don’t understand why browser vendors don’t simply ship ldns or unbound and ditch the idea of using non-local recursive resolvers alltogether. The additional latency from running a local unbound is quite negligible on non-mobile platforms, just as memory, network and CPU footprint (compared to the default of using dhcp / rooter / ISP provided recursive resolvers): Sure, my local unbound has smaller and colder cache and worse network connection than my ISPs big recursive resolvers, but that’s not a big deal if you don’t flush the cache all the time (e.g. browsers could occasionally commit their DNS cache to disk). Sure, this would need a better user experience for local domains, and would need some twiddling to make MTU / udp issues idiot-proof. But browser vendors are really good at making things idiot proof.

Is there a link where browser people commented more on their reasoning?

But then, I don’t understand why we did not spec that the queries with +adflag +nocdflag +dnssec +recurse SHOULD return a complete certificate chain for local validation (aka: dear recursive resolver, I don’t trust you, but nevertheless please hunt down the name I need, and put all the info I need for local verification into the additional section).

Furthermore, I don’t understand why we did not spec a non-TLD like “X.”, with the protocol-level stipulation that delegations from root “.” to “X.” are invalid. In such a world, “foo.bar.X.” could only be valid via lookaside validation: If I use DANE for a local PKI and namespace as subdomain of “X.”, then I cannot get owned by a compromise of the global DNS root (realistically, one would worry about institutional compromise due to government pressure).

Reply ↓
William May 28, 2019 at 11:37 am

https://bugs.chromium.org/p/chromium/issues/detail?id=50874#c22

Quote:
“DNSSEC and DANE (types 2/3) do not measurably raise the bar for security compared to alternatives, and can be negative for security.
DNSSEC+DANE (types 0/1) can be accomplished via HTTP Public Key Pinning to the same effect, and with a much more reliable and consistent delivery mechanism.”

https://sockpuppet.org/blog/2015/01/15/against-dnssec/

Quote: “Why? Because there are two DNS security problems. The first is somewhat esoteric, and allows servers to exploit the way DNS records refer to one another in order to add a bogus record for VICTIM.ORG to a lookup for EVIL.NET. The latter is obvious and allows any attacker with access to the network to simply spoof entire DNS responses. The committee designing DNSSEC split the baby and chose the former problem over the latter.

Even if DNSSEC was deployed universally and every zone on the Internet was signed, a coffee-shop attacker would still be able to use DNS to man-in-the-middle browsers.”

Reply ↓
1. anon May 28, 2019 at 9:53 pm
  
  Hi Wiliam,
  
  >Even if DNSSEC was deployed universally and every zone on the Internet was signed, a coffee-shop attacker would still be able to use DNS to man-in-the-middle browsers.”
  
  Correct deployment of DNSSEC is to run a validating recursive resolver/cache on localhost. The overhead for running e.g. unbound is quite negligible, once the cache is warm.
  
  Even better deployment would be for localhost to forward queries to the coffee shop’s resolver, instead of hunting down glue records: Simply query for foo.bar.com first, and then move up towards root, requesting DS / RRSIG / DNSKEY, until we either hit something that is already cached and validated or a validation error. There are at most 4 steps in this chain. In case of validation errors, try the usual next: go down from root to foo.bar.com, following glue to resolve the A/AA record of NS, as unbound usually does. The latter fallback is necessary in the case of misconfigured coffee shops.
  
  If something like this was integrated into the browser, one could even save more trouble by accepting invalid (+cdflag) A/AA records for http requests, and only insisting on +nocdflag on A/AA and DANE information for https (if my ISP/coffee shop wants to screw with me, I am already ****** with http, hence no security gain from dnssec).
  
  The primary problem with the x.509 tire fire is that there are hundreds of entities that can **** me when connecting to some host foo.bar.com (each CA in my browser’s root cert store, and each jurisdiction hosting one of these CAs). With DNSSEC/DANE, I can get ****** by root, com, bar.com, and foo.bar.com.
  
  The secondary problem with the x.509 tire fire is the complexity of the entire thing. Writing a secure parser / validator for DNSSEC certificate chains is relatively easy; doing so for x.509 is a nightmare.
  
  Reply ↓
tialaramex May 28, 2019 at 6:50 pm

Hi Geoff, you mention

“a regular flow of stories of falsely issued certificates that have been used to perform intrusion attacks”

If you find REAL examples of “falsely issued certificates” you can and should tell the issuer and us at m.d.s.policy about them. But a certificate is not “falsely issued” just because you don’t like it, or don’t like who it was issued to. The passport issued to a murderer is not “falsely issued” just because of their heinous crime, it’s a real passport, it was properly issued, we just don’t want them to use it to flee justice.

Stories tend to conflate this because it makes for a better headline. Nobody wants to read an article entitled, “Using the password pass1234 for the domain registrar was probably a bad idea for my top 10 e-commerce company” or indeed “Phishing is still a thing” but either story can be dressed up as about “falsely issued” certificates if you turn a blind eye to the fact that the certificates were fine and the problem is elsewhere.

Does the fact that you fell for these stories indicate laziness, or is this something on which we need to better educate otherwise technical people?

Reply ↓
Geoff Huston May 28, 2019 at 7:38 pm

You ask for cases – her’s a recent one: a presentation by Bill Woodcock at a Recent ICANN Symposium can be found at https://icann.box.com/shared/static/gcv5mu7vwr3jh3rubhsx7q7nwv3nih3n.key

There obviously is a problem in the WebPKI. A problem where the attacker needs only a few minutes to execute their scripts to obtain a domain name certificate under entirely nebulous circumstances and a CA system that proudly says that falsely issued certificates will be detected … eventually.

Users quite reasonably expect a secure system that is secure ALL the time. Thats not what we have, and, as I said already, denial is just not a useful response in this situation.

Reply ↓
Ray Hunter June 6, 2019 at 12:25 am

> It is incredibly frustrating to see these developments. The intentions behind both DANE and free domain name certificates were laudable, as affordable high-quality security for all was the intended result.

> But what we are left with is no better than before, and possibly worse.

Intentions aside, there’s no Internet Police, or even effective real Police, as I learned after ordering from a legitimate web site operated by a fradulent company (it was an otherwise a legitimate business where the directors declared serial-bankruptcy on purpose before delivering any goods).

I think we could do well to lower the expectations of what “trust” and “high quality security” can be delivered via technology alone.

A green browser bar means nothing if there’s no legal enforcement or financial insurance for the eyeballs behind the browser.

But on the other hand simply knowing that your computer is talking to the device you think your’re talking to, and that no one else can trivially intercept or alter the communication, would be worthwhile start.

That level of technical checking is far better anchored in DNS than in certificates IMHO (at least for guaranteeing “common names”).

It is also orthoganal to whether someone has done any checks on the “legitamacy” of the site operator. CACert also initially fell into this trap of over-reaching in the early days.

Reply ↓
bk December 13, 2019 at 3:34 am

Also check “Better mail security with DANE for SMTP”, https://blog.apnic.net/2019/11/20/better-mail-security-with-dane-for-smtp/

Reply ↓