Scoring the DNS Root Server System

By Geoff Huston on 15 Nov 2016

The process of rolling the DNS Root’s Key Signing Key of the DNS has now started. During this process, there will be a period where the root zone servers’ response to a DNS query for the DNSKEY resource record of the root zone will grow from the current value of 864 octets to 1,425 octets. Does this present a problem?

Let’s look at the DNS Root Server system and score it on how well it can cope with large responses. It seems that awarding stars is the current Internet way, so let’s see how many stars we’ll give to the Root Server System for their handling of large responses.

Packets and Networks

What is it about large responses that are an issue here?

There are a number of persistent themes in packet networking that appear to be unresolved despite many decades of experience. One of these is the handling of packet sizes.

Packet-switched networks dispensed with the constant time base used in time-switched networks. Instead, they allow individual packets to be sized according to the needs of the application as well as the needs of the network.

Smaller packets have a higher packet header to payload ratio and are consequently less efficient in data carriage. On the other hand, within a packet switching system the smaller packet can be dispatched faster, reducing the level of head-of-line blocking in the internal queues within a packet switch and potentially reducing network-imposed jitter as a result.

Larger packets allow larger data payloads, which in turn allows greater carriage efficiency. Larger payload per packet also allows a higher internal switch capacity when measured in terms of data throughput. But larger packets take longer to be dispatched and this can be a cause of increased jitter.

Some packet network designs, notably ATM, used a constant-sized packet, replicating many of the properties of the time-switched systems. Others deferred the decision over packet size to the next layer up in the protocol stack and supported variable packet sizes.

Ethernet, designed in mid-1970’s, adopted a variable packet size, with supported packet sizes of between 64 and 1,500 octets. FDDI, a fibre ring local network, used a variable packet size of up to 4,478 octets. Frame Relay used a variable packet size of between 46 and 4,470 octets.

The choice of a variable-sized packet allows applications to refine their behaviour. Jitter and delay-sensitive applications, such as digitized voice, may prefer to use a stream of smaller packets to attempt to minimize jitter, while reliable bulk data transfer may choose a larger packet size to increase the carriage efficiency.

The nature of the medium may also have a bearing on this choice. If there is a high Bit Error Rate (BER) probability, then reducing the packet size minimizes the impact of sporadic errors within the data stream, which may increase throughput.

The real issues surface when you impose an overlay end-to-end network transport design on top of these various packet delivery media. What should an Internet router do with a 4,478 octet IP packet received on an FDDI interface when the next hop is an Ethernet segment with a maximum packet size of 1,500 octets?

The answer varies according to the IP version. In IPv4, as long as the DON’T FRAGMENT bit in the IP packet header is clear, the router is permitted to split the payload across several IP packets, fragmenting the packet to match the next hop maximum message size, replicating the IP header in all the fragments (aside from the fragmentation control header fields, of course). The IPv6 behaviour is similar to that of IPv4 when the packet header has the DON’T FRAGMENT field set to “one”. The router is not permitted to fragment the packet, and as it cannot forward the packet onward, an ICMP diagnostic message (containing the leading octets of the to-be-discarded packet) is sent back to the source address in the original packet, and the packet is then dropped.

In the case of IPv4 router-fragmentation, nothing more needs be done. Fragmentation is handled at the IP layer and the reassembled complete packet is delivered to the upper layer transport protocol at the other end. But in other cases, namely IPv6 and IPv4 when the IPv4 DON’T FRAGMENT field is set on, then the issue of a path packet size issue needs to be handled at the transport layer.

For TCP, the intended response by the sender is to pass the ICMP diagnostic packet and have the session reduce its MSS value to match the reduced size. TCP will then assume control for repairing the data gap because of the dropped packet and the session should continue.

For UDP, it’s a little trickier. UDP has no “memory” so the received ICMP diagnostic message has no logical delivery point within the local host. Ultimately, this becomes a problem at the application layer, and the application using UDP has to detect packet loss and to take into account a potential cause of packet size mismatch in its recovery behaviour. If the application is lucky, the host will lend a hand here and place a host entry in its local forwarding table that records the original destination address and the maximum packet size that can be sent to that address, based on the size field contained in the packet that’s too big for ICMP diagnostic message.

Why is all this relevant?

Because DNS.

DNS

The DNS is a UDP application, and in the context of the Internet it’s a critical application. Pretty much every transaction across the Internet starts with a name-based rendezvous, and the first step is to resolve the name to an IP address. For this we use the DNS.

The original design of the DNS limited packet payloads using UDP to 512 octets (RFC1035). Interestingly, the motivation behind this appeared not to avoid packet fragmentation per se, but to avoid a different packet size issue: the maximum reassembled packet size that an IPv4 host is assured to be able to reassemble is 576 octets (RFC791).

This limitation has some interesting side-effects. For example, this size limitation means that the number of authoritative name servers listed in response to a DNS priming query was limited to 13 entries, as long as you only wanted to know the IPv4 addresses of these servers. A 14th entry would push the DNS response to list the root zone’s name servers over 512 octets in length. From this came the limitation of 13 distinct root name servers for the DNS.

Some things have changed over the intervening years, but interestingly the 512 octet limit to DNS payloads hasn’t, in theory. This is despite the observation that a new informal standard packet MTU size has been adopted by the Internet: these days a 1,500 octet packet stands a strong chance of being passed through the Internet unscathed. But while 1,500 octet packets stand a good chance of making it through, there is a difference between a probabilistic estimate and certainty.

IPv4 actually provides no certainty in this space. While the maximum size of a reassembled IPv4 packet was specified at 576 octets, the minimum unfragmented size was not. The IPv6 specification defines a minimum unfragmented IP packet length of 1280 octets. That is, any IPv6 packet with a size equal to or less than 1,280 octets will not be fragmented by an IPv6 carriage network, and will be accepted by the intended host.

Why 1,280 octets? It seems like such an arbitrary number. The answer I’ve been given is that 1,280 is what you get when you add 1024 and 256! This somewhat meaningless piece of maths was intended to ensure that an IPv6 packet could transit an underlying Internet fabric that presumably supported 1,500 octet packets, and also admit the possibility of a number of levels of IP-in-foo encapsulation. Personally, I find this arbitrary choice one of the major design flaws of IPv6 and the cause of much brokenness in the IPv6 Internet!

While the 512 octet limit still applies to the DNS in theory, it’s not uncommon to see larger DNS responses being pushed around the Internet with apparent impunity. This is particularly the case when DNSSEC is added, and the response contains the digital signature as well as the requested data.

To cope with this, the DNS protocol now uses an extension mechanism, EDNS(0), defined in RFC6891, that allows a querier to specify the largest UDP response it is willing to receive. If this number exceeds 1,500 octets (and it is commonly set to 4,096 in many resolvers), then it is highly likely that the DNS UDP response will be fragmented, and the querier will need to reassemble the IP fragments in order to assemble the DNS response. If the response would be larger than the offered EDNS(0) buffer size, then the response will necessarily be truncated to fit within the specified payload size. If the querier wants the complete answer, it will either need to re-query with a larger EDNS(0) buffer size, or, more commonly, re-query using TCP.

Again, why is this relevant?

Because DNS, DNSSEC and the forthcoming roll of the KSK of the root zone.

DNS Large Responses

DNS resolvers that perform DNSSEC validation will, from time-to-time, query one of the root zone name servers for the signed valued of the root zone’s DNSKEY records. When there is no key roll happening, the response contains one KSK, one ZSK and one RRSIG signature signed by the KSK. Now that the ZSK is 2,048 bits in size, the total size of this DNS response is 864 octets in length.

During the planned roll of the KSK of the DNS Root zone there will be a period when two KSKs (old and new) are in the root zone at the same time, and the DNSKEY record will be signed by both of these KSK keys. The signed response to a query for the root zone’s DNSKEY record will inflate from 864 octets to 1,425 octets at this point in time. In the current plan this will occur on the 11th January 2018, and last for 20 days (see slide below).

As far as I am aware, this is the first time a “normal” DNS response from the root servers will exceed 1,232 octets in length, and the IPv6 UDP packet will exceed 1,280 octets in length.

Now in theory this should not present a problem, but theory and practice often tend to diverge.

DNS Root Servers and Large DNS Responses

How will the root servers deliver this response?

These days with anycast constellations any question about root server behaviour is not a simple question. Let’s simplify this a bit and ask what can we see from the root servers from one particular vantage point?

By crafting a relatively long query name for a non-existent domain name, we can get a root server to generate a response where the DNS payload is 1,268 octets in length. In this case, the query used EDNS(0) and specified a UDP buffer size of 4,096, and requested DNSSEC signatures to be attached to the response. An IPv6 UDP datagram containing that response is 1,316 octets long and the IPv4 UDP datagram is 1296 octets long. What we see from each root server is shown in Table 1.

Root	IPv4				IPv6
	Truncate	Fragment	TCP	Truncate	Fragment	TCP MSS
A	N	N	1460	1280	N	1440
B	1280	N	1460	1280	N	1440
C	N	N	1460	N	N	1440
D	N	N	1460	N	N	1440
E	N	N	1460	N	N	1440
F	N	N	1460	N	1280	1440
G	1280	N	1460	1280	N	1440
H	N	N	1460	N	N	1440
I	N	N	1460	N	N	1440
J	N	N	1460	1280	N	1440
K	N	N	1460	N	N	1440
L	N	N	1460	N	N	1440
M	N	N	1460	N	1280	1440

Table 1 – Root Server Response Profile to a large DNS response

Table 1 shows that in IPv4, 11 of the 13 root servers send the 1,296 octet UDP response packet directly to the querier. The other two root servers, B and G, elect to respond with a shortened (truncated) response.

Some experimentation with varying response lengths shows that this truncation occurs when the DNS response is 1,252 octets or larger. The querier is implicitly directed to retry using TCP through this response. This would tend to suggest that the root server is attempting to limit its responses to be no more than 1,280 octets in length, even in the case of IPv4 responses.

However, on both B and G, a TCP session offers an MSS of 1,460, indicating that both root servers appear to have a local MTU setting of 1,500 octets. For IPv4, this is entirely unexpected behaviour, and it is unclear why B and G have chosen to configure their IPv4 behaviour to perform response truncation in this manner, as it seems to be inviting extraneous TCP sessions to be set up.

For IPv6, 7 of the 13 root servers send the 1,316 octet UDP response packet without fragmentation. F and M elect to fragment the response, fragmenting the packet so it fits within a 1,280 octet limit. A, B, G and J elect to truncate the response instead. When opening a TCP session all four root servers offer an MSS of 1,440 octets, indicating that they are using a 1,500 octet MTU for TCP over IPv6. This behaviour seems to be more than a little odd.

The aim of the root servers is to maximize the likelihood that the response will be received by the recursive resolver, and, in so doing, it needs to chart a careful course between the various operational pitfalls that we are aware of.

Let’s look at UDP first.

In IPv4 there is a problem with firewalls allowing fragments through, and some recursive resolvers live behind firewalls that discard trailing fragments of a fragmented packet. For this reason, it makes a lot of sense to use a 1,500 octet value for the maximum IP packet size for UDP over IPv4, so as to avoid gratuitously fragmenting outbound IPv4 UDP packets at the source.

A similar line of reasoning holds in IPv6, but the problems with fragmentation are now twofold. Not only are firewalls prone to discard trailing IPv6 fragments, but certain routers are prone to discard all fragmented IPv6 packets. This is due to the use of an extension header in IPv6 to carry the IP fragmentation control fields. Some commonly deployed routers discard all IPv6 packets that contain IPv6 extension headers, including fragmentation extension headers. This has been observed to affect recursive resolvers. Some 30% of users that sit behind IPv6-capable resolvers use resolvers that are seen to be unable to receive a fragmented IPv6 packet. The F and M root servers fragment the IPv6 packet as if the server used a 1,280 octet MTU. This is not optimal behaviour in the light of this packet mishandling.

The other option instead of fragmentation is to perform response truncation.

In IPv4, the B and G servers do not deliver a large UDP response, even when the query specifies a large UDP buffer size. Instead, the server truncates the response so as not to deliver a UDP datagram larger than 1,280 octets.

In IPv6, A and J join B and G in truncating the IPv6 response as if there was a local MTU size of 1,280 octets. The intention here is to push the client resolver into re-issuing the query over TCP. So, how does TCP work with the root servers?

Firstly, TCP is not a viable option for all resolvers. In fact, previous measurements have shown that 17% of resolvers that query authoritative name servers appear to be unable to perform a query using TCP. This inability of the resolver to perform a TCP query is either due to some local resolver configuration, or an overly zealous firewall front end that assumes that the DNS is exclusively a UDP-based protocol. The result is that just under 3% of clients are affected by this and cannot resolve a name where the UDP response is truncated. So, TCP has its problems for the DNS.

What happens when the resolver is capable of performing a TCP DNS query?

In IPv4, all the root servers offer a TCP MSS of 1,460 octets. This is indicative of a local MTU setting of 1,500 octets in IPv4 for TCP. This appears to be a robust choice.

In IPv6, all the root servers offer a TCP MSS of 1,440 octets. Again, this is indicative of a local MTU setting of 1,500 octets. However, in this case, I would offer the view that this is a sub-optimal local configuration. The problem lies when the path includes some form of Path MTU Black Hole.

In IPv6, sending a TCP packet that is too large for the path results in an ICMP6 message being sent back to the host. If the host receives the diagnostic message, then it is in a position to drop its session maximum segment size and resend the packet. If the ICMP6 PTB message is lost or filtered before it reaches the original packet’s sender then the TCP session is wedged and cannot proceed. The conservative workaround for this is to avoid the packet too big situation altogether. If the sender were to set the TCP MSS for IPv6 down to 1,220 octets, then no TCP packet would be larger than 1,280 octets and the packet would not require fragmentation (at least if everyone honours the 1,280 MTU limit in the IPv6 transit networks).

The conservative workaround for this is to avoid the ‘packet too big’ situation altogether. If the sender were to set the TCP MSS for IPv6 down to 1,220 octets, then no TCP packet would be larger than 1,280 octets and the packet would not require fragmentation (at least if everyone honours the 1,280 MTU limit in the IPv6 transit networks). The response sizes we are talking about for the root servers, the marginal speed increase in raising the MSS from 1,220 to 1440 is negligible, while the consequent Path MTU blackholing is significant.

Scoring the Root Servers

How can we score the actions of each root server when dealing with a response that’s larger than 1,280 octets?

If the IPv4 UDP packet is sent without fragmentation for packets up to 1,500 octets in size, then let’s give the server a star.
If the offered IPv4 TCP MSS value is 1,460 octets, then let’s give the server another star.
If the IPv6 UDP packet is sent without fragmentation for packets up to 1,500 octets in size, then let’s give the server a star.
If the IPv6 UDP packet is sent without truncation for IPv6 packet sizes up to 1,500 octets, then let’s give the server a star.
If the offered IPv6 TCP MSS value is no larger than 1,220 octets, then let’s give the server another star.

How do the root servers fare on this five-star rating system? Again, I should repeat that this is the results from a test performed from just one vantage point of the Internet. It could be that different anycast instances of these root servers have different behaviour. However, such variation in behaviour in an anycast situation would make some tasks, particularly related to diagnosing failure, far worse. Therefore, let’s assume that the sane thing is going on here and all anycast instances are essentially the same in this respect.

A	IPv4 is good	Could do better	⭐⭐⭐
	IPv6 truncates UDP at 1280 and uses a TCP MSS of 1440

B	IPv4 truncates UDP at 1280	Epic fail!
	IPv6 truncates UDP at 1280 and uses a TCP MSS of 1440

C	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

D	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

E	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

F	IPv4 is good	Pretty ordinary	⭐⭐
	IPv6 fragments UDP at 1280 and uses a TCP MSS of 1440

G	IPv4 truncates UDP at 1280	Epic fail!
	IPv6 truncates UDP at 1280 and uses a TCP MSS of 1440

H	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

I	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

J	IPv4 is good	Could do better	⭐⭐⭐
	IPv6 truncates UDP at 1280 and uses a TCP MSS of 1440

K	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

L	IPv4 is good	Almost there	⭐⭐⭐⭐
	IPv6 UDP uses a 1500 MTU and uses a TCP MSS of 1440

M	IPv4 is good	Could do better	⭐⭐⭐
	IPv6 truncates UDP at 1280 and uses a TCP MSS of 1440

Table 2 – Rating the Root Servers

By this metric, the average for the entire root server system is 3 out of 5 stars, which is passable, but not exactly inspiring.

If we split out IPv4 and IPv6, the average IPv4 score is 1.7 out of 2 stars, whereas the average IPv6 score for the root server system is 1.5 out of 3 stars.

This is a somewhat disappointing outcome. We are talking about the servers for the DNS root zone, and when a DNSSEC-validating recursive resolver cannot prime its local state with the ZSK state of the root zone through DNS queries, then the resolver simply cannot function. So, in some sense, failure is not an option here, yet the settings we see in the DNS root zone’s servers, particularly for IPv6, elevate the odds of encountering failure when the response being managed is one that sits in that twilight zone between 1,280 and 1,500 octets in length. And in a little over a year from now, that’s exactly what will be happening in the root zone of the DNS.

However, the real question is what this behaviour implies for users. How many users will be stranded from the entire DNS root name system for the period when a large response is an intrinsic part of anchoring a validating DNS resolver into the global DNS? Earlier work has suggested that this count should be a relatively small number, but perhaps we should revisit this result in the light of this additional information about exactly how root servers behave when sending large DNS responses.

But that’s best left as a question for another day and another article.

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

5 Comments

Yoshiki Ishida November 15, 2016 at 6:24 pm

At Table-2、M root is missing.

Reply ↓
1. Yoshiki Ishida November 16, 2016 at 5:01 pm
  
  Fixed.　Great!
  
  Reply ↓
Howard November 17, 2016 at 2:58 am

The Pv6 TCP MSS on H has been changed to 1220.

Reply ↓
Hugo Salgado November 17, 2016 at 4:49 am

At the end of the first paragraph in “DNS Large Responses”, the ZSK size is 2,048 *bits*, not octets!

Reply ↓
Maciej Żenczykowski September 11, 2018 at 2:35 pm

I disagree with how you score things.

For something as critical as DNS you want both performance and reliability.

By having most of the root servers operate in a highly performant way, while some in a more resilient way, you can get the best of both worlds.

Remember: for DNS resolution to work, only *one* of the 13 root servers has to be capable of getting a response through to you.

Certain clients will fail to retry on TCP (or won’t be able to get a tcp connection established), certain clients will fail to receive fragments, etc… you want some diversity.

Furthermore if the UDP response never reaches you (because it was dropped, because of being too large, or because it was fragmented), most resolvers won’t even try to retry on TCP and instead just assume the DNS server is dead.

So, what’s highly performant and highly resilient?

IPv4 UDP:
– performant: DF clear, frag at 1500 mtu
– reliable: DF clear, frag at 1460 (enough so that you can handle ipv4-in-ipv{4,6} or ipv4-gre or pppoe encapsulation) or even 1280 mtu, or truncate and ask for tcp retry

my recommendation: 10 servers should frag at 1460, 2 should frag at 1280 (1280 is rapidly becoming a standard even for ipv4, it’s also just bloody likely to work), 1 should send a truncated <=576 byte response (this might be going too far, maybe truncate to <=1280 instead, there's probably no real 576s left in the internet anyway)

Obviously packets should always have DF clear, it's better to let the internet frag as needed: the DNS server won't know how to handle any error it gets back anyway.

IPv4 TCP:
– performant: xmit with DF set (or maybe clear?), advmss 1460
– reliable: xmit with DF clear, advmss 1420 or even 1240

(DF clear is 'mtu lock' in routing table on linux)

my recommendation would be for 10 servers to xmit with DF clear and advmss 1420, and 3 with DF clear and advmss 1240

Why DF clear? These tcp transactions are short, doing pmtud isn't worth it, and the DNS servers don't want to remember the extra route state anyway.

IPv6 UDP:
– performant: frag at 1500 mtu
– reliable: frag at 1280 mtu or truncate

my recommendation would be for 10 servers to frag at 1280 and 3 to truncate to 1280

IPv6 TCP:
– performant: advmss 1440
– reliable: advmss 1220

my recommendation: all servers should do mss 1220

The above is give-or-take optimized for every client being able to get a response from at least one of the servers.
(now whether it's worth doing, seeing as the current system works… is an altogether entirely new question)

Reply ↓

Packets and Networks

DNS

DNS Large Responses

DNS Root Servers and Large DNS Responses

Scoring the Root Servers

5 Comments

Leave a Reply Cancel reply