Lessons learned when building my DNS resolver

By on 12 Jun 2024

Category: Tech matters

Tags: , , ,

Blog home

I learned about the ‘Implement DNS in a weekend’ project by Julia Evans long ago and I always wanted to give it a try.

I mean, we all know something about the DNS, at least at a high level, right? Yeah, right, probably. So I immediately delved into the implementation after checking the first few pages of it. Looking back, this was a mistake (and I took almost a week to build from scratch).

Anyway, this post is about a guy who never did low-level network programming, trying to build his own DNS resolver with the language he was using on and off, occasionally.

Some notes before we start:

  • It’s highly recommended to read ‘Implement DNS in a weekend’ then RFC 1035 and other resources if you are interested.
  • All diagrams, if not mentioned specifically, are borrowed from RFC 1035.
  • You can find my code at my GitHub repo. Keep in mind it’s merely a toy resolver with incomplete support of standards.
  • There are tons of DNS-related RFCs. I have a huge respect for whoever implemented these.

Prerequisite knowledge

  • An octet is a byte.
  • For a UDP message, the message limit is 512 octets at most.
  • A domain (for example, ‘blog.wtcx.dev‘) should be less or equal to 255 octets. For the way how length is calculated, see ‘Domain representation and message compression’.
  • A label, aka the components of the domain (for example, ‘blog‘, ‘wtcx‘, or ‘dev‘), is less or equal to 63 octets.

Big-endian, little-endian

2.3.2. Data Transmission Order

…Whenever an octet represents a numeric quantity, the left most bit in the diagram is the high order or most significant bit.

…Similarly, whenever a multi-octet field represents a numeric quantity the left most bit of the whole field is the most significant bit. When a multi-octet quantity is transmitted the most significant octet is transmitted first.

RFC 1035

In terms of computer networking, data is mostly transferred in big-endian order. This means that if the data consists of multiple octets, for example, the TTL (unsigned 32-bit) in a Resource Record (RR), we have to ensure the most significant byte (MSB) is the leftmost one.

Let’s see how this works in action:

fn main() {
    let n: u32 = 1;

    let n_ne = n.to_ne_bytes();
    let n_be = n.to_be_bytes();
    let n_le = n.to_le_bytes();
    print!(
        "n (native byte order): {:?}\nbig-endian: {:?}\nlittle-endian: {:?}\n",
        n_ne, n_be, n_le
    )
}

Output:

n (native byte order): [1, 0, 0, 0]
big-endian: [0, 0, 0, 1]
little-endian: [1, 0, 0, 0]

If we want to construct a u32 from that response (in a buffer of [u8]):

fn main() {
    // let's say we have a TTL of 1 second
    let buf: [u8; 4] = [0, 0, 0, 1];

    let n_ne = u32::from_ne_bytes(buf.try_into().unwrap());
    let n_be = u32::from_be_bytes(buf.try_into().unwrap());
    let n_le = u32::from_le_bytes(buf.try_into().unwrap());
    print!(
        "n (native byte order): {:?}\nbig-endian: {:?}\nlittle-endian: {:?}\n",
        n_ne, n_be, n_le
    )
}

Output:

n (native byte order): 16777216
big-endian: 1
little-endian: 16777216

Huge difference, right? So make sure you are building packets with .to_be_bytes() and reading with .from_be_bytes().

Message format

Before we can process the DNS message, we need to know the format.

4.1. Format

+---------------------+
|        Header       |
+---------------------+
|       Question      | the question for the name server
+---------------------+
|        Answer       | RRs answering the question
+---------------------+
|      Authority      | RRs pointing toward an authority
+---------------------+
|      Additional     | RRs holding additional information
+---------------------+

For a query, we can only construct Header and Question sections. When we receive a response, it will contain both of these fields as well, followed by optional RRs for Answer, Authority, and Additional.

Header

The first part of the message is Header, which has a fixed length of 12 octets so it is relatively easy to parse.

4.1.1. Header section format


                                1  1  1  1  1  1
  0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      ID                       |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR|   Opcode  |AA|TC|RD|RA|   Z    |   RCODE   |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    QDCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ANCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    NSCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ARCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
FieldLength (Bits)Comment
ID16Used to validate the question and response.
QR10 for query, 1 for response.
OPCODE40 for standard query, refer to RFC for others.
AA11 = authoritative answer.
TC11 = message is truncated, refer to RFC for others.
RD11 = recursion desired, refer to RFC for others.
Z3Reserved for future use.
RA11 = recursion available, refer to RFC for others.
RCODE4Response code. 0 = no error, 3 = NXDOMAIN, check the RFC for the rest.
QDCOUNT16How many entries in Question section (after Header).
ANCOUNT16Record count in Answer section.
NSCOUNT16Record count in Authority section.
ARCOUNT16Record count in Additional section.
Table 1 — Header fields.

This is the structure (struct) I used for Header:

pub struct MessageHeader {
    id: u16,
    flags: u16,
    qd_count: u16,
    an_count: u16,
    ns_count: u16,
    ar_count: u16,
}

It’s also easy to build a header for a standard query — we can just change ID and QDCOUNT, leave the other fields to 0 and call it a day.

Extract flags in Header section

We learned that different flags in the Header take different lengths, but how do we parse it from a response? Say we want to check the RCODE (response code) field to see whether the response is successful; the RCODE takes the last four bits of the flags, and since we already parsed the whole flags as a u16, we need to AND it with decimal 15:

// some made up flags
let flags: u16 = 3; // 0b0000000000000011
let rcode = flags & 0b0000000000001111;
// or
let rcode = flags & 0x000F;

Let’s see it in binary form:

      0b0000000000000011
AND   0b0000000000001111

After this operation, we will get the last four bits, which is the RCODE we wanted. And when RCODE is 3, it’s the NXDOMAIN that you know and probably don’t love.

Question

The Question section is the query’s payload. Parsing becomes slightly trickier because the length is not fixed.

4.1.2. Question section format

                                1  1  1  1  1  1
  0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                                               |
/                     QNAME                     /
/                                               /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                     QTYPE                     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                     QCLASS                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
FieldLength (Bits)Comment
QNAMEVariousQNAME contains label sequence; no one knows how long this will be. But QNAME ends with a 0 octet so just keep parsing until you see it. This field is likely to be compressed.
QTYPE16QTYPE is the type you want to query (for example, 1 for A record), and is the superset of TYPE. See RFC 1035 Section 3.2.3.
QCLASS16QCLASS is the class you want to query (mostly just 1 for IN, aka the internet). See RFC 1035 Section 3.2.5.
Table 2 — The question section.

The struct:

pub struct MessageQuestion {
    domain: String,
    q_type: RecordType,
    q_class: RecordClass,
}

Where RecordType and RecordClass are just enums mapping u16 to corresponding variants.

I thought for a while about whether I should use String for the domain. It does make sense since we don’t know how long it will be, and it’s way easier to perform common operations on a String than &[u8]. The cost is an additional allocation, though. But I guess performance is not the problem I need to deal with at the moment so String it is.

Resource Record (RR)

The last three sections are Answer, Authority, and Additional. All three of them share the same format:

4.1.3. Resource record format

                                1  1  1  1  1  1
  0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                                               |
/                                               /
/                      NAME                     /
|                                               |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      TYPE                     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                     CLASS                     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      TTL                      |
|                                               |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                   RDLENGTH                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/                     RDATA                     /
/                                               /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
FieldLength (Bits)Comment
NAMEVariousA domain name, like QNAME in Question.
TYPE16 (RFC 1035 didn’t specify whether it’s a signed or unsigned 16-bit. I am using u16.)Type of the record (for example, 1 for A record). See RFC 1035 Section 3.2.2.
CLASS16 (RFC 1035 didn’t specify whether it’s a signed or unsigned 16-bit. I am using u16.)Class of the record (mostly just 1 for IN, aka the internet). See RFC 1035 Section 3.2.4.
TTLUnsigned 32The content varies in the different TYPE of resource record. See RFC 1035 Section 3.3. Standard RRs.
RDLENGTHUnsigned 16How long the RDATA field is.
RDATAVariousThe content varies in different TYPEs of resource records. See RFC 1035 Section 3.3. Standard RRs.
Table 3 — Resource records.

The struct:

pub struct ResourceRecord {
    pub name: String,
    pub r_type: RecordType,
    pub r_class: RecordClass,
    pub ttl: u32,
    pub rd_length: u16,
    pub r_data: RecordData,
}

Domain representation and message compression

Every time you see <domain-name> in the RFC, such as QNAME in the Question section or the RDATA field for RRs, it can be compressed to save the payload size, without repeating the data that has been transferred before.

You don’t necessarily need to compress the domain name before sending the query, but the response will likely have a compressed domain.

Uncompressed domain representation

If we want to query the address of ‘blog.wtcx.dev‘, we will encode the domain into a series of ‘length octet + label’ format.

blog.wtcx.dev‘ consists of three labels, ‘blog‘, ‘wtcx‘, and ‘dev‘. We will repeatedly use 1 octet to indicate how long the following label is, then follow with the actual label. Finally, we end the whole domain with a 0 octet.

Remember there is a 63 octets length limit for each label?

3.1. Name space definitions

…The high order two bits of every length octet must be zero, and the remaining six bits of the length field limit the label to 63 octets or less.

RFC 1035

That is, for every length octet, we can only use the lower 6 bits, which is 0b00111111 (63 in decimal) at most.

The total domain length is calculated by combining all <label_length><label> pairs with the ending octet and don’t have the dots (‘.‘) we see in a human-readable format. Therefore, the domain representation for ‘blog.wtcx.dev‘ is something like 4blog4wtcx3dev0, which is 15 octets long. And we can have at most 255 octets for a domain.

Compression format

Let’s borrow the diagram from RFC again:

4.1.4. Message compression

   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
20 |           1           |           F           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
22 |           3           |           I           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
24 |           S           |           I           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
26 |           4           |           A           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
28 |           R           |           P           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
30 |           A           |           0           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
40 |           3           |           F           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
42 |           O           |           O           |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
44 | 1  1|                20                       |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
64 | 1  1|                26                       |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
92 |           0           |                       |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

From the 20th (1) to the 31st octet (0), that’s precisely the uncompressed domain representation above.

The range starting from 40th (3) to 43rd (O, English letter, not zero) shares the same logic as the uncompressed format. However, unlike the uncompressed format, the higher two bits of the 44th octet are 11.

Remember the limit where a label can only be 63-bits long? That limit leaves the higher two bits for another purpose.

  • When you see a length octet like 0b00111111, the first two bits are 00, which means it’s an uncompressed, normal label.
  • When the first two bits are 11, like you saw on the 44th octet above, it’s in compression format.
  • The first two bits, 10 or 01, are reserved for future use. I didn’t look into that since it’s not required at this moment.

Check the domain name compression and where the pointer points to

We know a compressed domain consists of a 2-octet group. When the higher 2 bits is 11, the following domain is compressed and the position we need to ‘jump to’ is calculated by the lower 14 bits.

So how do we check whether it’s compressed, and where the position is? Luckily, we can reuse the AND operation in the flag example. Normally, I parse the domain name octet by octet because that works well for both uncompressed and compressed domains. For example, if we are parsing the length octet (u8) and see 0b11000010, we check the first two bits by ANDing a u8 0b11000000 (or a 0xC0 in hex); that way everything other than the first 2 bits will be set to 0.

      0b11000010
AND   0b11000000

And we have the result — 0b11000000.

The actual code is self-explanatory as well:

let len = buf[curr_pos] as usize;
// 0xC0 = 0b11000000
// Check whether the first two bits are 11.
let is_compressed = len & 0xC0 == 0xC0;

If it’s not compressed, we continue to parse octet by octet until we see a 0 octet. Otherwise, it’s compressed, and we have to combine the remaining six bits with the following eight bits to calculate the pointer.

Let’s imagine the two octets look like this:

      0b1100000000010100

Combine the two octets with:

u16::from_be_bytes([buf[curr_pos], buf[curr_pos + 1]])

And then we get 0b1100000000010100.

To set the first two bits to 0, we will AND the value with 0b0011111111111111 (or 0x3FFF in hex).

      0b1100000000010100
AND   0b0011111111111111
let offset = (u16::from_be_bytes([buf[curr_pos], buf[curr_pos + 1]]) & 0x3FFF) as usize;

In this case, we get 20. So we can ‘jump’ to the 20th octet and continue to parse the domain. If you are not sure where we were before the ‘jump’, check the 44th octet of the compression format diagram.

You can check my (and ChatGPT’s) attempt on domain name parsing on GitHub.

Binding 127.0.0.1 won’t work

When sending a DNS query message, we first need to bind a UDP socket on the host and then send packets with it.

I don’t want to bind on every interface, maybe just 127.0.0.1 with a random port. So:

// ...
let socket = UdpSocket::bind("127.0.0.1:0").map_err(Error::NetworkError)?;
println!("socket binding OK!");
let bytes_sent = socket
    .send_to(&query.to_query_bytes(), addr)
    .map_err(Error::NetworkError)?;
// ...

And then:

Looking up blog.wtcx.dev
socket binding OK!
thread 'main' panicked at src/main.rs:6:47:
called `Result::unwrap()` on an `Err` value: NetworkError(Os { code: 49, kind: AddrNotAvailable, message: "Can't assign requested address" })

Seems like we can bind to 127.0.0.1 without problems. But we can’t send packets to a remote nameserver — why?

After Googling a bit, the answer is actually in the UdpSocket::bind documentation all along:

…Note that bind declares the scope of your network connection. You can only receive datagrams from and send datagrams to participants in that view of the network. For instance, binding to a loopback address as in the example above will prevent you from sending datagrams to another device in your local network.

UdpSocket::bind

Even though the nameservers we are about to query are not ‘another device in your local network’, you get the idea. The remote peer is not the host that the resolver program is running on, therefore binding on 127.0.0.1 just won’t do.

We can verify this by changing the remote address to an imaginary local nameserver at 127.0.0.1:53, and the packets will be sent without any issues!

let bytes_sent = socket
    // WARNING: Of course we don't want to do this. This is just to check that loopback addr <-> loopback addr is possible.
    .send_to(&query.to_query_bytes(), "127.0.0.1:53")
    .map_err(Error::NetworkError)?;

Since a DNS resolver primarily queries remote nameservers (if the response is not cached), let’s simply bind to 0.0.0.0:0 and be done with it.

// ...
let socket = UdpSocket::bind("0.0.0.0:0").map_err(Error::NetworkError)?;
println!("socket binding OK!");
let bytes_sent = socket
    .send_to(&query.to_query_bytes(), addr)
    .map_err(Error::NetworkError)?;
// ...

Unexpected IPv6 ambush

Just when I thought that would be pretty much it and started my test, along came the twist.

My resolver crashed during the parsing of the RR. It complained about an unknown record type 28. I went back to search the RFC and couldn’t find type 28 anywhere. After checking with ChatGPT I then learned that’s in RFC 3596, the IPv6 extension with the new record type AAAA.

So, even though I thought I wouldn’t have to deal with IPv6 anytime soon, it quickly became relevant. That’s fair, considering we’re not living in the 1980s. The good news is, all I need to do is update the RecordType enum and add a case to the match expression.

Improvements worth considering

Because this is just a toy DNS resolver, it doesn’t contain extensive tests and doesn’t have most of the standards supported. There are still things I want to do in the future (if possible).

Caching and negative caching

The RFC specifically states that the resolver should cache the result whenever possible (while respecting the TTL). This way, we don’t constantly bother the root nameservers and the authoritative nameservers.

IPv6

This is obvious, right? Supporting IPv6 is probably one of the most important things for a resolver.

More protocols

There are many protocols that can be used for DNS queries other than UDP nowadays — TCP, DNS over TLS, or DNS over HTTPS. I don’t know the difficulty yet, but let’s put it on the wish list.

Async operation

I don’t want to use the word ‘obvious’ again but you get the idea. A resolver with synchronous operation probably only works for a toy CLI program.

Internationalized domain names (IDN)

Internationalized domain names like 中文.tw may not be very popular but may be fun to implement.

Final thoughts

I mentioned that I always wanted to create my own DNS resolver, and I finally did. However, implementing it without looking at how others have done it can be both a good and a bad thing.

The good thing is that I really delved into the RFC and tried to understand my progress. It was eye-opening to see how everything underneath really works. Seeing the resolver function properly felt like the first time you dealt with some web APIs and saw them work just as the documentation mentioned, but at a much lower level. Of course, it has its own caveats to be addressed, but it’s interesting in its own way. I would recommend this to anyone who’s interested!

However, I ended up spending way more time than I expected. Most of the time, I found myself repeatedly reviewing the same paragraphs, trying to grasp the meaning of the document. In hindsight, I could have saved time by referring to examples (or even asking AI for clarification). There are tons of RFCs available, but I only focused on RFC 1035 to create a minimum working Proof of Concept (PoC).

I also struggled in several places for a while like bitwise operations and finding out what is TYPE 28. These look so obvious in hindsight but really took me some time to figure out before I turned to ChatGPT.

At work, I’m mostly involved in high-level programming tasks, such as integrating RESTful APIs and handling regular CRUD operations. As an infrastructure engineer, I have plenty of debugging tools readily available. Rarely do I get the chance to build a tool that works on a low-level because, quite frankly, there’s just no need for it — just like everyone else.

It was a fun experience and a good project to practice Rust (unlike the abandoned advent of code).

Below are some observations during the making of a DNS resolver:

  • There are more length checks than I thought (I guess it’s still not enough).
  • A malformed pointer in compression format can severely disrupt the resolver, potentially causing it to become infinitely recursive. Therefore, the resolver must be diligent and not naive when handling such situations.
  • I am not a security expert. But when I saw I got a ~500 bytes response from a ~30 bytes query, that really gave me a concept of how dangerous a DNS amplification attack can be!
  • DNS caching is very important (especially… in Kubernetes, right?)
Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top