This is the first of what will be a series of blog posts on our experiences with eXpress Data Path (XDP), an exciting new technology in networking.
Network programming using eXpress Data Path (XDP) has been on our radar at NLnet Labs (and if you are reading this, possibly on yours too!) for a while now. As tooling around this technology has vastly improved, we decided that it was time to finally get our hands dirty and see what this technology is all about.
As part of SURFnet’s Research on Networks program, we will be investigating how XDP can be used to improve the things that we care about at NLnet Labs: open and reliable solutions for core protocols in our Internet. More specifically for this research, anything DNS.
We will experiment how we can leverage the power of XDP to improve performance of resolvers and increase the versatility of name servers, as well as perform low-level measurements on high-speed links. Where will we end up exactly? Nobody knows, but we plan to share our experiences with this exciting new technology with you as we go. We have a lot ahead of us, but we sure did enjoy our first encounters!
XDP in a nutshell
Let’s take a step back first. To understand what XDP is and where it comes into play, we need to understand what (E)BPF is.
EBPF stands for Extended Berkeley Packet Filter, as it was indeed inspired by what we now call classic BPF, even though it can do much more than filter packets. EBPF is a (simple) Virtual Machine running in kernel space, enabling the execution of EBPF programs. These programs are not necessarily networking related, which makes the name of this technology a bit misleading perhaps. They can be used to trace and get insights on any system call or kernel event!
Most people just say or write ‘BPF’ when referring to EBPF, and so will we for the remainder of this post. (The original, classic BPF is commonly abbreviated to cBPF.)
XDP is a hook in the network driver that enables you to execute BPF code right there, in the network driver. This is executed before an incoming packet is passed on to the networking stack in the kernel, thus before the socket buffer comes into play. It provides certain flexibility even at very high packet rates, by simply attaching your BPF program to the network interface and not even requiring a reboot.
Note that even though XDP is a part of BPF focused on networking, it can still do much more than only filter packets. As we will see, it can be used to modify packets, send them out again directly, pass them on to the network stack, and more.
As always, with such great power comes the necessary responsibility.
We cannot ‘just execute arbitrary code inside the kernel’ if we also want system uptimes of more than a handful of minutes. To ensure BPF code does not crash the kernel and thus the entire system, the program is verified upon loading it. The verifier makes sure the code actually terminates in finite time, and that no illegal memory access occurs. We’ll see that writing valid XDP programs has its challenges and sometimes requires certain creativity.
Some fun first
To start, we conducted some simple exercises — perhaps not of any use in the real world, but we needed to find out what the limits of XDP are (both qualitatively and quantitatively) in the area we want to apply it. Note that the code snippets in this part are not full programs (you can find them in our Git repository).
First we wanted to know if can we use XDP to properly refuse (that is, send a response with RCODE 5) any incoming DNS query? The answer is we can, and while simple in nature, the resulting program touches most of the core concepts and features of XDP programs. Moreover, it includes actions common to most programs that will process and modify packets, such as truncation and updating checksums.
Let’s go through a few iterations of our dns-says-no program, working our way towards an implementation that adheres to all the current standards, and examine some of those XDP concepts and common actions.
Round 1: Swap, set, send
The first iteration of dns-says-no does the bare minimum: swapping the source and destination addresses and ports, setting the QR bit to 1 indicating turning the request into a response, and setting the response code to REFUSED.
Return codes: Looking at the code snippet below, we see that the function returns an int, which represents the destiny of the packet being processed. If the packet should be passed on to the network stack of the operating system, XDP_PASS is returned. In our code, this is used for packets that are not DNS queries (the udp_dns_reply() returns a non 0 value). Otherwise, we make sure the packet leaves the interface again by returning XDP_TX, with IP and MAC addresses swapped.
Note that in that case, all handling for this packet was done in the BPF virtual machine, without involving the OS network stack in any way. Also, no userland code has been executed, because XDP allows us to do all the necessary parsing and modification as well.
Parsing: to find out what type of packet we are dealing with, common frame and packet header structs are used, branching based on protocol numbers in switch or if conditions. After the version of IP has been determined, parsing of the UDP and DNS header follows. We have provided our own struct dnshdr to parse a DNS header in a similar fashion as the header structs provided by the Linux kernel header files.
The verifier needs to keep track of which part of the packet has been parsed and needs to verify that we stayed within the packet boundaries. We have learned from experience that the verifier has an easier job when parsing based on earlier choices is handled directly in the branches for those choices. Therefore scanning of the UDP and DNS header is done in the branches where the version of IP was determined with the udp_dns_reply() function. This catering for the verifier is quite peculiar and takes some time and experience to get used to.
Modification: With the structs at hand, we can directly alter the buffer, and thus modify the contents of the packet before it is sent out or passed on to the network stack: source and destination addresses/ports are swapped, and fields in the DNS header are set.
The XDP_TX return code will send out the modified packet immediately without any more help from the kernel’s network stack. One of those things the kernel’s network stack would normally help out with is calculating the checksums in the IPv4 and UDP headers. With XDP_TX, we have to do those calculations ourselves.
Since network checksums are based on the summation of 16-bits aligned values, and the IPv4 header is based on values within the IPv4 header only, the checksum over the IPv4 header stays unchanged, since the swapping of the IP addresses didn’t change the sum of all the 16-bit values in the header.
The UDP checksum is over the UDP header including the payload. The payload (that is, the DNS message) did change (although only the 16 bits containing the flags and rcode), therefore we recalculate the UDP checksum based on the previous and the new value of these 16-bits with update_checksum().
This first version of dns-says-no is pretty self-contained. To generate the BPF module, we do not need much more than the clang compiler, the Linux kernel headers and the IP tool from the iproute2 package to load the module. For example, to compile and load this program associated with eth0, we do:
$ clang -target bpf -O -c -o xdp_dns_says_no_kern_v1.o xdp_dns_says_no_kern_v1.c $ sudo ip link set dev eth0 xdpgeneric obj xdp_dns_says_no_kern_v1.o sec xdp-dns-says-no-v1
Sending a query with dig, returns:
$ dig @22.214.171.124 will.you.answer.me. A +norec ; <<>> DiG 9.16.1-Ubuntu <<>> @126.96.36.199 will.you.answer.me. A ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55544 ;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: c5ebfcd9e4943c77 (echoed) ;; QUESTION SECTION: ;will.you.answer.me. IN A ;; Query time: 136 msec ;; SERVER: 188.8.131.52#53(184.108.40.206) ;; WHEN: do jul 02 22:00:45 CEST 2020 ;; MSG SIZE rcvd: 59
This program has an error: it echoes back the options found in the EDNS(0) OPT PSEUDOSECTION. Our program does not support EDNS(0) options and should not reply with any option. It should not echo back the DNS COOKIE to dig. In version 2 of dns-says-no we will improve on this by returning REFUSED without echoing back all EDNS(0) options.
Round 2: Stripping the cookie
The OPT Resource Record (RR) is not on a fixed position in the DNS message. Variable locations are particularly difficult for the BPF verifier, which requires a limited variation to be able to determine that a program will end in an infinite time. Luckily the DNS has all kinds of hard limits that save the day.
The OPT RR is the first record in the additional section. A well-formatted DNS query will have exactly one question in the ‘Question’ section, zero RRs in the ‘Answer and Authority’ section and potentially one OPT RR in the ‘Additional’ section.
To get to the OPT RR, we thus need to parse one Question RR, which contains one Query name (qname) that can be of variable length, however, it will have at most 128 labels of at most 63 characters. The function below can be used to move the cursor forward to skip a domain name in wire format, which can be used to skip the qname:
Following the qname is the 16 bits RR type (that is, A, AAAA or TXT) followed by the class (IN). All other RRs besides query RRs (such as the OPT RR) will have an additional 32 bits TTL field followed by 16 bits Resource Record data length (rdata length), followed by the length indicating the amount of Resource Record data (rdata). We introduce two structs to parse Question and regular RRs:
All RRs are preceded by a variable but bounded dname.
The udp_dns_reply() function is adapted to check for proper values for the section RR counters of a well-formatted DNS Query packet, and to skip the cursor over the Question RR:
The next RR should be the OPT RR (since the Answer and Authority sections contain no RRs). The owner name of an OPT RR is always the root label, so a single byte indicating a zero-length label. In theory, we could have reused the parse_dname() function to parse the OPT RR owner name, but this introduces too much variation for the verifier in the current Linux kernel to check. So instead we check for the single zero bytes after which we continue parsing the regular RR struct.
Of course, setting the rdata length of the OPT RR to zero inflicts an update to the UDP checksum. Note that we need to check the alignment of the modified value. If it was not exactly on a 16-bit border, we need to update the two 16-bit values overlapping the rdata length value. The setting of the rcode and flags and swapping of UDP source port remains the same as before.
So, does this version behave as desired? Let’s check:
$ dig @220.127.116.11 answer.me.please. A +norec ; <<>> DiG 9.16.1-Ubuntu <<>> @18.104.22.168 answer.me.please. A ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 37260 ;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: Message has 12 extra bytes at end ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;answer.me.please. IN A ;; Query time: 24 msec ;; SERVER: 22.214.171.124#53(126.96.36.199) ;; WHEN: do jul 02 23:59:11 CEST 2020 ;; MSG SIZE rcvd: 57
Well… we got rid of the echoed COOKIE, but now we have 12 bytes of excess trailing data at the end. We need to properly truncate our packet.
Finally: Properly truncate the response
To properly truncate a packet within an XDP program, we need to use an external function. The normal C library functions are not available to XDP programs. Note that we used clang compiler builtins for htons() and ntohs(). Similar builtins can be used for htonl() and ntohl(), memset(), memcpy() and memmove() (mem* functions for constant amounts only though). The only functions that are allowed in XDP programs are the functions listed in bpf_helpers.h.
Although the manpage describing the functions in this header file is readily available on Linux systems (man bpf-helpers), the header file with the function definitions is not. It is available via the libbpf git repo, which we included as a submodule in our Git repository. Make sure you’ve done a git submodule update — init, to make this header available.
The function from bpf_helpers.h that we need is bpf_xdp_adjust_tail(), however, we are still responsible for updating the IP and UDP header length fields and updating the IP and UDP header checksums. bpf_helpers.h also provides a convenient function to keep track of how the checksum needs to be updated with all (kinds of) modifications: bpf_csum_diff(). We will use that function in this version of dns-says-no instead of the more limited update_checksum() used earlier.
Below is our final and properly REFUSED responding main function for the dns-says-no program:
A few noteworthy observations:
- Whether bytes need to be stripped from the tail is derived from the cursor’s pos member being smaller than the end member.
- Calculating the UDP checksum for both IPv4 and IPv6 introduced too much control flow variation in our program. Therefore, we only computed the IP header checksum for IPv4 and not the UDP checksum (which is optional for IPv4 anyway). IPv6 does not have an IP header checksum, but the UDP checksum is mandatory.
- We could not simply calculate the effect of removing all data “to strip” by passing the to_strip variable to csum_remove_data(). This again introduced too many different variations of the program for the verifier to check. We worked around this by limiting the maximum amount we can strip to 128 (0x80) and calculated the effect that has on the checksum in chunks (first 0x40, then 0x20, then 0x20, and so forth). Furthermore, we had to limit the number of labels in qname to 40.
- It would be worthwhile to see if we can overcome those limitations by splitting our program into multiple distinct programs (that can be verified independently) and creating a chain of those programs to fulfill a functionality, but this would be for a future blog post.
BPF and XDP are complex subjects and we are just scratching the surface with this introductory post. In future posts, we intend to dive deeper and get more technical, and also showcase programs that will actually benefit the DNS and operator community.
One of these programs, which we hope to describe in our next post, implements Response Rate Limiting (RRL) in XDP with interaction from userspace for configuration flexibility. We will see what so-called Maps are in BPF and how we can leverage them.
In the meantime, we encourage people to dive into the matter themselves and discover what XDP can offer. Some starting points we found very helpful are provided below.
- Read Fast Packet Processing with eBPF and XDP: Concepts, Code, Challenges, and Applications by Marcos Vieira et al, in: ACM Computing Surveys, Vol. 53, No. 1, Article 16. (February 2020)
- Watch Brendan Gregg on Performance Analysis Superpowers with Linux eBPF at Velocity Conf 2017
- Or, start hacking with https://github.com/xdp-project/xdp-tutorial/
Contributors: Willem Toorop
Adapted from original post which appeared on RIPE Labs.
Luuk Hendriks is a PhD candidate at the University of Twente, the Netherlands.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.