Rocky road towards ultimate UDP server with BPF-based load balancing on Linux: Part 2

By on 19 Oct 2023

Category: Tech matters

Tags: , ,

Blog home


In the first part of this series, we evaluated multiple possible options to load balance UDP traffic between multiple sockets using the SO_REUSEPORT capability.

To improve on simple load balancing — use BPF

Due to the nature of traffic in my particular case (where the majority of traffic is coming from a single device), the standard Linux kernel load balancing algorithm did not distribute it well and we were limited in options to use more threads to process UDP traffic (for tests in this series of posts I used Ubuntu 22.04 with Linux kernel 6.2.0-26-generic).

Figure 1 — Default Linux kernel load balancing logic for reuse port does not work well when all traffic comes from just a few sources.
Figure 1 — Default Linux kernel load balancing logic for reuse port does not work well when all traffic comes from just a few sources.

To achieve even traffic distribution between sockets/threads we will use classic BPF (cBPF) microcode loaded into the Linux kernel, which will be used for the sole purpose of finding what is the best thread ID for a particular UDP packet. From this very simple microcode, we will have access to the UDP packet itself and multiple network stack-specific variables such as:

  • len skb->len
  • proto skb->protocol
  • type skb->pkt_type
  • poff Payload start offset
  • ifidx skb->dev->ifindex
  • mark skb->mark
  • queue skb->queue_mapping
  • hatype skb->dev->type
  • rxhash skb->hash
  • cpu raw_smp_processor_id()
  • vlan_tci skb_vlan_tag_get(skb)
  • vlan_avail skb_vlan_tag_present(skb)
  • vlan_tpid skb->vlan_proto
  • rand prandom_u32()

You can find the full list in the documentation for networking in the section: ‘Possible BPF extensions are shown in the following table’.

I can recommend doing a detailed evaluation of the protocol you work with to find the best approach to distribute traffic. In my case, we decided to do random distribution over all available threads using the rand extension.

There is another more complex form of BPF if you need it — eBPF

In addition to classic BPF microcode, which can be loaded using SO_ATTACH_REUSEPORT_CBPF, the Linux kernel supports SO_ATTACH_REUSEPORT_EBPF, which uses the significantly more advanced eBPF. My particular case is very simple and I do not need all the advanced tricks available in eBPF; I prefer to keep things simple. If you’ve reached the limitations of classic BPF then eBPF may be your choice.

How to write the BPF microcode

What is the best way to write BPF microcode? You may manually write it in opcodes as documented in the FreeBSD documentation. Don’t worry, BPF is almost the same on Linux and FreeBSD (eBPF is only available on Linux at this time; it hasn’t been back-ported to FreeBSD).

I found some examples in different guides and was able to create my microcode in opcodes:

struct sock_filter bpf_random_load_distribution[3] = {
    /* Load random to A */
    { BPF_LD  | BPF_W | BPF_ABS,  0,  0, 0xfffff038 },
    /* A = A % mod */
    { BPF_ALU | BPF_MOD, 0, 0, threads_per_port },
    /* return A */
    { BPF_RET | BPF_A, 0, 0, 0 },

The step-by-step logic is as follows:

  1. We generate a random 32-bit integer accessible via the 0xfffff038 address.
  2. Then load it to accumulator A.
  3. Then we use the modulo operation and divide A by the number of threads we have in place.
  4. After that, we load the result from the modulo division to accumulator A.
  5. As a last step, we return the value of accumulator A to the Linux kernel.

You may just copy it as is but if you want to implement alternative logic be warned, it may get exceptionally tricky to do so in opcodes.

What is the alternative option? We can write it in BPF assembler. In this case, my microcode will look like this:

; Load random 32 bit value to A
ld rand
; Mod divide register A by number of worker threads
mod #4
ret a

If you had experience with the assembler previously you may find it a lot easier to understand this notation.

To compile this code into sock_filter format we can use the Python module: bpf-asm. It can be installed this way:

sudo pip3 install bpf-asm

After that, we can compile the assembler into the C structure compatible with BPF:

pybpf_asm  reuseport_thread_distribution_bpf_random.txt -c

In my case, I got the following output:

{ 0x20, 0, 0, 0xfffff038 },
{ 0x94, 0, 0, 0x00000004 },
{ 0x16, 0, 0, 0x00000000 },

It looks different from my handwritten opcodes but it’s completely equivalent to it:

{ BPF_LD  | BPF_W | BPF_ABS,  0,  0, 0xfffff038 },
{ BPF_ALU | BPF_MOD, 0, 0, threads_per_port },
{ BPF_RET | BPF_A, 0, 0, 0 }

The only difference is that manually written code is easier to understand, as we used symbolical constants. If you do not like the Python-based bpf-asm for some reason you may use the C-based bpf_asm tool in the Linux kernel sources and install it using this method:

tar -xf linux-6.4.12.tar.xz
cd linux-6.4.12/tools/bpf
sudo apt install -y binutils-dev libreadline-dev bison flex
sudo cp bpf_asm /opt

The compile command for a BPF assembler program is similar:

/opt/bpf_asm -c reuseport_thread_distribution_bpf_random.txt

How to load the microcode into the kernel

The last remaining step is to load our microcode into the Linux kernel for reuse_port load balancing. We will start from modifying our multi-thread UDP server. In the official man pages from Linux we find a pretty relaxed description of the ways to load it:

Figure 2 — Extract from the Linux man page for BPF.
Figure 2 — Extract from the Linux man page for BPF.

I did my best to understand the documentation and I’ve added logic to load BPF-based microcode to our previous version of the UDP server with reuse_port-based load balancing. You can find my complete source code here. I decided to load it after we set the SO_REUSEPORT flag and before the actual bind() call. Unfortunately, it didn’t work:

We will listen on :::2055 udp port
Setting reuse port
Loading BPF to implement random UDP traffic distribution over available threads
Successfully loaded reuse port BPF
Successful bind
We will listen on :::2055 udp port
Setting reuse port
Loading BPF to implement random UDP traffic distribution over available threads
Successfully loaded reuse port BPF
Can't bind on port: 2055 on host :: errno:98 error: Address already in use
Cannot create / bind socket
Started capture

I did my best to read all the documentation again and again. I shared my broken code with the community and asked for feedback. At the same time, I started looking for examples of successful use of SO_ATTACH_REUSEPORT_CBPF on GitHub and I found just a few PRs and not very much about the source code.

As for all the relatively new features, we may expect examples in the Linux kernel source code. I was lucky and I found reuseport_bpf.c in the folder tools/testing/selftests/net subdirectory of the latest Linux kernel (6.4.12), which included a complete test kit for SO_ATTACH_REUSEPORT_CBPF. After half an hour of code removal, I was able to make it fail in a similar way as my code. You can find the complete source code of the test.

---- IPv4 UDP ----
Testing CBPF mod 10...
Testing CBPF mod 20...
./reuseport_bpf: failed to bind recv socket 0: Address already in use

At this point, I was pretty sure that this logic actually worked, as the original Linux kernel test suite did not fail but something was different from my implementation. So, after a careful review of the removed source code, I found one place where another flag SO_REUSEADDR was set in addition to SO_REUSEPORT. I decided to try this idea and added SO_REUSEADDR right after the SO_REUSEPORT call. Success! It did not fail. You can find the complete source code of this attempt.

Can we be sure that it works? SO_REUSEADDR clearly has nothing to do with load balancing and fixed our issue just by lucky accident. I did expect that it may break something and I have to repeat production tests to confirm that balancing works as expected.

Luckily for us, everything worked just fine:

Thread ID	UDP packet / second
Thread 0	41263
Thread 1	41224
Thread 0	41374
Thread 1	40995
Thread 0	41260
Thread 1	41265

And the CPU was distributed evenly, as shown in Figure 3.

Figure 3 — CPU load with two threads.
Figure 3 — CPU load with two threads.

Just out of curiosity, I increased the number of worker threads to four and got the same perfect results as with two (Figure 4).

Figure 4 — CPU load with four threads
Figure 4 — CPU load with four threads.

It may be time to declare success but we have to keep in mind that SO_REUSEADDR has pretty nasty side effects and can be clearly considered as insecure. It allows other apps on the same machine to listen on this port.

A cleaner solution is out there — pre-create the sockets and bind BPF after

I explained my original issue to Vadim Fedorenko from Meta (Facebook) and after checking the source code of the Linux kernel, he shared with me that after we assign a BPF program to a socket, we will not be able to find more sockets to port bind with the reuse_group. Then I had the idea to remove the SO_REUSEADDR workaround by making small structural changes to my logic.

I moved the logic to assign the BPF to the socket after we created all the sockets for the reuse port group and called bind() for each of them. At this point, we will not need to add more sockets and we can just assign the BPF microcode for each socket. You can find the complete source code here.

I ran production tests to confirm that everything worked just fine without any need for the SO_REUSEADDR workaround.

Thread ID	UDP packet / second
Thread 0	44711
Thread 1	44368
Thread 0	45027
Thread 1	45253

It was clearly a rocky road but I’m very happy to have made this code work. It creates several opportunities to make UDP servers more efficient and increase their throughput. I hope the need to use these tricks will be addressed in future versions of the Linux kernel. Meanwhile, you can use my example as a reference as it will work for all current versions of the Linux kernel without any issues.

Pavel is the author of FastNetMon, an open source DDoS detection tool with a variety of traffic capture methods and works in software development and community management.

Adapted from the original posts at Pavel’s Blog.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *