Towards verifying programmable switches at runtime with P6

Programmable networks such as Software Defined Networks (SDNs) and P 4 networks herald a paradigm shift in the design and operation of networks. Because programmable networks break the tie between vendor-specific hardware and proprietary software, they facilitate an independent evolution of software and hardware. The P4 language allows users to define, in a program, instructions for processing packets. This could include instructions on how the received packet should be read, manipulated, and forwarded by a network device, such as a P4 switch.

However, because P4 switches are programmed, this means that complex bugs and faults can surface, especially during runtime when they face diverse workloads and unprecedented threats.

To give an example in a single P4 switch: A programmer writes a P4 program defining packet processing and then deploys it on a P4 switch to decide the packet processing logic. However, programming errors or P4 switch-specific errors may result in an abnormal switch behavior under active traffic. This necessitates the detection and elimination of bugs in a P4 program or a P4 switch.

To detect runtime bugs in a P4 program, my colleagues and I at TU Berlin collaborated with the Huawei Munich Research Center, University of Vienna, and MPI Informatics to develop P4RL that leveraged Reinforcement Learning-guided Fuzzing. To extend P4RL for allowing localization and patching of bugs in a P4 program, we developed P6 (which we will present at Infocom 2021 in May). P6 stands for P4 with runtime Program Patching. P6 is an important foray into self-driving networks, which come with stringent requirements on dependability and automation.

The challenge

Bugs or errors can occur at any stage in the P4 pipeline. If a bug occurs in any of the programmable blocks of P4, then we term the bug as platform-independent and software patching can solve the problem. If the bug appears in the nonprogrammable or platform-dependent blocks, namely, the PRE or BQE, then the vendor has to be informed to fix the issue if the implementation is hardware-related or vendor-specific. Existing P4 program verification systems are able to detect bugs using static analysis. Unfortunately, static analysis has some shortcomings:

It mostly detects memory safety bugs
It is prone to false positives
It cannot detect platform-dependent bugs
It cannot detect runtime backs that require actively sending real packets

An illustration of a platform-independent bug. — Figure 1 – Illustration of a platform-independent bug. Packets with the wrong IPv4 checksum are accepted (or generated) as the checksum calculation (or update) is not done on the P4 switch. Blue arrows show the expected behaviour and red arrows show the faulty behaviour.

The opportunity and insight

Moving from the old P4₁₄ system to the newer P4₁₆ version [PDF] that was released in 2016, there are twice as many programmable blocks, increasing the chances for patchability. Bugs detected in the platform-independent part can be localized and patched; a platform-dependent bug may not be patchable as it is hardware-related.

An image showing the evolution of P4 to a more advanced version. — Figure 2 – The evolution of P4₁₄ to the more advanced version P4₁₆ ( P4₁₆ new version at bottom).

The solution

In P4, automated program repair is uncharted territory and becomes increasingly important as the software development lifecycle in programmable networks is short, often with insufficient testing. Via P6, we show that due to the structure of P4 programs, it is possible to automate patching of platform-independent bugs (P4 program-specific software bugs) in P4 programs, if the patch is available. P6 is a novel runtime P4-switch verification system that: (a) detects; (b) localizes; and (c) patches software bugs in a P4 program. P6 improves our existing work, P4RL by extending it and augmenting it with automated localization and runtime patching. P6 relies on the combination of static analysis of the P4 program and a Reinforcement Learning (RL) technique to guide the fuzzing process to verify the P4-switch behaviour at runtime.

An image showing the P6 workflow — Figure 3 – The P6 workflow. Modules from P6 are in solid green boxes.

In a nutshell, in P6, the first step is to capture the expected behaviour of a P4 switch, which is achieved using information from three different sources:

The control plane configuration
Queries in p4q, a query language that we leverage to describe expected behaviour using conditional statements
Accepted header layouts such as IPv4, IPv6, and so forth, learned via static analysis of the P4 program

If the actual runtime behaviour to the test packets generated via machine-learning guided fuzzing differs from the expected behaviour through the violation of the p4q queries, it signals a bug to P6, which then identifies a patch from a library of patches. If the patch is available, P6 modifies the original P4 program to fix the bug signalled by the p4q queries. Then, the patched P4 program is subjected to sanity and regression testing.

We developed a prototype of P6 and evaluated it by testing it on eight P4₁₆ application programs from switch.p4, P4 tutorial solutions, and NetPaxos codebase across two P4 switch platforms, namely, behavioural model version 2 (bmv2) and Tofino. Our results show that P6 successfully detects, localizes, and patches diverse bugs in all P4₁₆ programs while significantly outperforming bug detection baselines without introducing any regressions. To ensure reproducibility and facilitate follow-up work, we have released the P6 software and library of ready patches for all existing bugs in the P4 programs to the research community.

An image of P6 in action, showing source code with bugs, when localized, and when patched. — Figure 4 — P6 in action — depicting the automated detection, localization and patching of a bug in a L3 switch P4 program.

The results

P6 discovers 10 bugs (7 platform-independent and 3 platform-dependent bugs) in 8 publicly available P4 programs. P6 detects bugs in ~2 secs in 7 programs, in ~10 secs in switch.p4 (8715 LOC) with 28 packets only.

Bug IDs	Bugs
1	Accepted wrong checksum (PI)
2	Generated wrong checksum (PI)
3	Incorrect IP version (PI)
4	IP IHL value out of bounds (PI)
5	IP TotalLen value is too ssmall (PI)
6	TTL 0 or 1 is accepted (PI)
7	TTL not decremented (PI)
8	Clone not dropped (PD)
9	Resubmitted packet not dropped (PD)
10	Multicast not dropped (PD)

Table 1 — Bugs (with Bug IDs) detected by the P6 prototype. Note, PI/PD refers to platform-independent and platform-dependent.

Figures 6 and 7 illustrate the performance of P6 in terms of median bug detection, localization, and patching time across Bmv2 and Tofino switch platforms. It is noteworthy that in all runs on bmv2 except for switch.p4 program, P6 was able to detect all bugs in less than two seconds. In switch.p4 (8715 LOC), P6 was able to detect all bugs in less than ten seconds.

P6 detection, localization and patching, and performance on a Tofino switch, measured in seconds. — Figure 6 – P6 detection time (second-scale), localization and patching time (ms-scale) performance on a Tofino Switch. Each plot represents a median over 10 runs.

P6 detection, localization and patching, and performance on a Bmv2 switch, measured in seconds. — Figure 7 – P6 detection time (second-scale), localization and patching time (ms-scale) performance on a Bmv2 Switch. Each plot represents a median over 10 runs.

With P6, developers of P4 programs and operators of P4-enabled devices can improve the security of their products. As a part of our future agenda, we plan to apply P6 on commercial-grade P4 programs and networks to report on our experience. We note that leveraging programmability, future programmable networks will encompass even more possibilities of faults with a mix of vendor-code, reusable libraries, and in-house code. As such the general problem of network verification will persist and we will have to explore how to extend the P6 system to traditional IP-based networks. The extensive P6 technical report is also available.

Dr. Apoorv Shukla is an expert in the verification of programmable networks such as Software Defined Networking and P4, and is currently a Senior Networks Researcher in Cloud at Huawei Munich Research Center.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

The challenge

The opportunity and insight

The solution

The results

Leave a Reply Cancel reply