Microbursts are a phenomenon where a sudden surge of packets enters a network router, creating higher queuing delay and jitter, and potentially leads to packet drops when the router runs out of queuing buffer.
As microbursts happen on millisecond timescales, existing monitoring and diagnosis tools provide very little insight (beyond counting packet drops), let alone enable mitigation.
ConQuest is an algorithm that analyses the contribution of individual flows to a router’s queue in real time, running on first-generation commodity P4-based programmable switches.
With ConQuest, operators and researchers can continuously monitor the queuing buffer, pinpoint the particular bursty flows leading to the microbursts, and perform immediate corrective actions (such as re-routing or de-prioritizing the packets) to avoid full queuing buffer and packet drops.
Scrutinizing queues for large flows
Our goal has been to measure how many bytes of queuing buffer space any given flow is occupying, at any given time, whenever the packet queue length exceeds a threshold. We can use this information to then identify the large flows in the queue, which are the likely culprits for microbursts.
Maintaining statistics for each flow in a router’s queue is more challenging than it sounds, as high-link speeds (as much as 100Gbps) require the router to process one packet in as little as 6 nanoseconds. This makes it infeasible for an algorithm to update a data structure in memory, both upon packets that are enqueued and dequeued.
We addressed this challenge by slicing the traffic into very small time windows and performing queuing analysis only using packet dequeue records and each packet’s queuing delay. Doing this, we demonstrated that ConQuest can act surgically on bursty flows and improve network performance in a testbed experiment.
How we set up the experiment
We sent a million small- to medium-sized flows (resembling typical web traffic distribution) over a 10Gbps bottleneck link, at 20% utilization, and periodically injected synthetic bursty flows to create queuing.
Without ConQuest, a baseline setup would have indiscriminately throttled or dropped packets from all flows, whenever the queue grows longer than some threshold. When the switch throttles only the packets from bursty flows, as measured by ConQuest in real time, we could reduce flow completion time by as much as 11%, as the small and medium flows are left intact and kept sending at the full rate.
Bringing queue analysis to legacy devices
There’s one catch, though: legacy, non-programmable routers in today’s network can’t run ConQuest. This is unfortunate as network operators need such real-time, fine-grained queuing analysis before deploying next-generation programmable boxes.
To meet this challenge, we propose a novel setup to analyse and diagnose queuing dynamics in legacy routers, in an on-demand fashion using mirrored traffic via tapping links, which are readily available in many networks.
This involves simultaneously tapping both the ingress and egress links of the legacy router, and feeding the traffic to a P4 programmable switch running ConQuest. The P4 switch will first match the two appearances of the same packet — once at ingress before queuing and once at egress after queuing — to calculate the queuing delay experienced by each packet. Then, it runs our queuing analysis algorithm to identify and report bursty flows in real-time.
Although we cannot immediately react to microbursts and suppress bursty flows in this passive tapping setup, ConQuest still provides operators granular insights on small time scale queuing in production networks.
ConQuest in action
Using the tapping setup, we have deployed ConQuest to analyse queuing in two real-world production networks.
At Princeton University, we observed a border router with 100G peering to Internet2 suffers from occasional packet losses (high drop count) at a particular 10G port, yet the link utilization at this port (reported by SNMP) never exceeds 20%.
Using ConQuest, we scrutinized the traffic whenever the queue is long, and categorized these queuing events into three categories; we also identified that the drops are potentially caused by an active performance measurement tool.
We presented our observations (PowerPoint) in more detail at the Buffer Sizing Workshop in December 2019.
At AT&T, the operators have been aware of microbursts in the carrier core network, according to the per-second maximum queue length (high watermark counter) statistics, however there’s no more insight available.
After deploying ConQuest, we observed fine-grained burst characteristics and pinpointed the type of application traffic that leads to burstiness/high-queuing delay. Also, the burst size distribution we observed enables us to extrapolate the packet drop rate when we deploy switches with shallow buffer.
These observations are also presented at the Buffer Sizing Workshop.
In conclusion, ConQuest is a powerful fine-grained analysis tool that reports the size of individual flows in packet queues, that can be used both to optimize Active Queue Management in real-time, and in an off-path fashion, to diagnose queuing dynamics in today’s production networks.
Read our paper ‘Fine-Grained Queue Measurement in the Data Plane‘ to learn more.
Xiaoqi Chen is a PhD student at Princeton University. His research interests include designing and implementing approximate algorithms to run in programmable network switches, for performing network measurement and optimization.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.