A look at Route Flap Damping in the wild

By on 20 Jul 2020

Category: Tech matters

Tags: , ,

Blog home

The use of BGP Route Flap Damping (RFD) has been a controversial method since its creation more than 20 years ago. Recommendations have been revised multiple times over the past two decades, and still differ from vendor default values.

To understand which configurations are in use, and provide RFD parameter sets based on past recommendations for your router, a group of us from Freie Universität Berlin, Internet Initiative Japan, University of Strasbourg, Hamburg University of Applied Sciences and Arrcus have been measuring RFD use in the wild as part of the Beacon Measurement’s project. In this post, I’ll share what we’ve uncovered thus far.

A chequered past

The Border Gateway Protocol (BGP) connects Autonomous Systems (ASes) in the Internet by announcing (and withdrawing) routes. To prevent oscillating routes, RFD was introduced in the mid-nineties (and standardized in RFC 2439 in 1998) to suppress repeating BGP updates.

Route flaps cause performance problems on routers. Although RFD was developed to mitigate these problems, it has also been shown to suppress well-behaved/stable routes and in some cases has led to networks being unreachable. Mao et al., proved these observations and RIPE consequently recommended disabling RFD in 2006 (ripe-378). Because of these drawbacks, the common belief is that RFD is not widely deployed.

In 2011, Pelsser et al., suggested slight adjustments to the previously recommended RFD configuration, thus making RFD usable without the need to adjust vendor implementations. Based on their findings, RIPE (ripe-580) and the IETF (RFC 7454) published recommendations to use RFD with adjusted parameters. The harmful vendor default parameters were not revised.

Timeline of Route Flap Damping.
Figure 1 — Timeline of RFD.

In all this time there has been no study attempting to measure real-world deployment and configuration of RFD. Understanding the configurations network operators use in practice is crucial for Internet operations and measurements.

The RFD signature

An RFD-enabled router maintains a penalty value per-prefix per-BGP session that defines when a prefix should be suppressed or released. This value increases with each announcement or withdrawal for that prefix and decreases exponentially over time. When it exceeds a threshold the prefix is suppressed until the penalty decays below a second threshold.

Figure 2 visualizes how the penalty for one of our prefixes behaves in an RFD-enabled router.

RFD router perspective
Figure 2 — RFD router perspective: The penalty for a prefix that oscillates between announcement (green) and withdrawal (orange).

At t0, the router starts to receive updates for the prefix and additively increments the penalty. 

At t1 the penalty is larger than the suppress-threshold and therefore the prefix is withdrawn from peers and any further received updates will not be propagated. 

At t2, the router no longer receives updates for the prefix and therefore the penalty can reduce below the reuse-threshold at t3. The reuse-threshold defines when a prefix is considered usable again. As a result, the router re-advertises the prefix to its peers.

Measurement infrastructure: rapid Beacons to trigger RFD

The re-advertisement (t3 in Figure 2) would not occur if the router continues to receive updates at a sufficiently quick rate. Therefore, in our experiment, we are announcing and withdrawing Beacon prefixes in a so-called ‘burst and break’ pattern (light-blue and white bands in Figure 2).

In bursts, we begin with a withdrawal, alternate between announcement and withdrawal, and end with an announcement. In breaks, we do not send any updates.

The update interval between two consecutive updates in the burst determines which kind of RFD configuration is triggered. We did not expect configurations stricter than the vendor default values, which already suppress 14% of all prefixes. A Juniper or Cisco router would start damping a prefix that flaps at least every 9 or 8 minutes respectively. MRAI limits us to go much lower than 1 minute, because 30 seconds is the Cisco default, and other routers are probably similarly configured. We used 1, 2 and 3 minutes as update intervals with a 6-hour break in our first experiment and 5, 10 and 15 minutes with a 2-hour break in our second experiment.

Wanting to collect a variety of path data, we announced our ‘Beacons’ from seven different locations: Bangkok, Thailand; Johannesburg, South Africa; Copenhagen, Denmark; Munich, Germany; São Paulo, Brasil; Seattle, USA; and Tokyo, Japan. To pick up the Beacon pattern that has been altered by RFD-enabled routers, we used three route collector projects IsolarioRIPE RIS, and RouteViews. We refer to peers of route collector projects as ‘vantage points’. 

We use the term BGP Beacon or ‘Beacon’ to refer to a publicly documented prefix having global visibility and a published schedule for announcements and withdrawals.

To detect instances of RFD we interpreted all received updates for each AS path. We could decipher only the announcement pattern because withdrawals do not contain AS paths. Figures 3 and 4 visualize what we observed at the vantage point for two AS paths.

RFD Signature
Figure 3 — RFD signature. At least one AS on the path (701, 2,497, 3,130) has RFD enabled.
Non-altered Beacon pattern.
Figure 4 — Non-altered Beacon pattern. RFD does not occur on this path.

The upmost axis reflects exactly when we receive announcements for the given path from the vantage point. The two-axis depict when updates were sent from the Beacon router and whether they were received or not. 

Figure 3 clearly shows the RFD signature sketched in Figure 2, whereas in Figure 4 almost all announcements were exported by the vantage point and not damped. This means we can infer that either 701 or 2,914 uses RFD (3,130 is Beacon AS).

Based on whether we can observe the RFD signature, we labelled paths with RFD true or false. Although the resulting dataset gives an idea of RFD deployment, we want to know exactly which AS is damping. With that aim, we faced the challenge that a non-negligible share of ASes uses RFD selectively, for example, suppresses only churn from customers. On top of the usual measurement noise, selective damping entails a contradicting dataset.

We developed three heuristics to determine which ASes deploy RFD on the Internet. The first heuristic simply computes the relative occurrence of an AS on damped and non-damped paths. The second method relies on alternative paths that are announced after the damped path has been suppressed. The last heuristic uses the characteristic that RFD-enabled ASes export on average less updates towards the end of Bursts.

Real-world deployment and configurations

At this time there are two relevant parameter sets: vendor default values and recommendations by the IETF (BCP-194) and RIPE (ripe-580). These are displayed in Table 1.

RFD ParameterCiscoJuniperBCP 194 / RIPE-580
Withdrawal penalty1,0001,0001,000
Re-advertisement penalty01,0000/1,000
Attributes change penalty500500500
Half-life (min)151515
Max suppress time (min)606060

   Table 1 — Vendor default parameters and recommendations.

Default parameters have proven to be harmful in the past (Mao et al.) because they can lead to reachability issues, hence the difference to the recommended parameters.

We used six different update intervals in our experiment: 1, 2, 3, 5, 10, and 15 minutes. Although Cisco and Juniper have different suppress-thresholds and penalty increments with re-advertisements, both start damping at the 5-minute update interval. Figure 5 shows the number of damping ASes that we identified for each update interval. 

Number of Damping ASes for each update interval.
Figure 5 — Number of damping ASes for each update interval. Total measured ASes = 610.

There is a clear increase visible from 10 to 5 minutes. This shows that many ASes use harmful vendor default values. This observation is confirmed with ground truth where 60% are using vendor defaults. 

The slight increase for the smaller update intervals is likely caused by some network operators following the recommendations. The ASes damping at the larger update intervals are likely also using vendor default values, but receive more updates than we send from the Beacon routers due to topology phenomena, and therefore dampen the low-frequency Beacon prefixes.

We contacted network operators to validate our results and found 95% precision with one false positive.

Recommended RFD parameter sets

In the network operator community, there seems to be much confusion about how to apply current recommendations correctly. As a result some operators just use the default values supplied by vendors because they do not expect vendors to ship harmful configurations. Therefore, we chose to supply the exact configuration parameters in Table 1.

It is worth noting that these values are based on previous measurements (ripe-580) and might need to be updated based on recent data. This will be part of our future work.

Tier 1 and small ISPs are using RFD

In contrast to the expectation of the networking community, we found that RFD is being used by at least 8% of the measured ASes.

Tier 1 providers, as well as small ISPs, deploy RFD and most of them use deprecated harmful vendor defaults. To those of you who are practicing this, please consider updating and checking your configuration. We will report about our ongoing work via the project website.

This research was presented at RIPE 80 — watch the recording.

Contributors: Clemens Mosig, Randy Bush, Cristel Pelsser, Thomas C Schmidt, Matthias Wählisch.

Adapted from the original version which appeared on RIPE Labs.

Clemens Mosig studies Computer Science at Freie Universität Berlin and works at the Internet Technologies research lab.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please answer the math question * Time limit is exhausted. Please click the refresh button next to the equation below to reload the CAPTCHA (Note: your comment will not be deleted).