Automatically detect peering infrastructure outages with new tool

By on 2 May 2018

Category: Tech matters

Tags: , , , , ,

Blog home

Networks rely increasingly on Internet Exchange Points (IXPs) and carrier-neutral interconnection facilities that enable dense localized peering connectivity to handle the massive traffic exchange between clients and servers.

IXPs provide layer-2 Ethernet switches to interconnect edge routers of IXP members, while co-location facilities offer physical space for networks to deploy equipment and establish direct cross-connects.

Today, there are over 640 IXPs and more than 2,600 facilities in the world. The largest IXPs have over 700 connected members with hundreds of thousands of peering interconnections among them.

Given the high concentration of interconnections, the uptime of peering infrastructures is crucial for overlay Internet applications. Facilities and IXPs strive to meet Service Level Agreements (SLAs) of ‘five nines’ (five minutes downtime per year), and ‘four nines’ (50 minutes downtime per year). However, outages still occur due to power failures, human errors, attacks, and natural disasters.

The geographic agglomeration of the peering activity leads to tight interdependencies between IXPs and co-location facilities, while practices such as remote peering extend the reach of local infrastructures to global scale. Therefore, failures can have cascading effects that mask the outage source and hinder accurate detection. Consequently, operators often lack monitoring capabilities for infrastructures outside their network perimeter and resort to mailing lists, online forums, or social media to understand the causes of interruptions.

Kepler introduces a new methodology to automate the localization and monitoring of outages at IXPs and interconnection facilities, by using publicly available connectivity and routing data.

Key points:

  • IXP and carrier-neutral interconnection facility failures can have cascading effects that mask the outage source and hinder accurate detection.
  • Kepler is a new method to automate the localization and monitoring of outages at IXPs and interconnection facilities by using publicly available data.
  • It can detect outages in facilities that have at least six different members that appear in BGP paths annotated with location-tagging BGP Communities.

Detecting outages by measuring routing paths

To understand the challenges in detecting and localizing infrastructure outages in routing data, consider how a facility outage in the example below (Figure 1) is reflected on paths:

  • When Facility 2 fails, the traffic between AS1-AS2 switches to Facility 1 but the AS path remains the same.
  • The backup path shifts away both from the failed facility and the IXP. Having only AS2 as a vantage point (VP) doesn’t suffice to pinpoint the exact source of the outage. But if we also monitor the paths through AS4 we can observe that the IXP is still available.
  • To detect changes in the traversed infrastructures, we need to compare the routing states before and during the outage to find the affected hops.

Figure 1 — Example of how a facility outage is reflected on paths.

Therefore, Points of Presence-level (PoP) outage detection requires measuring routing paths from diverse vantage points, at high frequency, and at the granularity of infrastructure-level hops.

Note: we use the term ‘outage’ to refer strictly to the status of connectivity over the affected infrastructures. From an end-user perspective, this could also be a degradation of service with little or theoretically no impact.

Passive BGP measurements satisfy the first two requirements, but BGP encodes AS-level paths. On the other hand, traceroute measurements reveal IP-level hops that can be mapped to IXPs and co-location facilities [1,2,3], but the high measurement overhead is virtually prohibitive for continuous probing.

To tackle these challenges, Kepler deciphers infrastructure-level data encoded in the Communities BGP attribute.

The BGP Communities attribute

BGP Communities are 32-bit numerical values used by AS operators to attach arbitrary information on BGP advertisements. Communities offer flexibility in defining complex and dynamic routing policies.

Between 2010 and 2016, the visible ASes using BGP Communities more than doubled, and the number of unique community values tripled to more than 50,000.

A popular application of Communities is to tag the interconnection point where a network received a route advertisement. Figure 2 shows how AS13030 uses the Community values 13030:51702 and 13030:4006 to annotate the facility and the IXP where prefix is received by AS20940.

Figure 2 — Diagram showing how AS13030 uses the BGP Community values 13030:51702 and 13030:4006 to annotate the facility and the IXP where prefix is received by AS20940.

The Communities attribute values are not standardized, therefore their interpretation requires documentation sources. Many operators document their Communities values in Internet Routing Registry (IRR) records, or their web pages, but typically not in machine-parsable format.

Kepler combines web mining techniques with Standford’s Natural Language Processing platform to automatically compile a Communities dictionary, that includes 5,284 interpreted Communities by 468 ASes and 48 route servers and covers 288 cities in 72 countries, 172 IXPs, and 103 facilities. While 468 ASes is a small fraction of the total ASes, it includes all but two Tier-1 ASes and most major peering ASes.

Figure 3 — Map showing the location of BGP Communities ASes.

As shown in the above map (Figure 3), the majority of the Communities (66%) tag a location in Europe, followed by North America (24.5%). Only 2% of the communities cover locations in Africa and South America. Importantly, the interpreted BGP Communities are present in about 50% of all BGP IPv4 updates.

How Kepler works

The system is initialized by obtaining a stream of BGP data through BGPStream to extract BGP updates annotated with interpreted Communities. By continuously monitoring the BGP messages, Kepler establishes a baseline of paths that consistently traverse the same PoPs. Then, it monitors the baseline of stable paths to capture PoP-level changes through explicit BGP withdrawals, through changes in the PoP-tagging community values.

Routing updates are binned in time intervals to correlate path changes. For each interval, Kepler calculates the fraction of paths that continue to traverse the baseline PoP and raises an outage signal if, for a certain PoP, this fraction falls below a threshold.

Outage signals can have different types of triggering events:

  • Link-level signals are caused by changes to an AS-link that transit a large number of prefixes, for example, de-peering.
  • AS-level signals indicate changes in the availability of a densely connected AS at a specific location, for example, disconnection from an IXP.
  • Operator-level signals are for when all ASes under the same organization (sibling ASes) are affected.

PoP-level signals involve multiple AS links with disjoint near-end and far-end ASes and organizations. Kepler infers a PoP-level incident if at least three operator-level incidents occur in the same time bin at the same PoP.

Kepler validates the occurrence and duration of outages via periodic traceroute measurements from sources and destinations that have been found to cross the affected PoP in RIPE Atlas and CAIDA’s Ark paths and checks whether they still traverse them. When over half of the paths return to the baseline, the outage is inferred as restored.

Increasing signal resolution and signal disambiguation

The majority of Communities annotate routes at city-level granularity. To achieve infrastructure-level detection, Kepler uses a co-location map of:

  • ASes to IXPs
  • ASes to facilities, and
  • IXP to facilities built based on PeeringDB and data in AS websites.

The co-location map is used to de-correlate the ‘fate’ of ASes during a city-level outage signal, according to their connectivity at facilities in the same city. The co-location map is also used in disambiguating outage signals.

Figure 4 — The physical connectivity between two ASes can involve multiple PoPs, while Communities only identify the nearest-end PoP (highlighted in green). A failure in any of them will trigger a signal at the near-end facility.

Kepler determines the outage source by correlating the affected ASes with their presence at common facilities. If there are concurrent signals for multiple infrastructures in the same city, the signals are collapsed to a single IXP-level or city-level incident.

How it performs

Kepler can detect outages in facilities that have at least six different members that can be located through Communities, three at the near-end of a link, and three at the far-end.

Figure 5 — About 50% of IPv4 and 30% of IPv6 paths in 2016 were annotated with at least one location-encoding Community and thus were usable by Kepler. Moreover, Kepler’s Communities consistently tag over 35% of the IPv4 and 28% of the IPv6 AS links across every BGP snapshot.

Of the 1,742 facilities in the co-location map, 1,209 have fewer than six members; for another 130 there are less than six trackable members. Therefore, Kepler can track 403 facilities (23%), meaning that the detected outages are a lower bound. However, Kepler covers 180 out of 183 (98%) facilities with at least 20 members, which are the most prominent interconnection hubs.

Using Kepler to analyze historical BGP between 2012-2016, we found 159 outages among 87 facilities and 41 IXPs. The number of outages remained relatively stable over time, fluctuating between 10-15 outages per half-year, with the exception of the second term of 2012 due to the visible impact of hurricane Sandy.

We are currently working on making Kepler available via an interactive interface and making our dictionary of location-based communities available on request.

For more information on Kepler please read the full paper published at SIGCOMM 2017 and watch our presentation below, or leave us a comment below.

Vasileios Giotsas is a researcher at Lancaster University. His work focuses on systems security and resilience, distributed systems, and measurement and analysis of Internet protocols.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *