How complete is our view of the Internet?

Originally designed as a research network, the Internet has evolved into the most popular commercial network. As such, the complexity of its structure has increased significantly due to the introduction of new entities such as Internet Exchange Points (IXPs), content providers, and cloud providers.

Tracking the influence of such complexity changes and the global effects of local routing decisions has piqued the interest of both operators and researchers. This led to the introduction of various route collectors — dedicated devices that establish sessions with Autonomous Systems (ASes) in order to collect routing information exchanged via the Border Gateway Protocol (BGP). ASes that provide their routing information to route collectors are usually referred to as monitors or feeders.

Today, many commonly used services such as hijack detection, outage detection, or importance rankings rely on the data of route collectors. While a single route collector can have a large number of feeders due to various business aspects of inter-domain routing, every feeder only provides a limited ‘view’ of the Internet’s routing ecosystem. Notably, capturing peering links is much harder than capturing provider or customer links. To this end, it is important to understand the limitations of the currently available route collector data as well as the implications thereof.

Below is a summary of a pilot project that we at the Max-Planck-Institut für Informatik conducted to understand these limitations. We collected four days of data from public route collector projects BGPMon, Isolario, PCH, RIPE RIS, and Route Views.

We collected both routing information base (RIB) snapshots as well as update messages. In order to account for anomalies in the collected routing data, for example announcements of unassigned address space or AS_PATHs containing loops, we applied various well-known sanitation steps. Afterward, we generated topology information based on the AS_SEQUENCE fields within the AS_PATH as well as single element AS_SET fields.

Making sense of all the data

As illustrated in Figure 1, Isolario, RIPE RIS, and Route Views monitor a similar amount of links; however, none of these projects by themselves come close to the combined visibility of all projects. Isolario, RIPE RIS, and Route Views contribute roughly 130k, 22k, and 70k links respectively that are not visible within the data of any other project.

Figure 1 — Number of unique undirected links viewed.

When taking a look at the transit degree — that is, the number of ASes that a particular network appears to provide transit — for all feeders of a particular project (Figure 2), Isolario and RIPE RIS appear to be biased towards larger networks while PCH and Route Views appear to be biased towards smaller networks. Thus, despite observing a similar amount of links, every project captures a slightly different part of the Internet’s ecosystem.

Figure 2 — Transit degree for each route collector.

While estimating the incompleteness of each data source is hard we can derive a lower bound of the number of missing links by taking other data sources into account. We gathered additional control plane information from 233 looking glasses, 31 route servers, and the Internet Routing Registries.

We further extended this information with two types of inferences. First, we inferred AS links from IP paths gathered from RIPE Atlas, CAIDA’s Ark, the bismark project, the portolan project, and active measurements sourced from VMs hosted in clouds of Amazon and Google. Second, we inferred links based on multilateral peering (MLP) agreements leveraging explicitly defined BGP community encodings of 42 IXP route servers.

When adding all of these data sources together we found an additional 146k undirected AS links (527k in total). MLP inferences added the most additional links (62%) followed by IP Paths (30%); however, both of those sources are only able to capture topological information and thus fail to capture BGP dynamics. Notably, the inference of AS links based on IP Paths is known to be inaccurate, which makes this data source the least reliable in our study. Despite requiring the most effort, looking glasses and route servers only contributed a marginal fraction (2%) of additional links.

The aggregated data from all collector projects revealed 72.3% of all AS links. Combining data only from Route Views and RIPE RIS revealed no more than 57% of all AS links. Considering that these are the most frequently used route collector tools, this combination of data sources appears highly incomplete.

Finally, we compared the individual data sources against private routing information of a European Tier-1 ISP. The route collector data alone contained 98.3% of AS links visible from the ISP. Adding the other inferences only increased this number by 0.3%.

However, when explicitly looking at the direct neighbors of the ISP, the picture changes. The route collector data alone sees only 85.9% of the direct neighbors. The inferences detected an additional 7% of the ISP’s direct neighbors. In summary, when using all data sources the topological view of our Tier-1 ISP was reflected rather well.

Data is only as good as its completeness

Whenever you are confronted with claims derived from one specific data source, pay attention to its incompleteness and possible bias. Furthermore, if you plan to use routing information for your own analysis make sure to not only use RIPE RIS and Route Views but also Isolario — its data is structured in the same way, complementing the others nicely.

Finally, in case you operate a network that is currently not feeding any collector project consider providing your view. Not only do some collector projects, such as Isolario, provide you with a set of real-time monitoring tools, the moment you feed their system your data also helps to achieve more accurate results when using your day-to-day services such as hijack detection.

Lars Prehn is a PhD student in Computer Science at Max-Planck-Institut für Informatik.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Making sense of all the data

Data is only as good as its completeness

Leave a Reply Cancel reply