Improving coverage of Internet outage detection

By on 9 Jun 2020

Category: Tech matters

Tags: ,

Blog home

Internet reliability is of concern to all Internet users, and improving reliability is the goal of industry and governments.

A number of groups are using different approaches to measure Internet outages, including active probing of most of the IPv4 Internet, and passive observation of traffic. Our group at the Information Sciences Institute of the University of Southern California (USC/ISC), developed Trinocular, a system that pings millions of /24 IPv4 blocks every 11 minutes (since October 2014).

Broad coverage is an important goal of any outage detection system. As figure 1 shows, active detection systems like Trinocular report 4M /24 trackable blocks, and passive systems using Content Delivery Network data (Akamai/MIT) report coverage for more than 2M blocks. More specialized systems focus coverage on areas with bad weather (Thunderping), or provide broad, country-level or regional coverage, but perhaps without /24-level granularity inside the regions (CAIDA UCSD-NT).

Figure 1 — State-of-art outage detection systems’ IPv4 address coverage.

Although each of the systems provides broad coverage, each recognizes there are portions of the Internet that it cannot measure because the signal it measures is not strong enough. Systems typically detect and ignore areas where they have an insufficient signal.

Out of the 5.9M responsive blocks, Trinocular tracks 4M, producing reliable results with little or no false outages for only 3.4M, since it requires at least 15 responsive addresses in a block. Other systems also ignore blocks where only a few addresses reply. Finding a good threshold can be hard: setting thresholds too high reduces coverage, yet setting them too low risks false outages from misinterpreting a weak signal.

Our PAM paper describes two new algorithms designed to increase coverage in Trinocular from 4M to more than 5M /24 blocks. Full Block Scanning (FBS), improves coverage for sparse blocks, while retaining accuracy and limits on probing rates. Lone-Address-Block Recovery (LABR), increases coverage by providing partial results for blocks with very few active addresses.

Overcoming the limitations of scanning

Sparse blocks challenge accuracy because of a tension between the amount of probing and the likelihood of getting a response. To constrain traffic to each block, and to track millions of blocks, Trinocular limits each block to 15 probes per round.

Limited probing can cause false outages in two ways: First, it may fail to reach a definitive belief and mark the block as unknown. Alternatively, if the block is usually responsive, a few non-responses may produce a down belief.

As an example, figure 2 shows four different levels of sparsity, (each starting 2017-10-06, 2017-10-27, 2017-11-14 and 2017-12-16) as (d) individual address responses to Trinocular probes, and (c) Trinocular state inferences. As the block gets denser, Trinocular improves its inference correctness. Furthermore, every address in this block has responded in the past. But for the first three periods, only a few are actually used, making the block temporarily sparse.

Figure 2 — A sample block over time (columns). The top bar (a) shows Lone-Address-Block Recovery. Bar (b) is Full Block Scanning and (c) shows the Trinocular status (up, unknown, and down). The bottom bar (d) shows individual address as rows, with coloured dots when the address responds to Trinocular.

FBS addresses sparse blocks by identifying them, then switches to a Full Round requirement of non-responses to confirm an outage when a sparse block appears to go down. A Full Round means we require attempts to probe every known responsive address in the block. FBS thus avoids the problem of making a hasty ‘out’ decision based on negative responses from a few addresses when others in the block are responsive.

The second challenge to coverage are blocks where only one or two addresses are active — we call this problem lone address blocks.

When a single address is active, then lack of a response may be a network outage, but it may also be a reboot of a single specific computer or other causes — the implication of non-response from a single address is ambiguous.

To avoid false down events resulting from non-outage problems with a lone address, we use LABR. We accept up events, but because outages are rare (much rarer than packet loss), we convert down events to ‘unknown’ for blocks with very few recently active addresses. By ‘few’ we mean at least three addresses to avoid making decisions on one or two addresses where packet loss could change results.

Trinocular now has 43% greater coverage

The result of applying these algorithms is we are able to increase Trinocular coverage from 4M blocks to more than 5M. We show coverage gains of implementing these two new algorithms in figure 1. Trinocular with FBS gets larger coverage than other methods of filtering or detection. FBS repairs 1.2M blocks, most sparse: of 0.9M sparse blocks, we find that FBS fixes 0.8M.

The remaining 100K correspond to either good blocks that went dark due to usage change and therefore pushing the quarterly average of active addresses down, or sparse blocks with few active addresses where Trinocular can make a better job inferring the correct state.

Watch – Guillermo’s presentation at PAM 2020.

Guillermo Baltra is a PhD student in the Analysis of Network Traffic (ANT) Lab at USC/ISI.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *