Troubleshooting the other half

Altered from Karla Hernandez' orginal at Unsplash.

You should know the operational state of your network because there really is no reason why you should not. The IETF and your hardware vendor of choice have worked hard to give you a whole fleet of protocols and machinery to assess all aspects of the pieces of the Internet that are under your direct control. But once you leave your own network and enter the public Internet, there is hardly anything you can do to pinpoint potential problems. And you might argue that nobody other than the network operator in question should actually be able to do that. Well, people using vital infrastructure hosted in the cloud might disagree.

While the Operations, Administration, and Management (OAM) superpowers you are accustomed to in your own network are not available on the public Internet, there are two tools everyone can use: ping and traceroute. Ping does not give you an awful lot of information. It provides you with boolean information about interface reachability and it estimates the respective round-trip time (RTT). Traceroute extends Ping’s functionality by also enumerating the routers on the path towards a destination and providing an estimate of their respective RTT. That doesn’t sound so bad. But traceroute output can be difficult to interpret correctly — even for network professionals.

One of the ‘problems’ that traceroute has, is that it can only discover the forward path between you and a destination on the Internet. At first glance, that might not seem like a big problem, but on today’s Internet, most paths are asymmetric, that is the forward and return paths differ. Traceroute output might therefore suggest a problem in the forward direction, whereas the real problem in the network is on the reverse path.

Figure 1 – Most paths are asymmetric in today’s Internet.

Figure 1 above illustrates this point. A client in network A is running a traceroute towards www.example.com, which is in network B. Traceroute’s output at the client is shown at the top of the figure. Network A keeps the traceroute packets as long as possible inside its own network until router C hands those packets off to router D, which is in network B. All Internet Control Message Protocol (ICMP) messages generated by routers A, B and C will take a return path through network A. Routers in network B operate in accordance with its own policy, which prefers to deliver packets addressed to the client through network C. So starting at router D, all ICMP messages will travel back to the client through network C.

Now, in Figure 1, there is a problem between routers E and F. These routers are not part of the traceroute output above, as traceroute does not ‘see’ the reverse path. But the problem is clearly visible as the latency in the traceroute output jumps from a few milliseconds to over 300 ms even though the router names suggest that both of those routers are in Frankfurt. From the perspective of the client, the problem is between routers C and D, whereas in reality, it is somewhere completely different.

Interestingly, there is a ‘solution’ today. Network professionals often subscribe to mailing lists where they discuss issues, report problems and ask for favours. On these lists, it is not uncommon to present traceroute output and ask people to traceroute back. Why? Because, as we have seen in the example, their traceroute output suggests that there is something wrong, they suspect they know what it is, but given traceroute’s output alone, they cannot really tell. A traceroute in the other direction would allow them to be certain. These communities are usually extremely helpful and often somebody will eventually come up with the information, but that ‘mechanism’ is not quite real-time, reliable or universally available. We should be able to do better. We need an actual ‘reverse traceroute’ tool.

We have proposed a reverse traceroute mechanism to the IETF, more specifically, we have submitted an Internet draft to the Internet Area Working Group of the IETF. We have also implemented that mechanism as an eBPF program (server) and in Python (client). There are even a few endpoints (servers) already online, so everybody can play around with it. And it works for both IPv4 and IPv6.

The protocol is pretty straightforward. We are using a new ICMP message to trigger a single packet being sent by a remote host towards the requestor. You can specify which protocol to use (UDP, TPC or ICMP), what time-to-live (TTL) value to set in that packet and a few other things. That packet sent by the remote host will look like any ordinary packet part of an attempt at traceroute today. It would also give you exactly the same kind of information, the IP address of the router at which the packet expired and an RTT estimate. That information is reported back to the client.

We have designed the protocol with a number of design goals in mind such as safety and security, lessons learned from past attempts (attempts at reverse traceroute go back at least as far as 1993 with RFC1393), and common wisdom (for example, should not require router changes).

A blog post is probably not the right place to go into too much detail, but if you are interested, there is a project website, information about the implementation and its use can be found on GitHub and of course, there is a document describing the protocol in detail.

We asked potential ‘customers’ of this work: Network operators, or more precisely, the people that operate networks. We went to DENOG14, the German Network Operators Group’s annual conference in November 2022. The talk is in English and you can find it online.

The feedback was phenomenal. Clearly, people running production networks understand the problem, they know a solution better than mailing lists would be a great addition to their operational tool belt and the public instances of reverse traceroute today are all hosted by participants from DENOG14. So when can you expect widespread adoption?

The ultimate goal is that such a mechanism will eventually be available in every operating system, just as ping and traceroute are today. The key ingredient to achieve this is an Internet standard. We have, as previously mentioned, suggested the protocol to the IETF and here’s the catch: The people that operate networks like it, but they usually do not participate in the IETF. A lot of people in the IETF usually have something very specific they work on and do not pick up on new work. If a new draft does not overlap with their specific interests inside the IETF, the chances are high that it won’t be discussed.

‘If you build it, it will come’ won’t work here. We’ve received some feedback in private but the discussion must take place in public, specifically, on the IntArea mailing list. Take this as a call to action if you would like to see reverse traceroute become available on your console and help us develop a tool for the Internet to measure the other half!

Dr Rolf Winter is the Professor of Data Communications at Augsburg University of Applied Sciences, where he teaches and researches computer networks, especially the Internet.

Valentin Heinrich is a Research Assistant at the University of Applied Sciences Augsburg, working mainly on network measurements and network protocol design.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply Cancel reply