PSA: Traceroute — safe and effective to use

By on 11 Feb 2025

Category: Tech matters

Tags: ,

Blog home

Traceroute is a safe and effective troubleshooting tool for all levels of IT and networking experience. It is used daily by people worldwide and is supported by all major network vendors, operating systems, and hardware — including end-user operating systems like Windows and Linux!

Don’t have a USD 5,000–10,000 per month OpEx budget for an observability tool? No full-time development team to build and maintain the tooling needed to visually map your entire network based on routing? Or maybe your packets seem to be trapped in an alternate reality, not reaching their destination on time? In that case, traceroute is probably your best starting point!

Contrary to recent messages circulating that ‘traceroute doesn’t exist’ (which I won’t link to, or the rebuttals), it’s actually a good troubleshooting tool so this post is intended to fight disinformation circulating the Internet. I often use traceroute to demonstrate routing to folks during training. There might not be an official RFC about it, but it helps with troubleshooting and validation.

Figure 1 — Public comments made about trace route not existing.
Figure 1 — Public comments made about traceroute not existing

Some people might be arguing that traceroute is a ‘hack’ or ‘exploit’ but I think it’s just another tool like ping. I will first acknowledge though that when you issue a traceroute, if you don’t know what you’re looking at then yes you can be drawn to a wrong conclusion or think something is happening that is not (which often happens with non-network engineers).

When you issue a ping you really only learn three things: You can get to and from a destination, the end-to-end latency to the destination, and if the end device is online. However, we know that if a ping fails, a device can still be online, and the failure could be due to a firewall blocking the ping. We also know that just because a ping goes through, it doesn’t mean our TCP/UDP traffic (or similar) will succeed if there is an MTU issue. That is why, when using ping, we set the Don’t Fragment (DF) bit and adjust the data size to determine the Maximum Transmission Unit (MTU ).

Just like with ping, traceroute provides nearly the same information. High latency doesn’t necessarily indicate packet loss, and unresponsive hops (see example below) could be due to a Multiprotocol Label Switching (MPLS) hop or a firewall, not necessarily dropped traffic. Traceroute still reveals a possible path the traffic takes, along with IP addresses, hostnames, latency between hops, and other details.

High latency can result from inter-state / inter-city geography or load balancing and does not necessarily indicate congestion at those hops — a common misconception. Additionally, control plane policing and other factors affecting Internet Control Message Protocol (ICMP) messages in traceroute can lead to missing responses or increased latency due to dropped traces and the tool’s retries.

However, assuming we are getting hostnames, the information can be valuable. Like a compass, we get a direction of travel. We can determine the last hop our upstream network has before it hands off to the next network and who the network is. We can likely determine the geographies of where the traffic is going and how many networks are travelled before the destination.

Once we get close to the destination, we can likely see the last hops, which can help validate a design. For example, if you requested a specific Point of Presence (PoP) to be used on a circuit in a remote location, you could see the router’s hostname, which may include the city. How else can you verify that without someone on the provider’s side? If you log into a looking glass (which is probably not the same router or city), you won’t see the last hop IP or hostname.

Let’s say you’re deploying a new router and a new connection for Equal-Cost Multipath (ECMP). First, you’ll check your Border Gateway Protocol (BGP) and Internal Gateway Protocol (IGP) configurations, along with the relevant show commands and information. You’ll likely also review the routing table, but won’t you also run a traceroute to verify that traffic is indeed being distributed across both paths?

Speaking of multiple links and load balancing, a common sign that one link in a bundle is having issues is when traffic is intermittently dropped or behaving inconsistently. With ping, you’ll only notice drops and possibly see different millisecond measurements. However, a better way to verify this is by running a traceroute and observing if different IP addresses appear between the same hostnames when a drop occurs. Some tools combine pinging with tracing, which can help isolate the problematic interface.

It’s situations like this where you can verify behaviours or deployments before and after a change. I can count many times when I used a traceroute as part of my validation process after a change to confirm everything was working as expected.

Another example is when you know there are two paths from a certain location or subnet, and an end user reports something isn’t working. You can ask for a traceroute to determine whether the issue lies on the secondary or primary path, giving you an immediate insight into the network state. This is why it’s useful to either find hardware / tools or leverage your existing monitoring tools to deploy probes and run traceroutes across your network. By documenting the state and setting up alerts for any changes, you can quickly detect issues. Tools like ThousandEyes, NetBrain, and SolarWinds come to mind, though there are likely many others that can help with this.

It’s generally safe to run traceroutes consistently on your network and even to destinations over networks you don’t control for data measurements. However, be aware that this traffic will likely be classified as best effort and could be dropped, though it’s unlikely. It’s also important to consider the CPU load that running these traceroutes can impose on other operators’ networks, so it’s best to keep the intervals moderate to avoid overloading their systems.

I’ve noticed that Microsoft Teams clients run traceroutes quite consistently in the background. Just think about how much you could learn from millions of hosts running traces all over the Internet, with that data being visualized to show common hops, latency, and IP addresses.

Furthermore, if you were deploying a new, diverse secondary Direct Internet Access (DIA) circuit to a remote co-location, for example, you’d want to run traceroutes to determine or confirm that your Internet path is different from your primary path. This could change of course but you’d want a baseline path to start.

Lastly, imagine you have a customer facing an issue with an obscure Autonomous System (AS) and subnet that you don’t directly peer with, where there are five possible paths to take — some of which are longer, more latent, or have congestion in intermediate ASes. The level 1 techs can’t spend time sifting through routing data to determine all the paths and make a decision, either because they don’t know how or lack the necessary permissions. Instead, they ask the customer to run a traceroute and then pass the ticket to you. From there, you can see in real-time where traffic is exiting your network. Once you identify which path the traffic is preferring, you can assess your edge configuration and potentially adjust the local preference for that route via another peer, providing a better path for the customer and resolving the issue. In situations like this, network tracing can be invaluable for diagnosis. It’s not always the final answer, but it’s helpful.

Figure 2 — Example traceroute with information from BGP Toolkit.
Figure 2 — Example traceroute with information from BGP Toolkit.

Looking at Figure 2’s random traceroute output, we can see that AT&T is likely the last-mile carrier based on hop 3, and we can see the first hop when the traffic leaves the local network (the last RFC 1918 IP with low latency). Next, we see another AT&T hop at 71.149.23.68, followed by several unknown hops (marked with ***). Eventually, we see 192.205.32.138 as hop 9, which is the last AT&T IP address in the trace. While there are some unknown hops, we can reasonably conclude that the traffic is still on the AT&T network before hop 9.

Once we’re on the Hurricane Electric (HE) network, we notice the DNS names include ‘port-channel’ and ‘core’, suggesting a high likelihood of multiple links on the path and that it’s a core router. From there, we can see that the traffic stays within the HE network until it reaches the destination. Additionally, hop 3 lists hstntx (Houston, TX), while later HE hostnames list fmt2 (Fremont, CA), which helps us pinpoint the locations of these routers along the path.

Note the latency: If we had just pinged the destination, we would have only seen the ~48ms latency. However, here we can observe the latency increase between hops 9 and 10, becuase the traffic is going from Texas to California. This suggests that there are likely more Layer 2 (L2) type devices between those routers, which could be contributing to the increased latency. To validate this further, additional tracing tools would be needed to gather more insight. More validation would be needed there using a different tracing tool. This is just a quick example to emphasize the points above of the effectiveness of network tracing.

Yes, we aren’t getting the full picture if you don’t control the network. Yes, there are likely hidden hops even on Internet paths (you can explore tracing to include label-switched hops), and yes, you will need to look at the routing table on your routers or via looking glasses to get a more complete view of the network path.

If there aren’t DNS records displayed you’ll have to check a Regional Internet Registry (RIR) website or Google for more information, which could be time-consuming.

Figure 3 — Traceroute not showing any hostnames from DNS Checker.
Figure 3 — Traceroute not showing any hostnames from DNS Checker.

You probably shouldn’t rely on traceroute to determine the root cause of a performance issue unless you suspect the problem is increased latency due to distance or frequent routing changes — though even then, it might not provide a conclusive answer. Traceroute is also not ideal for Service Level Agreement (SLA) validation (unless the end-to-end path is part of the SLA). A tool like TWAMP is better to measure something like that.

Definitely use traceroute to help with various types of validations like determining a network path, routing behaviour, first and last hops, and for reconnaissance. It’s true it can be more effective if you use it on your own network versus external networks.

I recall an issue where, after switching to a new service provider, I noticed my ping was high to a certain destination. I ran a traceroute and saw that traffic was going all the way to Texas before reaching the last-mile provider closer to the destination’s origin. After some additional research based on the trace, I discovered that the new provider on the source side only peered with the last-mile provider in Texas. Without this information, I would have had no idea what was happening or where to begin developing a hypothesis.

It’s important to educate yourself, your peers, and your direct reports on when to use traceroute, when not to use it, and how to interpret the results. Once you have a solid understanding of basic routing, network architecture, and how traceroute works, you can confidently use it. Traceroute can be an effective tool for quickly illustrating a network path or diagnosing an issue without immediately reviewing multiple devices. It can also help identify which device may have a misconfiguration or where a problem originates (CCNP TSHOOT curriculum covers this).

This topic is becoming one of those divisive discussions where many will have strong opinions, and influencers may use it to spark debate. However, by now you should at least be aware that traceroute exists and that it can be safely and effectively used in the right situations.

Thank you for reading, and good luck!

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top