
For the past decade, the tech world has been obsessed with the raw power of graphics processing units (GPUs). We celebrate every new chip announcement, every leap in teraflops, as a milestone in the Artificial Intelligence (AI) revolution. We’ve come to believe that the key to unlocking AI is simply a matter of cramming more and more powerful processors into a data centre. But this is a dangerously incomplete picture.
Imagine building the world’s most advanced race car. You spare no expense on the engine, a masterpiece of engineering that can generate unprecedented horsepower. But then, you connect it to the wheels with a transmission made of brittle plastic. What happens? The moment you hit the accelerator, the engine roars, but the car barely moves. The power is there, but it can’t be delivered.
Welcome to the hidden bottleneck of the modern ‘AI factory’ — the sprawling, multi-billion-dollar clusters of servers that are the engines of modern AI. The dirty little secret of the AI industry is that for many of these supercomputers, the biggest constraint isn’t the GPU. It’s the network.
The only metric that matters: Job Completion Time
In the world of AI, we love to talk about speeds and feeds — the clock speed of a GPU, the bandwidth of a memory bus. But these are just vanity metrics. They are the equivalent of admiring the engine on a test bench. For the Chief Technology Officers (CTOs) and researchers running these AI factories, there is only one metric that truly matters: Job Completion Time (JCT).
JCT is exactly what it sounds like — how long does it take to get the answer? How long does it take to train a large language model to the desired accuracy? How long does it take to run a complex climate simulation? This is the metric that dictates the return on investment (ROI) of a billion-dollar infrastructure. A 20% reduction in JCT means you can run 20% more experiments, get your product to market 20% faster, or ask 20% more questions of your data.
The primary killer of JCT is a phenomenon known as tail latency. In a distributed training job, thousands of GPUs must communicate and synchronize with each other thousands of times per second. The entire system can only move as fast as the slowest connection. If just one message, composed of thousands of data packets, is delayed, the entire multi-million-dollar array of GPUs sits idle, waiting. This is the ‘straggler’ problem, and in the world of high-performance computing, it’s death by a thousand paper cuts. The performance of your entire supercomputer is defined not by its average speed, but by its worst-case, slowest moment.
The unique, punishing nature of AI traffic
So what makes AI workloads so uniquely punishing for networks? It’s because they don’t behave like the familiar, chaotic traffic of the Internet. Instead, AI training is a highly choreographed, synchronized dance. The primary communication patterns are ‘collective operations’ like All-Reduce.
Imagine a team of a thousand analysts in a room, each with a piece of a puzzle. On a signal, every single analyst must show their piece to every other analyst. Then, they all have to agree on the next step before anyone can proceed. This is what happens during an All-Reduce operation. Every GPU in the cluster needs to send its latest calculations (gradients) to every other GPU.
This many-to-many communication pattern creates a network nightmare known as incast congestion. It’s like every single person in a stadium trying to exit through the same gate at the same time. The switch ports leading to a specific server get overwhelmed, their tiny buffers overflow in milliseconds, and they start dropping packets.
The flaw in the machine: ECMP, hashing, and the illusion of balance
Traditional Ethernet networks have a clever trick up their sleeve to prevent traffic jams: Equal-Cost Multi-Path Routing (ECMP). The idea is simple — if you have multiple roads going to the same destination, don’t send all the cars down one road. Spread them out. ECMP does this for data packets.
It works like a mail sorter at a post office. The sorter looks at the address on an envelope (the packet’s header information) and uses a consistent rule to decide which mailbag (which network path) it goes into. This ‘rule’ is a mathematical function called a hash. The hash typically looks at five key pieces of information in the packet header — the source IP, destination IP, source port, destination port, and protocol — collectively known as the ‘5-tuple.’
For the diverse, chaotic traffic of the Internet, this works beautifully. Millions of different users are talking to millions of different servers, creating a huge variety of 5-tuples. The hash function scatters these conversations evenly across all available network links, achieving excellent load balancing.
But AI traffic breaks this model completely.
AI training workloads are characterized by ‘low entropy.’ Instead of millions of short, random conversations, you have a small number of extremely large, long-lived conversations. A single GPU might need to send a massive, multi-gigabyte ‘elephant flow’ of data to another specific GPU. For the entire duration of that transfer, the 5-tuple values do not change.
This means the hash function, which is designed to be consistent, produces the exact same result for every single packet in that elephant flow. The result? The entire flow gets locked onto a single network path. Now, imagine you have several of these elephant flows happening at once between different pairs of GPUs. Due to the low variability, it’s highly likely that several of these massive flows will be hashed to the same physical link.
This is called path polarization or a hash collision. You end up with a disastrously unbalanced network; a few links are completely saturated, overwhelmed by multiple elephant flows, while other parallel links sit nearly idle. The network’s theoretical capacity is high, but its real-world throughput is crippled by these self-inflicted hotspots.
The catastrophic cost of a single dropped packet
In a normal enterprise network, a dropped packet is no big deal. The TCP protocol on your computer notices it’s missing, asks for it to be re-sent, and your webpage loads a fraction of a second slower. You never even notice.
In a high-performance AI fabric, a dropped packet is a catastrophe.
The protocols used for AI networking are built for speed and assume a perfect, lossless network. When a packet is dropped — which is exactly what happens when those ECMP-induced hotspots overwhelm a switch — the entire process grinds to a halt. The recovery process is slow and computationally expensive, introducing massive jitter and latency. The numbers are staggering: Theoretical analysis of training a large model like GPT-3 shows that a packet loss rate of just 0.1% — one packet in a thousand — can slash your effective GPU use by over 13%. If that loss rate climbs to 1%, your GPUs are spending less than 5% of their time actually computing.
You’ve built a billion-dollar AI factory, and it’s spending 95% of its time waiting for lost mail.
This is why the focus of AI networking has shifted. The game is no longer just about building a fast network, it’s about building a predictable, lossless network. It’s about engineering a fabric that can handle the synchronized, brutal communication patterns of AI workloads without flinching.
The network is not just the plumbing of the AI factory. It is its central nervous system. A slow, congested, or unpredictable network leads to an empty-headed supercomputer, no matter how powerful its silicon heart may be.
Arun is a veteran Software Engineer specializing in high-performance computing, network automation, and scalable infrastructure. More recently, he has been involved in building secure enterprise agentic AI platforms at scale. He is also an active contributor to the open-source community.
This post is adapted from the original at Scribbles Into The Void, as part of the ‘Fabric Wars’ series.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.