In a previous post, I observed that the classic way to compute round-trip time (RTT) statistics and retransmission timers does not work well. When acknowledgements are too frequent, the correlation between successive RTT measurements causes the smoothed RTT estimate to track closely the last values, and the RTT variations to be widely underestimated.
This imprecision impacts both the effectiveness of loss recovery and the implementation of delay-based congestion algorithms like BBR. I think this can be fixed by updating the variables just ‘once per RTT’. In this post, I first describe the once per RTT update algorithm, then report on promising measurements using the simulations in the Picoquic test suite.
The once per RTT update algorithm
The goal of the algorithm is to update the
rttvar, and Probe Time Out (PTO) variables just once per RTT. The ‘once per RTT update’ algorithm operates as follows:
rttvaras specified in RFC 9002.
smoothed_rtt = kInitialRtt rttvar = kInitialRtt / 2
- Remember the packet number of the first packet sent in an RTT measurement epoch, and initialize epoch-specific variables:
rtt_sample_number_in_epoch = 0 rtt_epoch_pn = last_packet_number_sent + 1 rtt_epoch_nb_samples = 0 rtt_epoch_sum_samples = 0 rtt_epoch_sample_max = 0 rtt_epoch_sample_min = UINT64_MAX;
- Upon receiving an acknowledgement, process the RTT sample:
- If this is the first sample after initialization, set
smoothed_rtt = latest_rtt rttvar = latest_rtt / 2 min_rtt = latest_rtt
- Or else, update the per epoch statistics:
adjusted_rtt = latest_rtt if (latest_rtt >= min_rtt + ack_delay): adjusted_rtt = latest_rtt - ack_delay rtt_epoch_nb_samples += 1 rtt_epoch_sum_samples += adjusted_rtt rtt_epoch_sample_max = max(adjusted_rtt, rtt_epoch_sample_max) rtt_epoch_sample_min = min(adjusted_rtt, rtt_epoch_sample_min) min_rtt = min(min_rtt, adjusted_rtt)
- If this is the first sample after initialization, set
- If the acknowledged packet number is greater than the first packet number sent in the epoch, update the RTT variables. The
rttvar_sampleis set to the maximum value of the difference between the
smoothed_rttand the maximum or minimum RTT seen during the epoch.
rttvar_sample = max(abs(rtt_epoch_sample_max - smoothed_rtt), abs(rtt_epoch_sample_min - smoothed_rtt)) rttvar = 3/4 * rttvar + 1/4 * rttvar_sample rtt_epoch_estimate = rtt_epoch_sum_samples / rtt_epoch_nb_samples smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * rtt_epoch_estimate PTO = smoothed_rtt + 3*rttvar + max_ack_delayThen, reinitialize the per-epoch variables.
Evaluating the once per RTT algorithm
After coding the once per RTT update algorithm in Picoquic, I ran the Picoquic test suite to evaluate the results. The test suite includes a couple of hundred simulations of QUIC connections under various conditions. Each of these tests verifies results such as the expected simulated time for completing the simulated transfers, or sometimes a maximum number of retransmission errors or a maximum number of spurious retransmissions. All these tests passed, which means that the new algorithm was well within the tolerances set for the new algorithms. Looking at the details, I saw that:
- In many cases, the old and new algorithms produced exactly the same results. This is expected because many tests do not involve packet losses, or involve packet losses that are detected via packet numbers instead of timers.
- When they differ, the new algorithm results were generally better, but not always. In particular, the results tended to be better when simulating RTT-dependent congestion control algorithms like BBR, but delays could be very slightly higher for some tests using Cubic.
The big differences happened with tests simulating connection events such as link breaks in multi-path scenarios. In these cases, having a better estimate of the RTT helps with reaction time. In single path scenarios, link breaks result in connection failures and the measurements are not very interesting, but the test suite includes ‘MTU Drop’ tests that simulate a change in the path MTU size, and measure how long it takes to react to the event and continue the session with the new MTU. This reaction is also very dependent on RTT estimates, and I analysed it in detail.
Comparison of RTT algorithms in MTU-Drop test
The ‘MTU Drop’ test simulates the download of a 1MB file over a 1Mbps link, with the transmission delay set at 100ms in each direction, for a nominal RTT of 200ms. Initially, the path MTU is set to 1,500 bytes. The code simulates the establishment of the connection, then waits for 1 second, and verifies that path MTU discovery was performed and that the connection is using a packet size of 1,440 bytes. At that point, the path MTU is reset to 1,394 bytes. The test verifies that the code notices the drop in size, and renegotiates an appropriate Path MTU (PMTU). The test verifies that the download completes in the expected time.
The goal of the test is to verify the PMTU adaptation, but that adaptation depends on the packet loss recovery algorithm, which itself depends on the computation of the RTT and the retransmission timer. We evaluate two ways to compute these timers:
- Classic, as specified in RFC 9002.
- Once per RTT, as explained above.
When using the once per RTT computation, we evaluate three variants:
- Once 4V, in which the retransmission timer is computed as specified in RFC 9002, adding four times ‘RTT_VAR’ and the ACK delay to the smoothed RTT value.
- Once MaxRTT, in which the retransmission timer is set to the MaxRTT plus RTT_VAR plus the ACK delay.
- Once 3V, which is the same as Once 4V, but only adding three times ‘RTT_VAR’ instead of four.
The results of the four test runs are summarized in the following table:
Table 1 shows that ‘Once 3V’ performs better than the classic implementation — the download completes faster, and there are significantly fewer packet losses.
It is hard to understand precisely why one specific tuning works better than others. When looking at ‘Once 3V’ versus ‘Classic’, it seems that most of the difference happened during the initial ‘start-up’ phase. The log of the ‘Once 3V’ trial shows it exiting the start-up 631ms after the beginning of the connection, with the ‘smoothed RTT’ set at 224ms. The ‘Classic’ log shows it exiting the start-up 991ms after the start, with the smoothed RTT set at 353ms. By the time the MTU drop is simulated, the ‘Classic’ connection has the smoothed RTT set at 407ms, while the ‘Once 3V’ connection has the RTT at 217ms. The larger RTT means that the ‘Classic’ connection takes almost twice as long to detect the packet losses caused by the MTU drop and recovers almost twice slower.
That same reasoning explains why setting the retransmission timer from the Max RTT does not work so well. In our samples, the maximum observed RTT happens at the end of the start-up, before stabilizing to lower values. The retransmission timer set from Max RTT is too large, and losses are not detected quickly enough.
How many deviations in the timer?
We thus arrive at the notion that ‘large retransmission timers slow down recovery’. That’s the observation that pushes us to deviate from the original formula:
retransmission timer = smoothed_RTT + 4 * rtt_var + ack_delay
That formula seems pretty arbitrary. It is related to the classic observations that for a Gaussian distribution, observations over four standard deviations away from the average happen less than 0.01% of the time, which seems a good way to minimize ‘spurious packet loss detection’. But it is far from clear that the distribution of RTT samples is Gaussian, and the computation of rtt_var is only a vague approximation of computing a standard deviation. In any case, the use of N times rtt_var is a tradeoff, with fewer spurious detection versus longer recovery time from errors.
In QUIC, most loss detections are based on packet number comparisons, not timers. This, alone, renders spurious loss detections fairly rare. In addition, congestion control algorithms like BBR recover much better from spurious loss detection than old algorithms like RENO. It is probably time to revisit the tradeoff. Our tests show that using a coefficient 3 rather than 4 speeds up error detection without visible side effects.
The ‘once per RTT’ algorithm design appears sound, but the evaluations are based on simulations. The next step is to work with other QUIC developers and see whether the algorithm can be validated in large-scale deployments.
Christian Huitema (Mastodon) is a researcher and software architect who has written books and developed standards. He’s worked on IPv6, TCP-IP, multimedia transmission over the Internet, security, and privacy. He’s currently working on DNS metrics for ICANN, standardization of QUIC, Picoquic, and privacy and security in the IETF.
Adapted from the original at Private Octopus.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.