A couple of months ago, in July 2022, I wrote about our work in measuring the level of use of QUIC in the Internet. Getting this measurement ‘right’ has been an interesting exercise, and it’s been a learning experience that I’d like to relate. We’ll start from the end of the previous article and carry on from there.
There’s too little QUIC!
We’ve used APNIC Labs’ measurement platform where the measurement is embedded in a script within an online advertisement. The advertisement script directs the user to perform a number of URL fetches, and the servers that serve the referenced objects are instrumented to allow client capabilities and behaviours to be inferred from the server’s actions.
In this case, the client is directed to load a basic URL object (a minimal 1×1 pixel ‘blot’) where the domain name part of the URL is unique to each individual measurement. To set up a QUIC measurement we’ve taken the following steps:
- Used the NGINX server v1.23.1 with QUIC support included.
- Used a URL domain name with a defined HTTPS resource record (HTTPS RR) type whose value is alpn=”h3”.
- Used an Alternative Service directive on the content, namely Alt-Svc: h3=”:443”, which is intended to direct the client to use HTTP/3 for subsequent retrievals.
Normally this last measure, the Alternative Service directive, would be largely ineffectual. Each client receives the ad as a single event and the script directs each ad to load once, so a client should not be performing a second load of the URL. In this case, we’ve used a variant of the script that directs the client to wait for two seconds and then repeat the load of this URL. It is assumed that this delayed repeat would be sufficient for the client to act on the Alternative Service directive.
The QUIC measurement commenced at the start of June 2022. In this measurement, we measure both the number of users who query for an HTTPS record and the number of users who use HTTP/3 (QUIC) to retrieve the URL.
Figure 1 compares the HTTP/3 retrieval rate using these two triggering processes. If the client does not use HTTP/3 in the first retrieval but changes to use HTTP/3 in the second retrieval we assume it used the Alt-Svc. If the client used HTTP/3 in the initial retrieval, then we assume it used the DNS HTTPS mechanism.
The issue is that the second fetch rate is just too low. Google announced that the Chrome browser was adding support for QUIC in October 2020, using the Alt-Svc directive to trigger QUIC. Cloudflare observes that 30% of sessions use QUIC in its Radar Report, although it’s unclear if the site is referring to traffic volume or session counts. In any case, a 30% use of QUIC is way more than between 3% and 4%.
What’s going wrong?
Increase the number of repeats
Our first thought was that the second repeat, scheduled two seconds after the first fetch, was just not enough. The NGINX server will prefer to keep a session with the client open if it can, in order to amortize the cost of the TLS session establishment, and HTTP/2 can support this session reuse. Perhaps a single repeat fetch was not enough.
We increased the number of repeat fetches from one to seven, making a total of eight fetches in all. The scheduling interval between fetches was kept at two seconds. This change improved the picture, lifting the second fetch HTTP/3 query rate from 4% to 26% (Figure 2).
The result was an obvious improvement, but it is still below the Cloudflare measurement. We were seeing around half of the Chrome browsers still not shifting to HTTP/3 across the repeat interval.
Adjust the server’s keepalive timer
Persistent connections in HTTP help amortize the overheads of establishing a TCP connection (and a TLS connection for HTTPS) over a number of subsequent fetches from the same client. When a fetch is complete the server will keep the connection open for a further number of seconds (the ‘keepalive’ interval) before shutting down the session. This improves the server’s responsiveness for the client at the expense of some additional memory state in the server to keep the session open.
In our case, this server behaviour is not exactly what we want for the clients using the Alternative Service directive to switch to HTTP/3. We are looking for the server to close the HTTP/2 session that was used for the initial fetch, and then have the browser client open a new session, hopefully over QUIC and HTTP/3 for the second and subsequent fetches every time, rather than the more intermittent behaviours we are seeing across the seven repeat fetches.
We tried setting the NGINX server’s keepalive timer to zero seconds, but this had the unintended side effect of disabling all QUIC support in the server. Clearly, this was not the intended outcome!
We then raised the keepalive timer to one second, on the basis that a non-zero value would enable QUIC support. The result of these actions on the count of seen QUIC sessions is shown in Figure 3.
Clearly, for the Chrome-like behaviour that is triggered by the Alternative Service directive, this has had the desired effect. The subsequent fetch cycle has seen the level of use of QUIC rise to over 50% of the tests.
It should be noted that slightly more than half of the QUIC-capable samples switched to use QUIC on the second fetch, the remainder switched on the third or later fetches, and in around 10% of fetch sequences, clients switched back to using TLS over TCP. It was also noted that this one-second timer value had the side effect of disabling all use of QUIC on the first fetch (the use triggered by the DNS HTTPS resource record, seen predominantly in the Safari browsers).
It appears that the browsers that use the DNS lookup of the HTTPS record are experiencing the issue that the QUIC connection is shutting down before it can get established.
The IETF specification of QUIC (RFC 9000) includes an Idle Timeout (Section 10.1); it is possible that the server’s keepalive timer value is being used by the underlying QUIC transport as well, and the QUIC code is performing an early close of the QUIC session before the HTTP context can be established.
Further adjustment to the server’s keepalive timer
To test this theory, we adjusted the servers’ keepalive parameter up from one second to 20 seconds.
At this point, the measurement script is performing seven repeat fetches, scheduled at two-second intervals. The server’s keepalive timeout is set to 20 seconds. Also, to speed up the DNS-triggered path we’ve added ipv4hint and ipv6hint fields to the HTTPS record in addition to the alpn=”h3” field, allowing the client to bypass an additional DNS query, assuming that the client code will accept these fields in place of a further explicit DNS query.
These changes appeared to have addressed the major issues we’ve been having with low QUIC counts (Figure 4).
The longer keepalive value allows the initial fetch to use QUIC, assuming that the client is one that uses the DNS HTTPS lookup, so we are once more seeing a first fetch value of 1.9% of samples using QUIC as a global average, while some economies are seeing the use of QUIC on first fetch higher than 6%. It’s difficult to tell whether this rate could be pushed to higher values or not by further exploration of the keepalive timers and the scheduling of the subsequent fetches in the measurement script. The data we do not have any access to is that of the failed connection attempts for QUIC, where the outbound path from the client to the server discards UDP packets addressed to port 443. Overly enthusiastic local firewall filter rules have had quite a notable impact on the robustness of other services (such as IPv6-in-IPv4 tunnelling using 6to4), and we have no direct way of looking at the outbound behaviour of QUIC packets, so the robustness of this approach is challenging to measure.
In any case, a deployment rate across the entire internet of more than 50% is no mean feat. The world map of QUIC support shows this level of support for QUIC in almost all economies. The only major economy with a level of QUIC support below 20% is China (Figure 5).
If a server supports the delivery of content over QUIC, will clients then use QUIC?
The answer appears to be ‘yes’, in that a majority of browser clients will use QUIC when it is offered, but there are a number of caveats to this positive response. Given the majority of browser clients use Chrome, and Chrome still uses the Alternative Service directive to switch to HTTP/3 for subsequent fetches, this implies that both the server and the content need to enable QUIC use. It also requires that the content is constructed to enable the use of multiple fetches in sequence. As we’ve found, this may require some careful tuning of the keepalive parameter to allow the client to switch over to use QUIC for subsequent fetches.
From this respect, the DNS HTTPS record approach looks preferable as a triggering mechanism for QUIC, allowing the QUIC attributes faster connection establishment, zero-round-trip time (0-RTT) session reestablishment, superior multi-session support, and full encryption of the end-to-end transport parameters to be used from the point of first contact between client and server.
However, deploying change is becoming slower on the Internet, in line with its continued growth. In this case, the change is not only in the behaviour of the client browser set (which is not a large set), nor the collection of servers (again, with the centrality pressures in mind, this is not a particularly large set). While QUIC has some obvious advantages in terms of offering a faster and more secure experience to the user, it appears the downside lies in the cost of making changes to the various provisioning systems for the DNS and coordinating DNS published capabilities with server-side capabilities. But this is perhaps not as much of a barrier to widespread adoption as it might appear.
In a scan of the Tranco domain name list (31 August 2022), of the top 250K names in that list, some 12.5% of these domain names have an HTTPS record. Most of these HTTPS records have a similar format, including address hint field values that point to Cloudflare servers. So, in some ways, this does not indicate a general adoption of HTTPS records in servers, but the use of this construct by a single Content Distribution Network (CDN) platform that is widely used. In Cloudflare’s case, the names of the services hosted by Cloudflare are served by authoritative servers also operated by Cloudflare, so the inclusion of the appropriate HTTPS records is relatively straightforward. Given the almost universal desire to speed up the elements of HTTP content delivery it’s likely that other content delivery platforms will also adopt QUIC support in the near future.
The next step for the general adoption of QUIC in the browser world may well lie in the timing of the code release for the Chrome browser to add the HTTPS DNS record as the trigger to enable QUIC at the time of the initial fetch. QUIC use measurements are available at APNIC Labs.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.