This is the fourth blog post on the topic of the centralization of the Internet. The previous posts cover the diversity of authoritative name servers, the diversity of MX records, and an analysis of the use of CAA records across generic Top-Level Domains (gTLDs).
The Domain Name System (DNS), aside from being inevitably the cause of whatever infrastructure problems you’re encountering, is a treasure trove of data. As the foundation for so much of what happens on the Internet, it serves as a good means to measure and assess the degree of centralization of the Internet at large.
After having looked at NS, MX, and CAA records, I’ve now analysed the use of A and AAAA records making up the large set of so-called ‘naked domains’. The term ‘naked’ or ‘bare’ domain is generally used to refer to a second-level domain (example.com), in contrast to a more qualified subdomain, such as www.example.com.
Any second-level domain necessarily must have some resource records. At a minimum, there’s the Start of Authority (SOA) record, as well as one or more NS records. For any domain that’s DNSSEC signed, we also get a few additional records (RRSIG, DNSKEY, and so on), and the analysis of those might be another project to undertake some time.
The World Wide Web is the only thing I know of whose shortened form takes three times longer to say than what it’s short for.Douglas Adams
Even though we’ve trained users well to type ‘www.example.com‘, this is very annoying to do. And for most users, there really is no distinction between the World Wide Web and the Internet, so they will often leave off the ‘www‘ and simply enter ‘example.com‘ into their browser and still expect to be taken to that company’s website, which is why many domains want to add IP addresses to their second-level domain name and create such a naked domain.
Putting the possible complications relating to, for example, X.509 certificates and wildcards, Content Distribution Network (CDN) load-balancing on the apex (and more) aside, I once again dug into the gTLD zone files as well as miscellaneous ccTLD data and threw about 9GB of compressed zone data at my little virtual private server running bind(9) to find out just how many domains are using which IP addresses at the apex.
As I had noted when talking about email addresses, some TLDs are completely naked, as it were. By having A / AAAA or MX records, they are dotless domains. ICANN prohibits dotless domains, but nevertheless, I found 12 TLDs with A or AAAA records:
ai, arab, cm, music, pn, tk, uz, va, ws, xn--l1acc, xn--mxtq1m, and xn--ngbrx. Of those, only va has an IPv6 address (and is, in fact, IPv6 only).
But back to the naked second-level domains…
IPv4 vs IPv6 usage
After parsing the zone data, I ended up with roughly 240 million domain names (166 million, or nearly 70%, of which are in .com alone) and looked at the distribution of IPv4 vs IPv6 addresses.
Now DNS data is necessarily always a snapshot in time, and during the lookup there are many things that can go wrong. As a result, I saw a number of NXDOMAIN and SERVFAIL results. Those aside, I found:
- About 9% of all domains (24M) do not have either an A or AAAA record.
- Roughly two-thirds of all domains are IPv4 only.
- 8.5% of all domains (22.5M) are dual-stack.
- Roughly 65K domains are IPv6 only.
But not all domains are of equal importance, so in addition to running this analysis for all domains, I also ran it for the Top 1 Million domains only, as identified by the Tranco list. There, things are slightly better with respect to IPv6 support; among the most popular one million domains, we have almost a quarter of them supporting IPv6:
CNAMEs at the apex
One thing that’s not shown in Figures 1 and 2 is that aside from A and AAAA records, there are also a lot of naked domains that are CNAMEs. That is, the actual second-level domain has a CNAME record that may then eventually resolve to an IPv4 and/or IPv6 address.
Now, having a CNAME record at the apex is, of course, against RFC 1034 because as a second-level domain there must already be at least SOA and NS records, and RFC 1034 says that thou shalt not have a CNAME in the presence of any other resource records (except, of course, there’s also RFC 4034 and DNSSEC ruins everything by saying, no, wait a second, you can have other RRs next to a CNAME, because otherwise, DNSSEC won’t work, but those are the only exceptions, pinky promise, so anyway, you still can’t have a CNAME at the apex).
Only… many organizations really want that precisely because users like to leave off the www when entering a domain in their browser. In total, I found that about 3.8% (9.2M) of all domains and 9K of the Top 1M domains have a CNAME record at the apex, and while that violates RFC 1034, by and large, the world doesn’t end because most systems are still tolerant in what they accept. Looking at what these CNAMEs point to, we find them to cluster in around 200K different domains:
Notably, 54% of all CNAMEs (4.9M) point to a single name (traff-1.hugedomains.com), with other sizeable representations by different domain name registrars, domain name monetization companies, and so on.
But let’s go back to IP addresses. NXDOMAIN, SERVFAIL, and NODATA aside, I found 84% of all domains (around 201M) did have at least one IP address. And, of course, it’s not uncommon for a domain to have more than one IP address. For the small number of domains that are dual-stack, it’s quite common to have an equal number of IPv4 and IPv6 addresses:
- 59% (119M) have exactly one IP address
- 22% (44M) have two addresses
- 7% (14M) have four addresses (for example, two IPv4 and two IPv6)
- 236 domains have >100 addresses
Having a domain map to over 100 IP addresses may seem excessive, but when you make a few hundred million DNS lookups, you find all sorts of funny things:
When I collected data and ran the DNS lookups, the domain vannaoh.com had 2,024 IP addresses in total (912 IPv4, 1,112 IPv6), micahclay.us had 1,433 total (896 IPv4, 537 IPv6), and kejanigarage.com had 777 total (228 IPv4, 532 IPv6). That’s an interesting attempt to implement DNS-based round-robin load balancing, I suppose (and which doesn’t seem to consider the various failure modes of the gigantic DNS response size these packets incur).
Oh, and talking about finding funny things in the data, guess what? There are over 668,000 domains that use reserved IP addresses:
- 562K use 127.0.0.0/8 (most 127.0.0.1, but over 300 others appear; 2.1K use ::1)
- 35K use RFC 1918 / RFC 4193 addresses, most commonly 10.10.10.10, 10.0.0.1, 192.168.1.1; some fd00::/8 and even some fc00::/8
- 31K use 0.0.0.0/8, most commonly 0.0.0.0
- 1.5K use RFC 6598 shared address space 100.64.0.0/10, most commonly 100.100.100.100
- 10K use a link-local address 169.254.0.0/16, fe80::/64, most commonly 169.254.254.254
- 1.8K use documentation IPs (for example, 192.0.2.0/24, 198.51.100.0/24, 2001:db8::/32)
- 2.1K use 240.0.0.0/4
- 570 use ::
- 341 use 255.255.255.255
Ok, all the funny business aside, let’s talk about which IP addresses are encountered most frequently.
All in all, I found 328M A and AAAA records across 13.7M unique IP addresses. This, of course, implies that many domains share an IP address — they resolve to the same A or AAAA addresses. And some IP addresses are found more often than others:
- 201 IPs are used by more than 100K domains each
- 64 IPs are used by more than 500K domains each
- 29 IPs are used by more than 1M domains each
- The top five most used IPs are:
- 220.127.116.11 (25M domains) — 18.104.22.168.bc.googleusercontent.com
- 22.214.171.124 (9.8M domains) — a2aa9ff50de748dbe.awsglobalaccelerator.com
- 126.96.36.199 (9.8M domains) — a2aa9ff50de748dbe.awsglobalaccelerator.com
- 188.8.131.52 (6M domains) — unalocated.63.wixsite.com
- 184.108.40.206 (6M domains) — unalocated.63.wixsite.com
In addition, it’s worth noting that all of the top 20 most frequently used IP addresses are IPv4 addresses and that there are only eight IPv6 addresses in the top 100 most frequently seen IPs.
For the Top 1M domains, I found about 1.8 million A and AAAA records mapping to around 700,000 unique IPs, with 50 IPs being used by more than 1,000 domains each, six IPs being used by more than 4,000 domains each, and the top five most used IPs being:
- 220.127.116.11 (7,487 domains) — myshopify.com
- 18.104.22.168 (5,804 domains) — myshopify.com
- 22.214.171.124 (4,726 domains) — unalocated.63.wixsite.com
- 126.96.36.199 (4,688 domains) — unalocated.63.wixsite.com
- 188.8.131.52 (4,654 domains) — unalocated.63.wixsite.com
Now obviously the top IP addresses here belong to some of the same organizations, but it seemed like it might be interesting to see how the IP addresses reverse to identify controlling domains, so I went ahead and tortured my DNS resolver a little bit more…
Now a little over half of the IP addresses found don’t have a PTR record at all or fail resolution. The remaining 6 million addresses do have PTR records; for those, I found 30,000 in-addr.arpa records that are CNAMEs, meaning they follow RFC 2317 to allow for in-addr.arpa delegation on a non-octet boundary.
Another thing that’s perhaps not entirely obvious is that a single IP address can reverse to multiple domains: I found over 500 IPs that reversed to over 100 domains each. The IP addresses reversing to the largest number of domains were:
- 184.108.40.206 (reverses to 3.2K domains), used by 23 domains
- 220.127.116.11 (2.8K), used by 14 domains
- 18.104.22.168 (2.4K), used by 10 domains
- 22.214.171.124 (2.3K), used by 25 domains
- 126.96.36.199 (2.1K), used by 674 domains
What I think is interesting here is that there are some IPs that have a much, much larger number of PTR records than they are currently used for, hinting at a possible use for domain name parking, and reuse lacking cleanup when the domain is resold, expired, or moved.
Looking at the names to which most IP addresses reverse, we see again concentration across common providers, notably Amazon AWS here, although I thought that having PTR records pointing to the root (.) was also interesting:
- s3-website-us-east-1.amazonaws.com (16K IPs)
- connect.rcp.net (15.5K IPs)
- unassigned.psychz.net (7.5K IPs)
- s3-website-us-west-2.amazonaws.com (5K IPs)
- . (5.7K IPs)
Noticing the shared domain names here, I then went ahead and normalized all the reversed FQDNs to add up the distribution into second-level domains:
We see about 32% of all IPs that do have PTR records point to amazonaws.com, almost 7% pointing to googleusercontent.com, 6% to secureserver.net (owned by GoDaddy / Wild West Domains) and so on.
But these are unique IPs; it seems that we should weigh the IPs that are used by more domains differently from those that are used by only one domain. When taking this into consideration, we quickly find the most widely used providers — almost 100 million IP addresses added up by frequency of use reverse into only 20 hostnames, with the top five being:
- 188.8.131.52.bc.googleusercontent.com (25.4M)
- a4ec4c6ea1c92e2e6.awsglobalaccelerator.com (19.7M)
- unalocated.63.wixsite.com (18M)
- a904c694c05102f30.awsglobalaccelerator.com (6.5M)
- a16e665f42988324c.awsglobalaccelerator.com (3.4M)
(It looks like wixsite.com had a typo in its name but probably couldn’t go and fix it once it was so widely in use. Remember kids, domain names are forever!).
If we again sum up the stats of the reversed names by domain name, instead of FQDN, we see an ever starker concentration: 37% of cumulatively encountered IPs (121M) reverse into only five different domains:
- awsglobalaccelerator.com (33.6M)
- googleusercontent.com (32.5M)
- amazonaws.com (21M)
- wixsite.com (18M)
- 1e100.net (16M)
Oh, and of course, there are also 554K IPs that reverse to localhost, but let’s stay focused on the domain names. Looking at the top five above, you should quickly notice that, of course, awsaccelerator.com and amazonaws.com are the same entity, as are googleusercontent.com and 1e100.net.
This got me thinking as to how I can better identify IP address space concentration in use here. Visual inspection by domain name is not going to work well, and whois data these days is often unusable, so instead I went ahead and mapped the IP addresses to their Autonomous System Numbers (ASNs).
IPs by ASN
The IP addresses of the top 1 million most frequently encountered IPs map into around 12,000 different ASNs but, of course, here too we can see a high degree of concentration, as 48.5% map into just 20 ASes, with the top five being:
- AS13335 (CLOUDFLARENET, US) (120K IPs)
- AS16509 (AMAZON-02, US) (45K IPs)
- AS47583 (AS-HOSTINGER, CY) (34K IPs)
- AS26347 (DREAMHOST-AS, US) (31K IPs)
- AS16276 (OVH, FR) (28K IPs)
OVH likely shows prominently here because fr. is one of the few ccTLDs for which the list of domain names is available.
If we then look at the 328 million cumulatively encountered A / AAAA records, we find nearly 50% mapping into just five ASNs:
- AS16509 (AMAZON-02, US) (52.4M, 16%)
- AS13335 (CLOUDFLARENET, US) (40M, 12%)
- AS396982 (GOOGLE-CLOUD-PLATFORM, US) (31M, 9.5%)
- AS15169 (GOOGLE, US) (19M, 5.8%)
- AS58182 (WIX_COM, IL) (18M, 5.5%)
As before, here we can also again identify distinct ASNs owned by the same company (for example, AS396982 and AS15169) and add them up, which then gives us this picture:
Alright, that’s a whole lot of numbers and pie charts. Let’s try to summarize and see what this data is telling us.
For starters — and this may not be surprising, but still disappointing — we find that we’re still very much in an IPv4-only world — almost three-quarters of all domains are IPv4 only.
We also saw that there’s a clear demand for CNAMEs at the apex. This is something that especially CDNs have to deal with, as they often rely on CNAMEs to effectively load balance traffic across their edge networks.
Akamai, for example, offers a feature called Zone Apex Mapping, Cloudflare uses ‘CNAME flattening‘, and Amazon Route53 offers so-called alias records, to give some examples. In my opinion, there’s an opportunity for standardization of a workable solution here (SVCB / HTTPS DNS records, as specified in this draft are a good candidate here).
But with respect to the question of whether the Internet is (effectively) centralized, we did note that:
- A single IP address is used by over 10% of all domains, which I think is impressive
- 37% of cumulatively encountered IPs reverse into five domains owned by three companies (AWS, Google, Wix)
- Those same companies — Amazon, Google, Cloudflare, and Wix — control over 50% of the most frequently encountered IP addresses
However, it is important to note that the data used for this research is incomplete. The inability to get access to the ccTLD zones means that our data is necessarily skewed towards the US market. For example, I would expect there to be a significant weight shift if I had been able to analyse the .CN, .JP, and .DE zones, to name just a few examples.
Secondly, what I have presented here is only a snapshot in time. It might be interesting to track these findings over a longer timeframe to see whether or not we are moving towards an increasingly centralized Internet or not.
Jan Schaumann is a Distinguished Infrastructure Security Architect, and Adjunct Professor of Computer Science, with an interest in information security and the overall health of the internet, as well as the safety and privacy of its users. You can follow Jan on Twitter and Mastodon.
This post is adapted from the original at Jan’s Blog.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.