The Internet was not designed with security in mind.
A number of recent protocols such as Encrypted DNS and HTTPS encrypt critical parts of the web architecture, which can otherwise be exploited by eavesdroppers to infer users’ data. But encryption may not necessarily guarantee privacy, especially when it comes to metadata.
Emerging standards such as DNS-over-HTTPS (DoH) or Encrypted Server Name Indication (ESNI) can protect the content of both DNS queries and the TLS SNI extensions. However, it might still be possible to determine which websites users are visiting by simply looking at the destination IP addresses on the traffic originating from users’ devices, which are visible as a part of the ClientHello of the TLS Handshake.
This metadata can be exploited and monetized by several agents to profile and target the user for advertising.
Read: Some not so private thoughts from IETF 105
Searching for Page Load Fingerprints
We, at the University of Illinois, did a measurement study to understand whether an adversary can deduce the websites a user is trying to connect to, using a set of IP addresses originating from the user’s device alone.
Using a highly configurable web crawler built on top of Chromium called MIDA, we performed DNS resolution on all domains involved in rendering the most popular websites listed in the Alexa Top 1 Million.
Figure 1 — The workflow we adopted to perform our measurement study.
We also accounted for several resources that get loaded from different web servers due to the sub-queries performed when a website is requested. The set of all these IPs contacted is referred to as the Page Load Fingerprint (PLF) of the website.
We adopted the model of an adversary who aims to recover domain information by collecting forward mappings of various candidate domains, and then using the answers to infer the reverse mapping of a given IP.
Figure 2 — A graphical representation of how a PLF accounts for the several resources loaded on part of a web request. Disclaimer: The above website is used just as an example.
DNS and SNI privacy offers limited protection
For each IP address in our dataset we calculated the number of domains that map to it as its anonymity set.
A slight minority of the IP addresses in our data set (47.6%) correspond to a single domain. For these domains, where the adversary knows the set of potential addresses a user may look up and is able to perform forward lookups on them, encrypted DNS provides little to no benefit. About 20% of the requests are uniquely identifying in this way; notably, XMLHttpRequests (XHRs) are less likely to map to site-unique IP addresses whereas stylesheets and images are more likely.
Figure 3 — This graph maps the number of anonymity sets generated to their sizes indicating that almost half the anonymity sets are of size 1 and thus, can be uniquely mapped to a website.
Around 68% of the IPs in our data set are unique to a single site, and a total of 402,524 (42.6%) sites use at least one resource whose domain maps to a site-unique IP address. The majority of websites (95.7%) have a unique PLF, suggesting there is a risk of identifying that a user is visiting the site solely from a list of contacted IP addresses.
Figure 4 — How a page load fingerprint can be used as a signature to identify the webpage that was requested by simply looking at the IP addresses.
We thus conclude that, in the context of web browsing, DNS and SNI privacy offers limited protection against an adversary who knows a plausible set of sites a user might visit (even if the set is quite large), and who performs forward lookups to infer the domain names and sites associated with given IP addresses.
The real-world inference will be slightly different from our closed-world assumption because a wider dataset will be available to the adversary. It can happen that a PLF signature that might seem unique in our study can actually belong to two different websites; it’s optimistic but we have identified IP addresses that have mappings to unique domains and these can potentially be used to uniquely profile websites.
We do identify a significant opportunity for content distribution networks (CDNs) to offer additional protection by coalescing more domains onto the same IP address. HTTP/2 connection coalescing can suppress connections from the page load trace and contribute to improved user privacy.
To learn more about our work watch our presentation at the Applied Networking Research Workshop 2019.
Simran Patil is a Masters student in Computer Engineering at the University of Illinois at Urbana-Champaign. She is a part of the Hatswitch research group led by her advisor, Prof. Nikita Borisov, at Security and Privacy Research at Illinois.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.
Thank you for the post about the IP address.
Thanks for giving this great information really very help full. Thanks for sharing.
Thanks for sharing. It is very helpful for me and also informative for all those users who will come to read.