The recent explosion in the number of consumer Internet of Things (IoT) devices has increased the Internet’s exposure to large-scale attacks, such as the Mirai DDoS attack that broke records for traffic volumes.
To mitigate against such attacks, network operators need to be able to identify and locate IoT devices in their network, something that is becoming increasingly challenging.
After all, in large networks with millions of subscribers, there is already 100TB of traffic per day to sift through, and usually only sparsely-sampled summaries of network traffic (such as NetFlow) are available.
In an effort to improve how we currently do this, my colleagues and I, headed by the Max Planck Institute for Informatics, collaborated on a multinational project to determine whether sparsely-sampled flow captures can be used to detect IoT devices, and at what granularities can IoT devices be detected.
In our study, we empirically demonstrated that IoT devices have communication patterns that appear even in sparsely-sampled data. We used these patterns to generate signatures and rules using limited packet fields. We applied these rules on flow data collected at a large European ISP and IXP, and detected millions of subscriber lines that had devices from at least one of the studied IoT manufacturers. We made these signatures available, and you can apply them on your own data.
Detecting IoT devices
IoT devices tend to provide functionality by relying on a backend infrastructure in the form of a set of remote servers. We found that by studying the destinations of IoT communication, we can infer the type of device hosted at a subscriber line — even if they are behind a Network Address Translation (NAT).
This left us the problem of how to determine which devices contact which servers. To bootstrap this mapping between devices and the servers they rely on, we set up two testbeds with 56 IoT products, from 40 vendors, in six categories.
Next, we sought to ground truth the traffic of each device using two types of experiments:
- Idle experiments, where we powered on the devices and left them alone, so their network traffic corresponds to heartbeat and synchronization communication.
- Active experiments, where we used automation to interact with devices, for example, turning on/off, changing volume, and sending voice commands.
Conducting these experiments in a lab provides ground truth about each device’s network communication, but not about whether it is detectable under the sparsely sampled NetFlow records we would expect in a network provider. To bridge this gap, we tunnelled the traffic from our two IoT testbeds (located outside of the ISP being studied) to a household inside the ISP, as shown in Figure 2. This also helped us to ensure the contacted destinations are not biased toward the location of testbeds and are similar to the ones contacted by the other subscribers of the ISP.
Generating detection rules
Once we generated the IoT traffic, the next step was to identify which features (IP addresses, ports, domains, and protocols) can be used to identify these devices. Remember that at the provider level, we use flow captures such as NetFlow, and these captures contain only packet headers.
We started with a simple approach: create a hit list of IPs and port numbers that each IoT device communicates with. We then used this list to find devices from other subscriber lines that communicate with the same set of IPs or even a subset of them. When these lists matched, we infer the presence of the given IoT device.
However, we probably saw only a subset of IP addresses in the ground truth data and they might change after some time. Moreover, IP addresses of Content Delivery Networks (CDNs) are heavily contacted by a wide range of websites and applications that are not necessarily IoT devices. Therefore, we needed to exclude CDN IP addresses.
We address these challenges by analysing the fully qualified domain names resolved by the devices and found which domains and their associated IPs can be used to generate our rules. For this purpose, we used passive DNS datasets to map domains to IPs and vice versa, finding additional IP addresses of a domain, and inferring if an IP belongs to a shared (CDN) or dedicated infrastructure that supports IoT devices.
Filtering out shared and non-relevant domains meant we lost some of the features and we had to work with less information. Thus, some devices could only be detected at coarser granularity. We generated rules at three detection levels from the most fine-grained granularity to the most coarse:
- Product-level, for example, Amazon Echo
- Manufacturer-level, for example, Samsung device
- Platform-level: IoT device (we can’t infer the product type or manufacturer)
Applying the detection rules on datasets from an ISP
The final part of our project involved applying our methodology on data from a large European ISP with 15 million subscribers. Figure 3 shows that Alexa-enabled devices, that is, devices that respond to Alexa Voice Service commands, can be detected within minutes!
Increasing the observation period for each subscriber from one hour to 24 hours helps with inferring even more subscribers as shown in Figure 4. This figure shows some devices require a longer duration of data observation to be detectable.
Our insights can be used to develop signatures that allow an ISP to identify households that use specific IoT services. If such services are, for example, subject to security concerns, they can use such signatures to notify the corresponding customer of the potential problem and fix.
This is also possible if the IoT service is no longer supported or needs end-user manual upgrades, for example, to mitigate threats. Such signatures may also be used to move from DDoS attacks towards identifying culprits. To learn more about our work, read our ACM IMC’20 paper A Haystack Full of Needles: Scalable Detection of IoT Devices in the Wild.
Said Jawad Saidi is a Research Assistant at the Max-Planck Institute for Informatics.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.