Real-time detection of DNS exfiltration

By on 24 Jun 2019

Category: Tech matters

Tags: , ,

Blog home

DNS traffic has historically been poorly policed by organizations, compared to services such as email, FTP, and HTTP. As such, it has been exploited by cyber criminals in recent years resulting in damages that have ranged into the millions of dollars.

Enterprise firewalls are typically configured to allow all packets on UDP port 53 (used by the DNS) since the DNS is such a crucial service for virtually all applications. Some firewalls do offer enhanced DNS protection, but these require deep packet inspection of DNS messages to identify the covert channel and then isolate domains that contain encoded data. The significant resources required for this capability, and the resulting impact on firewall forwarding performance, usually results in enterprise network operators disabling such features.

This ability to transit firewalls gives attackers a covert channel, albeit a low-rate one, by which to exfiltrate private data and to maintain communication with malware by tunneling other protocols (for example, SSH, FTP) to command-and-control centres. One example of this happening is the remote access trojan DNSMessenger discovered in 2017 that used DNS queries and responses to execute malicious PowerShell commands on compromised hosts.

What are our contributions to tackle this problem?

We at the University of New South Wales (UNSW) have developed a real-time approach to detect data theft via the DNS in an enterprise network. Our approach has an accuracy of 98% for both cross-validation and testing phases.

We developed, tuned, and trained a machine learning algorithm (isolation forest) to detect anomalous DNS queries using a known dataset of benign domains only  — this helps in finding attacks that have not already been detected. We then implemented our approach on live 10 Gbps traffic streams from the borders of two test organizations — UNSW’s campus and CSIRO/Data 61.

Our solution is based on stateless attributes of fully qualified domain names (FQDNs). We call it stateless because it can be computed in real-time without any prior knowledge. Our list of stateless attributes are as follows:

  • Total count of characters in FQDN
  • Count of characters in the sub-domain
  • Count of uppercase characters
  • Count of numerical characters
  • Entropy
  • Number of labels
  • Maximum label length
  • Average label length

Based on these attributes, we developed a machine learning technique to determine if a DNS query of an enterprise host is normal or not (anomaly detection). We trained our anomaly detection machine with benign data from four days of our dataset — we kept three days of data for testing.

Validating benign and malicious instances

The research community has largely drawn ground truth benign domains from highly ranked popular domains. We have used Majestic Million’s top 10,000 domains from its top million list. Majestic ranks websites by the number of subnets linking to that site.

For ground-truth malicious instances, we have generated more than 1.4 million DNS queries from an open source tool called Data Exfiltration Toolkit (DET). These DNS queries are publicly available.

Please refer to our paper ‘Real-Time Detection of DNS Exfiltration and Tunneling from Enterprise Networks‘ that we presented at the the 16th IFIP/IEEE Symposium on Integrated Network and Service Management for more details.

Performance evaluation

We evaluated the efficacy of our approach by:

  1. Cross validating and testing the accuracy of the trained model for benign instances.
  2. Testing the detection rate for malicious DNS queries that we generated using the DET tool.
  3. Quantifying the performance in real-time on a live 10 Gbps traffic stream from the two organizations.

Our approach had the accuracy of 98% for both the cross-validation and testing phases. To address the 2% of false alarms, we populated the white list of highly trusted domains (for example, mcafee.com and sophosxl.com) plus the top 100 domains on the Majestic list.

This approach has been incorporated into project Nozzle, an enterprise network security solution we are working on with CSIRO/Data61, which combines software-defined networking and network function virtualization in a novel way to decouple ‘data collection’ from ‘data analytics’, thus enabling the use of machine learning and AI methods for cybersecurity analytics in software while maintaining high-speed data forwarding in (OpenFlow and P4) programmable hardware.

Jawad Ahmed is a Ph.D. student at the University of New South Wales (UNSW) Sydney.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top