There is a proverb I like to refer to when talking about network security:
“It is better and more useful to meet a problem in time than to seek a remedy after the damage is done.”
These words rang loudly in my ears when I recently read how the 2017 Equifax breach, which exposed nearly 150 million people’s personal credit records, has since cost the credit reporting company nearly USD 1.4 billion to cleanup and overhaul its information security program. This is excluding the legal fees for ongoing lawsuits as well as the reputational damage that it has also inflicted.
— RSA Fraud (@RSAFraud) September 12, 2017
It has already been dubbed the greatest security catastrophe of modern times and all because their network security system was slow to detect anomalous behaviour in their network traffic.
In this post, I want to share my organization’s approach to anomaly detection, specifically for DNS traffic that has become a popular target in recent years, and how machine learning (ML) can assist with overcoming some of the shortcomings of manual detection processes.
Keeping pace with increased demand
The organization that I work for, Link3 Technologies Limited, serves more than 820 million queries a day, from five caching DNS servers, all running IP anycast. This equates to more than 420GB of data a day, and nearly 150TB over the course of a year.
When I first started at the company, we were using fairly conventional methods to detect anomalies (we were only working with 10 to 20GB of data daily at the time), some of which we continue to use today. These include detecting:
- Time series analysis of DNS statistic and NetFlow — we use Nfsen and configure it to show us protocol and port-based top talker, statistics on standard protocols, and threshold-based notifications for any spike or changes with respect to previous time series.
- Passive checks of DNS queries — we use the Farsight DNSDB database to investigate domain issues.
- Real time monitoring of hardware resources and DNS services — this includes checking the use of the network as well as RAM, CPU, hard disk input/output operations, system processes, and TCP/UDP sockets that are serving DNS queries.
To give one example of how we have set up these anomaly detection systems, we have configured the static threshold for every zone of our anycast DNS system. If any of the servers crosses the failure ratio of 10%, it will notify us and we investigate the reason behind the failure ratio.
While many of these options still remain viable and are a part of many organization’s best current practices, they have their weaknesses, most of which are related to the human element at the centre of them. The more human interaction, the more chance of human error, not to mention the more work hours it takes (not just those working on the systems but all other departments that support these teams). There is also a certain lag that comes from human command.
All these factors played a large part in our decision to incorporate ML into our detection mix. So far, we have implemented it to monitor for:
- Open DNS resolvers
- Malware domains queries
- Cache poisoning
We have developed a custom algorithm for these separate requirements and also used the k-NN algorithm.
Phase-by-phase testing is ongoing using bare-metal server hardware with the Debian operating system; Python is being used as the coding base. Our final target is to work on a cloud and container system to scale the hardware resources according to our need.
The biggest lesson learnt thus far is that implementing ML is not easy. However, the performance benefits that it has so far achieved have been well worth the effort.
Our development and testing is ongoing and we hope to publish more of our lessons once we finish the project. Until then feel free to ask any questions in the comment section below or share your experience with anomaly detection.
A. S. M. Shamim Reza is Deputy Manager of the NOC team at Link3 Technologies Limited.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.