Analysing network data to detect anomalies, such as intrusions, is an effective and proven technique for threat detection. However, it is a very intensive task, and continues to become more so with the growing demand for cybersecurity services.
Among the craze surrounding Artificial Intelligence (AI), there has been an interest in using Deep Learning (DL) within the networking and network security communities to improve efficiency and to reduce the risk of mistakes with detecting anomalies in network data.
To test the feasibility of this approach, DCSO Labs recently conducted a research project into building a practical system for machine-learning based anomaly detection using network flow data as well as possible data processing architectures suitable for large-scale collection and analysis of flow data (regardless of the specific method used for analysing them).
In this post, we briefly summarize the results and key takeaways of this project and our own work at KIProtect on data security for network analytics.
Collecting and preparing data for machine learning models
The flow data for the project was collected from several network sensors distributed over a real-world, large-scale client network, each comprising hundreds of thousands of endpoints. The flows were extracted using the Suricata IDS software.
Note: Network flows contain IP addresses, which are considered personal data according to the General Data Protection Directive (GDPR). See below how we were able to navigate around this.
DCSO developed a Golang-based flow extraction tool (Fever), which receives network flows in EVE-JSON format, encodes them into a binary format and sends them batch-wise to a processing backend. This method proved suitable even for large data volumes of 10 Gb / second per sensor.
The pseudonymized flows are still unordered and grouped into chunks when receiving them by the network sensor. To prepare them for machine learning, we need to group flows together to form a time series. How this grouping happens depends on the actual task that we try to solve; in general, a grouping by the IP addresses of the communication partners and the server port (if applicable) is a common choice.
The grouping can be performed using stream processing, for which we again implemented a message-queue based, decentralized solution in Golang. The grouped flows can then be stored in a suitable database system.
For long-term data processing we chose Cassandra, as it is designed for large-scale storage of sequential, well-ordered data and is particularly suited for ‘write-once-read-often’ usage scenarios. To store flows for near-time analysis we chose Redis as it provides distributed, low-latency storage, automated expiration of old data and support for stream processing.
To prepare the data for machine learning we performed several additional transformation steps. We replace the timestamp of a given flow with the time difference between that flow and the previous one. We then approximate this time difference and other numerical features (bytes sent, bytes received, packets sent, packets received) by applying a logarithmic transformation to the input data and grouping the results into distinct bins. This encodes each feature as a one-hot vector, which is very suitable for ingestion by a neural network.
Training machine-learning models
With the ordered, feature-engineered data in place, we can finally start thinking about using it to train machine-learning models. For this purpose, we investigated several deep-learning based machine-learning models.
For the initial project stage, our goal was to produce a model that can predict the application type of a communication stream based on the flow data alone. We tested two different model architectures for this task based on recurrent neural networks (RNN) and convolutional neural networks (CNN), respectively.
To make a prediction, we feed the data into our model in small chunks, for example, 128-time-steps each time. The network is then trained to predict the application class for a given flow sequence when presented with flow sequences from a number of different applications.
We measured the accuracy of the prediction both on the training data and on a test data set. Our initial model architecture reached an accuracy of 80% on short, 128-step flow sequences. This may not seem encouraging, however, bear in mind that we can run our model on many, potentially, overlapping sequences of flow data and then take the majority vote of all predictions, reducing the error as we feed more data into our system.
Machine-learning network analytics is as good as the data that feeds it
While this was a very short research project, we were able to show that large-scale processing of network flow data with machine learning is feasible and that even quite simple model architectures trained on labelled data can achieve reasonably good accuracy to be useful in practice.
A key takeaway from the project was that building high-quality machine learning models requires large, well-annotated training data. Generating and storing such data is not trivial and gets hindered by privacy and security concerns. We intend to keep working on better methods for anonymization and pseudonymization of network flow data to make publishing data sets easier.
We also hope to open-source more parts of our data processing and machine-learning pipeline in the near future, so that other researchers and organizations can benefit from it as well.
No, we didn’t forget GDPR
In light of recent GDPR laws, we needed to consider how collecting and storing flows on a large scale poses a risk to the privacy of the people whose data we collected.
To minimize this risk, we developed a novel, cryptography-based pseudonymization method for IP addresses. This allowed us to generate a unique, pseudorandom and reversible mapping of IPv4 and IPv6 values that preserves information about IP subnets, while making re-identification of the original addresses very difficult for an adversary. It can guarantee a unique, reversible mapping without relying on any central coordination beyond a shared cryptographic key. This is very desirable as it allows the pseudonymization to be carried out in a distributed system, making the method suitable for large-scale data processing.
To make the method secure, it’s necessary to rotate the cryptographic keys, for example, once per day or even once per hour. This limits the ability of an adversary to collect statistical information about a given mapping, which could then allow a re-identification of individual addresses or subnets. The method is similar to Cryptopan but relies on a different mechanism for generating pseudonyms.
Like other components of the system, we hope to open-source the algorithm and a reference implementation early next year on the DCSO Github Repository as well, so stay tuned.
Contributors: Katharine Jarmul and Andreas Lehner.
Andreas Dewes is Co-Founder of KIProtect. He is a trained physicist with a background in experimental quantum computing, working as a (data) scientist and tech entrepreneur.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.