In recent years, much has been made of the advances of machine-learning (ML) and artificial intelligence (AI) in various aspects of technology. This is certainly true also for cybersecurity and the increase of ML powered-security solutions. However, there are inherent risks both at a high level and specifically when applied to cybersecurity that need to be kept in mind.
Arvind Narayanan has published an excellent presentation on recognizing AI snake oil that summarizes the technology as being: good for processing “perception”; improving for automating judgement (where cybersecurity ML fits in); and fundamentally dubious for predicting outcomes. While the various harms for “bad” ML/AI can be quite evident and broadly influential in other disciplines (including raising human rights concerns), in cybersecurity the harms are limited to potential interruption of legitimate use.
That being said, one of the key differences between cybersecurity ML and other forms of ML (image classification, NLP and so forth) is that cybersecurity starts with data that is under the complete control of a malicious actor who has decades of experiences in manipulating our information systems. Anti-analysis, anti-debugging, obfuscation, and other forms of deception are well established in their processes. For obvious privacy concerns, it is difficult to get “known-good” user behaviour because no one wants all their data collected and analysed.
Beyond these initial concerns, the question is “how can cybersecurity ML be done safely?”
There are two primary things that need to be handled safely and correctly: determining the classes and choosing the features.
Analogies makes determining classes difficult
Cybersecurity has a seemingly simple classification system: is something bad or is it good? The difficulty is that we use rough analogies for each class.
For example, a look on scholar.google.com on domain name whitelisting will focus almost all the conversation on using various forms of “popularity” (for example, Alexa, Umbrella, Majestic, or Tranco) as a whitelist. This is a close analogy; it is not 100% accurate. Just because something is popular does not make it safe (for instance, various dynamic DNS domains are highly ranked) and just because something isn’t the most popular, 100,000 domains does not make it malicious.
A more illustrative example is domain generation algorithms (DGAs), which are popular in malware. The problem is that a DGA is a tool. Its use is determined by humans making decisions (of which using a DGA or choosing a domain name is one of many). Ad tracking and telemetry tools use DGAs as well and for similar reasons that malware authors do. So, choosing a class of “is a DGA” includes both good and bad data.
Picking the classes is on those security professionals who are deploying these tools; data scientists do not inherently have expertise to make these determinations. It is important to not allow “sleight of hand” to substitute some “analogy” (popular, is a DGA) class for the desired class (malicious, benign). Many cybersecurity research efforts, commercial tools, and in-house developed technologies do exactly this, often without thinking about it.
Determining features requires expertise
The second consideration is which features to use in a ML system. The features are the various inputs into the system that’s looking for correlations and statistical relationships. Feature selection, however, is entirely up to the security professionals developing those tools and setting the requirements of a system. It requires domain expertise to know what matters and what does not.
There are two ways to determine those features. Any automated security tool will start by attempting to automate the work of a security analyst and do it quick and at-scale. A security analyst examining network traffic will look at IP reputation, whois details of the domain, available DNS records, metadata about the network traffic, and so on. Many of these can be automated and “scored” to be input into a ML system. Often analysts are using APIs that produce machine-parsable data.
One other over-looked way to determine features is to analyse those items that represent decision points of the adversary. For instance, a domain name is a tool — it can be good or bad. What makes it bad is the intention of the person behind it and the decisions (or at least as many as you have access to) they make as they go about their activity.
In this example, a domain has to have various DNS records; they choose a registrar and have it point to certain IPs that are provided by someone else. Many of those items have text fields that can be parsed (domain whois, netblock whois, SSL/TLS cert details if available). Some features may not be available in the logs or traffic itself but must be acquired or otherwise extracted.
In the case of SSL/TLS certs, these contain a variety of metadata about the requestor. More than a few malware campaigns can be tracked simply because of how they generate SSL/TLS certs or signing certs are the same over years of activity.
To go back to a previous example, someone choosing a DGA domain clearly intends for the domain to not be used by humans (no one would remember it) and likely is expecting to be blocked at some point but that applies to both criminals and ad trackers. However, ad trackers operate on legitimate infrastructure to receive the telemetry, likely have appropriate domain records, and if one looks hard enough, have a fully incorporated company behind them. If you analyse more than one decision (choosing a domain), it makes it easier to more clearly infer the intent of the actor and to safely automate judgement about them.
By choosing classes carefully and selecting features that not only an analyst would use, but ones that help analyse each decision an attacker is making, it becomes possible to more safely automate judgement in cybersecurity systems.
John Bambenek is President of Bambenek Labs, a cybersecurity threat intelligence firm that uses machine-learning to find and track malicious activity on the Internet.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.
The efficient result of an ML Model depends on feature selection and defining the data class. The more you are focusing on it, the less stress you will get afterward.
Thanks for the blog, John.