In the first blog post of this series, we discussed considerations for measuring and understanding the performance of machine learning models in information security. In the second post, we compared machine learning models at a fairly coarse level on a malware detection task, noting several considerations of performance for each model, including accuracy (as measured by AUC), training time, test time, model size, and more. In this post, we'll depart slightly from the more academic discussion of our previous blog posts and discuss real-world implementation considerations. Specifically, we’ll build upon a related white paper and address operationalizing a malware classifier on an endpoint in the context of a hunt paradigm.
A Word about the Hunt
First, let's establish some context around malware detection in a hunt framework. We define hunting as the proactive, methodical, and stealthy pursuit and elimination of never-before-seen adversaries in one's own environment. Threat hunters establish a process to locate and understand a sentient adversary prior to eviction, without prematurely alerting the adversary to the hunter's presence. A thorough understanding of the extent of an adversary's access is necessary for complete removal and future prevention of the adversary. Prematurely alerting an adversary that he's being pursued can prompt him to destroy evidence of his exploitation, accelerate his timeframe for causing damage and stealing data, or cause him to burrow deeper and more carefully into systems than he otherwise might. Therefore, the hunter uses stealthy tools to survey the enterprise, secures assets to prevent the adversary from moving laterally within networks, detects the adversary's TTPs, and surgically responds to adversary tactics without disruption to day-to-day operations, for example, by terminating a single injected thread used by the adversary.
In this context, malware detection is part of a multi-stage detection framework that focuses particularly on discovering tools used by the adversary (e.g., backdoors used for persistence and C2). Unlike the passive detection framework, malware detection in the hunt framework should have particularly rigid standards for stealth, including a low memory and CPU footprint, while still providing high detection rates with low false positive rates. A low memory and CPU footprint allow an agent to hide amongst normal threads and processes, making it difficult for an adversary to detect or attempt to disable monitoring and protective measures. For this purpose, we focus attention specifically to developing a robust, lightweight model that resides in-memory on an endpoint to support hunting as one part of a larger detection framework. The task of this lightweight classifier is to determine the maliciousness of files that are:
- Newly created or recently modified;
- Started automatically at boot or other system or user event;
- Executed (pre-execution);
- Backing running processes;
- Deemed suspicious by other automated hunting mechanisms; or
- Specifically queried by the hunt team.
Lightweight Model Selection
Similar to results presented in the coarse model comparison experiment in our second post, and following additional, more detailed experiments, gradient boosted decision trees (GBDTs) offer a compelling set of metrics for lightweight malware detection for this hunt paradigm, including:
- Small model size;
- Extremely fast query time; and
- Competitive performance as measured by AUC, allowing model thresholds that result in low false positive and false negative rates.
However, to improve performance of GBDTs and tune it for real-world deployment, one must do much more than train a model using off-the-shelf code on an ideally-labeled dataset. We discuss several of these elements below.
From Data Collection to Model Deployment
The development of a lightweight malware classifier requires a large amount of diverse training data. As discussed in the first blog post of this series, these data come from a variety of both public and private sources. Complicating our initial description of data collection is that: (1) data must often be relabeled based on preliminary labels, and (2) most of these data are unlabeled—there's no definitive malicious/benign or family indicator to the collected samples.
Data Labels: It Just Seemed Too Easy, Didn't It?
Even when data come with labels, they may not be the labels one needs for a malware classifier. For example, each label might consist of a family name (e.g., Virut), a malware category (e.g., trojan) or in a more unstructured setting, a free-form description of functionality from an incident response report, for example. From these initial labels, one must produce a benign or malicious tag. When curating labels, consider the following questions:
- While a backdoor is malicious, what about legitimate network clients, servers and remote administration tools that may share similar capabilities?
- Should advanced system reporting and manipulation tools, like Windows Sysinternals, that might be used for malicious purposes be considered malicious or benign?
- Are nuisance families like adware included in the training dataset?
These questions speak to the larger issue of how to properly define and label "grayware" categories of binaries. Analyzing and understanding grayware categories to build a consensus are paramount for constructing an effective training dataset.
Unlabeled Data: The Iceberg Beneath the Surface
Unlabeled data may be under-appreciated in the security data science field, but are often the most interesting data. For example, among the unlabeled data may be a small fraction of bleeding-edge malware strains that have yet to be detected in the industry. Although not always straightforward to do so, these unlabeled samples can still be leveraged using so-called semi-supervised machine learning methods. Semi-supervised learning can be thought of as a generalization of supervised learning, in which both labeled and unlabeled samples are available for training models. Most models, like many considered in our second post, do not natively support the use of unlabeled samples, but with care, they can be modified to take them into account. We explain two such methods here.
First, semi-supervised methods exist that work in cooperation with a human analyst to very judiciously select "important" samples for hand-labeling, after which traditional supervised learning models can be used with an augmented labeled dataset. This so-called active learning framework is designed to reduce the burden on a human analyst, while enhancing the model’s performance. Instead of inspecting and hand labeling all of the unlabeled samples, a machine learning model guides the human to select a small fraction of samples that would be the greatest benefit to the classifier. For example, samples may be selected that provide the greatest gain in the volume of data label discovery. By asking the human to label a single sample, the labels for an entire tight cluster of samples can be inferred. There are many similar, sometimes competing and sometimes complementary objectives outlined below:
- Which unlabeled samples is the model most uncertain about?
- If labeled correctly, which sample will maximize information gain in my model?
- Which unlabeled samples could represent new attack trends?
Malware samples selected by active learning can address one or more of these objectives while respecting the labeling bandwidth of human analysts.
A second category of semi-supervised methods leverages unlabeled data without human intervention. One approach in this category involves hardening decision trees (and by extension, GBDTs), which are known to suffer from being overly sensitive in regions of the feature space where there are few labeled samples. The objective is to produce a GBDT model that is regularized towards uncertainty (produce a score closer to 0.5 than 0.0 or 1.0) in regions of the feature space where there are many unlabeled samples, but few or no labeled samples. Especially in the hunt paradigm, a model should have a very low false positive rate. This locally-feature-dependent regularization of the model can save a hunter from alert fatigue, which inherently eliminates the utility of the alerting system.
Other semi-supervised methods that do not require human intervention include label spreading and propagation which infer labels inductively—neighbors to a labeled sample should carry the same label—and self-training—where the model predicts labels for unlabeled samples, and the most confident decisions are added to the labeled training set for re-training.
Automation and Deployment
For an enterprise-relevant data science solution, a wholly automated process is required for acquiring samples (malicious, benign and unlabeled), generating labels for these samples, automatically extracting features, partitioning the features into training and validation sets (for feature and model selection), then updating or re-training a model with new data. This may seem like a mundane point, but data lifecycle management and model versioning and management don’t enjoy the standard processes and maturity that are now common within software version management. For example, consider four independent elements of a data science solution that could change the functionality and performance of an endpoint model: 1) the dataset used to train the model; 2) the feature definitions and code to describe the data; 3) the model trained on those features; and 4) the scaffolding that integrates the data science model with the rest of the system. How does one track versioning when new samples are added to or labels are changed in the dataset? When new descriptive features are added? When a model is retrained? When encapsulating middleware is updated? Introducing the engineering processes into a machine learning solution narrows the chasm between an interesting one-off prototype and a bona fide production machine learning malware detection system. Once a model is trained and its performance on the holdout validation set is well characterized, the model is then automatically pushed to a customer.
But the job doesn’t stop there. In what constitutes quality assurance for data science, performance metrics of the deployed model are continuously gathered and checked against pre-deployment metrics. Following deployment, the following questions must be answered in order to monitor the status of a deployed model:
- Is there an unusual spike in the number of detections or have the detections gone quiet?
- Are there categories or families the model is no longer correctly classifying?
- For a sampling of files submitted to the model, can we discover the true label and compare them against the model’s prediction?
The answer to these questions is particularly important in information security, since malware samples are generated by a dynamic adversary. In effect, the thing we’re trying to detect is a moving target: the malware (and benign!) samples we want to predict continue to evolve from the samples we trained on. Whether one acknowledges and addresses this issue head on is another issue that separates naive from sophisticated offerings. Clever use of unlabeled data, and strategies that proactively probe machine learning models against possible adversarial drift can be the difference between rapidly discovering a new campaign against your enterprise, or being “pwned”.
Endgame MalwareScore™
Optimizing a malware classification solution for the hunt use case produces a lightweight endpoint model trained on millions of benign, malicious, and unlabeled samples. The Endgame model allows for stealthy presence on the endpoint by maintaining a minuscule memory footprint without requiring external connectivity. Paired with a sub-100 millisecond query time, the model represents the ideal blend of speed and sophistication necessary for successful hunt operations. The endpoint model produces the Endgame MalwareScore™ where scores approaching 100 inform the analyst that the file in question should be considered malicious. Endgame MalwareScore™ also encourages the analyst to easily tune the malicious threshold to better suit the needs of their current hunt operation, such as reducing the threshold during an active incident to surface more suspicious files to the hunter. The Endgame MalwareScore™ is an integrated element of detecting an adversary during hunt operations, enriching all executable file-backed items with the score to highlight potentially bad artifacts like persistence mechanisms and processes to guide the hunter as effectively as possible.
That's a Wrap: Machine Learning's Place in Security
After reading this series of three blogs, we hope that you are able to see through the surface-level buzz and hype that is too prevalent in machine learning applied to cyber security. You should be better equipped to know and do the following:
- Understand that, like any security solution, machine learning models are susceptible to both false positives and false negatives. Hence, they are best used in concert with a broader defensive or proactive hunting framework.
- Ask the right questions to understand a model's performance and its implications on your enterprise.
- Compare machine learning models by the many facets and considerations of importance (FPR, TPR, model size, query time, training time, etc.), and choose one that best fits your application.
- Identify key considerations for hunting on the endpoint, including stealthiness (low memory and CPU footprint), model accuracy, and a model's interoperability with other facets of a detection pipeline for use by a hunt team.
- Understand that real-world datasets and deployment conditions are more "crunchy" than sterile. Dataset curation, model management, and model deployment considerations have major implications in continuous protection against evolving threats.
Endgame’s use of machine learning for malware detection is a critical component of automating the hunt. Sophisticated adversaries lurking in enterprise networks constantly evolve their TTPs to remain undetected and subvert the hunter. Endgame’s hunt solution automates the discovery of malicious binaries in a covert fashion, in line with the stealth capabilities developed for elite US Department of Defense cyber protection teams and high end commercial hunt teams. We’ve detailed only one layer of Endgame’s tiered threat detection strategy: the endpoint. Complementary models exist on the on-premises hunt platform and in the cloud that can provide additional information about threats, including malware, to the hunter as part of the Endgame hunt platform.
Endgame is productizing the latest research in machine learning and practices in data science to revolutionize information security. Although beyond the scope of this blog series, machine learning models are applicable to other stages of the hunt cycle—survey, secure, and respond. We’ve described the machine learning aspect of malware hunting, specifically the ability to identify persistence and never-before-seen malware during the “detect” phase of the hunt. Given the breadth of challenges in the threat and data environment, automated malware classification can greatly enhance an organization’s ability to detect malicious behavior within enterprise networks.