Endgame's Blog

Last summer, Stagefright became a household name after security researcher Joshua Drake highlighted vulnerabilities in the multimedia engine in Android that goes by the same name. His BlackHat USA talk last August set off a gold rush amongst bug hunters. Android security bulletins continue to be filled with libstagefright and mediaserver vulnerabilities each month, as depicted in the chart below. This high volume of bug fixes in Android is both comforting and alarming.

CVE report Android

Vulnerability discovery, disclosures, and patch management remain an integral part of improving the security of platforms such as Android. However, exploit mitigations can also increase the level of difficulty for exploitation by forcing the attacker to adapt their techniques. At Endgame, our Vulnerability Research & Prevention (VR&P) team is actively pursuing both models in order to help defend against exploitation. As an example of the latter approach, this post discusses how the performance monitoring unit (PMU) of certain ARM cores can be utilized to perform system-call monitoring. This hardware-assisted technique adds minimal performance overhead, avoids any requirement for patching the kernel, and offers a unique way to perform integrity checks on system-calls such as for ROP detection.

Hardware-Assisted Exploit Prevention

Over the past year, our VR&P team has been investigating an emerging area of exploit mitigation – the use of performance counters for hardware-assisted security solutions. Our exciting research on hardware-assisted control-flow integrity on the Intel architecture will be presented later this summer at BlackHat USA 2016. As a mobile security researcher at heart, I became curious about the applicability of our approach to the ARM architecture almost immediately upon starting our x86 research.

As it turns out, Intel x86 isn’t the only architecture that can count. ARM cores can count too! In fact, a performance monitoring unit (PMU) is quite common on many modern CPU architectures. I covered the adventures of reading ARM processor manuals [1,2] and researching performance counters on various ARM chipsets during my talk at REcon 2016 last week in Montreal, Canada titled “Hardware-Assisted Rootkits & Instrumentation: ARM Edition”. Much of my talk focused on using the PMU to enable instrumentation tools on ARM and discussing some offense-oriented use-cases such as rootkits. However, as with most findings in the InfoSec field, capabilities can often be utilized for either offensive or defensive purposes.

Much like the Intel PMU, the ARM PMU includes support for a number of general counters, architectural events, and a performance monitoring interrupt feature to signify a counter overflow. However, something interesting surfaced in the ARM reference manuals: the PMU on several Cortex-A and other custom ARM cores is able to count exceptions for each of the ARM exception vectors individually. By configuring hardware performance counters to overflow on every instance of the event, it is then possible to effectively trap each of these events via performance monitoring interrupt. The Supervisor Call (SVC) exception vector is utilized by many operating systems, including Android, to handle system calls. Trapping SVC instructions offers many interesting use-cases both for offense or defense.

EMET and Anti-ROP

Code reuse attacks such as Return-Oriented Programming (ROP) have been a staple of the attacker arsenal over the past decade. As such, anti-ROP products have become widespread in the PC market stemming from Microsoft’s BlueHat competition in 2012. Microsoft’s EMET (Enhanced Mitigation Experience Toolkit) [3] was one of the initial byproducts of the competition as it integrated many of the ROP detection techniques presented by one of the contestant winners. Since EMET launched, many third-party security vendors have added similar anti-ROP detections to their products.

One of the primary weaknesses in EMET and similar tools is that they rely on injecting code into each user-mode process that it protects. First, this actually increases the attack surface for code-reuse attacks given that it adds code and data to the process being attacked. Moreover, EMET bypasses have emerged that actually disarm protections by reusing injected code within EMET.dll. A second key challenge to user-mode anti-ROP protection is the fact that integrity checks are typically introduced by hooking critical API’s. ROP chains could jump past function prologues to avoid a hook point, and hooking every possible API is reminiscent of the old-fashioned arcade game Whac-a-Mole.

Anti-ROP integrity checks from the kernel have not been explored as often in Windows products likely due to kernel patch protection. However, being able to trap SVC instructions (system calls) on the ARM architecture without modifying the Exception Vector Table (EVT) or any portion of a kernel image opens up new possibilities. As a fun application of this ARM PMU research, I decided to implement an anti-ROP prototype loadable kernel module for Android without requiring any modifications to the running kernel by trapping all SVC instructions using only performance monitoring interrupts. The performance overhead of this solution was less than 5% on Android, and can monitor all system calls system-wide.

Blocking Stagefright

I put this prototype to the test by using it against the very popular libstagefright attack vector in Android. Thus, I pulled pieces of Stagefright exploit proof-of-concepts from Mark Brand of Project Zero and NorthBit’s Metaphor on CVE-2015-3864. Both ROP chains utilize the same stack pivot, and the pivot was easily detected on the mprotect or mmap calls. The video below depicts the outcome of the test.

While this is just a proof-of-concept, it hopefully demonstrates the potential for extending hardware-assisted exploit prevention techniques to the ARM architecture. Slides from my RECon talk can be found here. Be sure to check out our talk at BlackHat USA later this summer where our team will discuss and demonstrate our PMU research on the Intel architecture in order to detect and prevent control-flow hijacks in real-time.

References

ARM, ARM Architecture Reference Manual: ARMv7-A and ARMv7-R edition.  
ARM, ARM Architecture Reference Manual: ARMv8, for ARMv8-A architecture profile.  
Microsoft, Enhanced Mitigation Experience Toolkit 5.5 User’s Guide, Jan 2016

Code Human Form

The policy world will spend the day shocked that the Brexiteers defeated the Remainers by 52-48%, leading Prime Minister David Cameron to promise to resign this Fall. The majority of security professionals likely didn’t follow the ebbs and flows of the debate with the same fervor they give to Game of Thrones. With much of the media discussion appropriately focused on the economic and financial implications, there has not been much analysis of the implications for the digital domain. With the Brexit now a reality, it warrants a renewed focus by the security community as the consequences of the Brexit vote play out over the next few days, months and years.

For the most part, the security industry has focused on whether or not a Brexit would make the UK safer (or less safe) with regard to cybersecurity. A poll administered to security professionals claimed that that there likely would not be cybersecurity implications, noting that Britain may simply pursue a national implementation of EU policies. A different poll of security professionals disagreed, concluding that a Brexit would weaken cybersecurity because there would be additional hurdles and, likely, weakened information sharing with the EU. Clearly, it is too early to tell, but a Brexit could continue the devolution toward the Balkanization of the Internet. The UK may opt to implement its own version of EU policies, such as the General Data Protection Regulation, which aims to facilitate commerce through a single Digital Single Market while providing enhanced digital privacy. However, the vote in favor of the Brexit was a vote in favor of greater national control over its economy. There is reason to believe this same desire will also bleed into the digital arena, with a push for digital sovereignty and greater national control over the Internet. As has historically been the case when more isolationist policies defeat internationalist ones, these policies are not single-issue, but address the larger need for national control over all aspects of life. A Brexit may further augment the Balkanization of the Internet if the UK pursues its own sovereign path in the digital domain.

The tech world largely sided with the Remainers, due to the ease of access to markets as well as a larger talent pool. This sentiment was especially strong among the Fintech community, who wants London to remain the well-established hub of the financial world. With Bitcoin surging and the pound dropping, Fintech’s concerns about a Brexit are well-founded. With the EU push for a “Digital Single Market”, Fintech companies no longer will benefit from their passport to the European Economic Area, likely resulting in UK-based companies moving to EU countries. The UK will adapt, but London’s role as the financial hub is now increasingly threatened thanks to the Brexit, coupled with the rise in digital currencies, and the EU’s move toward greater digital integration within member-states.

Finally, while most people in security associate bots with malware, there are initials signs that ‘bots’ attempted to influence voting behavior. Bot traffic comprises 60% of all online traffic, with social media bots found for both the Brexit and remain camps. Focusing on Twitter traffic for Brexit, remain and neutral hashtags, the most active 1% of contributors made up 30% of the content. As undecided voters generally don’t make up their minds until the last 24-48 hours, these social media bots can potentially influence those last-minute decisions. Moreover, as elections increasingly see a spike in phishing scams, it would be surprising if the same did not occur in the lead up to yesterday’s vote.

With the Brexit now a reality, UK workers will encounter more limitations to their mobility, and UK’s security industry will encounter a rising patchwork of hurdles to sharing threat intelligence and other forms of digital collaboration and regulations. And for those who still fail to see the relevance of the Brexit, even Game of Thrones is impacted, as the show receives EU subsidies while filming from Northern Ireland. Winter may come in a different location next season. That said, the pendulum historically tends to swing between extremes of isolation and integration, progressing eventually toward greater integration. While the Brexit vote may not have registered on many in the community’s radar, it is a very impactful vote that has unknowable security implications.

"Brexit" Over UK Flag

Machine learning is a fashionable buzzword right now in infosec, and is often referenced as the key to next-gen, signature-less security. But along with all of the hype and buzz, there also is a mind-blowing amount of misunderstanding surrounding machine learning in infosec. Machine learning isn't a silver bullet for all information security problems, and in fact can be detrimental if misinterpreted. For example, company X claims to block 99% of all malware, or company Y's intrusion detection will stop 99% of all attacks, yet customers see an overwhelming number of false positives. Where’s the disconnect? What do the accuracy numbers really mean? In fact, these simple statistics lose meaning without the proper context. However, many in the security community lack the basic fundamentals of machine learning, limiting the ability to separate the relevant and credible insights from the noise.

To help bridge this gap, we're writing a series of machine learning-related blog posts to cut through the fluff and simplify the relevant fundamentals of machine learning for operationalization. In this first post, we provide a basic description of machine learning models. Infosec is ripe for a multitude of machine learning applications, but we’ll focus our overview on classifying malware, using this application to demonstrate how to compare models, train and test data, and how to interpret results. In subsequent blog posts, we'll compare the most prominent machine learning models for malware classification, and highlight a model framework that we believe works well for the malware-hunting paradigm on a lightweight endpoint sensor. While this post is focused on malware classification, the machine learning fundamentals presented are applicable across all domains of machine learning.

What is Machine Learning?

In general, machine learning is about training a computer to make decisions. The computer learns to make decisions by observing patterns or structures in a dataset. In machine learning parlance, the output of training is called a model, which is an imperfect generalization of the dataset, and is used to make predictions on new data. Machine learning has many advantages, automating many aspects of data munging and analysis at scale. For example, executables are either benign or malicious, but it’s impossible to manually review all of them. A corporate system may contain millions of files that require classification and few, if any, companies have enough staff to manually inspect each file. Machine learning is perfect for this challenge. A machine learning model can classify millions of files in minutes and can generalize better than manually created rules and signatures.

Supervised learning models are trained with examples to answer precisely these kinds of questions, such as, “is this file malicious?”. In this supervised learning setting, the training set may consist of two million Windows executables consisting of one million malware samples and one million benign samples. A machine will observe these samples and learn how to differentiate between benign and malicious files in order to answer the question. Typically, this decision is in the form of a score such as a single value between 0 and 1. The figure below demonstrates the creation and use of a supervised model.

Anderson - Model

The model’s scores are converted to yes/no answers by way of a threshold. For example, if our scores ranged from 0 to 1, we may want to set a threshold of 0.5. Anything less than 0.5 is benign (“not malicious”), and everything equal or greater than 0.5 is malware (“is malicious”). However, models are rarely perfect (discussed later) and we may want to tweak this threshold for a more acceptable performance. For instance, perhaps missing malware is far worse than labeling a benign sample as malware. We might set the threshold lower, say 0.3. Or, maybe mislabeling in general is very bad, but our use case will allow for outputting unknown labels. In this case we can set two thresholds. We could choose to set everything below 0.3 as benign, everything above 0.7 as malicious, and everything else is unknown. A visualization of this is below. The important point here is that there is no one-size-fits-all solution, and it is essential to understand your data well and adjust models based on the use case and data constraints.

Anderson - Benign / Malware

How do we evaluate models?

Metrics are necessary to compare models to determine which model might be most appropriate for a given application. The most obvious metric is accuracy, or the percentage of the decisions that your model gets right after you select appropriate thresholds on the model score. However, accuracy can be very misleading. For example, if 99.9% of your data is benign, then just blindly labeling everything as benign (no work required!) will achieve 99.9% accuracy. But obviously, that's not a useful model!

Results should be reported in terms of false positive rate (FPR), true positive rate (TPR), and false negative rate (FNR). FPR measures the rate in which we label benign samples as malicious. An FPR of 1/10 would mean that we classify 1 in 10 (or 10%) of all benign samples incorrectly as malicious. This number should be as close to 0 as possible. TPR measures the rate in which we label malicious samples as malicious. A TPR of 8/10 would mean that we classify 8 in 10 (or 80%) of all malicious samples as malicious. We want this number to be as close as possible to 1. FNR measures the rate in which we label malicious samples as benign and is the opposite of TPR (equal to 1-TPR). A 9/10 (9 in 10) TPR is the same as 1-9/10 or a 1/10 (1 in 10) FNR. Like FPR, we want this number to be as close to 0 as possible.

Using these three measurements, our results don't look so great for the useless model above (always labeling a sample benign) evaluated only on benign samples. Our FPR rate is 0% (perfect!), but TPR is now 0%. Since we never label anything as positive, we will not get any false positives. But this also means we will never get a true positive, and while our accuracy may look great, our model is actually performing quite poorly.

This tradeoff between FPR and TPR can be seen explicitly in a model's receiver operating characteristic (ROC) curve. A ROC curve for a given model shows the FPR / TPR tradeoff over all thresholds on the model’s score.

Anderson - False Positive

What FPR, TPR and FNR metrics really mean to a user greatly depends on scale. Let's say that we have a FPR of 0.001 (1/1000). Sounds great, right? Well, a Windows 7 x64 box has approximately 50,000 executables on a given system. Let’s say you have 40,000 endpoints across your network. Absent any actions to rectify the false positives, this model would produce approximately two million false positives if applied to all Windows executables on all systems across your entire network! If your workflow calls for even the most basic triage on alerts, which it should, this would be an inordinate amount of work.

Importance of Training and Testing Data

So is the key to good machine learning a good metric? Nope! We also have to provide our learning stage with good training and testing data. Imagine all of our training malware samples update the registry, but none of the benign samples do. In this example, a model could generalize anything that touches the registry as malware, and anything that does not touch the registry is benign. This, of course, is wrong and our model will fail miserably in the real world. This example highlights a phenomena known as overtraining or overfitting and occurs when our model becomes too specific to our training set and does not generalize. It also highlights a problem of bias, where our training/testing data over-represents a sub-population of the real-world data. For example, if the training data disproportionately is overpopulated with many samples from a small number of malware families, you may end up with a model that does a great job detecting new variants of those specific families but a lousy job detecting anything else.

Why is this important? Overfitting and bias can lead to misleading performance. Let's assume you are training your malware classification model on 5 families (benign + 4 different malware families). Let's also assume that you computed FPR, TPR, and FNR on a different set of samples consisting of the same 5 families. We'll also assume that your model is awesome and gets near perfect performance. Sounds great, right? Not so fast! What do you think will happen when you run this model in the real world and it’s forced to classify families of malware that it has never seen? If your answer is "fail miserably" then you're probably correct.

Good performance metrics are not enough. The training/testing data must represent and function in the real world if you expect similar runtime performance as the performance measured during test time.

Acquiring Data

So where do you get good train and test data? At Endgame we leverage a variety of sources ranging from public aggregators and corporate partnerships to our own globally distributed network of sensors. Not only do these sources provide a vast of amount of malicious samples, but they also aid in acquiring benign samples helping Endgame achieve a more diverse set of data. All of this allows us to gather real world data to best train our classifiers. However, your training data can never be perfect. In the case of malware, samples are always changing (such as new obfuscation techniques) and new families of malware will be deployed. To counteract this ever-evolving field, we must be diligent in collecting new data and retraining our models as often as possible. Without a constantly evolving model, we will have blinds spots and adversaries are sure to exploit them. Yes, the game of cat and mouse still exists when we use machine learning!

Acquiring Labels

Now that your training/testing data is in the computer, we need to decide which samples are benign and which samples are malicious. Unlabeled or poorly labeled data will certainly lead to unsuccessful models in the real world. Malware classification raises some interesting labeling problems ranging from the adversary use of benign tools for nefarious means to new, ever-evolving malware that may go undetected.

Unlabeled data. Data acquired from a sensor is often unlabeled, and presents a serious problem for machine learning. Unlabeled binaries require human expertise to reverse engineer and manually label the data, which is an expensive operation that does not scale well to the size of the training set needed for model creation.
Incorrectly labeled data. Unfortunately, when classifying malware, the malware is often incorrectly labeled. Whether labeled by humans or machines, the necessary signatures may not be accessible or known, and therefore by default the malware is classified as benign. This will ultimately confuse the classifier and degrade performance in terms in FPR, TPR, and FNR.
Nebulous or inconsistent definitions. There is no consistent or concrete definition of what constitutes malware. For example, it is common for attackers to use administrative tools to navigate a system post-attack. These tools aren't inherently malicious, but they often have the same capabilities as their malicious counterparts (e.g., the ability to send network commands, view processes, dump passwords) and are often used in attacks. On the flip side, they are readily available and often used on a system administrator's machine. Constantly calling them malicious may annoy those administrators enough to entirely ignore your model's output. Decisions have to be made on what labels, and how much weight, to give these sorts of programs during training to ensure models generalize instead of overfit, and support business operations instead of becoming a nuisance.

The table below summarizes some primary challenges of labeling and their solutions.

Anderson - Labels Challenges

Conclusion

So remember, machine learning is awesome, but it is not a silver bullet. While machine learning isn’t a silver bullet and signatures and IOCs are not going away, some of the most relevant, impactful advances in infosec will stem from machine learning. Machine learning will help leapfrog our detection and prevention capabilities so they can better compete with modern attacker techniques, but it requires a lot of effort to build something useful, and, like all approaches, will never be perfect. Each iterative improvement to a machine learning model will be met with a novel technique by an adversary, which is why the models must evolve and adapt just as adversaries do.

Evaluating the performance of machine learning models is complicated and results are often not as really, really, ridiculously good looking as the latest marketecture would have you believe. When it comes to security you should always be a skeptic and the same goes for machine learning. If you are not a domain expert just keep the following questions in mind the next time you are navigating the jungle of separating machine learning fact from fiction.

Anderson - Endgame We believe that machine learning has the potential to revolutionize the field of security. However, it does have its limitations. Knowing these limitations will allow you to deploy the best solutions, which will require both machine learning and conventional techniques such as signatures, rules, and technique-based detections. Now that you are armed with a high-level introduction to interpreting machine learning models, the next step is choosing the appropriate model. There are a myriad of tradeoffs to consider when selecting a model for a particular task and this will be the focus of our next post. Until then, keep this checklist handy and you too will be able to detect data science fact from the data science fiction.

In our previous blog, we reviewed some of the core fundamentals in machine learning with respect to malware classification. We provided several criteria for properly evaluating a machine learning model to facilitate a more thorough understanding of its true performance. In this piece, we dig deeper into the operationalization of machine learning, covering the basics of feature engineering and a few commonly used types of machine learning models. We conclude with a head-to-head comparison of several models to illustrate the various strengths and performance tradeoffs of these varied machine learning implementations. When operationalizing machine learning models, organizations should seek the solution that best balances their unique requirements, including query time, efficacy performance, and overall system performance impact.

Feature Engineering

Feature engineering is a major component of most machine learning tasks, and entails taking some raw piece of data, such as malware, and turning it into a meaningful vector of features. Features are a numeric representation of the raw data that can be used natively by machine learning models. For this post, our feature vector is a vector of floats generated from a static Portable Executable (PE) file. We follow the feature extraction techniques outlined in Saxe and Berlin’s malware detection piece, and use these features for all classifiers. The features include:

Byte-level histogram
Entropy measured in a sliding window across the PE file
Information parsed from the PE file, including imports, exports, section names, section entropy, resources, etc.

Similar to previous research, these features are condensed into a vector of fixed length of 1,562 floats. Turning a seemingly infinite list of imports, exports, etc. into a fixed length vector of floats may seem like an impossible task, but feature hashing makes this possible. Think of the vector of floats as a hash table, wherein a hashing function maps one or more features to a single feature that accumulates the values that correspond to those features. This technique is quite common and is surprisingly effective.

Models

Now that we have defined our features, it is time to select the models. There are too many machine learning models for a complete review, so we focus on seven common models: naive Bayes, logistic regression, support vector machine, random forest classifier, gradient-boosted decision tree, k-nearest neighbor classifier, and a deep learning model. In our descriptions below, we explain how each model performs as they pertain to malware classification, although clearly they support many other use cases.

Naive Bayes

Naive Bayes is one of the most elementary machine learning models. Naive Bayes crudely models the features of each class as having been generated randomly from some user-specified statistical distribution, like the Gaussian distribution. Its name comes from its reliance on Bayes theorem with a naive assumption that all features are independent of one another. Independence of features implies that the occurrence of one feature (e.g., entropy of the .text section) has no bearing on the value or occurrence of another feature (e.g., an import of FtpGetFile fromWinINet.dll). It is termed naive since this assumption rarely holds true for all features in most applications (e.g., an import of FtpGetFile is correlated with an import of InternetOpen). Despite its simplicity, naive Bayes works surprisingly well in some real world problems such as spam filtering.

Logistic Regression

Logistic regression (LR) is a model that learns a mapping from a vector of feature values to a number between 0 (benign) and 1 (malicious). LR learns one coefficient for each element of a sample's feature vector, multiplying the value of each element by the corresponding coefficient, and summing up those products across all elements. LR then compresses that number to a value between 0 and 1 using a logistic function, and approximates from the features the target label (0=benign, 1=malicious) provided in the training set.

Support Vector Machine

A support vector machine (SVM) is a so-called max-margin classifier, of which we only consider a linear SVM. Linear models seek a straight line that bisects malicious from benign samples in our feature space. In turn, a linear SVM aims to find the fattest line/slab that separates malicious from benign. That is, of all slabs that could separate malicious features from benign features, SVM finds the one that occupies the most empty space between the malicious and benign classes. This "fatness" objective provides a measure of robustness against new malware samples that may begin to encroach on the margin between malicious and benign samples.

Random Forest Classifier

A random forest classifier is a model that learns to partition the feature space using a committee of decision trees. Each decision tree is trained to essentially play a game of "20 questions" to determine whether a sample's features represent a malicious or benign file. The number of questions and which questions to ask are learned by an algorithm. To provide good generalization without overfitting, each decision tree is trained only on a subset of the data and questions can be asked only from a subset of features (hence, "random"). Tens or even hundreds of decision trees of this type (hence, "forest") are bagged together in a single model, where the final model's output is a majority vote of each randomized decision tree in the forest.

Gradient-Boosted Decision Tree

Like the random forest classifier, a gradient-boosted decision tree (GBDT) combines the decision of many small decision trees. In this case, however, the number of questions that can be asked about the features are restricted to a relatively small number (perhaps one to ten questions per tree). From this initial tree, the model makes several errors, and in the next round (i.e., next decision tree), the algorithm focuses more attention on correcting errors from the previous round. After tens or hundreds of rounds, a final determination is made by a weighted vote of all the trees.

K-Nearest Neighbors (k-NN)

k-nearest neighbors (k-NN) is another extremely simple model. The idea is that you can classify an object by its closest (or most similar) neighbors. For example, in a database of ten million malware and benign samples, if a file's ten most similar matches in the database are all malware, then the object is probably malware. This classifier is non-parametric (doesn't try to fit the data using some user-prescribed function) and completely lazy (requires no training!). Nevertheless, in straightforward implementations one must search the entire dataset for every query, so memory and speed may be limiting factors for large datasets. A whole branch of research is devoted to approximating nearest neighbor search to improve query speed.

Deep Learning

Deep learning refers to a broad class of models defined by layers of neural networks. Since this class is so broad, we briefly describe only the deep learning classifier in the malware detection paper previously referenced in the feature engineering section. This model consists of 1 input layer, 2 fully-connected hidden layers, and an output classifier layer. Our implementation of the model differs slightly from that paper in an effort to improve model training convergence. It should be noted that, like all models in this blog post, this deep learning model does not represent any company's next-gen malware product, and is intended only for coarse comparison with other model types. Since deep learning is a fast moving research field and may produce models of varying complexity and performance, it would be unfair and misleading to draw narrow product conclusions about this particular embodiment of deep learning.

Bake-Off Framework

With our features and models selected, it is now time to evaluate the performance of each model and explore whether there is a best-of-breed model for malware classification.

As we previously highlighted in our last post, one must take care when comparing models: "accuracy" alone doesn't tell the whole story. We discussed several performance measures including false positive rate (FPR), false negative rate (FNR), true positive rate (TPR)—which depend on a user-specified threshold on a model's output—and the Receiver Operating Characteristic (ROC) which explicitly shows the tradeoff between FPR and TPR for all possible model score thresholds. In what follows, we'll provide a summary of the model performance by reporting the area under the ROC curve (AUC), which can be thought of as the TPR averaged over all possible thresholds of the model score: it allows performance comparisons without first selecting an arbitrary threshold. However, at the end of the day, a model must make an up/down decision on whether the sample is malicious, so in addition to AUC, we also report FPR and FNR with a threshold set at the "middle" value (0.5 for models that report scores between 0 and 1).

Since these metrics are not the only consideration for a particular task, we also compare model size (in memory) and speed (query rate measured in samples per second using wall clock time). Wall clock time can be misleading due to differences in machine specs, threading, and current process loads. We attempt to mitigate these issues by testing all models on the same machine, under minimal load, and all speed measurements are done on a single thread.

Experimental Setup

We implement our bake-off test in Python leveraging scikit-learn, Keras, and a custom NearestNeighbor algorithm. We train each of these models against the same dataset: a random sample of 2 million unique malicious and benign PE files. Unfortunately, in the real world data doesn’t follow a perfect Gaussian distribution, but rather is often quite skewed or imperfect. In this case, the breakdown of malicious to benign is 98/2, highlighting the class imbalance problem mentioned in the previous post. We attempt to maximize the effectiveness of each model during training by performing a grid search to determine ideal parameters for learning. Each machine learning algorithm has a set of hand tunable parameters that determine how the training is conducted. In an exhaustive grid search, the models are trained with multiple combinations of these parameters and the performance at each grid point is analyzed.

Performance measures were calculated using a training set and a testing set. The models are trained on the training set, then performance measures are calculated on the disjoint testing set. The disjointness is key: if our testing set were to contain items from our training set, then we would be cheating and the results would not represent the performance of our classifier in the real world. However, it is possible to hand craft the training and testing set to achieve better (or worse) performance. To remove any bias, we use a technique called cross-validation. Cross-validation creates several randomly-selected training/test splits and the performance metrics can be aggregated over all sets. The number of training/test set divisions is called folds and we use ten folds in our experiments.

Results

Results for the bake-off and interpretation of each model's performance are given in Table 1. False positive and false negative rates were calculated using a scoring threshold naively chosen halfway between benign and malicious.

Table 1: AUC, FP/FN rate and Model Metrics for Classifiers

* Query time does not include time required to extract features

** Update: We used a default threshold for each model (0.5 on a scale of 0 to1) to report FP and FN rates. As with other details in each model, the threshold would be chosen in a real-world scenario to a fixed low FP rate with a corresponding trade-off in FN rate. AUC is an average measure of accuracy before thresholding, and is the most appropriate metric that should be used here to compare models.

AUC

The top performing models based on the average AUC are the boosted trees models (GBDT and Random Forest), as well as the Deep Learning and k-NN models, with scores hovering around 0.99. These models all had significantly smaller FP/FN rates than the next highest scoring models, SVM and Logistic Regression, which saw FP rates in the 20% range. A model that produces 1 FP for every 5 samples is far too many to be used operationally (imagine the alerts!). Naive Bayes performed poorly in this exercise likely due to a lack of feature independence between the malicious and benign labels.

Model Size

The size of the final model has little to do with an ability to correctly classify samples, but it is extremely important for real-world implementation for classifying malware on a user’s machine. The bigger the model is on disk (or in memory), the more resources it consumes. Larger models are infeasible on an endpoint because of the negative impact on the host environment. Avoiding any degradation to the user’s experience on their machine is paramount for any successful endpoint solution.The results of the bake-off show three major classes of model sizes by impact of the host system.

Figure 1 – Model AUC vs. Model Size on Disk

As Table 1 depicts, SVM, Logistic Regression, Naive Bayes, and GBDT models consumed less than 1MB, which would be considered small and have a negligible impact on a system. After eliminating the models that had unreasonable average AUC scores, GBDT again outperforms the competition.

Both Deep Learning and Random Forest performed well in the AUC test, but the size of the models on disk could have a negative impact. While 80-100MB is certainly not large, these models look like monsters compared to the tiny GBDT. Moreover, there is not a significant enough increase in AUC to warrant a model of this size.

Finally, k-NN weighed in at 10.5GB which is probably not a viable solution for an endpoint model. Remember: k-NN has no parameters, instead it requires maintaining the entire training set in order to perform a nearest neighbor search! With its high AUC average and low FP/FN rates, k-NN represents a potentially viable cloud-based classification solution.

Query Time

An endpoint malware classifier must make snap decisions on whether a newly observed binary is malicious or benign. Failure to do so could result in the execution of a binary that wreaks havoc on an individual machine or an enterprise network. The ability for a model to quickly classify multiple binaries could be a major differentiator in our search for an ideal model. Figure 3 shows how many queries can be made against each model per second. This time does not include feature calculation, since that time is equivalent for each model.

Figure 2 – Model AUC vs. Query per second

SVM and Logistic Regression outperformed the rest of the competition in this category. Both models only require a couple of addition and multiplication operations per feature at query time. GBDT and Random Forest models were in the next tier for speed. The query time of these models is dependent on the size and quantity of the decision trees they've trained. The number and size of those trees can be controlled during training and there is a tradeoff between speed and classification accuracy here. For our bake-off, both models were tuned for best classification accuracy, so it is nice to see those models still performing quickly. For each query, Deep Learning involves a series of multiple matrix multiplications for each query, and k-NN involves searching through the entire training set for each query, so it is not surprising to see that these models are slower.

Other Considerations

AUC, model size, and query time are all important metrics for evaluating machine learning models, but there are other areas that warrant consideration, too. For instance, training time is an important factor, particularly when your dataset is constantly evolving. Malware is a great example of this problem as new families and attack vectors are constantly developed. The ability to quickly retrain and evaluate models is crucial to keeping pace with the latest malicious techniques. Table 2 details each model’s initial training time and how the model can be updated.

Table 2 – Model Training Time and Update Method

** We show GPU instead of CPU time due to default implementation and customary training method. The CPU time would consist of nearly 16 hours (57K seconds!)

*** SVM and LR models could also be conveniently trained using a GPU. However, default implementations of publicly available packages leverage CPU.

Final Thoughts

Machine learning has garnered a lot of hype in the security industry. However, machine learning models used in products may differ in architecture, implementation, training set quality, quantity and labeling (including the use of unlabeled samples!), feature sets, and thresholds. None of the models analyzed represent any company's next-gen malware product, and are intended only for coarse comparison with other model types. That said, one may coarsely compare results to get a sense of suitability to a particular application (e.g., endpoint vs. cloud-based malware detection).

As the bake-off demonstrated, each model has strengths and weaknesses. GBDTs are an especially convenient model for real-time endpoint-based malware classification due to a good combination of high AUC, small size, and fast query speed. However, GBDTs require a complete retraining to update a model, which is the most time consuming training process of all tested algorithms. Both Random Forest and k-NN models performed nearly as well as GBDT (avg. AUC), but were significantly larger (in size on disk) models. However, despite its size, k-NN requires no training at all!

There is no magical formula for choosing the machine learning model that's right for your situation. Some of the considerations that we've discussed, like query time or training time, might not matter in other application areas. In some situations, you're trying to provide the best experience to the user of your model and so convenient metrics for optimization like AUC may not exist. When computing resources are so cheap and developer and data scientist time so expensive, it might be best to go with the model with which you have the most experience and best meets your operational requirements. The next step to building a great classifier involves handling corner cases, continually cleaning, managing, and improving your collected dataset, and verifying your results at every step of the way. In the next and final post in this three-part series, we'll consider some of the finer details of implementing a lightweight endpoint malware classifier using GBDTs.

"But the real magic comes when you take the expertise that you've got in security and you translate it and you rebuild it and you reform it. Don't be afraid to take the knowledge you have and make it accessible to vastly more people." Dan Kaminsky during the Black Hat 2016 keynote address.

Information sharing – snuck semi-discreetly into last year’s omnibus bill– is often portrayed as the policy community’s silver bullet for increased security. The policy community does not maintain a monopoly on silver bullets, as many buzz words and hype made the rounds during last week’s major infosec conferences – BSidesLV, Black Hat, and Defcon. Both communities’ propensity for bumper sticker solutions to extremely complex issues aside, there is something to be said for greater information sharing, not just of data as it is currently conceived, but more so of information sharing across communities. The combination of last week’s conferences, with President Obama’s recent directives and Cybersecurity National Action Plan (CNAP) priorities, all point back to the need for greater information sharing. However, we need to reframe the notion of information sharing such that the emphasis becomes one of laying the foundation for greater accessibility of inbound and outbound knowledge. That is, the industry requires a greater diversity of perspectives to compliment the current phenomenal domain expertise, and help overcome some of today’s greatest challenges in security.

Greater Integration of Strategic, Business, and Technical Thinkers

It’s a useful heuristic to gauge this notion of knowledge accessibility by looking at the Vegas conferences through the lens of two of CNAP’s major objectives. First, CNAP called for the creation of a Commission on enhancing cyber security that would represent a public/private partnership. At its core, the goal is to combine the top “strategic, business, and technical thinkers” to dramatically innovate the policy realm, very similar to the government’s outreach to the larger tech community, as epitomized by organizations like DIUx and collaborative projects such as Hack the Pentagon. However, last week there was a noticeably smaller number of policy panels. In fact, using the Black Hat categorization filter, there were only seven policy talks, the majority of which likely wouldn’t be considered policy topics by their authors or by policy wonks. Previous years had much greater engagement by the government or those in the policy community, as well as more direct content straight from the policy community. Each year, these talks tend to be very well attended, and this year was no exception. Jason Healey’s talk on technologies and policies for a defensible cyberspace was standing room only, AND people stuck around to ask questions instead of bolting as so often happens.

Why does this matter? Well, let’s look at one of the few policy-relevant talks last week, where two Congressmen told the Defcon attendees that it will likely take years for the government to begin to reach some initial resolutions on encryption. There is no guarantee that, even after that prolonged time, the solution will be something both technically and politically viable. The latest round of Wassenaar Arrangement discussions validates this, with many security experts finding it not only off the mark, but overall detrimental to international cybersecurity. While the security industry certainly can’t speed up the legislative process, it can participate more to ensure that those proposals that do emerge are technically sound and viable. This requires greater interaction between the communities, including at these extremely technical conferences. Clearly, the technical aspect should continue to dominate, but there needs to be more than a few loosely categorized panels to truly stimulate the policy innovation so desperately needed.

Boost Cyber Workforce

Another objective of the CNAP is to boost the cyber workforce. Identified as the most critical skills gap, most solutions point to greater training and education as the key solution. As many acknowledge, this may be useful in the future, but does nothing for the current dearth of talented applicants. In contrast, many of the skills required actually can be acquired from other disciplines, for whom security does not seem like a natural career path. Look at data science, for example, where disciplines from electrical engineering to physics to materials science can provide the modeling and coding skills essential for today’s security data environment. To illustrate this point, Endgame was fortunate enough to have speakers at all three main conferences, covering a range of topics including machine learning, blockchain, control-flow integrity, and steganography. The speakers’ backgrounds, in turn, vary significantly and demonstrate the importance of the cross-pollination of ideas across disciplines, and the value of cross-functional teams. Unfortunately, looking at most security job reqs today, they remain so focused on specific security skills or languages that they omit the most important aspects of the security workforce – the ability to adapt, innovate, collaborate, and maintain strong technical acumen.

Not only can disciplinary diversity help augment the current workforce, but it requires a diversity of perspectives of all kinds, including various educational backgrounds, disciplines, and race. This clearly also includes the overwhelming gender gap in the industry that seems to only be getting worse. Defcon created the most buzz in this area on social media in response to a hacker jeopardy category. In contrast, there is some good news to report – BSidesLV had roughly 22% female speakers, and two of three keynotes were women. For any other industry, 22% would be nothing to cheer about, but given the declining current rate of roughly 8-11% women in the industry, 22% looks pretty good! Conference culture can go a long, long way toward helping impact this gender gap. From reducing the number of manels to increasing the number of female speakers to creating a conference culture that penalizes blatant harassment, the conferences are a key gauge of the state of the industry. Unfortunately, it appears to remain stagnant at best, or possibly trending in the wrong direction.

Biggest Bang for the Buck

In the keynote address at BlackHat, Dan Kaminsky stressed the need to make the knowledge of the industry more accessible, to translate it, and reform it. This is exactly what the industry needs to help tackle what is truly one of the most complex and dynamic geopolitical environments, which happens to coincide with one of the most impactful technological revolutions. Last week at BSidesLV, Black Hat, and Defcon, we witnessed some truly phenomenal technical breakthroughs. Unfortunately, they alone are not enough. The gap between policy and security, and the workforce gap also will continue to impact private and public security for the foreseeable future. Fortunately, there are simple and impactful solutions that exist to these two major gaps impacting the industry. My vote for this year’s biggest bang for the buck toward addressing one of these gaps is TiaraCon, a movement that caught on quickly and garnered funding to provide a safe and supportive hacking environment for all genders. Let’s hope other social movements similarly gain momentum, because the status quo will not be sustainable against today’s threat landscape.

In the first blog post of this series, we discussed considerations for measuring and understanding the performance of machine learning models in information security. In the second post, we compared machine learning models at a fairly coarse level on a malware detection task, noting several considerations of performance for each model, including accuracy (as measured by AUC), training time, test time, model size, and more. In this post, we'll depart slightly from the more academic discussion of our previous blog posts and discuss real-world implementation considerations. Specifically, we’ll build upon a related white paper and address operationalizing a malware classifier on an endpoint in the context of a hunt paradigm.

A Word about the Hunt

First, let's establish some context around malware detection in a hunt framework. We define hunting as the proactive, methodical, and stealthy pursuit and elimination of never-before-seen adversaries in one's own environment. Threat hunters establish a process to locate and understand a sentient adversary prior to eviction, without prematurely alerting the adversary to the hunter's presence. A thorough understanding of the extent of an adversary's access is necessary for complete removal and future prevention of the adversary. Prematurely alerting an adversary that he's being pursued can prompt him to destroy evidence of his exploitation, accelerate his timeframe for causing damage and stealing data, or cause him to burrow deeper and more carefully into systems than he otherwise might. Therefore, the hunter uses stealthy tools to survey the enterprise, secures assets to prevent the adversary from moving laterally within networks, detects the adversary's TTPs, and surgically responds to adversary tactics without disruption to day-to-day operations, for example, by terminating a single injected thread used by the adversary.

In this context, malware detection is part of a multi-stage detection framework that focuses particularly on discovering tools used by the adversary (e.g., backdoors used for persistence and C2). Unlike the passive detection framework, malware detection in the hunt framework should have particularly rigid standards for stealth, including a low memory and CPU footprint, while still providing high detection rates with low false positive rates. A low memory and CPU footprint allow an agent to hide amongst normal threads and processes, making it difficult for an adversary to detect or attempt to disable monitoring and protective measures. For this purpose, we focus attention specifically to developing a robust, lightweight model that resides in-memory on an endpoint to support hunting as one part of a larger detection framework. The task of this lightweight classifier is to determine the maliciousness of files that are:

Newly created or recently modified;
Started automatically at boot or other system or user event;
Executed (pre-execution);
Backing running processes;
Deemed suspicious by other automated hunting mechanisms; or
Specifically queried by the hunt team.

Lightweight Model Selection

Similar to results presented in the coarse model comparison experiment in our second post, and following additional, more detailed experiments, gradient boosted decision trees (GBDTs) offer a compelling set of metrics for lightweight malware detection for this hunt paradigm, including:

Small model size;
Extremely fast query time; and
Competitive performance as measured by AUC, allowing model thresholds that result in low false positive and false negative rates.

However, to improve performance of GBDTs and tune it for real-world deployment, one must do much more than train a model using off-the-shelf code on an ideally-labeled dataset. We discuss several of these elements below.

From Data Collection to Model Deployment

The development of a lightweight malware classifier requires a large amount of diverse training data. As discussed in the first blog post of this series, these data come from a variety of both public and private sources. Complicating our initial description of data collection is that: (1) data must often be relabeled based on preliminary labels, and (2) most of these data are unlabeled—there's no definitive malicious/benign or family indicator to the collected samples.

Data Labels: It Just Seemed Too Easy, Didn't It?

Even when data come with labels, they may not be the labels one needs for a malware classifier. For example, each label might consist of a family name (e.g., Virut), a malware category (e.g., trojan) or in a more unstructured setting, a free-form description of functionality from an incident response report, for example. From these initial labels, one must produce a benign or malicious tag. When curating labels, consider the following questions:

While a backdoor is malicious, what about legitimate network clients, servers and remote administration tools that may share similar capabilities?
Should advanced system reporting and manipulation tools, like Windows Sysinternals, that might be used for malicious purposes be considered malicious or benign?
Are nuisance families like adware included in the training dataset?

These questions speak to the larger issue of how to properly define and label "grayware" categories of binaries. Analyzing and understanding grayware categories to build a consensus are paramount for constructing an effective training dataset.

Unlabeled Data: The Iceberg Beneath the Surface

Unlabeled data may be under-appreciated in the security data science field, but are often the most interesting data. For example, among the unlabeled data may be a small fraction of bleeding-edge malware strains that have yet to be detected in the industry. Although not always straightforward to do so, these unlabeled samples can still be leveraged using so-called semi-supervised machine learning methods. Semi-supervised learning can be thought of as a generalization of supervised learning, in which both labeled and unlabeled samples are available for training models. Most models, like many considered in our second post, do not natively support the use of unlabeled samples, but with care, they can be modified to take them into account. We explain two such methods here.

First, semi-supervised methods exist that work in cooperation with a human analyst to very judiciously select "important" samples for hand-labeling, after which traditional supervised learning models can be used with an augmented labeled dataset. This so-called active learning framework is designed to reduce the burden on a human analyst, while enhancing the model’s performance. Instead of inspecting and hand labeling all of the unlabeled samples, a machine learning model guides the human to select a small fraction of samples that would be the greatest benefit to the classifier. For example, samples may be selected that provide the greatest gain in the volume of data label discovery. By asking the human to label a single sample, the labels for an entire tight cluster of samples can be inferred. There are many similar, sometimes competing and sometimes complementary objectives outlined below:

Which unlabeled samples is the model most uncertain about?
If labeled correctly, which sample will maximize information gain in my model?
Which unlabeled samples could represent new attack trends?

Malware samples selected by active learning can address one or more of these objectives while respecting the labeling bandwidth of human analysts.

A second category of semi-supervised methods leverages unlabeled data without human intervention. One approach in this category involves hardening decision trees (and by extension, GBDTs), which are known to suffer from being overly sensitive in regions of the feature space where there are few labeled samples. The objective is to produce a GBDT model that is regularized towards uncertainty (produce a score closer to 0.5 than 0.0 or 1.0) in regions of the feature space where there are many unlabeled samples, but few or no labeled samples. Especially in the hunt paradigm, a model should have a very low false positive rate. This locally-feature-dependent regularization of the model can save a hunter from alert fatigue, which inherently eliminates the utility of the alerting system.

Other semi-supervised methods that do not require human intervention include label spreading and propagation which infer labels inductively—neighbors to a labeled sample should carry the same label—and self-training—where the model predicts labels for unlabeled samples, and the most confident decisions are added to the labeled training set for re-training.

Automation and Deployment

For an enterprise-relevant data science solution, a wholly automated process is required for acquiring samples (malicious, benign and unlabeled), generating labels for these samples, automatically extracting features, partitioning the features into training and validation sets (for feature and model selection), then updating or re-training a model with new data. This may seem like a mundane point, but data lifecycle management and model versioning and management don’t enjoy the standard processes and maturity that are now common within software version management. For example, consider four independent elements of a data science solution that could change the functionality and performance of an endpoint model: 1) the dataset used to train the model; 2) the feature definitions and code to describe the data; 3) the model trained on those features; and 4) the scaffolding that integrates the data science model with the rest of the system. How does one track versioning when new samples are added to or labels are changed in the dataset? When new descriptive features are added? When a model is retrained? When encapsulating middleware is updated? Introducing the engineering processes into a machine learning solution narrows the chasm between an interesting one-off prototype and a bona fide production machine learning malware detection system. Once a model is trained and its performance on the holdout validation set is well characterized, the model is then automatically pushed to a customer.

But the job doesn’t stop there. In what constitutes quality assurance for data science, performance metrics of the deployed model are continuously gathered and checked against pre-deployment metrics. Following deployment, the following questions must be answered in order to monitor the status of a deployed model:

Is there an unusual spike in the number of detections or have the detections gone quiet?
Are there categories or families the model is no longer correctly classifying?
For a sampling of files submitted to the model, can we discover the true label and compare them against the model’s prediction?

The answer to these questions is particularly important in information security, since malware samples are generated by a dynamic adversary. In effect, the thing we’re trying to detect is a moving target: the malware (and benign!) samples we want to predict continue to evolve from the samples we trained on. Whether one acknowledges and addresses this issue head on is another issue that separates naive from sophisticated offerings. Clever use of unlabeled data, and strategies that proactively probe machine learning models against possible adversarial drift can be the difference between rapidly discovering a new campaign against your enterprise, or being “pwned”.

Endgame MalwareScore™

Optimizing a malware classification solution for the hunt use case produces a lightweight endpoint model trained on millions of benign, malicious, and unlabeled samples. The Endgame model allows for stealthy presence on the endpoint by maintaining a minuscule memory footprint without requiring external connectivity. Paired with a sub-100 millisecond query time, the model represents the ideal blend of speed and sophistication necessary for successful hunt operations. The endpoint model produces the Endgame MalwareScore™ where scores approaching 100 inform the analyst that the file in question should be considered malicious. Endgame MalwareScore™ also encourages the analyst to easily tune the malicious threshold to better suit the needs of their current hunt operation, such as reducing the threshold during an active incident to surface more suspicious files to the hunter. The Endgame MalwareScore™ is an integrated element of detecting an adversary during hunt operations, enriching all executable file-backed items with the score to highlight potentially bad artifacts like persistence mechanisms and processes to guide the hunter as effectively as possible.

That's a Wrap: Machine Learning's Place in Security

After reading this series of three blogs, we hope that you are able to see through the surface-level buzz and hype that is too prevalent in machine learning applied to cyber security. You should be better equipped to know and do the following:

Understand that, like any security solution, machine learning models are susceptible to both false positives and false negatives. Hence, they are best used in concert with a broader defensive or proactive hunting framework.
Ask the right questions to understand a model's performance and its implications on your enterprise.
Compare machine learning models by the many facets and considerations of importance (FPR, TPR, model size, query time, training time, etc.), and choose one that best fits your application.
Identify key considerations for hunting on the endpoint, including stealthiness (low memory and CPU footprint), model accuracy, and a model's interoperability with other facets of a detection pipeline for use by a hunt team.
Understand that real-world datasets and deployment conditions are more "crunchy" than sterile. Dataset curation, model management, and model deployment considerations have major implications in continuous protection against evolving threats.

Endgame’s use of machine learning for malware detection is a critical component of automating the hunt. Sophisticated adversaries lurking in enterprise networks constantly evolve their TTPs to remain undetected and subvert the hunter. Endgame’s hunt solution automates the discovery of malicious binaries in a covert fashion, in line with the stealth capabilities developed for elite US Department of Defense cyber protection teams and high end commercial hunt teams. We’ve detailed only one layer of Endgame’s tiered threat detection strategy: the endpoint. Complementary models exist on the on-premises hunt platform and in the cloud that can provide additional information about threats, including malware, to the hunter as part of the Endgame hunt platform.

Endgame is productizing the latest research in machine learning and practices in data science to revolutionize information security. Although beyond the scope of this blog series, machine learning models are applicable to other stages of the hunt cycle—survey, secure, and respond. We’ve described the machine learning aspect of malware hunting, specifically the ability to identify persistence and never-before-seen malware during the “detect” phase of the hunt. Given the breadth of challenges in the threat and data environment, automated malware classification can greatly enhance an organization’s ability to detect malicious behavior within enterprise networks.

As we discussed in an earlier post, most defenses focus on the post-exploitation stage of the attack, by which point it is too late and the attacker will always maintain the advantage. Instead of focusing on the post-exploitation stage, we leverage the enforcement of coarse-grained Control Flow Integrity (CFI) to enhance detection at the exploitation stage. Existing implementations of CFI require recompilation, extensive software updates, or incur a significant performance penalty, making them difficult to adopt and use in the enterprise. At Black Hat USA 2016, we presented our hardware-assisted technique that has proven successful at blocking exploits, while minimizing the impact on performance to ensure operational utility at scale. To enable earlier detection while limiting the impact on performance, we have developed a new concept we’re calling Hardware Assisted Control Flow Integrity, or HA-CFI. This technology utilizes hardware features available in Intel processors to monitor and prevent exploitation in real time, with manageable overhead. By leveraging hardware features we can detect exploits before they reach the “Post-Exploitation” stage and provide stronger protections while defense still has the upper hand.

Prior Art and Operational Constraints

Our work builds on previous research that identified the Performance Monitoring Unit (PMU) of microprocessors as a good candidate for enforcing control-flow integrity. The PMU is a specialized unit in most microprocessor architectures that provides useful performance measuring facilities for developers. Most features of the unit are intended to count hardware level events during program execution to aid in program optimization and debugging.

In their paper, Yuan et al. [YUAN11] introduced the novel application of these events to exploit detection for software security. Their research focused on using PMU events along with the Branch Trace Store (BTS) messages to correlate and detect code-injection and code-reuse attacks without source code. Xia et al. explored the idea further in their paper CFIMon [XIA12], combining precise event context gathering with the BTS and PEBS to enforce real-time control-flow integrity. In addition to these foundational papers, others have pursued variations on the idea to specifically target exploit techniques such as Return-Oriented- Programming.

Alternatively, just-in-time CFI solutions have been proposed using dynamically instrumented frameworks such as PIN [PIN12] or DynamoRIO [DYN16]. These frameworks dynamically interpret code as it is executed while providing instrumentation functionality to developers. Applying control flow policies with a framework like PIN allows for the flexible and reliable checking of code. However, it often incurs a significant CPU over-head, in the area of 10 to 100x, making it unusable in the enterprise.

Our research into dynamic run-time CFI included parameters we feel would make this approach relevant to enterprise security, while also providing significant detection and prevention assurances. To ensure our approach is resilient for enterprise security while also providing significant detection and prevention assurances, we established several functional requirements, such as ensured functionality on 32 and 64bit Operating Systems, application without software recompilation, or access to source code.

Approach

HA-CFI uses PMU-based traps to apply coarse-grained CFI on indirect calls on the x86 architecture. The system uses the PMU to count and trap mispredicted indirect branches in order to validate branch destinations in real-time. In addition to gaining assistance from a carefully tuned PMU, a practical implementation of this approach requires support from Intel’s Last Branch Record (LBR) feature, and a method for tracking thread context switching in a given OS. It also requires an algorithm for validating branch destination addresses, all while keeping performance over-head to a minimum. After more than a year of fine-tuning these hardware features, we have proven our model is capable of generically detecting control-flow hijacks in real-time with acceptable performance over-head on both Windows and Linux. Because control-flow hijack attacks often stem from a corrupted or modified VTable, many CFI designs focus on validating all indirect branches. Because these call sites have never before jumped to the attacker controlled address, this indirect call is almost always mispredicted by the branch prediction unit. Therefore, by only focusing on mispredicted indirect call sites we greatly limit the number of places that a CFI check is necessary.

HA-CFI configures the Intel PMU on each core to count and generate an interrupt on every mispredicted indirect branch. The PMU is capable of delivering an interrupt any time an event counter overflows, and thus HA-CFI sets the initial counter value to -1 and resets the counter to -1 from the interrupt service routine to generate a trap for every occurrence of the event. In this way, the HA-CFI interrupt service routine becomes our CFI component capable of validating each mispredicted call and determining whether it is the result of malicious behavior. To validate target indirect branch addresses, HA-CFI builds a comprehensive whitelist of valid code pointer addresses as each.dll/.so is loaded into protected processes. When a counter overflows, the Interrupt Service Routine (ISR) called is then able to compare the mispredicted branch to a whitelist, and determine if the branch is anomalous.

Figure 1: High level design of HA-CFI using the PMU to validate mispredicted branches

Figure 1: High level design of HA-CFI using the PMU to validate mispredicted branches

To ensure we minimized the overhead of HA-CFI while maintaining an extremely low false-positive rate, several key design decisions had to be made, and are described below.

The Indirect Branch: On the Intel x86 architecture, an indirect branch can occur at both a CALL or JMP instruction. We focus exclusively on the CALL instruction for several reasons, including the frequent use of indirect JMP branch locations for switch statements. In our experimentation on Linux, we found roughly 12% of hijacked indirect branches occurred as part of an indirect JMP, but occurred even less frequently on Windows. Secondly, ignoring mispredicted JMP instructions further reduces the overhead of HA-CFI. Therefore, we opted to omit mispredicted JMP branches during this research, which can be achieved with settings on the PMU and LBR.

Figure 2: A breakdown of hijackable indirect JMP vs CALL instructions found in Windows and Linux x64 binaries

Added Precision with the LBR: Given our requirement for real-time detection and prevention of control-flow hijacks, unlike the majority of previous research, we couldn’t use the Intel Branch Trace Store (BTS), which does not permit analysis of the trace data in real-time. Instead, to precisely resolve the exact branch that caused the PMU to generate an interrupt, we make use of Intel’s Last Branch Record (LBR) stack. A powerful feature of the LBR is the ability to filter the types of branches that are recorded. For example, returns, indirect calls, indirect jumps, and conditional branches can all be included or excluded. With this in mind, we can configure the LBR to only record indirect call branches occurring in user mode. Additionally, the most significant bit of the LBR branch FROM address indicates whether the branch was actually mispredicted. As a result, this provides a quick filter for the ISR to ignore the branch if it was predicted correctly. It’s important to note that we are not iterating over the entire LBR stack, only the most recently inserted branch.

On-Demand PMU-Assisted CFI: HA-CFI is focused on protecting commonly exploited applications such as browsers, mail clients, and Flash. As such, the PMU and LBR are both configured to only operate on mispredicted indirect calls occurring in user mode, ignoring those that occur in ring-0. Moreover, by monitoring thread context switches in both Windows and Linux, we can turn the entire PMU on and off depending upon which applications are being protected. This design decision is perhaps the most critical element in keeping our performance overhead at an acceptable level.

Runtime Whitelist Generation: The final component to the HA-CFI system is the actual integrity check that involves querying a whitelist data structure containing valid destination addresses for indirect calls. Whitelist generation is performed at run-time for each image loaded into a protected process. We generated a whitelist such that all branches from our dataset could be verified in a hashtable leaving zero unknown captured branches.

Implementation Challenges  

Throughout the course of our research, we encountered numerous hurdles to meet our original goal of low-overhead, high detection stats. First, registering for PMU interrupts on Windows was a major challenge. Our initial prototype was developed under Linux. Transferring the same techniques to Windows proved problematic, especially with regards to Kernel Patch Protection. After significant research, we discovered an undocumented option in the Windows Hardware Abstraction Layer (HAL) that registers a driver supplied interrupt handler for PMU interrupts. Second, our implementation of CFI on Windows restricted PMU monitoring to a single process or thread.

The technique we ultimately arrived at makes use of a threads Asynchronous Procedure Call (APC) mechanism. Windows allows developers to register APC routines for a given thread, which are then added to a queue to be executed at certain points. By maintaining an APC registered on all threads that we seek to monitor at all times, we are notified that a thread has resumed execution when our routine executes. The routine re-enables the PMU counter if necessary and updates various tracking metrics. We detect when a processor swaps out a thread and begins executing another when we receive an interrupt in a different thread context. We can then disable the PMU counters, if needed.

Results

To evaluate our system, we measured success both in terms of performance overhead added by HA-CFI as well as detection statistics when tested against various exploits in common client applications, including the most common web browsers, as well as Microsoft Office and Adobe Flash. We sourced exploits from Metasploit modules for testing, as well as numerous live samples from popular Exploit Kits found in the wild.

Performance: After completing our prototype, we were concerned about the overhead of monitoring with HA-CFI and its impact on system performance and usability. Since each mispredicted branch in a monitored process would cause an interrupt, there was the potential for a very high number of interrupts to be generated. We subjected our prototype implementations to several tests to measure overhead. Monitoring Internet Explorer 11 while running a JavaScript performance test suite, the driver detected approximately 83,000 interrupts per second on average. In contrast, monitoring an “idle” IE resulted in roughly 1,000 interrupts per second. Our performance analysis revealed that overhead is highly dependent upon the process being protected. For example, with Firefox we saw around 10% overhead while running Dromaeo [DRO16] JavaScript benchmarks, and with PassMark benchmarking tool we saw 8-10% over- head. With Internet Explorer under heavy usage this number was above 10%. Alternatively, under normal user behavior the overhead is significantly less. We have deployed HA-CFI on systems in daily use monitoring web browsing, and observed little impact on performance or usability.

Exploit Detection: We extensively tested HA-CFI against a variety of exploits to determine its efficacy against as many bug classes and exploitation techniques as possible, with an emphasis on recent samples using approaches intended to bypass other mitigation measures. We ran one set of tests against more than 15 Metasploit exploits targeting Adobe Flash Player, Internet Explorer, and Microsoft Word. HA-CFI detected and prevented exploitation for each of the tested modules, with an overall detection rate greater than 98%.

We found the Metasploit results to be encouraging, but came to the conclusion that they did not provide sufficient diversity in exploitation techniques needed to comprehensively test HA-CFI. We used the VirusTotal service to download a set of samples used in real-world exploit kit campaigns from several widespread kits [KAF16]. In total, we tested forty-eight samples comprising twenty unique CVE vulnerabilities. We analyzed the samples to verify that they employed a varied set of both Return-Oriented Programming (ROP) and “ROPless” techniques. HA-CFI succeeded in detecting all 48 samples, with an overall detection rate of 96% in a multiple trial consistency test.

Results of VirusTotal Sample Testing, by Exploitation Technique

Results of VirusTotal Sample Testing, by Bug Class

Conclusion

Modern exploitation techniques are rapidly changing, requiring a new approach to exploit detection. We demonstrated such an approach to exploit detection by using the Performance Monitoring Unit to enforce control flow integrity on branch mispredictions. A run-time generated whitelist can determine the validity of indirect calls to locations classified as malicious. This approach greatly reduces the overhead of the instrumentation by moving the policy enforcement to a “coarse-grained” verifier on mispredicted indirect branch targets. The data provided also shows the efficacy of such a system on samples captured in-the-wild. These samples, from popular exploit kits, allow us to measure against unknown threats further validating its application. As exploits advance, we also need advanced exploit detection. Our hardware-assisted CFI (HA-CFI) system has a low performance impact and measurable prevention success against 0day exploits and previously unknown exploitation techniques. Using HA-CFI we have advanced the state-of-the-art – moving the industry from post-exploitation to exploitation – to give enterprise-scale security software an upper hand in earlier detection of exploitation. To learn more about pre-exploit detection and mitigation, we'll be discussing our approach during a webinar on August 25th, at 1 pm ET.

References

[YUAN11] L. Yuan, W. Xing, H. Chen, B. Zang, “Security Breaches as PMU Deviation: Detecting and Identifying Security Attacks Using Performance Counters”, APSys’11, July 11-12, 2011. 

[PIN12] PIN: A Dynamic Binary Instrumentation Tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

[DYN16] DynamoRIO: Dynamic Instrumentation Tool Platform. http://www.dynamorio.org/ 

[XIA12] Y. Xia, Y. Liu, H. Chen, and B. Zang, “CFIMon: Detecting violation of control flow integrity using performance counters,” in Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12, IEEE Computer Society, 2012. 

[DRO16] Dromaeo JavaScript Benchmark. http://www.dromaeo.com 

[EME16] The Enhanced Mitigation Experience Toolkit. https://support.microsoft.com/en-us/kb/2458544

[KAF16] Kafeine. Exploit Kit Samples. http://malware.dontneedcoffee.com/

Social media sites are frequently used for stealthy malware command and control (C2). Because many hosts on most networks communicate with popular social media sites regularly, it is very easy for a C2 channel hiding in this traffic to appear normal. Further, there are often rich APIs for communicating via social media sites allowing a malware author to easily and flexibly use the services for malicious purposes. Blocking the HTTP and HTTPS connections to these sites generally is infeasible since it would likely cause a revolt amongst the workforce. Security researchers have discovered multiple malware campaigns that have used social media for C2 capabilities. For example, Twitter has been used to direct downloaders to websites for installing other malware or for controlling botnets. Further, the information posted to a social media site may be obfuscated or Base64 encoded. Even if the malicious social media content is discovered, it would require an astute defender to recognize the intent.

Separately, a few malware strains (e.g., ZeusVM banking trojan) have used image steganography to very effectively disguise information required to operate malware. The stego image represents an effective decoy, with information subtly encoded within something that is seemingly innocuous. This can make malicious intent difficult to uncover. In the case of ZeusVM, an encoded image contained a list of financial institutions that the malware was targeting. However, even a close look at the image would not reveal the presence of a payload. The image payloads were discovered only because the stego images were retrieved on a server containing other malicious files.

We combine these two methods for hiding in plain sight and demonstrate "Instegogram", wherein we hide C2 messages in digital images posted to the social media site Instagram. We presented this research earlier this month at Defcon’s Crypto Village, as part of our larger research efforts that leverage our knowledge of offensive techniques to build more robust defenses. Since our research aims to help inform and strengthen defenses, we conclude with a discussion of some simple approaches for preventing steganographic C2 channels on social media sites, such as bit jamming methods and analysis of user account behaviors.

A Brief History of Steganography

Steganography is the art of hiding information in plain sight and has been used by spies for centuries to conceal information within other text or data. Recently, digital image steganography has been used to obfuscate configuration information for malware. The following timeline contains a brief history of steganography used in actual malware case studies:

JPEG-Robust Image Steganography Techniques & Instagram

Digital image steganography involves altering bits within an image to conceal a message payload. Let’s say Alice encodes a message in some of the bits of a cover image and sends the stego image (with message payload) to Bob, who decodes the message using a private key that he shares with Alice. If Eve intercepts the image in transit, she is oblivious to the fact that the stego image contains any message at all since the image appears to be totally legitimate both digitally and to the human eye.

A simple steganography approach demonstrates how this is possible. Alice wants to encode the bits 0100 into an image whose first 4 pixels are 0x81, 0x80, 0x7f, 0x7e. Alice and Bob agree (private key) that the message will be encoded in row-major order in the image using the least significant bit (LSB) to determine the message bit. Since the LSBs of the first 4 pixels are 1010, Alice must flip the LSBs of the first three pixels so that the LSBs are equal to the desired message bits. Since modifying the LSBs of a few pixels changes the pixel intensity by 1/255, the stego image appears identical to the original cover image.

Additionally, since Instagram re-encodes uploaded images in JPEG/JFIF form, the steganography encoding for Instegogram must be robust against JPEG compression artifacts.

JPEG images are compressed by quantizing coefficients of a 2D block discrete cosine transform (DCT). The DCT transformation is applied after a color transformation from RGB to YCbCr color space which consists of a luminance (Y) channel, and two chrominance channels, blue (Cb) and red (Cr). The DCT transformation on each channel has the effect of condensing most of the image block’s information into a few upper-left (low frequency) coefficients in the DCT domain. The lossy data compression step in JPEG comes in by applying an element-wise quantization to each coefficient in the DCT coefficient image, with the amount of quantization per coefficient determined by a quantization table specified in the JPEG file. The resulting quantized coefficients (integers) are then compactly encoded to disk.

One method of encoding a message in a JPEG image is to encode the message bits in the quantized DCT coefficients rather than the raw image pixels, since there are no subsequent lossy steps. But for use with Instagram, an additional step is required. Upon upload, Instagram standardizes images by resizing and re-encoding them using the JPEG file format. This presents two pitfalls in which the message can be clobbered in the quantized DCT coefficients: (1) if Alice's stego image is resized, the raw DCT coefficients can change, since the 8x8 block in the original image may map to a different rectangle in the resized image; (2) if Alice's stego image is recompressed using a different quantization table, this so-called double-compression can change the LSBs that may contain the secret message.

To prevent the effects of resizing, all cover images can be resized to a size and aspect ratio that Instagram will accept without resizing. To prevent the double-compression problem, the quantization table can be extracted from an existing Instagram image. During the course of this research and monitoring the tables, it appears that this quantization table is used across all Instagram images. Messages are then encoded in images that use those same quantization tables.

A number of other standard practices can be used to provide additional robustness against image manipulations that might occur upon upload to Instagram. First, pre-encoding the message payload using error correcting coding allows one to retrieve the message even when some bits become corrupted. For small messages, the encoded message can be repeated in the image bits until all image bits have been used. A message storing format that includes a header to specify the message length allows the receiver to determine when the message ends and a duplicate copy begins. Finally, for a measure of secrecy, simple methods for generating a permutation of pixel locations (for example, a simple linear congruential generator with shared seeds between Alice and Bob) can communicate to Bob the order in which the message bits are arranged in the image.

Instegogram Overview

Digital image steganography can conceal any message. Stego is appealing and compelling to malware authors for malware C2 because of its inherent stealthiness. Hosting C2 on social media sites is also appealing for the same reason - it is stealthy because it is hard to filter out or identify maliciousness. Instegogram combines these two capabilities - digital image steganography and the use of social media for C2 - to mirror the utilization of social networks for C2 that has increased for years, while exploring the feasibility of using stego on a particular site - Instagram.

The delivery mechanism for our proof of concept (POC) malware was chosen based on today’s most commonly used infiltration methods - a successful spearphish that causes the user to open a document and run a malicious macro. The remote access trojan (RAT) is configured to communicate with specific Instagram accounts that we control, and on which we’ll POST request images containing messages encoded with our steganographic scheme. The malware includes a steganographic decoder that extracts a payload from each downloaded image, allowing arbitrary command execution on the remote system.

The malicious app continuously checks the Instagram account feed for the next command image, decodes the shell command, and executes it to trigger whatever nefarious behavior is requested in the command. Results are embedded steganographically in another image which is posted to the same account.

As with any steganographic scheme, there is a limited number of characters which can be sent through the channel. In this simple POC, 40 characters could be reliably transmitted in the JPEG stego images. The capacity can be increased using more robust coding techniques as discussed previously.

In short, once the remote system is compromised, encoded images can be posted from the command machine using Instagram’s API. The remote system will download the image, decode it, execute the encoded commands, encode the results in another image, and post back to Instagram. This process can be repeated at will. This attack flow is depicted in the graphic below.

Our Instegogram POC was built on Mac OSX, specifically as an uncertified MacOs app developed in obj-c. To execute our POC, we needed to bypass Apple’s built in Gatekeeper protection, which enforces code signing requirements on downloaded applications and thereby makes it more difficult for adversaries to launch a malicious application on an endpoint. We discovered a Gatekeeper bypass, disclosed it to Apple, and are currently working with Apple on a fix.

Instagram API and Challenges

Instagram's API is only partially public. They encourage third-party apps to use likes, subscriptions, requests for images, and similar actions. But, the specifics of calls for uploads are only provided via iPhone hooks and Android Intents. The webapp is a pared down version of the mobile app and doesn't allow uploads. This is likely to deter spam, bots, and maybe malware C2 architectures.

To work as an effective C2 system, we need to be able to integrate our system with Instagram's to automate uploads, downloads, and comments. Creating an image with an embedded message, transferring it to a cell phone, and then uploading it via the official Instagram app is too cumbersome to be useful. We needed to reverse the upload API.

With the help of some open source efforts and a little web-proxy work, we identified the required fields and formats for all the necessary API calls. Charles Proxy was used to sniff out the payloads of API requests coming from the phone app itself. After the general structure was determined, we used a fake Android user-agent, crafted the body of the request, and were in business. The code demonstrating this is in the "api_access" section of Endgame’s github repo.

It is worth noting that the steganographic encode/decode capabilities and capability to programmatically interact with the Instagram API are not specific to a malware use case. They are tools that can be used for other purposes such as secure hidden messaging.

Detecting & Preventing Instegogram

As previously mentioned, this research is motivated by the requirement to strengthen defenses. Therefore, after successfully implementing the malware, we identified the most robust means to detect and prevent Instegogram. There is a range of measures for detecting and preventing C2 via image steganography as well as additional non-steganography measures that can be implemented.

First, a complementary field of research to steganography is steganalysis, in which a defender aims to: (a) predict whether an image may contain a steganographic payload, and if detected, (b) attempt to recover the message from the stego image. However, using statistical techniques to detect whether an image is corrupted may be infeasible for very small message payloads relative to the image size. Payload recovery is a difficult cryptanalysis problem in its own right, but is of interest only for forensic analysis of a detected C2 channel. Given the challenges in successfully implementing these steganalysis techniques through per-image or per-user defensive strategies, we don’t recommend a steganalysis approach.

In contrast, a much simpler set of measures can be implemented by the social media site owner via site-wide policies that effectively jam potential stego traffic. For example, one can create an effectively noisy stego channel for steganographic C2 with minimal visual distortion through one or more of the following methods:

Regularly and pseudorandomly change the site-wide JPEG quantization table used for re-encoding images. This can induce double-compression problems for simple stego routines like our POC that rely on a specific quantization table. Introducing a quantization table mismatch can reduce the channel capacity for communication, and force sophisticated attackers to resort to more robust and advanced information hiding methods.
Implement other minor and visually idempotent but digitally altering transformations on the image, such as cropping the image by a few boundary pixels, randomly shifting the image by a few pixels, and/or randomly flipping the LSBs of quantized bits. While this does not qualitatively alter the image, it creates a challenging environment to transmit information via stego C2.
Institute a policy that requires mandatory application of an Instagram filter, which represents a nonlinear transformation in the image domain. While this qualitatively changes the image, it represents a “visually appealing” change, whilst also providing an effective attack on possible stego C2.

In addition to the stego-focused mitigations, the social media provider can attempt to detect accounts which may be hosting C2. For example, the provider could look for account access anomalies such as a single account being used in rapid succession from geographically distant locations. Anomalies may also be visible in account creation or other factors.

Non-steganography based security policies can also be implemented by a local admin in an attempt to defend against this attack:

Limit access to third-party websites: Like any other remote access trojan or backdoor, you want a policy that limits connections to the command and control server. The simplest technical solution would to limit third-party websites entirely if it's not related to the mission. As mentioned earlier, although it is a simple solution, it may be infeasible due to workforce demands.
Outliers in network behavior: Network monitoring can be configured to detect anomalous network behavior, such as responses from multiple infected machines that utilize a single Instagram account for C2.
Android string detection: The instagram API we provided utilizes an Android network configuration. If your network is primarily Windows endpoints, the user agent string containing Android strings would be an obvious anomaly.
Disable VBA Macros: If not needed, Microsoft Windows Office VBA Macros should be disabled to avoid spearphishing campaigns utilizing the infiltration technique adopted by the POC.

Conclusion

We accomplished our goal of creating a proof of concept malware that utilizes a C2 architecture for nearly untraceable communications via social media and encoded images. We demonstrated a small message channel with the simplest of steganography algorithms. By combining digital image steganography with a prominent trend in malware research - social media for C2 - our research reflects the ongoing vulnerabilities of social media, as well as the novel and creative means of exploiting these vulnerabilities against which modern defenses must be hardened.

There are numerous defensive measures that can be taken to detect and prevent malware similar to Instegogram. As noted, applying Instagram filters or identifying preprocessed images based on quantization tables could be a powerful prevention strategy. Account anomalies can also point at potential issues. However, this is really only possible on the provider side, not the local defender side.

This type of attack is difficult to defend against within a given enterprise. Blocking social media services will be infeasible in many organizations. Local defenders can look at policy-based hardening and more general anomaly detection to defend against attacks like our proof of concept. These steps will help defend against an even broader set of malware.

Malware Timeline Sources

November 21, 2013 - ZeusVM retrospectively found By Xylibox - http://www.xylibox.com/2014/04/zeusvm-and-steganography.html
December 11, 2013 (Actual VirusTotal Date) - April 2014 - First Reported in April 2014 Lurk Downloader by Dell SecureWorks Blog: https://www.secureworks.com/research/malware-analysis-of-the-lurk-downlo...
January 31, 2014 - First reported ZeusVM by Jerome Segura, https://twitter.com/jeromesegura/status/423180548236771328 (Blog: https://blog.malwarebytes.org/threat-analysis/2014/02/hiding-in-plain-si...) and The discovery of stego was discovered by French researcher Xylitol: https://twitter.com/xylit0l
Late 2014 - Gozi Neverquest/Vawtrak - Reported by Dell SecureWorks Blog: https://www.secureworks.com/research/stegoloader-a-stealthy-information-...
June 15 2015 - First reported Stegoloader by Dell SecureWorks Blog: https://www.secureworks.com/research/stegoloader-a-stealthy-information-...
June 26 2015 - Evidence of KINS copying ZeusVm reported by Xylit0l and unixfreaxjp Blog: http://blog.malwaremustdie.org/2015/07/mmd-0036-2015-kins-or-zeusvm-v200...
November 2015 - AdGholas discovered by Proofpoint: https://www.proofpoint.com/us/threat-insight/post/massive-adgholas-malve...

Throughout history, foreign entities have meddled in the internal affairs of other countries, including leadership duration, reputation, and elections of other countries. Whether it’s a coup receiving external support, such as last year’s attempted coup in Burundi, or major power politics battling it out, such as between the East and West during the Cold War, external actors often attempt to influence domestic elections to achieve a variety of objectives. Yesterday, The Washington Post reported that the US is investigating the possibility of covert Russian operations to influence this fall’s presidential elections. Following the DNC hack, the DCCC hack, and last week’s news about breaches of Illinois and Arizona’s databases, this is just the latest, potential indication of Russian tactics aimed at undermining the US elections and democracy.

For years, Russian President Vladimir Putin has exerted domestic control of the Internet, employing his team of Internet trolls to control the domestic narrative. Building upon the domestic success of this tactic, he has also employed it in Ukraine, and other former Eastern bloc countries. As targeted sanctions continue to strangle the Russian economy, coupled with low oil prices, Putin’s international behavior predictably reflects a rational-actor response to his domestic situation. He is going on the offensive, while attempting to undermine the very foundation of US democracy. The potential for Russian covert operations reinforces the point that offensive cyber activity does not occur in a stovepipe from other cross-domain geo-political strategies. Instead, it is one part of a concerted, multi-pronged effort to achieve strategic objectives, which may include undermining the foundation of US democracy domestically and internationally.

US Election Digital Hacking: A brief history

At least as far back as 2004, security professionals have warned that elections can be digitally hacked. By some accounts, the US experienced the first election hack in 2012 when Miami-Dade County received several thousand fraudulent phantom absentee voter ballot requests. They weren’t the only political entities subject to hacking attempts. Both the Romney and Obama campaigns also experienced attempted breaches by foreign entities seeking to access databases and social media sites. Even earlier, in 2008, the Obama and McCain campaigns also suffered the loss of “a serious amount of files” from cyber attacks. China or Russia were the key suspects based on the sophistication of the attacks. Clearly, this has been an ongoing problem, with little fix in sight. Many states continue to push forth with electronic voting systems vulnerable to common yet entirely preventable attacks, and many states have yet to also implement a paper trail.

Hacking an Election around the World

The US is not the only country that has or continues to experience election hacking. External actors can pursue multiple strategies to influence an election. In the digital domain, these can roughly be bucketed into the following:

Data theft: This is not merely a concern in the United States, but is a global phenomenon and is likely going to grow as elections increasingly become digitized. The hack on the Philippines Commission on Elections exposed the personally identifiable information of 55 million citizens. The Ukraine election system was attacked prior to the 2014 election with a virus intended to delete the election results. More recently, Taiwan is experiencing digital attacks (likely from China), aimed at gaining opposition information. Hong Kong is in a similar situation, with digital attacks targeting government agencies leading up to legislative elections. The data theft often occurs for espionage purposes, either to conceal for information gain, or to disclose as a large data dump, such as the DNC data leak.
Censorship: Government entities – largely in authoritarian countries – attempt to control social media prior to elections. As a recent article noted, “Election time in Iran means increased censorship.” This is not just true of Iran, but of many countries across the globe. In Africa, countries as diverse as Ghana, Ethiopia, and Republic of Congo have censored social media during the election season. Ugandan president Yoweri Museveni, who has ruled for 30 years, has deployed Internet censorship, targeting specific social media sites, as a means to influence elections. Last year, Turkey similarly cracked down on social media sites prior to the June elections, similar to Erdogan’s tactics in 2014 during presidential elections. Of course, a government’s censorship often coincides with the electorate’s circumvention of the censorship, which can be analyzed through the use of anti-censorship tools like Tor or Psiphon.
Disinformation: In one of the most intriguing alleged ‘election hacks’ to date, earlier this year Bloomberg reported the case of a man who may have literally rigged elections in nine Latin American countries. He did not do this solely via data theft, but rather through disinformation. Andrés Sepúlveda would conduct digital espionage to gather information on the opposition, while also running smear campaigns and influencing social media. This is similar to Putin’s tactics via the Trolls, as he created fake YouTube videos, Wikipedia entries, and countless fake social media accounts. While the Russian trolls have largely focused domestically or in Eastern Europe, there is indication that they are increasingly posing as supporters within the US election.

Adjusting the Risk Calculus

Elections are just one of the many socio-political events that experience heightened malicious activity. However, within democracies, elections are the bedrock of the democratic process, and manipulation of them can undermine both domestic and international legitimacy. At this weekend’s G-20 summit, President Obama and Putin met to discuss, among other things, acceptable digital behavior. President Obama noted the need to avoid the escalation of an arms race as has occurred in the past, stressing the need to institutionalize global norms.

This is an important point, as digital espionage and propaganda – especially when aimed against a core foundation of democracy – do not necessitate digital responses. In yesterday’s The Washington Post article, Senator Ben Sasse urged Obama to publicly name Russia as the source behind the latest string of attacks on the US presidential elections. He noted, “Free and legitimate elections are non-negotiable. It’s clear that Russia thinks the reward outweighs any consequences….That calculation must be changed. . . . This is going to take a cross-domain response — diplomatic, political and economic — that turns the screws on Putin and his cronies.”

A key feature of any deterrence strategy is to adjust the risk calculus. Russia’s risk calculus remains steadfast, with the benefits of disinformation and data theft clearly outweighing the costs. The latest investigation of Russian covert operations further elucidates that the cyber domain must not be viewed solely through a cyber lens, as is too often the case. Instead, nefarious digital activity is part of the broader strategic context, requiring creative, cross-domain solutions that impact an adversary’s risk calculus, while minimizing escalation.

After adversaries breach a system, they usually consider how they will maintain uninterrupted access through events such as system restarts. This uninterrupted access can be achieved through persistence methods. Adversaries are constantly rotating and innovating persistence techniques, enabling them to evade detection and maintain access for extended periods of time. A prime example is the recent DNC hack, where it was reported that the attackers leveraged very obscure persistence techniques for some time while they evaded detection and exfiltrated sensitive data.

The number of ways to persist code on a Windows system can be counted in the hundreds and the list is growing. The discovery of novel approaches to persist is not uncommon. Further, the mere presence of code in a persistence location is by no means an indicator of malicious behavior as there are an abundance of items, usually over a thousand, set to autostart under various conditions on a standard Windows system. This can make it particularly challenging for defenders to distinguish between legitimate and malicious activity.

With so many opportunities for adversaries to blend in, how should organizations approach detection of adversary persistence techniques? To address this question, Endgame is working with The MITRE Corporation, a not-for-profit R&D organization, to demonstrate how the hunting paradigm fits within the MITRE ATT&CK™ framework. The ATT&CK™ framework—which stands for Adversarial Tactics, Techniques & Common Knowledge—is a model for describing the actions an adversary can take while operating within an enterprise network, categorizing actions into tactics, such as persistence, and techniques to achieve those tactics. Endgame has collaborated with MITRE to help extend the ATT&CK™ framework by adding a new technique – COM Object Hijacking– to the persistence tactic, sparking some great conversations and insights that we’ve pulled together into this post. Thanks to MITRE for working with Endgame and others in the community to help update the model, and a special thanks to Blake Strom for co-authoring this piece. Now let the hunt for persistence begin!

Hunting for Attacker Techniques

Hunting is not just the latest buzzword in security. It is a very effective process for detection as well as a state of mind. Defenders must assume breach and hunt within the environment continually as though an active intrusion is underway. Indicators of compromise (IOC) are not enough when adversaries can change tool indicators often. Defenders must hunt for never-before-seen artifacts by looking for commonly used adversary techniques and patterns. Given constantly changing infrastructure and the increasingly customized nature of attacks, hunting for attacker techniques greatly increases the likelihood of catching today’s sophisticated adversaries.

Persistence is one such tactic for which we can effectively hunt. Defenders understand that adversaries will try to persist and generally know the most common ways this can be done. Hunting in persistence locations for anomalies and outliers is a great way to find the adversary, but it isn’t always easy. Many techniques adversaries use resemble ways software legitimately behaves on a system. Adversary persistence behavior in a Windows environment could show up as installing seemingly benign software to run upon system boot, when a user logs into a system, or even more clever techniques such as utilizing Windows Management Instrumentation (WMI). Smart adversaries know what is most common and will try to find poorly understood and obscure ways to persist during an intrusion in order to evade detection.

MITRE has provided the community with a cheat sheet of persistence mechanisms through ATT&CK™, which describes the universe of adversary techniques to help inform comprehensive coverage during hunt operations. It includes a wide variety of techniques ranging from simply using legitimate credentials to more advanced techniques like component firmware modification approaches. The goal of an advanced adversary is not just to persist - it is to persist without detection by evading common defensive mechanisms as well. These common evasion techniques are also covered by ATT&CK™. MITRE documented these techniques using in-depth knowledge about how adversaries can and do operate, like with COM hijacking for persistence.

To demonstrate the value of hunting for specific techniques, we focus on Component Object Model (COM) Hijacking, which can be used for persistence as well as defense evasion.

So what’s up with the COM?

Microsoft’s Component Object Model (COM) has been around forever – well not exactly – but at least since 1993 with MS Windows 3.1. COM basically allows for the linking of software components. This is a great way for engineers to make components of their software accessible to other applications. The classic use case for COM is how Microsoft Office products link together. To learn more, Microsoft’s official documentation provides a great, comprehensive overview of COM.

Like many other capabilities attackers use, COM is not inherently malicious. However, there are ways it can be used by the adversary which are malicious. As we discussed earlier, most adversaries want to persist. Therefore, hunters should regularly look for signs of persistence, such as anomalous files which are set to execute automatically. Adversaries can cleverly manipulate the COM to execute their code, specifically by manipulating software classes in the current user registry hive, and enabling persistence.

But before we dive into the registry, let’s have a quick history lesson. Messing with the COM is not an unknown technique by any means. Even as early as 2005, adversaries were utilizing Internet Explorer to access the machine’s COM to cause crashes and other issues. Check out CVE-2005-1990 or some of CERT’s vulnerability notes discussing exactly this problem.

COM object hijacking first became mainstream in 2011 at the Virus Bulletin conference, when Jon Larimer presented “The Dangers of Per-User COM Objects.” Hijacking is a fairly common term in the infosec community, and describes the action of maliciously taking over an otherwise legitimate function at the target host: session hijacking, browser hijacking, and search-order hijacking to name a few. It didn’t take long for adversaries to start leveraging the research presented at Virus Bulletin 2011. For example, in 2012, the ZeroAcess rootkit started hijacking the COM, while in 2014 GDATA reported a new Remote Administration Tool (RAT) dubbed COMpfun which persists via a COM hijack. The following year, GDATA again presented the use of COM hijacking, with COMRAT seen persisting via a COM hijack. The Roaming Tiger Advanced Persistent Threat (APT) group reportedly also used COM hijacking with the BBSRAT malware. These are just a few examples to demonstrate that COM hijacking is a real concern which hunters need to consider and handle while looking for active intrusions in the network.

The Challenges and Opportunities to Detect COM Hijacking

Today, COM hijacking remains relevant, but is often forgotten. We see it employed by persistent threats as well as included in crimeware. Fortunately, we have one advantage - the hijack is fairly straightforward to detect. To perform the hijack, the adversary relies on the operating system to load current user objects prior to the local machine objects in the COM. This is the fundamental principle to the hijack and also the method to detect.

Easy, right? Well, there are some gotchas to watch out for. Most existing tools detect COM hijacking through signatures. A COM object is identified in the system by a globally unique identifier called a CLSID. A signature-based approach will only look at and alert on specific CLSIDs which reference an object that has been previously reported as hijacked. This is nowhere near enough because, in theory, any COM object and any CLSID could be hijacked.

Second, for us hunters, the presence of a user COM object in general can be considered anomalous, but some third-party applications will generate such objects causing false positives in your hunt. To accurately find COM hijacks, a more in-depth inspection within the entire current user and local machine registry hive is necessary. In our single default Windows 7 VM, we had 4697 CLSIDs within the local machine hive. To perform the inspection, you will need to dust off your scripting skills and perform a comparative analysis within the registry. This could become difficult and may not scale if you are querying thousands of enterprise systems, which is why we baked this inspection into the Endgame platform.

Pic of file folders

So many objects to hijack...

At Endgame, we inspect the registry to hunt exactly for these artifacts across all objects within the registry and this investigation scales across an entire environment. This is critical because hunters need to perform their operations in a timely and efficient manner. Please reference the following video to see a simple COM hijack and automatic detection with the Endgame platform. Endgame enumerates all known persistence locations across a network, enriches the data, and performs a variety of analytics to highlight potentially malicious artifacts in seconds. COM hijacking detection is one capability of many in the Endgame platform.

Hunting for COM Hijacking using Endgame

Conclusion

Persistence is a tactic used by a wide range of adversaries. It is part of almost every compromise. The choice of persistence technique used by an adversary can be the most interesting and sophisticated aspect of an attack. This makes persistence, coupled with the usual defense evasion techniques, prime focus areas for hunting and subsequent discovery and remediation. Furthermore, we can’t always rely on indicators of compromise alone. Instead, defenders must seek out anomalies within the environment, either at the host or in the network, which can reveal the breadcrumbs to follow and find the breach.

Without a framework and intelligent automation, the hunt can be time-consuming, resource-intensive, and unfocused. MITRE’s ATT&CK™ framework provides an abundance of techniques that can guide the hunt in a structured way. With this as a starting point, we have explored one persistence technique in depth: COM hijacking. COM hijacks can be detected without signatures through intelligent automation and false positive mitigation, getting beyond many challenges present if an analyst would need to find COM hijacks manually. This is just one way in which a technique-focused hunt mindset can allow defenders to detect, prevent, and remediate those adversaries that continue to evade even the most advanced defenses.

E-mail spam and browser exploitation are two very popular avenues used by criminals to compromise computers. Most compromises result from human error, such as clicking a malicious link or downloading and opening an attachment within an email and enabling macros. Email filtering can offer effective protections against the delivery of widespread malicious spam, and user training can be reasonably effective in reducing the number of employees willing to open an unknown document, enable macros, and self-infect.

Protection against browser exploitation is more difficult, which occurs simply by visiting a website. Users are prone to clicking links to malicious sites, and much worse, criminals actively exploit well trafficked sites (directly through web server exploitation or indirectly via domain squatting, ad injections, or other techniques) and cause the user’s browser to covertly visit an exploit server. The user’s browser gets hit, a malicious payload such as ransomware is installed, and the user has a bad day.

The hardest part of this for attackers is the exploitation code itself. Fortunately for criminals, a thriving underground market for exploit kits is available. These occasionally contain zero days, but more often, exploit kit authors rapidly weaponize new vulnerabilities which will allow for exploitation of users who fail to rapidly patch their systems.

Exploit kits are often key components of crimeware campaigns, which is estimated to be a $400 billion global market. Capturing these often evasive exploit kits is essential to advance research into protecting against them, but samples are hard to obtain for researchers. To solve this problem, we created Maxwell, an automated exploit kit collection and detection tool that crawls the web hunting for exploits. For researchers, Maxwell significantly decreases the time it takes to find exploit kits samples, and instead enables us to focus on the detection and prevention capabilities necessary to counter the growing criminal threat of exploit kits.

Exploit Kits in the Wild

The Angler exploit kit - responsible for a variety of malvertising and ransomware compromises - is indicative of just how lucrative these exploit kits can be. By some estimates, Angler was the most lucrative compromise platform for crimeware, reeling in $60 million annually in ransomware alone. Earlier this year, a Russian criminal group was arrested in connection with the Lurk Trojan. This coincided with an end of Angler exploit kit activity. A battle for market share has ensued since, with RIG and Neutrino EK jockeying for the market leading position.

A typical business model for exploit kit authors is malware as a service. The authors rent access to their exploit kits to other criminals for several thousand dollars a month, on average.

Other criminal groups instead opt to focus more on traffic distribution services or gates, and set out to compromise as many web servers as possible. Once compromised, they can insert iframe re-directions to websites of their choosing. The users of exploit kits can pay for this as a service to increase the amount of traffic their exploit kits receive and drive up infections.

The Exploitation Process

The graphic below depicts a high-level overview of the six core steps of the exploitation process. There are numerous existing case studies on exploit kits, such as one on Nuclear, that provide additional, very low level details on this process.

The graphic depicts a high-level overview of the six core steps of the exploitation process.

A user visits a legitimate web page.
Their browser is silently redirected to an exploit kit landing page. This stage varies, but focuses on traffic redirection. Either the attacker compromises the website and redirects the user by adding JavaScript or an iframe, or the attacker pays for an advertisement to perform the redirection.
The exploit kit sends a landing page which contains JavaScript to determine which browser, plugins and versions the user is running. Neutrino EK is a little different in that this logic is embedded in a Flash file that is initially sent.
If the user’s configuration matches a vulnerability to a particular exploit (e.g., an outdated Flash version), the user’s browser will be directed to load the exploit.
The exploit’s routines run, and gain code execution on the user’s machine.
The exploit downloads and executes the chosen malware payload. Today, this is usually ransomware, but it can also be banking trojans, click fraud, or other malware.

How to Catch an Exploit Kit: Introducing Maxwell

There are numerous motivations for collecting and analyzing exploit kits. As a blue teamer, you may want to test your defenses against the latest and greatest threat in the wild. As a red teamer, you may want to do adversary emulation with one of the big named exploit kits (e.g., Neutrino, RIG, Magnitude). Or maybe you have some other cool research initiative. How would you go about collecting new samples and tracking activity? If you work for a large enterprise or AV company, it is relatively easy as your fellow employees or customers will provide all the samples you need. You can simply set up packet collection and some exploit kit detections at your boundary and sit back and watch. But what if you are a researcher, without access to that treasure trove of data? That’s where Maxwell comes in. Maxwell is an automated system for finding exploit kit activity on the internet. It crawls websites with an army of virtual machines to identify essential information, such as the responsible exploit kit, as well as related IPs and domains.

The Maxwell Architecture

Maxwell consists of a central server, which is basically the conductor or brains of the operation, connecting to a vSphere or other cloud architecture to spin up and down virtual machines. RabbitMQ and ElasticSearch provide the means for message queuing and indexing the malicious artifacts. The virtual machine consists of a variety of Python agent scripts to enable iterative development, as well as a pipe server that receives messages from the instrumentation library, filters those that match a whitelist, and forwards the remaining messages to a RabbitMQ server.

Flux is our instrumentation library, which is a DLL loaded to new processes with a simple AppInit DLL key. Flux hooks the usual functions for dropping files, creating registry keys, process creation, etc. The hooking is done only in user-mode at the Nt function level. The hooks must be at the lowest possible level in order to capture the most data. Flux also has some exploit detection capabilities built in, and shellcode capturing, which will be discussed shortly.

Moving to outside the virtual machine, the controller is a Python script that listens on a RabbitMQ channel for new jobs, including basic information like the website to visit, a uuid, and basic config information. Once a new job is received, the controller is responsible for spinning up a virtual machine and sending the job information and the plugin to execute (which is currently only Flux). The controller uses ssh to copy files into the virtual machine. The results server is also a Python script that listens on a different RMQ channel. This script receives data from the virtual machines during execution. The data is forwarded to an Elasticsearch index for permanent storage and querying. Once a job has completed, the results server determines if any malicious activity has occurred. If so, it executes post processing routines. Finally, all extracted data and signature information is sent in a notification to the researcher. An important design decision worth noting, is to stream events out of the virtual machine during execution, as opposed to all at once after a timeout. The latter is susceptible to losing information after ransomware wreaks havoc in the virtual machine.

When configuring your virtual machine, it’s important to make it an attractive target for attackers, who work by market share and target the most popular software, such as Windows 7, Internet Explorer, and Flash. Be sure to also remove vmtools and any drivers that get dropped by VMware. You can browse to the drivers folder and sort by publisher to find VMware drivers. Finally, you should consider patch levels, and pick plugin versions that are exploitable by all major exploit kits, while also disabling any additional protections, such as IE protected mode.

Exploit Detection in Maxwell

As mentioned earlier, Maxwell automates exploit detection. While previously ROP detection was reliable enough, it is no longer effective at detecting modern exploit kits. The same is true for stack pivot, which basically checks ESP to see if it points to the heap instead of the stack, and is easily evaded by Angler and other exploit kits.

In Flux, we throw guard pages not only on the export address table, but also the IAT and MZ header of critical modules. We also use a small whitelist instead of a blacklisting technique, enabling us to catch shellcode that is designed to evade EMET. Even better, we can detect memory disclosure routines that execute before the shellcode. When a guard page violation is hit, we also save the shellcode associated with it for later inspection.

Similar to all sandboxes, Flux can log when files are dropped, new processes are created, registry keys are written to, or even when certain files are accessed. File access can be used to detect JavaScript vm-detection routines before an exploit is even triggered. However, these are all too noisy on their own so we must rely heavily on the whitelist capability. Essentially, every typical file/registry/process action that is observed from browsing benign pages is filtered, leaving only malicious activity caused by the exploit kit. Even if they evade our exploit detection techniques, we will still detect malicious activity of the payload as it is dropped and executed.

If malicious activity is detected, the post-processing step is activated by executing tcpflow on the PCAP to extract all sessions and files. Next, regular expressions are searched across the GET/POST requests to identify traffic redirectors (such as EITEST), EK landing pages, and payload beacons. Finally, any dropped files, shellcode, and files extracted from the PCAP are scanned with Yara. The shellcode scanning allows for exploit kit tagging based on the specific shellcode routines used in each kit, which are very long lasting signatures.

If you are protecting a network with snort rules for a component of the EK process, you need to know when Maxwell stops flagging a signature. Building robust signatures limits the necessity to frequently update them. There are a few tricks for writing robust signatures, such as comparing samples over time to extract commonalities or creating signatures from Flash exploits themselves. Both of these may result in longer lasting signatures. You can also take advantage of social media, and compare samples in Maxwell against those posted on Twitter by researchers, such as @kafeine, @malware_traffic and @BroadAnalysis.

Hunting for Exploit Kits

With the architecture and detection capabilities in place, it’s time to start hunting. But which websites should be explored to find evil stuff? A surprisingly effective technique is to continually cycle through the Alexa top 25,000 or top 100,000 websites, which can be streamlined by browsing five websites at a time instead of one, and get a 5x boost on your processing capability. In less than 24 hours, you can crawl through 25,000 websites with just a handful of virtual machines. The only down side is losing the ability to know exactly which of the five websites was compromised without manually looking through the PCAP. If you have a good traffic anonymizing service, you can just reprocess each of the five websites.

At DerbyCon 6.0 Recharge, the Maxwell research was presented for the first time and we released the code under an MIT license. You can find it on GitHub. We look forward to to comments, contributions, and suggestions for advancements. Maxwell has proven extremely useful in fully automating the detection and analysis of exploit kits and watering holes. Ideally, Maxwell can help both red and blue teamers test an organization’s defenses without requiring extensive resources or significant time. It also greatly simplifies a key pain point for researchers - actually collecting the samples. By hunting for exploit kits with Maxwell, researchers can spend more time analyzing and building defenses against exploit kits, instead of searching for them.

Picture Source: artistsinspireartists

In 2008, the number of internet-connected devices surpassed the number of people on the planet and Facebook overtook MySpace as the most popular social network. At the time, few people grasped the impact that these rapidly expanding digital networks would have on both national and cyber security. This was also the year I first used Hadoop, a distributed storage and processing framework.

Since then, Hadoop has become the core of nearly every major large-scale analytic solution, but it has yet to reach its full potential in security. To address this, last week Cloudera, Endgame and other contributors announced the acceptance of Apache Spot, a cyber analytics framework, into the Apache Software Foundation incubator. At the intersection of Hadoop and security, this new project aims to revolutionize security analytics.

“When you invent the ship, you also invent the shipwreck”- Paul Virilio

Back in 2008, the security industry was recovering from the Zeus trojan and unknowingly gearing up for a date with Conficker, a worm that would go on to infect upwards of 9 million devices across 190 countries. Simultaneously, government think tanks warned that web 2.0 social networks would facilitate the radicalization and recruitment of terrorists.

As a computer engineer in the US Intelligence Community, I was lucky enough to work on these large-scale problems. Problems that, at their core, require solutions capable of ingesting, storing, and analyzing massive amounts of data in order to discern good from bad.

Around this time, engineers at search engine companies were the only other teams working on internet-scale data problems. Inspired by Google’s MapReduce and File System papers, Doug Cutting and a team at Yahoo open sourced Apache Hadoop, a framework that made it possible to work with large data sets across inexpensive hardware. Upon its release in 2006, this project began democratizing large-scale data analysis and gained adoption across a variety of industries.

Seeing the promise of Hadoop, the Intelligence Community became an early adopter, as it needed to cost-effectively perform analysis at unprecedented scale. In fact, they ultimately invested in Cloudera, the first company founded to make Hadoop enterprise ready.

Fast Forward to Today: Hadoop for Security

In 2016, forward-leaning security teams across industry and government are increasingly adopting Hadoop to complement their Security Incident and Event Management (SIEM) systems. There are a number of fundamental characteristics that make Hadoop attractive for this application:

1. Scalability: Network devices, users, and security products emit a seemingly infinite flow of data. Based on its distributed architecture, Hadoop provides a framework capable of dealing with the volume and velocity of this cross-enterprise data.

2. Low Cost-per-Byte: Detection, incident response, and compliance use cases increasingly demand longer data retention windows. Due to its use of commodity hardware and open source software, Hadoop achieves a scaling cost that is orders of magnitude lower than commercial alternatives.

3. Flexibility: Starting with a single Apache project, the Hadoop family has grown into an ecosystem of thirty plus interrelated projects. Providing a “zoo” of data storage, retrieval, ingest, processing, and analytic capabilities, the Hadoop family is designed to address various technical requirements from stream processing to low-latency in-memory analytics.

Unfortunately, many Hadoop-based security projects exceed budget and miss deadlines. To kick off a project, engineers have to write thousands of lines of code to ingest, integrate, store, and process disparate security data feeds. Additionally, the numerous ways of storing data (e.g., Accumulo, HBase, Cassandra, Kudu...) and processing it tees up a myriad of design decisions. All of this distracts from the development and refinement of the innovative analytics our industry needs.

Apache Spot

Apache Spot is a new open source project designed to address this problem, and accelerate innovation and sharing within the security community. It provides an extensible turnkey solution for ingesting, processing, and analyzing data from security products and infrastructure. Hadoop has come a long way since its inception. Apache Spot opens the door for exciting security applications. Purpose-built for security, Spot does the heavy lifting, providing out-of-the-box connectors that automate the ingest and processing of relevant feeds. Through its open data models, customers and partners are able to share data and analytics across teams - strengthening the broader community.

Endgame is proud to partner with Cloudera and Intel to accelerate the adoption of Apache Spot across customers and partners. Our expertise in using machine learning to hunt for adversaries and deep knowledge of endpoint behavior will help Apache Spot become a prominent part of the Hadoop ecosystem. We’re excited to contribute to this open source project, and continue pushing the industry forward to solve the toughest security challenges.

To find out more about Apache Spot, check out the announcement from Cloudera and get involved.

As the security research community develops newer and more sophisticated means for detecting and mitigating malware, malicious actors continue to look for ways to increase the size of their attack surface and utilize whatever means are necessary to bypass protections. The use of scripting languages by malicious actors, despite their varying range of limited access to native operating system functionality, has spiked in recent years due to their flexibility and straightforward use in many attack scenarios.

Scripts are frequently leveraged to detect a user’s operating system and browser environment configuration, or to extract or download a payload to disk. Malicious actors also may obfuscate their scripts to mask the intent of their code and circumvent detection, while also deterring future reverse engineering. As I presented at DerbyCon 6 Recharge, many common obfuscation techniques can be subverted and defeated. Although they seem confusing at first glance, there are a variety of techniques that help quickly deobfuscate scripts. In this post, I’ll cover the latest advances in script obfuscation and how they can be defeated. I’ll also provide some practical tips for quickly cleaning up convoluted code and transforming it into something human-readable and comprehensible.

Obfuscation

When discussing malicious scripts, obfuscation is a technique attackers use to purposefully obscure their source code. They do this primarily for two purposes: subverting antivirus and intrusion detection / prevention systems and deterring future reverse engineering efforts.

Obfuscation is typically employed via an automated obfuscator. There are many to choose from, including the following freely available tools:

Since obfuscation does not alter the core functionality of a script (superfluous code may be added to further obscure a script’s purpose), would it be possible to simply utilize dynamic malware analysis methods to determine the script’s intended functionality and extract indicators of compromise (IOCs)? Unfortunately for analysts and researchers, it’s not quite that simple. While dynamic malware analysis methods may certainly be used as part of the process for analyzing more sophisticated scripts, deobfuscation and static analysis are needed to truly know the full extent of a script’s capabilities and may provide insight into determining its origin.

Tips for Getting Started

When beginning script deobfuscation, you should keep four goals in mind:

Human-readability: Simplified, human-readable code should be the most obvious realized goal achieved through the deobfuscation process.
Simplified code: The simpler and more readable the code is, the easier it will be to understand the script’s control flow and data flow.
Understand control flow / data flow: In order to be able to statically trace through a script and its potential paths of execution, a high level understanding of its control flow and data flow is needed.
Obtain context: Context pertaining to the purpose of the script and why it was utilized will likely be a byproduct of the first three goals.

Prior to starting the deobfuscation process, you should be sure you have the following:

Virtual machine
Fully-featured source code editor with syntax and function / variable highlighting
Language-specific debugger

It should also go without saying that familiarity with scripting languages is a prerequisite since you’re trying to (in most cases) understand how the code was intended to work without executing it. The following scripting language documentation will be particularly useful:

Online script testing frameworks provide a straightforward means for executing script excerpts. These frameworks can serve as a stepping-stone between statically evaluating code sections and setting up a full-fledged debugging session. The following frameworks are highly recommended:

Before you begin, it is important to know that there is no specific sequence of steps required to properly deobfuscate a script. Deobfuscation is a non-linear process that relies on your intuition and ability to detect patterns and evaluate code. So, you don’t have to force yourself to go from top to bottom or the perceived beginning of the control flow to the end of the control flow. You’ll simply want to deobfuscate code sections that your eyes gravitate towards and that are not overly convoluted. The more sections you’re initially able to break down, the easier the overall deobfuscation process will be.

Code uniformity is crucial to the deobfuscation process. As you’re deobfuscating and writing out your simplified version of the code, you’ll want to employ consistent coding conventions and indentation wherever possible. You’ll also want to standardize and simplify how function calls are written where possible and how variables are declared and defined. If you take ownership of the code and re-write it in a way that you easily understand, you'll quickly become more familiar with the code and may pick up on subtle nuances in the control or data flow that may otherwise be overlooked.

Also, as previously mentioned, simplify where possible. It can't be reiterated enough!

Obfuscation Techniques

Garbage Code

Obfuscators will sometimes throw in superfluous code sections. Certain variables and functions may be defined, but never referenced or called. Some code sections may be executed, but ultimately have no effect on the overall operation of the script. Once discovered, these sections may be commented out or removed.

In the following example, several variables within the subroutine are defined and set to the value of an integer plus the string representation of an integer. These variables are not referenced elsewhere within the code, so they can be safely removed from the subroutine and not affect the result.

Obfuscation_Code_1

Complicated Names

Arguably the most common technique associated with obfuscation is the use of overly complicated variable and function names. Strings containing a combination of uppercase and lowercase letters, numbers, and symbols are difficult to look at and differentiate at first glance. These should be replaced with more descriptive and easier to digest names, reinforcing the human-readable goal. While you can use the find / replace function provided by your text editor, in this case you’ll need to be careful to avoid any issues when it comes to global versus local scope. Once you have a better understanding of the purpose of a function or a variable later on in the deobfuscation process, you can go back and update these names to something more informative like “post_request” or “decoding_loop.”

Obfuscation_Code_2

In the above example, each variable and function that is solely local in scope to the subroutine is renamed to a more straightforward label describing its creation or limitation in scope. Variables or function calls that are referenced without being declared / defined within the subroutine are left alone for the moment. These variables and functions will be handled individually at the global scope.

Indirect Calls and Obscured Control Flow

Obscured control flow is usually not evident until much later in the deobfuscation process. You will generally look for ways to simplify function calls so they’re more direct. For instance, if you have one function that is called by three different functions, but each of the those functions transform the input in the exact same way and call the underlying function identically, then those three functions could be merged into one simple function. Function order can also come into play. If you think a more logical ordering of the functions that matches up with the control flow you are observing will help you better understand the code, then by all means rearrange the function order.

Obfuscation_Code_3

In this case, we have five subroutines. After these subroutines are defined, there is a single call to sub5. If you trace through the subroutines, the ultimate subroutine that is executed is sub2. Thus, any calls outside of this code section to any of these five subs will result in a call to sub2, which actually carries out an operation other than calling another subroutine. So, removing sub1, sub3, sub4, and sub5 and replacing any calls to those subs with a direct call to sub2 would be logically equivalent to the original code sequence.

Arithmetic Sequences

When it comes to hard-coded numeric values, obfuscators may employ simple arithmetic to thwart reverse engineers. Other than doing the actual math, it is important to research the exact behavior of the scripting language implementation of the mathematical functions.

Obfuscation_Code_4

In the line above, the result of eight double values which are added and subtracted to / from each other will pass into the ASCII character function. Upon further inspection, the obfuscator likely threw in the "86" values, as they ultimately cancel each other out and add up to zero. The remaining values add up to 38 which, when passed into the character function, results in an ampersand.

Obfuscation_Code_5

While the code section above may look quite intimidating, it is easily reduced down to one simple variable definition by the end of the deobfuscation process. The first line initializes an array of double values which is only referenced on the second line. The second line declares and sets a variable to the second value in the array. Since the array is not subsequently referenced, the array can be removed from the code. The variable from the second line is only used once in the third line, so its value can be directly placed inline with the rest of the code, thus allowing us to remove the second line from the code. The code further simplifies down as the Sgn function calls cancel each other and the absolute value function yields a positive integer, which will be subtracted from the integer value previously defined in the variable from the second line.

Obfuscated String Values

As for obfuscated string values, you’ll want to simplify the use of any ASCII character functions, eliminate any obvious null strings, and then standardize how strings are concatenated and merge together any substrings where possible.

This line of code primarily relies on the StrReverse VBA function which, you guessed it, reverses a string. The null strings are removed right off the bat since they serve no purpose. Once the string is reversed and appended to the initial “c” string, we’re left with code which invokes a command shell to run the command represented by the variable and terminate itself.

Obfuscation_Code_6

A common technique employed in malicious VBA macros is dropping and invoking scripts in other scripting languages. The macro in this case builds a Windows batch file, which will later be executed. While it is quite evident that a batch file is being constructed, the exact purpose of the file is initially unclear.

Obfuscation_Code_7

If we carry out string concatenations, eliminate null strings, and resolve ASCII characters, we can see that the batch file is used to invoke a separate VBScript file located in a subdirectory of the current user’s temp directory.

Advanced Cases

Okay, so you tried everything and your script is still obfuscated and you have no idea what else to do…

Well, in this case you’ll want to utilize a debugger and start doing some more dynamic analysis. Our goal in this case is to circumvent the obfuscation and seek out any silver bullets in the form of eval functions or string decoding routines. Going back to the ”resolve what you can first” approach, you also might want to start out by commenting out code sections to restrict program execution. Sidestepping C2 and download functions with the aid of a debugger may also be necessary.

Obfuscation_Final_Code_8

If you follow the function that is highlighted in green in the above example, you can see that it is referred to several times. It takes as input one hexadecimal string and one alphanumeric string with varying letter cases, and returns a value. Based off of the context in which the function is called (as part of a native scripting language function call in most cases), the returned value is presumed to be a string. Thus, we can hypothesize that this function is a string decoding routine. Since we are using a debugger, we don’t need to manually perform the decoding or reverse engineer all of its inner workings. We can simply set breakpoints before the function is called and prior to the function being returned in order to resolve what the decoded strings are. Once we resolve the decoded strings, we can replace them inline with the code or place them as inline comments as I did in the sample code.

Conclusion

Script deobfuscation doesn't require any overly sophisticated tools. Your end result should be simple, human-readable code that is logically equivalent to the original obfuscated script. As part of the process, rely on your intuition to guide you, and resolve smaller sections of code in order to derive context to how they’re used. When provided the opportunity, removing unnecessary code and simplifying code sections can help to make the overall script much more readable and easier to comprehend. Finally, be sure to consult the official scripting language documentation when needed. These simple yet effective tips should provide a range of techniques next time you encounter obfuscated code. Good luck!

As any good hunter knows, one of the first quick-win indicators to look for is malware within designated download or temp folders. When users are targeted via spear phishing or browser based attacks, malware will often be initially staged and executed from those folders because they are usually accessible to non-privileged processes. In comparison, legitimate software rarely runs from, and even more rarely persists from, these locations. Consequently, collecting and analyzing file paths for outliers provides a signature-less way to detect suspicious artifacts in the environment.

In this post, we focus on the power of file paths in hunting. Just as our earlier post on COM Hijacking demonstrated the value of the ATTACK^TM framework for building defensive postures, this piece addresses another branch of the persistence framework and illustrates the efficacy of hunting for uncommon file paths. This method has proven effective time and time again in catching cyber criminals and nation state actors alike early in their exploitation process.

Hunting for Uncommon Paths: The Cheap Approach

In past posts, we highlighted some free approaches for hunting to include passive DNS, DNS anomaly detection, and persistence techniques. Surveying file locations on disk is another cost effective approach for your team. The method is straightforward: simply inspect running process paths or persistent file locations and then look through the data for locations where attackers tend to install their first-stage malware. For an initial list of such locations, Microsoft’s Malware Protection Center provides a good list of common folders where malware authors commonly write files.

Your hunt should focus on files either running or persisting within temp folders, such as %TEMP%, %APPDATA%, %LOCALAPPDATA%, and even extending to %USERPROFILE%, but first you need to collect the appropriate data. To gather persistence data with freely available tools, you can use Microsoft’s SysInternals Autoruns to discover those persistent files and extract their paths. For detailed running process lists, there are many tools available, but we recommend utilizing PowerShell via the Get-Process cmdlet.

Some popular commands include:

“Get-Process | Select -Property name, path”: To list current processes and their path.
“Get-Process | Select -Property path | Select-String C:\\Users”: In this case we filter for running processes from a user’s path. Like grep, utilize Select-String to filter results.
“Get-Process | Group-Object -Property path | Sort count | Select count, name”: For the number of occurrences, group objects.

Expanding your hunt beyond just temp folders can also be beneficial, but legitimate processes will be running from %SystemRoot% or %ProgramFiles%. Because of this, outlier analysis will be effective and assist your hunt. Simply aggregate the results from your running process or persistent surveys across your environment, then perform frequency analysis to find the least occurring samples.

The Noise Dilemma

As you may have guessed, anomalous items uncovered with this approach to hunting does not necessarily mean they were suspicious. Users can install applications locally in uncommon locations, adding benign results to your hunt. Experienced hunters know that noise is always going to be part of many hunt analytics, and the art is in discerning the artifacts that are most likely to be malicious and should thus be focused on first.

There are many ways to triage and filter results, but here are a few examples.

You can utilize authenticode by filtering out executables that are legitimately signed by well-known signers or by signers you might expect in your environment. Unsigned code out of an odd location might be especially unusual and worth inspecting.
If you have a known-good, baseline image of your environment, you can use hash lookups to weed out any applications which were not approved for installation.
You may want to ignore folders protected by User Account Controls (UAC). In a well-configured environment, a standard user is not allowed to write to a secure directory, such as %SystemRoot% and some directories under %ProgramFiles%. Filtering out items executing from there can reduce the data set. It’s worth mentioning that you are assuming administrative credentials weren’t harvested and UAC wasn’t bypassed.
Once you’ve filtered data out, especially if you are left with many suspicious files, you may want to submit files for malware scans (e.g. VirusTotal). At Endgame, we prioritblize based on the MalwareScore™ of the file, our proprietary signature-less malware detection engine. The higher the score, the more likely it is that the file is malicious.

You may come up with many other ways to filter and reduce the data based on your specific knowledge of your environment. As always, your analysis of the hunt data and subsequent analytics should be driven by environmental norms, meaning that if you observe something uncommon to your system configurations, it’s worth investigating.

How Endgame Detects

At Endgame, we facilitate one-click collection of processes and persistence locations to allow for rapid anomaly and suspicious behavior detection. Our platform examines each endpoint within your network and can prevent, alert, or hunt for malicious behavior at scale. The following video shows a simple PowerShell-based approach to collection and stacking, and then shows the Endgame platform enumerating and enriching data in seconds to bubble suspicious artifacts to the top with the uncommon path technique.

Conclusion

There are many very effective ways to hunt for suspicious artifacts across your systems. Hunting for files executing or persisting out of strange paths can be effective in many environments. We’ve discussed some free ways to make this happen, and shown you how this can be done with the Endgame platform, highlighting malicious activity with ease. Layering this hunt approach with other analytics in your hunt operations will allow you to find intrusions quickly and do what every hunter wants to do: close the gap between breach and discovery to limit damage and loss.

InRise of the Machines, Thomas Rid details the first major digital data breach against the US government. The spy campaign began on October 7, 1996, and was later dubbed Moonlight Maze. This operation exfiltrated data that, if stacked, would exceed the height of the Washington Monument. Once news of the operation was made public, Newsweek cited Pentagon officials as clearly stating it was, "a state-sponsored Russian intelligence effort to get U.S. technology". That is, the US government publicly attributed a data breach aimed at stealing vast amounts of military and other technology trade secrets.

Fast-forward twenty years, and on October 7, 2016, ODNI and DHS issued a joint statement noting, “The U.S. Intelligence Community (USIC) is confident that the Russian Government directed the recent compromises of e-mails from US persons and institutions, including from US political organizations.” It’s been twenty years, and our policies have not yet evolved, leaving adversaries virtual carte blanche to steal intellectual property, classified military information, and personally identifiable information. They’re able to conduct extensive reconnaissance into our critical infrastructure, and hack for real-world political impact without recourse. This recent attribution to Russia, which cuts at the heart of democratic institutions, must be a game-changer that finally instigates the modernization of policy as it pertains to the digital domain.

Despite the growing scale and scope of digital attacks, with each new record-breaking breach dwarfing previous intrusions, this is only the fourth time in recent years that the US government has publicly attributed a major breach. Previous public attribution resulted in the indictment of five People’s Liberation Army officials, economic sanctions against North Korea following the Sony breach and earlier this year the indictment of seven Iranians linked to attacks on banks and a New York dam. As breach after breach occurs, those in both the public and private sectors are demanding greater capabilities in defending against these attacks.

Unfortunately, much of the cyber policy discussion continues to rely upon frameworks from decades, if not centuries, ago and is ill equipped for the digital era. For instance, Cold War frameworks may provide a useful starting point, but nuclear deterrence and cyber deterrence differ enormously in the core features of Cold War deterrence – numbers of actors, signaling, attribution, credible commitments, and so forth. Unfortunately, even among those highest ranking government officials there continues to be comparisons between nuclear and cyber deterrence, and so we continue to rely upon an outdated framework that has little relevance for the digital domain.

Some prefer to look back not decades, but centuries and point to Letters of Marque and Reprisal as the proper framework for the digital era. Created at a time to legally empower private companies to take back property that was stolen from them, they are beginning to gain greater attention as ‘hacking back’ also grows in popularity in the discourse. Nevertheless, there’s a reason Letters of Marque and Reprisal no longer exist. They fell out of favor, largely because of their escalatory effect on international conflict, during an era that didn’t even come close to the scope and scale of today’s digital attacks, or the interconnectivity of people, money and technologies. Similarly, technical challenges further complicate retaking stolen property. Adversaries can easily make multiple copies of the stolen data and use misdirection, obfuscation, and short-lived command and control infrastructure. This confounds the situation and heightens the risk of misguided retaliation.

So where does this leave us? The Computer Fraud and Abuse Act (CFAA) from 1986 remains the core law for prosecuting illegal intrusions. Unfortunately, just like the Wassenaar Arrangement and definitions of cyber war, the CFAA is so vaguely worded that it risks both judicial over-reach as well circumvention. This year’s Presidential Policy Directive 41 is essential and helps incident response but has no deterrent effect. In contrast, Executive Order 13694 in 2015, which basically sanctions people engaging in malicious cyber activity, is a start. It clearly signals the exact repercussions of an attack, but has yet to be implemented, and thus lacks the deterrent effect.

Similar steps must be taken to further specify the options available and that will be enacted in response to the range of digital intrusions. Too often it is assumed that a cyber tit for tat is the only viable option. That is extremely myopic, as the US has the entire range of statecraft available, including (but not limited to) cutting diplomatic ties, sanctions, indictments and, at the extreme, the military use of force. The use of each of these must be predicated on the target attacked, the consequences of that attack, as well as the larger geopolitical context.

Clearly, it is time for our policies to catch up with modern realities, and move beyond decades of little to no recourse for adversaries. This must be a high priority, as it affects the US as well as our allies. Last year’s attack on the French TV network, TV5Monde, came within hours of destroying the entire network. The attacks on a German steel mill, which caused massive physical damage, as well as the Warsaw Stock Exchange, not to mention the attacks on the White House, State Department and Joint Staff unclassified emails systems, have also been linked to Russia.

The world has experienced change at a more rapid pace arguably than any other time in history, largely driven by the dramatic pace of technological change. At the same time, our cyber policies have stagnated, leaving us unprepared to effectively counter the digital attacks that have been ongoing for decades. Given both the domestic and global implications, the US must step forward and offer explicit policy that clearly states the implications of a given attack, including consideration of targets, impacts of the attack, and the range of retaliatory responses at our disposal.

To be fair, balancing between having a retaliatory deterrent effect and minimizing escalation is extremely difficult, but we haven’t even really begun those discussions. Absent this discourse and greater legal and policy clarity, the intrusions will continue unabated. At the same time, many in the private sector will continue to debate the merits of a hacking back framework that has serious escalatory risks, and likely is ineffective. The next few weeks are extremely important, as the Obama administration weighs the current range of options that cut across diplomatic, information, military and economic statecraft. Hopefully we’ll see a rise in discourse and concrete steps that begin to address viable deterrent options, and signal the implications of digital attacks that have hit our economy, our government, and now a core foundation of our democracy.

*Note: This post was updated on 10/12/2016 to also include the indictment against seven Iranians.

Information security needs a more accurate metaphor to represent the systems we secure. Invoking castles, fortresses and safes implies a single, at best layered, attack surface for security experts to strengthen. This fortified barrier mindset has led to the crunchy on the outside, and soft chewy center decried by those same experts. Instead of this candyshell, a method from safety engineering - System Theoretic Process Analysis - provides a way to deal with the complexity of the real world systems we build and protect.

A Brief Background on STPA

Occasionally referred to as Stuff That Prevents Accidents, System Theoretic Process Analysis (STPA) was originally conceived to help design safer spacecraft and factories. STPA is a toolbox for securing systems which allows the analyst to efficiently find vulnerabilities and the optimal means to fix them. STPA builds upon systems theory, providing the flexibility to choose the levels of abstraction appropriate to the problem. Nancy Leveson’s book, Engineering a Safer World, details how the analysis of such systems of systems can be done in an orderly manner, leaving no possible failure unexamined. Because it focuses on the interaction within and across systems, it can be applied far outside the scope of software, hardware and network topologies to also include the humans operating the systems and their organizational structure. With STPA, improper user action, like clicking through a phishing email, can be included in the analysis of the system as much as vulnerable code.

Benefits of STPA

At its core, STPA provides a safety first approach to security engineering. It encourages analysts to diagram and depict a specific process or tool, manifesting potential hazards and vulnerabilities that otherwise may not be noticed in the daily push toward production and deadlines. There are several key benefits to STPA, described below.

Simplicity

Diagrammatically, the two core pieces of system theory are the box as a system and the arrow as a directional connection for actions. There is no dogmatic view of what things must be called. Just draw boxes and arrows to start, and if you need to break it down, draw more or zoom into a box. The exhaustive analysis works on the actions through labels and the systems’ responses. The networked system of systems can be approached one connection at a time. Unmanageable cascading failure becomes steps of simultaneous states. Research has shown that this part of STPA can be done programmatically with truth tables and logic solvers. The diagram below illustrates a simple framework and good starting point for building out the key components of a system and their interdependencies.

Completeness

The diagram of systems and their interconnections allows you to exhaustively check the possible hazards that could be triggered by actions. Human interactions are modeled the same way as other system interactions, allowing for rogue operators to be modeled as well as attackers. This is an especially useful distinction for infosec, which often fails to integrate the insider threat element or human vulnerabilities into the security posture. As you see below, the user is an interconnected component within a more exhaustive depiction of the system, which can be useful to extensively evaluate vulnerabilities and hazards.

Clear Prioritization

Engineering a Safer World - and STPA more broadly - urges practitioners and organizations to step back and assess one thing: What can I not accept losing? In infosec, for example, loss could mean exfiltrated, maliciously encrypted or deleted data or a system failure leading to downtime. During system design, if the contents of a box can be lost acceptably without harming the boxes connected to it, you don’t have to analyze it. An alternative method is to estimate the likelihood of possible accidents and assign probabilities to risks. Analyzing these probabilities of accidents instead makes it more likely that low likelihood problems will be deprioritized in order to handle the higher likelihood and seemingly more impactful events. But, since the probabilities of failure for new, untested designs can’t be trusted, the resulting triage is meaningless. Instead, treating all losses as either unacceptable or acceptable forces analysts to treat all negative events seriously regardless of likelihood. Black Swan events that seemed unlikely have taken down many critical systems from Deepwater Horizon to Fukushima Daiichi. Treating unacceptable loss as the only factor, not probability of loss, may seem unscientific, but it produces a safer system. As a corollary, the more acceptable loss you can build into your systems, the more resilient they will be. Building out a system varies depending on each use case. In some cases, a simple diagram is sufficient, while in others, a more exhaustive framework is required. Depending on your specific situation, you could arrive at a system diagram that falls in between those extremes, and clearly prioritizes components based on acceptable loss, as the diagram depicts below.

Resilience

Still accidents happen and we must then recover. Working on accident investigation teams, Dr. Leveson found that the rush to place blame hindered efforts to repair the conditions that made the accident possible. Instead, STPA focuses the investigation on the connected systems, making the chain of cause and effect into more of a structural web of causation. To blame the user for clicking on a malicious link and say you’ve found the root cause of their infection ignores the fact that users click on links in email as part of their job. The solutions to such problems require more than a blame, educate, blame cycle. We must look at the whole system of defenses from their OS, to their browser, to their firewall. No longer artificially constrained to simply checking off the root cause, responders can address the systemic issues, making the whole structure more resilient.

Challenges with STPA

Although designed for safety, STPA has been recently expanded to security and privacy. Colonel William E. Young, Jr created STPA-Sec in order to directly apply STPA to the military needs to survive attack. Stuart Shapiro, Julie Snyder and others at MITRE have worked on STPA-Priv for privacy related issues. Designing safe systems from the ground up, or analyzing existing systems, using STPA requires first defining unacceptable loss and working outwards.While there are clear operational benefits, STPA does come with some challenges.

Time Constraints

STPA is the fastest way to perform a full systems analysis, but who has the luxury of a full system analysis when half the system isn’t built yet and the other half is midway through an agile redesign? It may be difficult to work as cartographer, archeologist and safety analyst when you have other work to get done. Also, who has the time to read Engineering a Safer World? To address the time constraint, I recommend the STPA Primer. When time can be found, the scope of a project design and the analysis to be done may look like a never-ending task. If a project has 20 services, 8 external API hits and 3 user types the vital systems can be whittled down to perhaps 4 services and 1 user type, simply by defining unacceptable loss properly. Then, within those systems, subdivide out the hazardous from the harmless. Now the system under analysis only contains the components and connections relevant to failure and unacceptable loss. While there may be a somewhat steep learning curve, once you get a hang of it, STPA can save time and resources, while baking in safe engineering practices.

Too Academic

STPA may be cursed by an acronym and a wordiness that hides the relative simplicity beneath. The methodology may seem too academic at first, but it has been used in the real world from Nissan to NASA. I urge folks to play around with the concepts which stretch beyond this cursory introduction. Getting buy-in doesn’t require shouting the fun-killing SAFETY word and handing out hard hats. It can be as simple as jumping to a whiteboard while folks are designing a system and encouraging a discussion of the inputs and outputs to that single service systematically. I bet a lot of folks inherently do that already, but STPA provides a framework to do this exhaustively for full systems if you want to go all the way.

From Theory to Implementation: Cybersecurity and STPA

STPA grew from the change in classically mechanical or electromechanical systems like plants, cars and rockets as they became computer controlled. The layout in analog systems was often laid bare to the naked eye in gears or wires, but these easy to trace systems became computerized. The hidden complexity of digital sensors and actuators was missed by the standard chain of events models. What was once a physical problem, now had an additional web of code, wires, and people that could interact in unforeseen ways.

Cybersecurity epitomizes the complexity and systems of systems approach ideal for STPA. If we aren’t willing to methodically explore our systems piece by piece to find vulnerabilities, there is an attacker who will. However, such rigor rarely goes into software development planning or quality assurance. This contributes to the assumed insecurity of hosts, servers, and networks and the “Assume Compromise” starting point which we operate from at Endgame. Secure software systems outside of the lab continue to be a fantasy. Instead defenders must continually face the challenge of detecting and eradicating determined adversaries who break into brittle networks. STPA will help people design the systems of the future, but for now we must secure the systems we have.

The financial sector continues to be a prime target for highly sophisticated, customized attacks for an obvious reason - that’s where the money is. Earlier this year, the SWIFT money transfer system came under attack, resulting in an $81 million heist of the Bangladesh Bank. This number pales in comparison to estimates close to $1 billion stolen by the Carbanak group from over 100 banks worldwide.

Earlier this month, Symantec detailed a new threat to the financial sector, which they said resembles the highly sophisticated Carbanak group. In their excellent post, they describe Odinaff, a precision toolkit used by criminal actors with a narrow focus on the financial industry with tradecraft resembling that of nation-state hackers. It appears that the malware is being used in part to remove SWIFT transaction records, which could indicate an attempt to cover up other financial fraud.

Given the sophistication and stakes involved in the Odinaff campaign, we wanted to see how well Endgame’s early-stage detection capabilities would do against this emergent and damaging campaign. The verdict: extremely well. Let’s walk through this campaign and show how it can be detected early and at multiple stages, before any damage.

Background

According to Symantec, the Odinaff trojan is deployed at the initial compromise. Additional tools are deployed by the group to complete their operations on specific machines of interest. The group is conscious of its operational footprint and uses stealth techniques designed to circumvent defenses, like in-memory only execution. The toolkit includes a variety of techniques offering the group flexibility to do just about anything including credential theft, keylogging, lateral movement, and much more.

The integration of multiple, advanced attack techniques with careful operational security is a trademark of most modern, sophisticated attacks. We’ve put a lot of effort into developing detection and prevention techniques which allow our customers to detect and prevent initial compromise and entrenchment by adversaries (for example, see our How to Hunt posts on file paths and COM hijacking for more information on how we’re automating and enabling hunt operations across systems). Our exploit prevention, signature-less malware detection, in-memory only detections, malicious persistence detection, and kernel-level early-stage adversary technique detections combine to make it extraordinarily difficult for adversaries to operate. This prevents the adversary from establishing a beachhead in the network and protects the critical assets they’re after. Let’s take a look and see how this layering and integration of early stage detection and prevention fare against the Odinaff trojan. We tested the following dropper referenced in the Symantec post: F7e4135a3d22c2c25e41f83bb9e4ccd12e9f8a0f11b7db21400152cd81e89bf5.

Initial Malware Infection

According to Symantec, the initial trojan is delivered via a variety of methods including malicious macros and uploaded through an existing botnet. The instant this malware hits disk, Endgame catches it. Our proprietary signature-less detection capability, MalwareScore™, immediately alerts that the new file is extraordinarily malicious, scoring it a 99.86 out of 100. This would lead to immediate detection through Endgame.

pic 3_odinaff.png

Persistence Detection

One of the first things the malware does is persist itself as a run key in the registry. Endgame’s comprehensive persistence enumeration and analytics cause the malicious Odinaff persistence item to clearly stand out, warning the network defender and enabling quick remediation. The persistence inspection is crucial because even if the Odinaff actors had cleverly written their malware to evade our MalwareScore™, other characteristics of the dropped persistence item are caught by Endgame’s automated analytics. These include an anomaly between the filename on disk and the original compilation filename from Microsoft’s Version Info, the fact that it’s unsigned, outlier analysis highlighting the anomalous artifact in the environment, and more. All these analytics are presented to the user in a rich and intuitive user interface and point to the persistence item as very suspicious.

In-Memory Detection

Endgame’s patent-pending detections of memory anomalies allow users to find all known techniques adversaries use to hide in memory. On install, Odinaff sleeps for about one minute and then modifies its own memory footprint in a way which Endgame detects as malicious. The evasion technique used is uncommon and will be very difficult to detect with other endpoint security products, requiring at a minimum a tremendous amount of manual analysis. On the other hand, Endgame highlights Odinaff’s in-memory footprint as malicious with high confidence in seconds. Endgame discovers other known in-memory stealth techniques just as easily.

Layered and Early Detection at Scale

This malware was not widely discussed in the security community before the Symantec report, yet these sophisticated attackers have been deploying Odinaff in the wild since at least January 2016, according to Symantec. Signature-based techniques do not provide adequate protection as new threats emerge because it takes time for threats to become known and for signatures to be created and propagate.

As we’ve described above, Endgame’s layered detection technology detects Odinaff with ease with no prior knowledge of signatures. By focusing on detection of techniques including in-memory stealth, which are seen time and time again as initial access is gained, detection and prevention can reliably take place early. Early detection stops advanced adversaries from achieving their objectives and in turn prevents damage. Take a look at the video below to walk through these layered defenses and see how Endgame detected Odinaff early and at various stages of the attack.

Masquerading was once conducted by the wealthiest elite at elaborate dances, allowing them to take on the guise of someone else and hide amidst the crowd. Today, we see digital masquerading used by the most sophisticated as well as less skilled adversaries to hide in the noise while conducting operations. We continue our series on hunting for specific adversary techniques and get into the Halloween spirit by demonstrating how to hunt for masquerading. So let’s start the masquerade ball and hunt for a simple but more devious defense evasion technique.

Defense Evasion

In nature, camouflage is a time-proven, effective defensive technique which enables the hunted to evade the hunters. It shouldn’t come as any surprise that attackers have adopted this strategy for defense evasion during cyber exploitation, hiding in plain sight by resembling common filenames and paths you would expect within a typical environment. By adopting common filenames and paths, attackers blend into and persist within environments, evading many defensive techniques.

Part of the attacker’s tradecraft is to avoid detection. We can look to frameworks, like Mitre’s ATT&CK^TM, to guide us through the adversary lifecycle. We’ve shown how it’s useful for hunting for persistence (as our COM hijacking post demonstrated) and it also covers the broad range of attacker techniques, including defense evasion. DLL search order hijacking, UAC bypassing, and time stomping are all effective for defense evasion, as is the one we will discuss today - masquerading.

Attackers use these defense evasion techniques to blend in, making them easy to miss when hunting, especially when dealing with huge amounts of data from thousands of hosts. Let’s start with some DIY methods to hunt for masquerading, which require an inspection of persistent or running process file names or paths.

The Masquerading Approach

We previously explored hunting for uncommon filepaths, which is a simple approach for detecting suspicious files. We can expand on this method by understanding masquerading. Let’s focus on two different masquerading techniques:

Filename masquerading where legitimate Windows filenames appear in a non-conventional location.
Filename mismatching where filenames on disk differ from those in the resource section of the compiled binary.

Filename Masquerading

For filename masquerading, you need to first build the list of files which have masquerade potential. We’ll call that the anchor list. A good approach is installing a clean base image representative of your environment (a fresh install of Windows will do). Next, you need to choose which files you care about. Like most things, there is a lazy approach and an approach that takes a little more effort, but will probably give you more meaningful results with less noise. To build your anchor list the lazy way, simply enumerate all files in C:\Windows including the filename and path and use that as your anchor list.

However, there are a huge number of filenames in this list, and you should ask yourself questions about the likelihood of an adversarial masquerade before putting it in the anchor list. After all, it isn’t much of a masquerade if the legitimate filename seen in a process list or anywhere else might cause someone to question its legitimacy, even if it’s a system file, such as NetCfgNotifyObjectHost.exe. So, put in a bit more work and make a custom list of native Windows files, such as svchost, lsass, winnit, smss, and logonui, which show up constantly and are likely to be passed over if an experienced but rushed investigator is inspecting the name. It is also a good idea for the anchor list to include names for other common applications you expect to find in your environment, such as reader_sl.exe, winword.exe, and more.

Once the anchor list is compiled, you can start using it during your hunt operations. List the running processes, persistent files, or some other file-backed artifact you’re interested in. Compare those names to the anchor list. Do the filenames match? There will be many matches. What about the filepaths? If not, you know where to target your hunt. There are legitimate reasons for this happening (users do unexpected things), but locating this simple defensive evasion technique is a good way to find intrusions.

We’d also recommend some additional triage of results before calling this a legitimate detection and embarking on an incident response. Easy things to do include checking hashes against the masquerade target in the anchor list. If it’s a match, it’s probably a false alarm, and check the signer information for the file as we discussed in the previous post. Be sure to avoid being too trustworthy of the name on the cert, as actors sometimes can get code signing certs that look similar to something legitimate...but that’s a topic for another day.

If you find this approach worthwhile, you will have to keep your anchor list updated. Software changes and if you don’t change with it, you’ll have gaps in your analysis.

Filename Mismatch

Why stop at simply comparing files to your anchor list when more can be done? In this bonus masquerading approach, let’s look at filenames on disk and from the resource section of the binary. There’s a wealth of additional information here, including the MS Version info. As they note, it includes the original name of the file, but does not include a path. This can inform you whether the file has been renamed by a user.

Obviously, if the filename on disk doesn’t match the original file name, there are generally two possibilities: either the user renamed it, or maybe someone brought a tool with them, but doesn’t want you to know. Let’s take DLL implants for example. Many APT groups have brought rundll32 with them, as opposed to using the native Windows version. APT groups aren’t the only ones masquerading. Everyone does this!

Endgame @ the Masquerade Ball

Crafting your own anchor list, regularly updating it, and manually comparing the list to your hunt data or adding this analytic to your bag of post-processing scripts may work for some, but it calls for routine grooming. Let’s take a look at how easy it is to hunt for masquerading using Endgame, where we provide this as one of the many one-click automations in the platform.

Conclusion

Whom amongst us doesn’t love to use Halloween as an excuse to masquerade as someone, or something, else? Unfortunately, adversaries embrace this mentality year round, hiding in plain sight, actively evading detection, and trying to blend in. Clever use of masquerading within filenames can make their activities difficult to detect. While there are manual means to detect mismatches and masquerading, this can be time intensive and may not scale well to larger environments. Thanks to Endgame’s advanced detection capabilities, in a few clicks we are able to quickly catch those masqueraders, remediate the intrusion early, and get back to the ball.

Machine learning is often touted as a silver bullet, enabling big data to defeat cyber adversaries, or some other empty trope. Beneath the headlines, there is rigorous academic discourse and advances that often are lost in the hype. Last month, the Association for Computing Machinery (ACM) held their 9th annual Artificial Intelligence in Security (AISec) Workshop in conjunction with the 23rd ACM Conference on Computer and Communications Security Conference in Vienna, Austria. This is largely an academic-focused workshop, highlighting some of the most novel advances in the field. I was lucky enough to present research co-authored by my colleagues Hyrum Anderson and Jonathan Woodbridge on Adversarial Machine Learning, titled DeepDGA: Adversarially-Tuned Domain Generation and Detection. As one of only three presenters outside of academia, it was quickly evident that more conferences which focus on the intersection of machine learning and infosec are desperately needed. While the numerous information security conferences are beginning to introduce data science tracks, it is simply not enough. Given how nascent machine learning is in infosec, the industry would benefit greatly from more venues for cross-pollinating insights across industry, academia, and government.

AISec brings together researchers from academic institutions and a few corporations to share research in the fields of security and privacy, highlighting what most would consider novel applications of learning algorithms that address hard security problems. Researchers applied machine learning (ML) and network analysis techniques to solve malware detection, application security, and address privacy concerns. The workshop kicked off with a keynote from Elie Bursztein, the director of Anti-Abuse at Google. Although the bulk of his talk discussed various applications of ML at Google, he ended his talk stressing the need for openness and reproducibility in our research. It was a great talk in which I felt Google was backing their open source efforts with libraries like Tensorflow and the steady pace of releasing high-performing models such as Inception-ResNet-v2, which allows even an entry-level deep learning (DL) enthusiast to get their hands dirty with a “real” model. In a way, this call for reproducible research could go a long way in eliminating a common misconception of the use of ML in infosec: that ML is only a black box obfuscated by the latest marketing speak. Opportunities abound to provide infosec data scientists an avenue to demonstrate results without “giving away the farm” and potentially losing out on any (and much deserved) intellectual property rights.

AISec compiled a fantastic day of talks, ranging from the release of a new dataset, cleverly named "SherLock vs Moriarty: A Smartphone Dataset for Cybersecurity Research" (Mirsky et al) to "Identifying Encrypted Malware Traffic with Contextual Flow Data" (Anderson and McGrew – Cisco Systems) and "Prescience: Probabilistic Guidance on the Retraining Conundrum for Malware Detection" (Deo et al – Royal Holloway, U of London, UK). The latter tackled a problem common amongst those who do ML on malware: stale models or models that were trained on old or outdated samples (known as adversarial drift). The basic question is "how do you decide when it is time to retrain a malware classification model?". This can be difficult, especially since training can be expensive, both in terms of time and computational resources. The model can go stale for a variety of reasons, such as shifts in malware techniques or changes to original labels of training samples. Drift is a fascinating topic and the researchers did an excellent job describing their methodology (use of Venn-Albers predictors) for handling such problems.

Our presentation focused on the use of Generative Adversarial Networks (GANs) to create a “red team vs. blue team” game for the creation of Domain Generated Algorithms (DGAs). We leveraged GANs to construct a deep-learning DGA designed to bypass an independent classifier (Red Team), and posit that adversarially generated domains can improve training data enough to harden an independent classifier (blue team). In the end, we showed that adversarially-crafted domain names targeting a DL model are also adversarial for an independent external classifier and, at least experimentally, those same adversarial samples could be used to augment a training set and harden an independent classifier.

Bobby presenting at AISec in Vienna, Austria on October 28, 2016

While more mainstream security conferences have an occasional talk on ML, this was my first experience where it was the sole focus. It will be shocking if AISec remains the only ML-focused security conference in the coming years, with the rise of ML in the information security domain. Frankly, it’s well past time to have a larger, multi-day conference, where both academic and information security company researchers come together to learn, recruit, and network on topics we spend our days (and often nights) trying to solve.

That is not to take anything away from AISec, as this conference packed quite the proverbial punch with the eight hours it had at its disposal. In fact, my biggest takeaway from AISec is that more conferences like it are needed. To that end, we’re working across partners and within our networks to formulate such a gathering in 2017, so stay tuned! Providing a venue for researchers to put aside various rivalries, including academic, corporate or public sectors, even for a day, would greatly benefit the entire infosec community, allowing us the opportunity to listen, learn, and apply.

The FireEye Labs Advanced Reverse Engineering (FLARE) team just hosted the third annual FLARE-On Challenge, its reverse-engineering CTF. The CTF is made up of linear challenges where one must solve the first to proceed to the next. Out of the 2,063 participants, only 124 people were able to finish all of the challenges. Of those 124 successful participants, only 17 were from the U.S., including us! Josh completed the FLARE-On Challenge last year (2015), learned a lot and improved his reversing skills significantly, which was motivation enough to challenge himself again and attempt to finish this year’s contest. Last year’s challenge had interesting problems and gave him first-hand experience reverse engineering Windows kernel drivers, native Android ARM libraries, and .NET executables, among other types of binaries.This year’s challenge was also a good way to validate that his reverse-engineering skills hadn’t dipped, and he was able to get through some of the challenges this year faster thanks to the lessons learned last year. Blaine's passion is reverse engineering, which he has applied to analyzing malicious binaries for years, including those of APTs. Due to the competition's reputation, Blaine decided to attempt the competition as a way to validate and hone his reverse engineering and malware analysis skills. While we both approached each problem set in unique ways, the competition this year did not disappoint.

So What is the FLARE-On Challenge?

This year's contest consisted of ten levels, each requiring a different strategy and set of skills. The levels progressed in difficulty starting with more basic reversing skills and escalating to the more difficult and lesser known skills. Many levels employed different anti-analysis techniques including:

Custom base64 encoding
Various symmetric encryption routines
Obfuscation
Anti-VM and anti-debugger checks
Custom virtual machine

As per the tradition in past FLARE-On challenges, each level consisted of a binary that participants needed to reverse-engineer to uncover a hidden flag -- an email address ending in “@flare-on.com”. Each challenge is unique and doesn’t build upon the previous challenges. Some examples of binaries seen in this year’s contest were:

.NET executable
DLL
Compiled Go executable
Ransomware sample
PCAP
SWF

This year’s FLARE-On Challenge used a variant of the CTFd framework to host the competition. Upon completing a level and finding the flag, you’d enter it into the system and it would score your flag as correct or incorrect, and record your time of completion. The framework is nice as it also provides you with a statistics view of how you are proceeding through the challenge. Below are examples of our stats:

Screen Shot 2016-11-08 at 2.49.30 PM.png

Screen Shot 2016-10-30 at 4.26.37 PM.png

Note: The failed submission attempts are a result of not being able to tell the difference between “0”s and “O”s on the challenges where the flag was in the form of an image.

Getting Started

The FLARE-On Challenge is open to all who wish to participate, and welcomes all skill levels from beginners to experts, or just the plain curious. Simply register on the site with your email and you’re off to the races. The challenges are generally open to contestants for 5-6 weeks at a time and have so far been held between July and November. This year’s contest was held for 6 weeks, starting on Sept 23 and ending on Nov 4.

If you’re new to the CTF world, there are some fundamental building blocks you’ll need to get started. First, you’ll need a virtual machine to enable you to run applications and various programs in either Windows or Linux. While not 100% necessary, we always recommend a VM as precautions should be taken when running unknown binaries on your machine. Your debugger of choice such as OllyDbg, Immunity Debugger, WinDbg, or x64dbg is also a necessity. Similarly, you’ll need your favorite disassembler. We highly recommend IDA (pro or free version), radare2, or Hopper (pro or free version). Additionally, a foundational knowledge of x86/x64 fundamentals will help with being able to read the disassembly.

With that infrastructure in place, there are a few more decisions to make. The challenge is language agnostic, although Python and C/C++ are always solid options. As we mentioned previously, each challenge is unique, so you’ll be acquiring and relying on different tools as you progress through the challenges. At various points, we relied upon tools like dnSpy and ffdec, keeping in mind that each binary is a precious snowflake that brings unique challenges. To that end, an interest in solving puzzles is perhaps the most essential requirement for succeeding at the FLARE-On challenge.

FLARE-On Strategery

A good strategy, especially for CTF-style reversing problems, is to start at the “win” basic block, and work your way backwards to see what conditions need to be satisfied to reach it. Pay attention to how your user input affects the flow of execution, and learn to block out the stuff that doesn’t matter (i.e. the white noise) which generally comes with experience -- so until then, enjoy determining what all the nitty-gritty parts do!

Writing out the pseudocode can aid in understanding what various functions and basic blocks do. For practice we highly recommend the open source IDA Pro Binary Auditing Training Material, which has many binaries representing high-level language (HLL) constructs (such as If-Then statements, pointers, C++ virtual tables, etc...). By understanding how these HLL constructs map to their disassembly counterparts you’ll quickly be able to understand what’s happening at the disassembly level and be able to reproduce near-source pseudocode. Or you can use a decompiler to do the bulk of the work for you, such as the ones found in IDA Pro and Hopper Pro.

Anti-analysis checks (such as anti-VM and anti-debugging) are sometimes thrown in the challenges to slow analysis of the binaries. However, this mirrors real world malware which usually has multiple anti-analysis checks built in. These checks serve multiple purposes in real world malware -- to hinder analysis and prevent infection of non-targeted systems (such as a malware analyst’s machine or honeypot). These checks can usually be overcome via binary patching (NOPing out the instructions) or modifying the VM if necessary (renaming or deleting specific drivers or programs).

If all else fails, keep it simple. Break apart the code one piece at a time, and if you hit a wall, Google it out! Remember, any and all tools (hopefully obtained legally) are at your disposal, so use them. Be creative! This is a great opportunity to gain a deeper understanding of real-world techniques used by malware authors to increase the difficulty of reversing.

So You Think You Know Reversing?

Are you looking for a personal test of skills and mental fortitude? Yearning for that mad street cred? Want to be the envy of everyone in your office with this most elusive of swag? The FLARE-On Challenge is for you! This year’s prize is the police-style badge below. Pretty cool, right?

More importantly, the FLARE-On Challenge is a tremendous way to continue to test, hone, and expand your reverse engineering skills. Now that you know how to get started, we strongly encourage you to consider participating in next year’s challenge. Over the remainder of the year, and to further assist you in your FLARE-On aspirations, we’ll provide a few more posts pertaining to the FLARE-On Challenge. We’ll get into the weeds of some of the more creative and daunting challenges we overcame on route to joining the esteemed ranks of those who completed previous challenges.