Quantcast
Channel: Endgame's Blog
Viewing all 698 articles
Browse latest View live

Stop Saying Stegosploit Is An Exploit

$
0
0

Security researcher Saumil Shah recently presented “Stegosploit” (slides available here). His presentation received a lot of attention on several hacker news sites, including Security AffairsHacker News, and Motherboard, reporting that users could be exploited simply by viewing a malicious image file in their web browser. If that were true, this would be terrifying.

“Just look at the image and you are HACKED!” – thehackernews

Here’s the thing. That is not what is happening with Stegosploit. Saumil Shah has created a “polyglot”. A polyglot is defined as “a person who knows and is able to use several languages,” but in the security world, the term can refer to a file that is a valid representation of two different data types. For example, you can concatenate a RAR file to the end of a JPG file. If you double click the JPG image, a photo pops up. If you then rename that JPG file to a .rar file, the appended RAR file will open. This is due to how the JPG and RAR file formats specify where the file begins. Stegosploit is using this same premise to embed JavaScript code inside of an image file, and obscure the JavaScript payload within pixel data.

This is still an interesting vector due to the difficulty of detection. It adds a layer of obfuscation, which relies on security through obscurity to avoid detection.

Embedding your code inside images requires a defensive product to not only process every packet, but also to inspect the individual artifacts extracted from the connection. Security through obscurity is widely considered ineffective. However, it is important to note that in order to identify even the most rudimentary steganography, you have to analyze every image file, which is computationally expensive, and increases the cost to defenders.

What is really interesting here is that Saumil Shah was actually rather forthcoming about this during his talk, clearly announcing that he was using a loader to deliver the payload, although that may not have been obvious to some of the observers. The exploit was delivered because the attacker sent malicious, obfuscated JavaScript to the browser. Stegosploit simply obfuscates an attack that could have been executed anyway. Just looking at an image will not exploit your web browser.

In the screenshot above, taken from the recording of the actual conference talk, Saumil is showing the audience the exploit “loader”. This is where a traditional JavaScript payload would be injected. The operative text in that screenshot is<script src=”elephant3.jpg”></script>, which takes a valid image file and interprets it as JavaScript. It simply injects the malicious code into a carrier signal so it looks innocuous. While it may seem like it is splitting hairs, it’s an extremely important distinction between “looking at this photo will exploit your machine”, and “this photo is camouflage that hides an exploit that has already occurred.”

All that being said, legitimate image exploits have been discovered in the past. Most notably, MS04-028 actually exploited the JPG processing library. In this case, loading an image into your browser would quite literally exploit your machine. This was tagged as a critical vulnerability, and promptly patched.

Stegosploit is an obfuscation technique to hide an exploit within images. It creates a JavaScript/image polyglot. Don’t worry, you can keep looking at captioned cat photos without fear.

Stop Saying Stegosploit Is An Exploit

Curt Barnard

OPM Breach: Corporate and National Security Adversaries Are One and the Same

$
0
0

On June 5, 1989, images of a lone person standing ground in front of Chinese tanks in Tiananmen Square transfixed the world. On the same day twenty-six years later, the United States government announced one of the biggest breaches of personnel data in history. This breach is already being attributed to China. China has also recently been stepping up its efforts to censor any mention of the Tiananmen Square massacre. The confluence of these two events – censorship of a pivotal human rights incident coupled with the theft of four million USG personnel records – should clarify beyond a doubt China’s intentions and vision for what constitutes appropriate norms in the digital domain. It is time for all of the diverse sectors and industries of the United States – from the financial sector in New York City to the tech industry in Silicon Valley to the government in Washington – to recognize the gravity of this common threat and commit to a legitimate public-private partnership that extends beyond lip service. As the OPM breach demonstrates, the United States government faces the same threats and intellectual property theft as the financial, tech, and other private sector industries. It’s time to move beyond our cultural divisions and unify against the common adversaries who are the true threats to privacy, security, democracy and human rights across the globe.

I attended a “Cyber Risks in the Boardroom” event yesterday in New York City. More often than not, these kinds of cybersecurity conferences will include one panel of private sector experts complaining about government regulations, infringements on privacy, and failure to grasp the competitive disadvantage of US companies thanks to proposed legislation. I have even heard the USG referred to as an “advanced persistent threat.” A government panel generally follows, and bemoans the inability of the private sector to grasp the magnitude of the threat. There is often an anecdote about an unnamed company that refuses government assistant when a breach has been identified, and there’s the obligatory attempt at humor to assuage fears that the government is really not interested in reading your email or tracking your Snapchat conversations.

That did not happen yesterday. The one comment that struck me the most was a call for empathy between the private and public sectors. In fact, at a conference held in the heart of the financial capital of the world, panel after panel reiterated the need for the government and private sector to work together to ensure the United States’ competitive economic advantage. The United States economy and its innovative drive is the bedrock of national security. The financial sector – one of the largest targets of digital theft and espionage – seems to grasp the essential role the government can and should play in safeguarding a level digital playing field. Nonetheless, even in this hospitable environment, cultural and linguistic hurdles, not to mention trust issues, continue to limit cooperation between the financial sector and government.

News of the OPM breach broke just as I was leaving the conference. Many are attributing the breach to China. As someone who lives at the intersection of technology and international affairs, it is impossible to ignore the irony. There continues to be heated debate about US surveillance programs, as well as potentially impending legislation on intrusion software. These debates will not likely end soon, and they are part of the democratic process and freedom of speech that is so often taken for granted. Compare that to China’s expansive censorship and propaganda campaign that not only forces US companies operating in China to censor any mention of Tiananmen Square, but limits any mention of activities that may lead to collective gatherings. Or compare that to China’s 50 cent party, a group of individuals paid by the Chinese government to provide positive social media content about the government. (Russia has a similar program, which extends internationally, including spreading disinformation on US media outlets.) Perhaps even more timely, China iscensoring online discussion about the horrific cruise ship collapse earlier this week on the Yangtze River. This is a very similar approach to that taken following the 2011 train crash that similarly led to censorship of any negative media coverage of the government’s response.

The enormous and historic OPM breach, revealed on the 26th anniversary of the Tiananmen Square protests, should cause the disparate industries and sectors that form the bedrock of US national security to pause…and empathize. Combating common adversaries that threaten not only national security, but also freedom of information and speech, requires a united front. The private and public sectors are much stronger working together than apart. Despite significant cultural differences, there are core values that unite the private and public sectors, and it’s time to put aside differences and work as a cohesive unit against US corporate and national security adversaries—for they are truly one and the same. This does not mean that debates about privacy and legislation should subside. On the contrary, those debates should continue, but must become constructive forms of engagement rather than divisive editorials. Many – especially those in the financial sector – seem to grasp the appropriate role for the government in handling these threats. It’s time to put aside differences and pursue constructive and united private-public sector collaboration to deter the persistent theft of IP and PII information at the hands of the adversaries we all face together.

OPM Breach: Corporate and National Security Adversaries Are One and the Same

Andrea Little Limbago

The Digital Domain’s Inconvenient Truth: Norms are Not the Answer

$
0
0

To say the last week has been a worrisome one for any current or former federal government employees is a vast understatement. Now, with this weekend’s revelations that the data stolen in the OPM breach potentially included SF-86 forms as well—the extraordinarily detailed forms required to obtain a security clearance—almost every American is in fact indirectly impacted, whether they realize it or not.  As China’s repository of data on United States citizens continues to grow, it’s time that the United States adjusts its foreign digital policy to reflect modern realities. Despite this latest massive digital espionage, the United States continues to pursue a policy based largely on installing global norms of appropriate behavior in cyberspace, the success of which depends on all actors playing by the same rules. Norms only work when all relevant actors adhere to and commit to them, and the OPM breach, as well as other recent breaches by Russia, North Korea, and Iran, confirms that each state is playing by their own playbook for appropriate behavior in the digital domain. The U.S. needs to adopt a new approach to digital policy, or else this collective-action problem will continue to plague us for the foreseeable future. Global norms are not the silver bullet that many claim.

The Problem with Norms in a Multi-Polar International System

In recent testimony before Congress, the State Department Coordinator for Cyber Policy, Christopher Painter, outlined the key tenets of US foreign policy in the cyber domain. During this testimony, he highlighted security and cybercrime, with norms as a key approach to tackling that issue. He explicated the following four key tenets (abridged) on which global norms should be based:

1. States cannot conduct online activity that damages critical infrastructure.

2. States cannot prevent CSIRTs from responding to cyber incidents.

3. States should cooperate in investigations of online criminal activity by non-state actors.

4. States should not support the theft of IP information, including that which provides competitive advantage to commercial entities.

While these are all valid pursuits, the OPM breach confirms the age-old tenet that states are self-interested, and therefore are quite simply not going to adhere to the set of norms that the United States seeks to instill. The United States government is not the only one calling for “norms of responsible behavior to achieve global strategic stability”. Microsoft recently released a report entitled International Cybersecurity Norms, while one of the most prominent international relations academics has written about Cultivating International Cyber Norms. Rather than focusing on norms, policy for the digital domain must reflect economic, political, military and diplomatic realities of international relations. It should not be viewed as a stove-piped arena for cooperation and conflict across state and non-state actors. For example, the omnipresent tensions in the South China Sea are indicative of China’s larger, cross-domain global strategy. Russian rhetoric and activities in Eastern Europe similarly are a source of great consternation, with the digital espionage a key aspect of Russia’s foreign policy behavior. These cross-domain issues absolutely spill over into the digital domain and therefore hinder the chance that norms will be successful. These tensions are exacerbated by completely orthogonal perspectives on the desired digital end-state of many authoritarian regimes, which focuses on the notion of cyber sovereignty. These issues are further confounded when these states continue to maintain an economic system predicated on state-owned enterprises, which are essentially an extension of the state, meaning that IP theft directly supports the government and their favorite quasi-commercial entities. Finally, the notion of credible commitments is again an essential factor in norm distribution. Because of the surveillance revelations of recent years, other states remain cautious and dubious that the United States will also adhere to these norms. This lack of trust only exacerbates distrust against the set of norms that the United States is advocating.

Towards a New Approach: Change the Risk Calculus for the Adversary

Instead of a norms-based approach, formal, multi-actor models that focus on calculating the risks and opportunities of actions from an adversary’sperspective could greatly contribute to more creative (and potentially deterrent) policies. Thomas Schelling’s research on bargaining and strategy is emblematic of this approach, expanding on the interdependence and the strategic interplay that occurs between actors. Mancur Olson’s work on collective action similarly remains especially applicable when pursuing policies that require adherence by all actors within a group. These frameworks account for the preferences of multiple actors in a decision-making process and help identify the probability of preferences across a spectrum of options. If done well, incorporating multi-actor preferences not only provides insights into why some actors pursue policies or activities that seem irrational to others, but it also forces the analyst or policymaker to view the range of preferred outcomes from the adversary’s perspective. Multi-actor models advocate for a strong understanding of activities that can favorably impact the expected utility and risk calculus of adversaries. The United States has taken some steps in this direction, and it should increasingly rely on policies that raise the costs of a breach for the adversary. For example, the indictment of the five PLA officers last year is a positive signal that digital intrusions will incur punishment. In addition to punitive legal responses targeted at adversaries, greater technical capabilities that hunt the adversaries within the network can also raise the cost of an intrusion. If the cost of entry outweighs the benefits, adversaries will be much less likely to attack at will. Until then, attackers will steal information without any fear of retribution or retaliation and the digital domain will remain anarchic. Finally, instead of focusing on global norms that give the competitive advantage to those who do not cooperate, digital cooperation should be geared toward allies, encouraging the synchronization of similar punitive legislation and responses in light of an attack. In this regard, cooperation can reinforce collective security, and focus on enabling the capabilities of allied states, not limiting those capabilities to allow adversaries the upper hand.

The United States continues to pursue policies that require global support and commitment in order to be effective, rather than focusing on changing the risk calculus for the adversary. The OPM breach—one that affects almost all former and current federal employees and their contacts and colleagues throughout their lives—is evidence that other states play by a different playbook. While the U.S. should continue its efforts to shape the digital domain as one that fosters economic development, transparency, equality and democracy, the reality is that those views are not shared by some of the most powerful states in the global community. Until that inconvenient truth is integrated into policy, states and state-affiliated groups will continue to compile an ever-expanding database of U.S. personnel and trade secrets, which not only impacts national security, but also the economic competitiveness on which that security is built.

The Digital Domain’s Inconvenient Truth: Norms are Not the Answer

Andrea Little Limbago

Data Science for Security: Using Passive DNS Query Data to Analyze Malware

$
0
0

Most of the time, DNS services—which produce the human-friendly, easy-to-remember domain names that map to numerical IP addresses—are used for legitimate purposes. But they are also heavily used by hackers to route malicious software (or malware) to victim computers and build botnets to attack targets. In this post, I’ll demonstrate how data science techniques can be applied to passive DNS query data in order to identify and analyze malware.

A botnet is a network of hosts affected by malware to conduct nefarious activities, usually without the awareness of their owners. A command-and-control host hidden in the network communicates with the affected computers to give instructions and receive results. In such a botnet topology, the command-and-control becomes the single point of failure. Once its IP address is identified, it could be easily blocked and the whole communication with the botnet would be lost. Therefore, hackers are more likely to use a domain name to identify the command-and-control, and employ techniques like fast flux to switch IP addresses mapped to a single domain name.

As data scientists at Endgame, we leverage data sets in large variety and volume to tackle botnets. While the data we analyze daily is often proprietary and confidential, there is a publicly available data set provided by Georgia Tech that documents DNS queries issued by malware across the years 2011 - 2014. The malware were contained in a controlled environment and had limited Internet access. Each and every domain name query was recorded, and if a domain name could be resolved, the corresponding IP address was also recorded.

This malware passive DNS data alone would not provide sufficient information to conduct a fully-fledged botnet analysis, but it does possess rich and valuable insights about malware behaviors in terms of DNS queries. I’ll explain how to identify malware based on this data set, using some of the methods the Endgame data science team employs daily.

Graphical Representation of DNS Queries

Here is the data set I’ll examine. Each row is a record of DNS query, including date, MD5 of the malware file, the domain name being queried, and the IP address if the query finds a result. 

What approach might enable the grouping of malware or suspicious programs based on specific domain names? As we have no information about the malware, the conventional static analysis of malware focusing on investigating binary files would not be helpful here. Clustering using machine learning may work only if each domain name is treated as a feature, but the feature space will be very sparse. That would result in expensive computation.

Instead, we can represent the DNS queries using a graphic network showing what domain names a malware is interested in, as displayed in Figure 1. Each malware program is labeled by an MD5 string. While Figure 1 only demonstrates a very small part of the network, the entire data set could actually be transferred into a huge network.

Figure 1. A small DNS query network

There are numerous advantages to expressing the queries in the format of a graph. First, this expedites querying complex relationships. A modern graph database, such as Neo4j, Orientdb, or Titandb, can efficiently store a large graphic network and conduct joint queries that normally are computationally expensive for relational databases, such as MS SQL Server, Oracle or MySql. Second, network analytic methods from a diverse range of scientific fields can be employed to analyze the data set to gain additional insights.

Graph Analysis on the Malware Network

The entire passive DNS data set covers several years, so I randomly picked a day during the data collection period and will present the analysis on the reduced data set. A graph was created out of a day’s worth of data, and the nodes include both domain names and malware MD5 strings. In other words, a node in the graph can either be an MD5 string, or a domain name, and an edge (or a connection) links an MD5 and a domain if the MD5 queries that domain name. The total number of nodes is 17,629, and the number of edges is 54,939. The average number of connections per node is about 3.

In my graph representation of DNS queries, there are two distinct sets of nodes: domain names and malware. A node in one set only connects with a node in the other set, and not one in its own set. Graph theory defines such a network as a bipartite graph, as shown in Figure 2. I wanted to split the graph into two graphs, one containing all the nodes of domain names, and the other containing only malware programs. This can be done by projecting the large graph onto the two sets of nodes, which creates two graphs. In each graph, two nodes are connected by an edge if they have connections to the same node of the other type. For example, domains xudunux.info and rylicap.info would be connected in the domain graph because both of them have connections with the same malware in the larger graph.

Figure 2. Bipartite graph showing two distinct types of nodes

Let’s look at the graph of malware first. For the day 2012-09-29 alone, there are 9876 unique malware recorded in the data set. First, I would like to know the topological layout of these malware and find out how many connected components exist in the malware graph.

A connected component is a subset of nodes where any two nodes are connected to each other by one or multiple paths. We can view connected components (or just components) as islands that have no bridge connecting each other.

Python programming language has an excellent network analysis package called networkx. It has a function to compute the number of connected components of a graph. The result of running that function, named number_connected_components, shows there are 2,114 components in the 9,876-node graph, 1,619 of which are one-node component. There are still 11 components that have more than 100 nodes within them. I will analyze those large components because the malware inside may be variants of the same program.

Figure 3 shows four components of the malware graph. The nodes in each component are densely connected to each other but not to any other components. That means the malware assigned to a component clearly possess some similar characteristics that are not shared by the malware from other components. 

Figure 3. Four out of eleven components in the malware graph

Component 1 contains 201 nodes. I computed the betweenness centrality of the nodes in the graph, which are all zeros, while the closeness centrality values of the nodes are all ones. This indicates that each node has a direct connection with each other node in the component, meaning that each malware queried exactly the same domain names as the other malware programs. This is a strong indication that all 201 malware are variants of a certain type of malicious executable.

Let’s return to the large DNS query graph to find out what domains the malware targeted. Using a graph database like Neo4j or OrientDB, or a graph analytic tool like networkx, the search is easy. The result shows that the malware in component 1 were only interested in three domain names: ns1.musicmixa.net, ns1.musicmixa.org, and ns1.musiczipz.com.

I queried VirusTotal for each of the 201 malware in component 1. VirusTotal submits the MD5 to a list of scanning engines and return the reports from those engines. A report includes its determination of the MD5 to be either positive or negative. If it’s positive, the report would provide more information about what kind of malware the MD5 is, based on the signature that the scanning engine uses.

I assigned a score to each malware by computing the ratio of the number of positive results to the total number of results. The distribution of the scores is shown in Figure 4. The scanning reports imply that the malware is a Wind32 Trojan.

Figure 4. Histogram of VirusTotal score of malware in Component 1

Using Social Network Analytics to Understand Unknowns

When I look at each of the components, not all of them have such high level of homophily as component 1 does. A different component has 2,722 malware nodes, and 681,060 edges. 309 of the 2,722 malware in this component were not known to VirusTotal, while the rest, 2,413 malware, had reports on the website. We need a way to analyze those unknown malware.

Social network analytic (SNA) methods provide insights into unknown malware by identifying known malware that are similar to the unknowns. The first step is to try to break the large component into communities. The concept of community is easy to understand in the context of a social network. Connections within a community are usually much denser than those outside a community. Members of a community tend to share some common trait, such as mission, geo-location, or profession. In this analysis, malware were connected if they queried the same domain that could be interpreted as two malware exhibiting a common interest in a domain name. Therefore, we can expect that malware programs that have queried similar domains represent a community. Communities exist inside a connected component and differ from the concept of components in that communities still have connections between each other.

Community detection is a particular kind of data clustering within the domain of machine learning. There are a wide variety of methods for community detection in a graphic network. Louvain method is a well-known and well-performed one, and tries to optimize the measure of modularity by partitioning a graph into groups of densely connected nodes. By applying the Louvain method to the big component with 2,722 nodes, I can identify 15 communities and the number of nodes within each community as shown in Figure 5.

Figure 5. Number of nodes in each community

Let’s take a specific malware as an example. The MD5 of this malware is 0398eff0ced2fe28d93daeec484feea6, and the search of it on VirusTotal found no result, as shown in Figure 6. 

Figure 6. Malware not found on VirusTotal

I want to know what malware programs have the most similar behavior in terms of DNS queries to this unknown malware. By looking into the similar malware that we do have knowledge about, we could gain insights into the unknown one.

I found malware 0398eff0ced2fe28d93daeec484feea6 in Community 4, which has 256 malware within it. To find the most similar malware programs, we need a quantitative definition of similarity. I chose to use Jaccard index to compute just how similar two sets of queried domains are.

Suppose malware M1 queried a set of domains D1, and malware M2 queried another set of domains D2. The Jaccard index of set D1 and D2 is calculated as:

The Jaccard index goes from 0 to 1, with 1 indicating an exact match.

Out of the total 2,722 nodes in Component 1, 100 malware programs have exactly the same domain queries as malware 0398eff. That means their Jaccard indices against malware 0398eff are 1. However, only 9 malware are known to VirusTotal. The 9 malware are shown below.

Each of the 100 malware programs, including the 9 known ones, that have the same domain queries as malware 0398eff appear in community 4. The histogram of Jaccard index is shown in Figure 7.

Figure 7. Histogram of Jaccard index for nodes in community 4

We can tell from the histogram that the malware programs in community 4 could be generally split into two sets. One set contains 100 malware that have exactly the same domain queries as malware 0398eff, and the other set has nodes that are much less similar to it. The graph visualization in Figure 8 demonstrates the split. By this analysis, we have found those previously unknown 91 malware behaving similarly to some known malware. 

This blog post demonstrates how I used DNS query data to conduct network-based graphic analysis for malware. Similar analysis can be done with the domain names to identify groups of domains that tend to be queried together by a malware program. This can help identify potentially malicious domains that were previously unknown.

Given the vast quantities of data those of us in the security world handle daily, data science techniques are an increasingly efficient and informative way to identify malware and targeted domains. While machine learning and clustering tend to dominate these kinds of analyses, social network based graphic methods should increasingly become another tool in the data science toolbox for malware detection. Through the identification of communities, betweenness, and similarity scores, network analysis helps show not only connectivity, but also logical groupings and outliers within the network. Viewing malware and domains as a network provides another more intuitive approach for wrangling the big data security environment. Given the limited features available in the DNS passive query data, graph analytic approaches supplement traditional static and dynamic approaches and elevate capabilities in malware analytics.

Data Science for Security: Using Passive DNS Query Data to Analyze Malware

Richard Xie

Meet Endgame at Black Hat 2015

$
0
0

 

Endgame will be at Black Hat!

Stop by Booth #1215 to:

GET AN ENDGAME ENTERPRISE DEMO

Sign up here for a private demo to learn how we help customers automate the hunt for cyber adversaries.
 

MEET WITH ENDGAME EXPERTS

Meet our experts and learn more about threat detection and data science. Check out the Endgame blog to read the latest news, trends, and research from our experts before you go.
 

EVERYONE NEEDS A SMART WATCH!

Enter to win an Apple or LG smart watch. Stop by the booth Wednesday, August 5 or Thursday, August 6 for a chance to win. We'll announce each day's winner on Twitter at the end of each day at 5pm PT.

Meet Endgame at Black Hat 2015

Examining Malware with Python

$
0
0

Before I came to Endgame, I had participated in a couple of data science competitions hosted by Kaggle. I didn’t treat them as competitions so much as learning opportunities. Like most things in the data science community, these competitions felt very new. But now that I work for a security company, I’ve learned about the long history of CTF competitions meant to test and add to a security researcher’s skills. When the Microsoft Malware Challenge came along, I thought this would be a great opportunity to learn about new ways of applying machine learning to better understand malware. Also, as I’ve talked about before, the lack of open and labeled datasets is a huge obstacle to developing machine learning models to solve security problems. Here was an opportunity to work with an already prepared large labeled dataset of malware samples.

I gave a talk at the SciPy conference this year that describes how I used the scientific computing tools available in Python to participate in the competition. You can check out my slides or watch the video from that talk here. I tried to drive home two main points in this talk: first, that Python tools for text classification can be easily adopted for malware classification, and second, that details of your disassembler and analysis passes are very important for generalizing any results. I’ll summarize those points here, but take a look at the video and slides for more details and code snippets.

My final solution to the classification challenge was mainly based on counting combinations of bytes and instructions called ngrams. This method is based on counting the frequency that a byte or an instruction occurs in a malware sample. When n is greater than one, I count the frequency of combinations of two, three, or more combinations of bytes or instructions. Because the number of possible combinations climbs very quickly, a hashing vectorizer must be used to keep the size of the feature space manageable.

Figure 1: Example byte 2grams from the binglide documentation

 

Figure 2: Byte 2grams from a malware sample included in the competition

At first, I was only using byte ngrams and I was very surprised that feeding these simple features to a model could provide such good classifications. In order to explore this, I used binglide to better understand what the bytes inside an executable look like. Figure 1 and Figure 2 show the results of this exploration. Figure 1 shows example output from binglide’s documentation and Figure 2 shows the output when I ran the tool on a sample from the competition. In all the images, the entropy of a binary is displayed on the strip to the left and a histogram of the 2gram frequency is shown on the right. For that frequency histogram, each axis contains 256 possible values for a byte and a pixel turns blue as that combination of bytes occurs more frequently.

You can see that the first 2gram pattern in Figure 2 generally looks like the first 2gram pattern in Figure 1. The .text section is usually used for executable code so this match to example x86 code is reassuring. The second 2gram pattern in Figure 2 is very distinctive and doesn’t really match any of the examples from the binglide documentation. Machine learning algorithms are well suited to picking out unique patterns like this if they are reused throughout a class. Finding this gave me more confidence that the classification potential of the byte ngram features was real and not due to any mistake on my part.

I also used instruction ngrams in a similar way. In this case, instructions refer to the first part of the assembly code after it’s been disassembled from the machine code. I wrote some Python code to extract the instructions from the IDA disassembly files that were provided by those running the competition. Again, feature hashing was necessary to restrain the size of the feature space. To me, it’s very easy to see why instruction ngrams could provide good classifications. Developing software is hard, and malware authors are going to want to reuse code in order to not waste effort. That repeated code should produce similar patterns in the instruction ngram space across families of malware.

Using machine learning algorithms to classify text is a mature field with existing software tools. Building word and character ngrams from text is a very similar problem to building the byte and instruction ngrams that I was interested in. In the slides from my SciPy talk I show some snippets of code where I adapted the existing text classification tools in the scikit-learn library to the task of malware classification. Those tools were a couple of different vectorizers, pipelines for cross validating multiple steps together, and a variety of models to try out.

All throughout this process, I was aware that the disassembly provided in the competition would not be available in a large, distributed malware processing engine. IDA Pro is the leading program for reverse engineering and disassembling binaries. It is also restrictively licensed and intended to be run interactively. I’m more interested in extracting features from disassembly automatically in batch and providing some insight to the files generated by a statistical model. I spent a lot of time during and after the competition searching for open source tools that could automatically generate the disassembly provided by the competition.

I found Capstone to be a very easy to use open source disassembler. I used it to generate instruction ngrams and tested the classification performance of models based on those ngrams to the same models based on IDA instructions. They both performed well in that there were very few misclassifications. The competition was judged on a multi-class logarithmic loss metric, though, and this metric was always better when using the IDA instructions.

After talking to some security experts at Endgame, I’ve learned that this could be due to the analysis passes that IDA does before disassembling. Capstone will just execute one sweep over the binary and disassemble anything it finds as it goes. IDA will more intelligently decode the binary looking for entry points, where functions and subroutines begin and end, and what sections actually contain code, data, or imports. I was able to relate this to my machine learning experience in that I viewed IDA’s disassembly as a more intelligent feature engineering pipeline. The result is that I’m still working on finding or building the best performing distributable disassembler.

This Kaggle competition was a great example of how data science can be applied to solve specific security problems. Data science has been described as a combination of skills in software, math, and statistics, along with domain expertise. While I didn’t have the domain expertise when I first joined Endgame, working closely with our security experts has expanded my breadth of knowledge while giving me a new opportunity to explore how data science techniques can be used to solve security challenges.

Examining Malware with Python

Phil Roth

Why We Need More Cultural Entrepreneurs in Security & Tech

$
0
0

Recently, #RealDiversityNumbers provided another venue for those in the tech community to vent and commiserate over the widely publicized lack of diversity within the industry. The hashtag started trending and gained some media attention. This occurred as Twitter came under fire for organizing a frat-themed party, while also facing a gender inequality claim. Unfortunately, as dire as the diversity situation is in the tech sector writ large, it pales in comparison to the statistics on diversity in the security sector. The security community not only faces a pipeline shortage, but it has also made almost no progress in actively attracting a diverse workforce. The tectonic shifts required to achieve true diversity in the security sector also mean a fundamental shift in the tech culture must take place. However, while companies such as Pinterest have publicly noted their commitment to diversity, very little has changed from top-down approaches to diversification in the tech community. Certainly internal policies and recruiting practices matter, and leadership support is essential. These are the core enablers, but are not sufficient for institutionalizing cultural change. Instead, cultural entrepreneurs reflecting technical expertise across an organization must lead a grassroots movement to truly institutionalize cultural change within organizations and across the tech community. All of us must move beyond our comfort zones of research, writing and coding and truly take ownership of organizational culture.

Given the competition for talent in the security industry, an organization’s culture (ceteris paribus) often proves to be the determining factor that fosters, attracts, and retains a highly skilled and diversified workforce. Because an organization cannot engineer its way toward an innovative, inclusive culture or simply throw money at the issue, this problem can be perplexing to tech-focused industries. As anyone who has even briefly studied cultural approaches knows, culture is very sticky and entails a concerted and persistent effort to achieve the desired effects. It requires a paradigm shift much in the same way Kuhn, Lakatos and Popper all approached the various avenues toward scientific progress. The good news – if there is any – is that many of the cultural shifts required to foster a driven, innovative and (yes!) inclusive work environment do not cost a lot of money. Similar to the role of policy entrepreneurs in pushing forth new ideas in the public sector, cultural entrepreneurs are key individuals who can use their technical credibility to push forth ideas and promote solutions for any cultural challenges they identify or experience. By serving as a gateway between various aspects of an organization, cultural entrepreneurs can move an organization and ideally the industry beyond a “brogramming” mentality and reputation. Cultural entrepreneurs must reflect technical expertise across a diverse range of skills and demographics in order to legitimately encourage diversity and innovation. This enables the credible organic shifts from below that foment cultural change.

Cultural entrepreneurs are required to ensure an organization’s culture is inclusive and purpose-driven, instead of perpetuating the status quo. In this regard, diversity is a key aspect of this cultural shift. Diversity provides an innovation advantage and positively impacts the bottom line. Many in the tech community are starting to realize this, with companies like Intel investing $300 million in diversity, and CEOs lamenting that they wished they had built diversity into their culture from the start. Admitting that the problem exists is an important step, but this rhetoric has yet to translate into a more diversified workforce. A concerted effort by major tech companies to address diversity resulted in at most a 1% increase in gender diversity and an even smaller increase in ethnic diversity. Cultural entrepreneurs, and their ability to foster grassroots cultural shifts, may be the missing link in many of these cultural and diversity initiatives.  

Cultural entrepreneurs across an organization can make a significant impact with minimal work or cost by focusing on both internal and external cultural aspects of an organization. First, there is a large literature on how cross-cutting links (think social network analysis) develop social capital, which in turn has a positive impact on civic engagement and economic success. A recent Gallup Poll reinforces just how hard it is to foster social capital, with results confirming that over 70% of the American workforce does not feel engaged. Many organizations know this, but unfortunately fail at implementation by opting for social activities that reinforce exclusivity or feel contrived or overly corporate. Events ranging from frat-themed parties to cruise boats with concubines clearly do little to attract a diverse workforce. Cultural entrepreneurs can encourage or informally organize inclusive activities – such as sports, team outings, or discussion boards – within and across departments to increase engagement. While these kinds of social activities may seem superfluous to the bottom line, they can positively impact retention, workforce engagement, and inclusivity by building cross-cutting social networks. The kinds of social activities certainly should vary depending on an organization, but they must appeal to multiple segments of the workforce to foster social capital instead of reinforcing stereotypes and stovepipes within organizations. However, with everyone’s heads to keyboard all day every day, technical cultural entrepreneurs rarely emerge, hindering the development of social capital.

Second, perception is reality, and cultural entrepreneurs can help shift external perceptions of the industry. A quick search of Google images for “hacker” reveals endless images of male figures in hoodies working in dark, nefarious environments.  The media perpetuates this with similar images every time a new high profile breach occurs. It’s not just a media problem. It is also perpetuated within the industry itself. A recent analysis of the RSA conference guide showed shockingly little diversity.  The study notes that “women are absent” and “people of colour are totally absent.” While it adequately reflects the reality of the security industry, it makes those of us currently in the security community feel more out of place if we don’t fit that profile, while also deterring anyone not fitting those profiles from entering the field.  Let’s hope the upcoming Black Hat and Def Con conferences are more inclusive, with a broader representation of gender, race and appearance, but I wouldn’t bet on it. It’s up to cultural entrepreneurs to continue to press their organizations and the industry to help shift the perception of the security community away from nefarious loners and toward one with a universal mission that requires a diverse range of skillsets and backgrounds. Providing internal and external thought leadership through blogs, presentations and marketing can go a long way toward helping reality reflect the growing rhetoric advocating for diversity.

The security industry, which mirrors the diversity problems in the tech industry writ large, would benefit from a cultural approach to workforce engagement and inclusivity. All of the amenities in the world are not enough to overcome the tech industry’s cultural problems that not only persist, but that are also much more exclusive than they were two decades ago.  In creative industries, cultural entrepreneurs are essential to fostering the social capital and intrinsic satisfaction that emerges from an inclusive and innovative culture. At Endgame, this is something that we think about daily and always seek to improve. We benefit from leadership that supports and understands the role of culture, while also letting us grow that culture organically.  This organic growth relies on technical leaders across the company working together and pushing both the technical and cultural envelopes. This combination of technical mastery coupled with an collaborative and driven culture provides the foundation on which we will continue to foment inclusivity while disrupting an industry which for too long has relied on outdated solutions to modern technical and workforce challenges.

Why We Need More Cultural Entrepreneurs in Security & Tech

Andrea Little Limbago

Sprint Defaults and the Jeep Hack: Could Basic Network Settings Have Prevented the Industry Uproar?

$
0
0

In mid-July, research into the security of a Jeep Cherokee was disclosed though a Wired article and subsequent Black Hat presentation. The researchers, Charlie Miller and Chris Valasek, found an exploitable vulnerability in the Uconnect entertainment system that operates over the Sprint cellular network. The vulnerability was serious enough to prompt a 1.4 million-vehicle recall from Chrysler.

In the Wired article, Miller and Valasek describe two important aspects of the vulnerability. First, they can target their exploit against a specific vehicle: “anyone who knows the car’s IP address can gain access from anywhere in the country,” and second, they can scan the network for vulnerable vehicles including a Dodge Ram, Jeep Cherokee, and a Dodge Durango. Both of these capabilities, to scan and target remotely through the cellular network, are necessary in order to trigger the exploit against a target vehicle.

While it’s really scary to think that a hacker anywhere in the country can drive your car off the road with the push of a button, the good news is that the cellular network has safeguards in place to prevent remotely interacting with phones and devices like Uconnect. For some inexplicable reason, Sprint disabled these safeguards and left the door wide open for the possibility of remote exploitation against the Uconnect cars. Had Sprint not disabled these safeguards, the Uconnect vulnerability would have just been another of several that require physical access to exploit and may not have prompted an immediate recall. 

The Gateway

Cellular networks are firewalled at the edge (Figure 1). GSM, CDMA and LTE networks are all architected a little differently, but each contains one of the following Internet gateways:

  • CDMA: Packet Data Serving Node (PDSN) in CDMA networks (Verizon and Sprint)
  • GSM: Gateway GPRS Support Node (GGSN) (T-Mobile or AT&T)
  • LTE: the responsibilities of the gateway are absorbed into multiple components in the System Architecture Evolution (SAE). All major Telcos in the US operate LTE networks.

Figure 1: Network layout

To keep things simple and generic, we’ll just call this component “the gateway.” Network connections only originate in one direction: outbound. You can think of the core network of your phone network as a big firewalled LAN, and it is not possible to gain access to a phone from outside the phone network (Figure 2). 

Figure 2: The attacker is blocked from outside the core network.

Miller was able to operate behind this firewall by tethering his laptop to a burner phone that was on the Sprint network (Figure 3).

But by default, phones are blocked from seeing each other as well. So even if the attacker knows the IP address of another phone on the network, the network won’t allow her to make a data connection to connect to that phone (Figure 4).  The network enforces this by what are called Access Point Names (APNs). 

Figure 3: Device-to-device was enabled for the car’s APN, enabling remote exploitation. Why?

Figure 4: Default configuration, device-to-device connections disabled. The attacker cannot access the target device from inside the firewall.

When a phone on the network needs to make a data connection, it provides anAPN to the network. If you want to view the APN settings in your personal phone you follow these instructions for iPhone or Android.  The network gateway uses the APN to determine how to allow your phone to connect to the Internet.  There are hundreds of APNs in every network, and your carrier uses APNs to organize how different devices are allocating data for billing purposes. In the case of Uconnect, all Uconnect devices operate on the Sprint network and use their own private APN.  APNs are really useful for third parties, like Uconnect, to sell a service that runs over a cellular network. So that each Uconnect user doesn’t need to maintain a line of service with Sprint, Uconnect is responsible for the data connection, and end users pay Uconnect for service, which runs through a private APN that was set up for Uconnect.

APNs are used extensively to isolate private networks for machine-to-machine systems like smart road signs and home alarm systems. If you’ve ever bought a soda from a vending machine with a credit card, the back end connection was using a private APN. 

Vulnerabilities caused by misconfigured APNs are not new; the APN of the bike-sharing system in Madrid was hacked just last summer. These bike-sharing systems need device-to-device access because technicians perform maintenance on these machines via remote desktop.  

Aftermath

There is no obvious reason for Uconnect to need remote administration. Why then are device-to-device connections allowed for the Uconnect APN, especially since it opens the door to a remote access exploit?  We will probably never know, because six days after the Wired story was published, Miller tweeted that Sprint had blocked phone-to-car traffic as well as car-to-car traffic. What this really means is that Sprint disabled internal traffic for the Uconnect APN. The remote access vector was closed.

The fact that Sprint made this change so quickly suggests that device-to-device traffic was not necessary in the first place, which leads us to two conclusions: 1) Had Sprint simply left device-to-device traffic disabled, the Jeep incident would have required physical access and not have been any more of a story than the Ford Escape story in 2013, or 2) More seriously, if the story hadn’t attracted mainstream media attention, Chrysler might not have taken the underlying vulnerability as seriously, and the fix would have rolled out much later, if ever. 

Security shouldn’t be a function of the drama circus that surrounds it.

 

Firewall icon created by Yazmin Alanis from the Noun Project
Pirate Phone icon created by Adriana Danaila from the Noun Project
Pickup truck icon created by Jamie M. Laurel from the Noun Project

Sprint Defaults and the Jeep Hack: Could Basic Network Settings Have Prevented the Industry Uproar?

Adam Harder

Black Hat 2015 Analysis: An Island in the Desert

$
0
0

This year’s Black Hat broke records yet again with the highest levels of attendance, including highest number of countries represented and, based on the size of the business hall, companies represented as well. While it featured some truly novel technical methods and the advanced security research for which it is so well known, this year’s conference even more than others reflected an institutionalization of the status quo within the security industry. Rather than reflecting the major paradigm shifts that are occurring in the security community, it seemed to perpetuate the insularity for which this community is often criticized.

In her Black Hat keynote speech, Jennifer Granick, lawyer and Director of Civil Liberties at Stanford University, noted that inclusion is at the heart of the hacker’s ethos and called for the security community to take the lead and push forth change within the broader tech sector. She explicitly encouraged the security community to refrain from being so insular, and to transform into a community that not only thinks globally but is also much more participatory in the policies and laws that directly affect them. While she focused on diversity and equality, there are several additional areas where the security community could greatly benefit from a more expansive mindset. Unfortunately, these strategic level discussions were largely absent from the majority of the Black Hat briefings that followed the keynote. The tactical, technical presentations understandably comprise the majority of the dialogue and garner the most attention.  However, given the growing size and expanding representation of disparate parts of the community, there was a noticeable absence of nuanced discussion about the state of the security community, including broader thinking about the three big strategic issues and trends that will define the community for the foreseeable future:

  • Where’s the threat? Despite a highly dynamic threat landscape, ranging from foreign governments to terrorist organizations to transnational criminal networks, discussion of these threat actors was embarrassingly absent from the panels this year. Although the security community is often criticized for over-hyping the threat, this was not the case at this year’s Black Hat. Even worse, the majority of discussions of the threat focused on the United States and Western European countries as the greatest security threats. Clearly, technology conferences must focus on the latest technological approaches and trends in the field. However, omitting the international actors and context in which these technologies exist perpetuates an inward-facing bias of the field that leads many to misunderstand the nature, capabilities and magnitude of the greatest threats to corporate and national security.
  • Toward détente? Last year’s Black Hat conference was still reeling from the Snowden revelations that shook the security community. A general feeling of distrust of the U.S. government was still apparent in numerous panels, heightening interest in privacy and circular discussions over surveillance. While sentiments of distrust still exist, this no longer appears to be the only perspective. In a few briefings, there was a surprising lack of the hostility toward the government that existed at similar panels a year ago. In fact, the very few panels that had government representation were not only well attended, but also contained civil discourse between the speakers and the audience. This does not mean that there were softball questions. On the contrary, there was blunt conversation about the "trust deficit" between the security community and the government. For instance, the biggest concern expressed regarding data sharing with the government (including the information sharing bill which Congress discussed last week, but is now delayed) was not about information sharing itself, but rather how the security community can trust that the government can protect the shared data in light of OPM and other high-profile breaches. This is a very valid concern and one that ignited a lot of bilateral dialogue. Organizations from the DHS to the Federal Trade Commission requested greater partnerships with the security community. While there are certainly enormous challenges ahead, it was refreshing to see not only signs of a potential thawing of relations between the government and the security community, but also hopefully some baby steps toward mutually beneficial collaboration.
  • Diversity. The general lack of diversity at the conference comes as no surprise given the well-publicized statistics of the demographics of the security community, as well as the#ilooklikeanengineer campaign that took off last week. However, diversity is not just about gender – it also pertains to diversity of perspectives, backgrounds and industries. Areas such as human factors, policy and data science seemed to be less represented than in previous years, conflicting with much of the rhetoric that permeated the business hall. In many of the talks that did cover these areas, there were both implicit and explicit requests for a more expansive partnership and role within the community.

Given the vast technological, geopolitical and demographic shifts underway, the security community must transform beyond the traditional mindset and truly begin to think beyond the insular perimeter. Returning to Granick’s key points, the security community can consciously provide leadership not only in shaping the political discourse that impacts the entire tech community, but also lead by example through promoting equality and thinking globally. The security community must play a participatory role in the larger strategic shifts that will continue to impact it instead of remaining an insularly focused island in the desert.

Black Hat 2015 Analysis: An Island in the Desert

Andrea Little Limbago

NLP for Security: Malicious Language Processing

$
0
0

Natural Language Processing (NLP) is a diverse field in computer science dedicated to automatically parsing and processing human language. NLP has been used to perform authorship attribution and sentiment analysis, as well as being a core function of IBM’s Watson and Apple’s Siri. NLP research is thriving due to the massive amounts of diverse text sources (e.g., Twitter and Wikipedia) and multiple disciplines using text analytics to derive insights. However, NLP can be used for more than human language processing and can be applied to any written text. Data scientists at Endgame apply NLP to security by building upon advanced NLP techniques to better identify and understand malicious code, moving toward an NLP methodology specifically designed for malware analysis—a Malicious Language Processing framework. The goal of this Malicious Language Processing framework is to operationalize NLP to address one of the security domain’s most challenging big data problems by automating and expediting the identification of malicious code hidden within benign code.

How is NLP used in InfoSec?

Before we delve into how Endgame leverages NLP, let’s explore a few different ways others have used it to tackle information security problems:

  • Domain Generation Algorithm classification – Using NLP to identify malicious domains (e.g., blbwpvcyztrepfue.ru) from benign domains (e.g., cnn.com)
  • Source Code Vulnerability Analysis – Determining function patterns associated with known vulnerabilities, then using NLP to identify other potentially vulnerable code segments.
  • Phishing Identification – A bag-of-words model determines the probability an email message contains a phishing attempt or not.
  • Malware Family Analysis –Topic modeling techniques assign samples of malware to families, as discussed in my colleague Phil Roth’s previous blog.

Over the rest of this post, I’ll discuss how Endgame data scientists are using Malicious Language Processing to discover malicious language hidden within benign code. 

Data Acquisition/Corpus Building

In order to perform NLP you must have a corpus, or collection of documents. While this is relatively straightforward in traditional NLP (e.g., APIs and web scraping) it is not necessarily the same in malware analysis. There are two primary techniques used to get data from malicious binaries: static and dynamic analysis. 

Fig 1. Disassembled source code

 

Static analysis, also called source code analysis, is performed using adisassembler providing output similar to the above (Fig 1). The disassembler presents a flat view of a binary, however structurally we lose important contextual information by not clearly delineating the logical order of instructions. In disassembly, jmp or call instructions should lead to different blocks of code that a standard flat file misrepresents. Luckily, static analysis tools exist that can provide call graphs that provide logical flow of instructions via a directed graph, like this and this.

Dynamic analysis, often called behavioral analysis, is the collection of metadata from an executed binary in a sandbox environment. Dynamic analysis can provide data such as network access, registry/file activity, and API function monitoring. While dynamic analysis is often more informative, it is also more resource intensive, requiring a suite of collection tools and a sandboxed virtual environment. Alternatively, static analysis can be automated to generate disassembly over a large set of binaries generating a corpus ready for the NLP pipeline. At Endgame we have engineered a hybrid approach that automates the analysis of malicious binaries providing data scientists with metadata from both static and dynamic analysis.

Lexical Parsing

Lexical parsing is paramount to the NLP process as it provides the ability to turn large bodies of text into individual tokens. The goal of Malicious Language Processing is to parse a binary the same way an NLP researcher would parse a document:

To generate the “words” in this process we must perform a few traditional NLP techniques. First is tokenization, the process of breaking down a string of text into meaningful segments called tokens.  Segmenting on whitespace, new line characters, punctuation or regular expressions can generate tokens. (Fig 2)

Fig 2. Tokenized disassembly

The next step in the lexical parsing process is to merge families of derivationally related words with similar meaning or text normalization. The two forms of this process are called stemming and lemmatization.

Stemming seeks to reduce a word to its functional stem. For example, in malware analysis this could reduce SetWindowTextA or SetWindowTextW to SetWindowText (Windows API), or JE, JLE, JNZ to JMP (x86 instructions) accounting for multiple variations of the essentially the same function.

Lemmatization is more difficult in general because it requires context or the part-of-speech tag of a word (e.g., noun, verb, etc.). In English, the word “better” has “good” as its lemma. In malware we do not yet have the luxury of parts-of-speech tagging, so lemmatization is not yet applicable. However, a rules-based dictionary that associates Windows API equivalents of C runtime functions may provide a step towards lemmatization, such as mapping _fread to ReadFile or _popen to CreateProcess.

Semantic Networks

Semantic or associative networks represent the co-occurrence of words within a body of text to gain an understanding of the semantic relationship between words. For each unique word in a corpus, a node is created on a directed graph. Links between words are generated with an associated weight based on the frequency that the two words co-occurred. The resulting graph can then be clustered to derive cliques or communities of functions that have similar behavior.

A malicious language semantic network could aid in the generation of a lexical database capability for malware similar to WordNet. WordNet is a lexical database of English nouns, verbs, and adjectives grouped into sets of cognitive synonyms. Endgame data scientists are in the incipient stages of exploring ways to search and identify synonyms or synsets of malicious functions. Additionally, we hope to leverage our version of WordNet in the development of lemmatization and the Parts-of-Speech tagging within the Malicious Language Processing framework.

Parts-of-Speech Tagging

Parts-of-Speech (POS) tagging is a piece of software capable of tagging a list of tokens in a string of text with the correct language annotation, such as noun, verb, etc. POS Tagging is crucial for gaining a better understanding of text and establishing semantic relationship within a corpus. Above I mentioned that there is currently no representation of POS tagging for malware. Source code may be too abstract to break down into nouns, prepositions or adjectives. However, it is possible to treat subroutines as “sentences” and gain an understanding of functions used as subjects, verb and predicates. Using pseudo code for a process injection in Windows, for example, would yield the following from a Malicious Language Processing POS-Tagger:

Closing Thoughts

While the majority of the concepts mentioned in this post are being leveraged by Endgame today to better understand malware behavior, there is still plenty of work to be done. The concept of Malicious Language Processing is still in its infancy. We are currently working hard to expand the Malicious Language Processing framework by developing a malicious stop word list (a list of the most common words/functions in a corpus of binaries) and creating an anomaly detector capable of determining which function(s) do not belong in a benign block of code. With more research and larger, more diverse corpuses, we will be able to understand the behavior and basic capabilities of a suspicious binary without executing or having a human reverse engineer it. We view NLP as an additional tool in a data scientist’s toolkit, and a powerful means by which we can apply data science to security problems, quickly parsing the malicious from the benign.

NLP for Security: Malicious Language Processing

Bobby Filar

Hunting for Honeypot Attackers: A Data Scientist’s Adventure

$
0
0

The U.S. Office of Personnel Management (known as OPM) won the “Most Epic Fail” award at the 2015 Black Hat Conference for the worst known data breach in U.S. government history, with more than 22 million employee profiles compromised. Joining OPM as contenders for this award were other victims of high-profile cyber attacks, including Poland's Plus Bank and the website AshleyMadison.com. The truth is, hardly a day goes by without news of cyber intrusions. As an example, according to databreachtoday.com, just in recent months PNI Digital Media and many retailers such as Wal-Mart and Rite-Aid had their photo services compromised, UCLA Health’s network was breached, and information of 4.5 million people may have been exposed. Criminals and nation-state actors break into systems for many reasons with catastrophic and often irremediable consequences for the victims.

Traditionally, security experts are the main force for investigating cyber threats and breaches. Their expertise in computers and network communication provides them with an advantage in identifying suspicious activities. However, with more data being collected, not only in quantity but also in variety, data scientists are beginning to play a more significant role in the adventure of hunting malicious attackers. At Endgame, the data scientist team works closely with the security and malware experts to monitor, track and identify cyber threats, and applies a wide range of data science tools to provide our customers with intelligence and insights. In this post, I’ll explain how we analyze attack data collected from a honeypot network, which provides insight into the locations of attackers behind those activities. The analysis captures those organized attacks from a vast amount of seemingly separated attempts.

This post is divided into three sections. The first section describes the context of the analysis and provides an overview of the hacking activities. The second section focuses on investigating the files that the attackers implanted into the breached systems. Finally, the third section demonstrates how I identified similar attacks through uncovering behavioral characteristics. All of this demonstrates one way that data science can be applied to the security domain. (My previous post explained another application of data science to security.)

Background

Cyber attackers are constantly looking for targets on the Internet. Much like a lion pursuing its prey, an attacker usually conducts a sequence of actions, known as the cyber kill chain, including identifying the footprints of a victim system, scanning the open ports of the system, and probing the holes trying to find an entrance into the system. Professional attackers might be doing this all day long until they find a weak system.

All of this would be bad news for any weak system the attacker finds – unless that weak system is a honeypot. A honeypot is a trap set up on the Internet with minimum security settings so an attacker may easily break into it, without knowing his/her activities are being monitored and tracked. Though honeypots have been used widely by researchers to study the methods of attackers, they can also be very useful to defenders. Compared to sophisticated anomaly detection techniques, honeypots provide intrusion alerts with low false positive rates because no legitimate user should be accessing them. Honeypots set up by a company might also be used to confuse attackers and slow down the attacks against their networks. New techniques are on the way to make setting up and managing honeypots easier and more efficient, and may play an increasingly prominent role in future cyber defense.

A network of honeypots is called a honeynet. The particular honeynet for which I have data logged activities showing that an attacker enumerated pairs of common user names and passwords to enter the system, downloaded malicious files from his/her own hosting servers, changed the privilege over the files and then executed them. During the period from March 2015 through the end of June 2015, there were more than 21,000 attacker IP addresses being detected, and about 36 million SSH attempts being logged. Attackers have tried 34,000 unique user names and almost 1 million unique passwords to break into those honeypots. That’s a lot of effort by the attackers to break into the system. Over time, the honeynet has identified about 500 malicious domains and more than 1000 unique malware samples.

The IP addresses that were owned by the attackers and used to host malware are geographically dispersed. Figure 1 shows that the recorded attacks mostly came from China, the U.S., the Middle East and Europe. While geographic origination doesn’t tell us everything, it still gives us a general idea of potential attacker locations. 

Figure 1. Attacks came from all around the world, color coded on counts of attack. The darker the color, the greater the number of attacks originating from that country.

The frequency of attacks varies daily, as shown in Figure 2, but the trend shows that more attacks were observed during workdays than weekends, and peaks often appear on Wednesday or Thursday. This seems to support the suspicion that humans (other than bots) were behind the scenes, and professionals instead of amateur hobbyists conducted the attacks. 

Figure 2. Daily Attack Counts.

Now that we understand where and when those attacks were orchestrated, we want to understand if any of the attacks were organized. In other words, were they carried out by same person or same group of people over and over again?

Attackers change IP addresses from attack to attack, so looking at the IP addresses alone won’t provide us with much information. To find the answer to the question above, we need to use the knowledge about the files left by the attackers. 

File Similarity

Malware to an attacker is like a hammer and level to a carpenter. We expect that an attacker would use his/her set of malware repeatedly in different attacks, even though the files might have appeared in different names or variants. Therefore, the similarity across the downloaded malware files may provide informative links to associated attacks.

One extreme case is a group of 17 different IPs (shown in Figure 3) used on a variety of days containing exactly the same files and folders organized in exactly the same structure. That finding immediately portrayed a lazy hacker who used the same folder time and time again. However, we would imagine that most attackers might be more diligent. For example, file structures in the hosting server may be different, folders could be rearranged, and the content of a malicious binary file may be tweaked. Therefore, a more robust method is needed to calculate the level of similarity across the files, and then use that information to associate similar attacks.

Figure 3. 17 IPs have exactly the same file structure.

How can we quantitatively and algorithmically do this?

The first step is to find similar files to each of the files in which we are interested. The collected files include different types, such as images, HTML pages, text files, compressed tar balls, and binary files, but we are probably only interested in binary files and tar balls, which are riskier. This reduces the number of files to work on, but the same approach can be applied to all file types.

File similarity computation has been researched extensively in the past two decades but still remains a rich field for new methods. Some mature algorithms to compute file similarities include block-based hashing, Context-Triggered Piecewise (CTP) hashing (also known as fuzzy hashing), and Bloom filter hashing. Endgame uses more advanced file similarity techniques based on file structural and behavioral attributes. However, for this investigation I used fuzzy hashing to compute file similarities for simplicity and since open source code is widely available.

I took each of the unique files based on its fuzzy hashing string and computed the similarity to all the other files. The result is a large symmetric similarity matrix for all files, which we can visualize to check if there are any apparent structures in the similarity data. The way I visualize the matrix is to connect two similar files with a line, and here I would choose an arbitrary threshold of 80, which means that if two files are more than 80% similar, they will be connected. The visualization of the file similarity matrix is shown in Figure 4.

Figure 4. Graph of files based on similarity.

It is visually clear that the files are indeed partitioned into a number of groups. Let’s zoom into one group and see the details in Figure 5. The five files, represented by their fuzzy hash strings, are connected to each other, having mutual similarity of over 90%. If we look at them very carefully, they only differ in one or two letters in the strings, even they have totally different file names and MD5 hashes. VirusTotal recognizes four out of the five malware, and the scan reports indicate that these malware are Linux Trojan. 

Figure 5. One group of similar files.

Identifying Similar Attacks

Now that we have identified the groups of similar files, it’s time to identify the attacks that used similar malware. If I treat each attack as a document, and the malware used in an attack as words, I can construct a document-term matrix to encapsulate all the attack information. To incorporate the malware similarity information into the matrix, I tweaked the matrix a bit. For malware that were not used in a specific attack, but that still share a certain amount of similarity with the malware being used, the malware will assume the value of the similarity level for that attack. For example, if malware M1 was not used in attack A1, but M1 is most similar to malware M2 which was used in attack A1, and the similarity level is 90%, then the element at cell (A1, M1) will be 0.9, while (A1, M2) be 1.0.

For readers who are familiar with NLP (Natural Language Processing) and text mining, the matrix I’ve described above is similar to a document-term matrix, except the values are not computed from TF-IDFs (Term Frequency-Inverse Document Frequency). More on applications of NLP on malware analysis can be found in a post published by my fellow Endgamer Bobby Filar. The essence of such a matrix is to reflect the relationship between data records and features. In this case, data records are attacks and features are malware, while for NLP they are documents and words. The resulting matrix is an attack-malware matrix, which has more than 400 columns representing malware hashes. To get a quick idea of how the attacks (the rows) are dispersed in such a high dimensional space, I plotted the data using the T-SNE (t-Distributed Stochastic Neighbor Embedding) technique and colored the points according to the results from K-means (K=10) clustering. I chose K=10 arbitrarily to illustrate the spatial segmentation of the attacks. The T-SNE graph is shown in Figure 6, and each color represents a cluster labeled by the K-means clustering. T-SNE tries to preserve the topology when projecting data points from a high dimensional space to a much lower dimensional space, and it is widely used for visualizing the clusters within a data set.

Figure 6 shows that K-Means did a decent job of spatially grouping close data points into clusters, but it fell short of providing a quantitative measurement of similarity between any two data points. It is also quite difficult to choose the optimum value for K, the number of clusters. To overcome the challenges that K-Means faces, I will use Latent Semantics Indexing (LSI) to compute the similarity level for the attack pairs, and build a graph to connect similar attacks, and eventually apply social network analytics to determine the clusters of similar attacks.

Figure 6. T-SNE projection of Attack-Malware matrix to 2-D space.

LSI is the application of a particular mathematical technique, called Single Value Decomposition or SVD, to a document-term matrix. SVD projects the original n-dimensional space (with n words in columns) onto a k-dimensional space, where k is much smaller than n. The projection then transforms a document’s vector in n-dimensional space into a vector in the reduced k-dimensional space under the requirement that the Euclidean distance between the original matrix and the resulting matrix after transformation is minimized.

SVD decomposes the attack-malware matrix into three matrices, one of which defines the new dimensions in the order of significance. We call the new dimensions principal components. The components are ordered by the amount of explained variance in the original data. Let’s call this matrix attack-component matrix. With the risk of losing some information, we can plot the attack data points on the 2-d space using the first and the second components just to illustrate the differences between data points, as shown in Figure 7. The vectors pointing to perpendicular directions are most different from each other.

Figure 7. Attack data projected to the first and second principal components.

The similarity between attacks can be computed with the results of LSI, more specifically, by calculating the dot product of the attack-component matrix.

Table 1. Attacks Similar to Attack from 61.160.212.21:5947 on 2015-03-23.

I connect two attacks if their similarity is above a certain threshold, e.g. 90%, and come up with a graph of connected attacks, shown in Figure 8.

 

Figure 8. Visualization of attacks connected by similarity.

There are a few big component subgraphs in the large graph. A component subgraph represents a group of attacks closely similar to each other. We can examine each of them in terms of what malware were deployed in the given attack group, what IP addresses were used, and how frequently the attacks were conducted.

I plotted the daily counts of attack for the two largest attack groups in Figure 9 and Figure 10. Both of them show that attacks happened more often on weekdays than on weekends. These attacks may have targeted different geo-located honeypots in the system and could be viewed as a widely expanded search for victims.

Figure 9. Daily counts of attack in one group.

Figure 10. Daily counts of attack in another group.

We can easily find out where those attackers’ IPs were located (latitude and longitude), and the who-is data associated with the IPs. But it’s much more difficult to fully investigate the true identity of the attackers.

Summary

In this post, I explained how to apply data science techniques to identify honeypot attackers. Mathematically, I framed the problem as an Attack-Malware matrix, and used fuzzy hashing to represent files and compute the similarity between files. I then employed latent semantic indexing methods to calculate the similarity between attacks based on file similarity values. Finally, I constructed a network graph where similar attacks are linked so that I could apply social network analytics to cluster the attacks.

As with my last blog post, this post demonstrates that data science can provide a rich set of tools that help security experts make sense of the vast amount of data often seen in cyber security and discover relevant information. Our data science team at Endgame is constantly researching and developing more effective approaches to help our customers defend themselves – because the hunt for attackers never ends.

Hunting for Honeypot Attackers: A Data Scientist’s Adventure

Richard Xie

Three Questions: Smart Sanctions and The Economics of Cyber Deterrence

$
0
0

The concept of deterrence consistently fails to travel well to the cyber realm. One (among the many) reasons is that, although nuclear deterrence is achieved through nuclear means, cyber deterrence is not achieved solely through cyber means. In fact, any cyber activity meant for deterrence is likely going to be covert, while the more public deterrence activities fall into diplomatic, economic, financial, and legal domains. Less than six months after President Obama  signed an executive order to further expand the range of responses available to penalize individuals or companies conducting “malicious cyber-enabled activities”, there are now reports that it may be put to use in a big and unprecedented way. Numerous news outlets have announced the possibility of sanctions against Chinese individuals and organizations associated with economic espionage within the cyber domain. If the sanctions do come to fruition, it may not be for a few more weeks. Until then, below are some of the immediate questions that may help provide greater insight into what may be one of the most significant policy evolutions in the cyber domain.

1. Why now?  

Many question the timing of the potential Chinese sanctions, especially given President Xi Jinping’s upcoming state visit to Washington. It is likely that a combination of events over the summer in both the US and China have instigated this policy shift:

Chinese domestic factors: China’s stock market has been consistently fallingsince June, with the most visible plunge occurring at the end of August, which has had global ramifications. A major slowdown in economic growth has also hit China, which by some estimates could be as low as 4% (counter to the ~10% growth of the last few decades, and lower than even the recent record low of 7.4% in 2014). The latest numbers from today reinforce a slowing economy, with the manufacturing sector recording a three-year low. Simultaneously, President Xi continues to consolidate power, leading a purge of Communist Party officials targeted for corruption and asserting greater control of the military. In short, President Xi is looking increasingly vulnerable, handling economic woes as well as continuing a political power grab, which has led to two influential generals toresign and discontent among some of the highest ranks of leadership.

US domestic factors: The most obvious reason for the timing of potential US sanctions seems to be in response to this summer’s OPM breach, which has been largely attributed to China. This is just the latest in an ongoing list of public and private sector hacks attributed to China, including United Airlines and Anthem. The OPM breach certainly helped elevate the discussions overretaliation, but it’s unlikely that it was the sole factor. Instead, the persistent theft of IP and trade secrets, undermining US competitiveness and creating an uneven playing field, is the dominant rationale provided. Ranging from the defense sector to solar energy to pharmaceuticals to tech, virtually no sectorremains unscathed by Chinese economic espionage. The continuing onslaught of attacks may have finally reached a tipping point.

The White House also has experienced increased pressure to respond in light of this string of high-profile breaches. Along with pressure from foreign policy groups and the public sector, given the administration’s pursuit of greater public-private partnerships, there is likely similar pressure from powerful parts of the private sector – including the financial sector and Silicon Valley – impacting the risk calculus of economic and financial espionage. For instance, last week, Secretary of Defense Ashton Carter visited Silicon Valley, encouraging greater cooperation and announcing a $171 million joint venturewith government, academia and over 160 tech companies. These partnerships have been a high priority for the administration, meaning that the government likely feels pressure to respond when attacks attributed to the Chinese, such as the GitHub attacks this spring, hit America’s tech giants.

2. Why is this different from other sanctions?

Sanctions against Russia and Iran were in response to the aggressive policies of those countries, while those against North Korea were in response to the Sony breach. However, each of these countries lacks the economic interdependence with the US that exists for China.  Mutually assured economic destruction is often used to describe the economic grip the US and China have on each other’s economies. The United States is mainland China’s top trading partner, based on exports plus imports, while China is the United States’ third largest trading partner, following the European Union and Canada. Compare this to the situation in Russia, North Korea, and Iran, the most prominent countries facing US sanctions, none of which have significant trade interdependencies with the US.

Similarly, foreign direct investment (FDI) between China and the US is increasingly significant, with proposals for a bilateral investment treaty (BIT)exchanged this past June, and discussions ongoing in preparation for President Xi’s visit this month. China is also the largest foreign holder of  US Treasury securities, despite its recent unloading of Treasury bonds to help stabilize its currency. Compare this to Russia, North Korea, or Iran, none of which the US economy relied on prior to their respective sanctions. Even in Iran and Russia’s strongest industry – oil and gas– the US has become less reliant and more economically independent, especially given that the US was the world’s largest producer of oil in 2014.

3. Who or what might be targeted?

If sanctions are administered, the US will most likely continue its use of “smart” or targeted sanctions that focus on key individuals and organizations, rather than the entire country. The US sanctions against Russia provide some insight into the approach the administration might take. Russian sanctions are targeted at Putin’s inner circle, including its affiliated companies. These range from defense contractors to the financial sector to the energy sector, and include close allies such as Gennady Timchenko.  Similarly, North Korean sanctionsfollowing the Sony hack focused on three organizations and ten individuals. In the case of China, the state-owned enterprises (SOEs)deemed to reap the most benefits from economic espionage will likely be targeted. In fact, the top twelve Chinese companies are SOEs, meaning they have close ties to the government. More specifically, sanctions could include energy giants CNOOC, Sinopec and PetroChina, some of the large banks, or the global tech giant Huawei because of their large role in the economy and their potential to benefit from IP theft. Interestingly, the largest Chinese companies do not include several of their more famous tech companies, such as Alibaba, Tencent, Baidu and Xiaomi. Most of these enterprises have yet to achieve a significant global footprint, which means they are less likely to top any sanctions list. In considering who among Xi’s network might be targeted, some point to the Shaanxi Gang, Xi’s longtime friends, while others look at those most influential within the economy, such as Premier Li Keqiang.

Given President Xi’s upcoming visit, is the talk of sanctions diplomatic maneuvering, or will it be backed by concrete action? If enacted, the administration’s intent will be revealed through the actual targets of the sanctions.  If the objective is to deter future cyber aggression, then sanctions must be targeted at these influential state-owned companies and inner circle of the regime.  Otherwise, it will be perceived as a purely symbolic act both in the United States and in China and lack the teeth to truly enact change. 

Three Questions: Smart Sanctions and The Economics of Cyber Deterrence

Andrea Little Limbago

A Keynesian Approach to Information Freedom

$
0
0

A free and open Internet is the cornerstone of net neutrality, advocated by civil liberties groups and the US government alike. A wide range of actors have taken this concept to the extreme by publicly releasing very private information and pictures.  This reflects a laissez-faire approach to the Internet, completely removing government intervention from the equation. Simultaneously, authoritarian regimes have implemented policies and approaches aimed at Internet censorship, reflecting a protectionist view, with the government determining exactly what information will be accessible to its citizens. But many democratic states are testing the waters with a third approach: Minimal intervention in the name of privacy and protecting citizens. Similar to Keynesian economics, where some government intervention is necessary to optimize economic stability, this third way may serve as a harbinger for the future of cyber security—especially as the public now very clearly understands just how fragile their private data is on the Internet.

Too Little: Laissez-Faire Data Freedom

Continuing the economics analogy, a laissez-faire approach to data freedom argues that data wishes to be free, and government intervention to prevent disclosures is anathema to that principle. The release of private information is already becoming a cottage industry from which blackmailers hope to reap financial benefits. One example is the Wikileaks compilation and aggregated release of the emails of Sony employees following last December’s attack. Similarly, last month’s Ashley Madison hack went a step further, publishing information that not only has the potential to ruin the personal and professional lives of its customers, but also their families.

Personally identifiable information (PII) is also a key target of disclosures and has implications both for personal as well as national security. The Islamic State of Iraq and Syria (ISIS) recently released email addresses, phone numbers, passwords and names of US military personnel, calling for attacks against the service members and their families. Other recent cases of PII theft include the Anthem and OPM breaches. Most of these were instigated by nation-states and criminal groups who have yet to publicly release the data, but who nevertheless now have access a vast trove of PII. And it’s not just adversarial groups who are releasing PII, but also data brokers, which are poorly regulated and have released billions of customer records. For instance, in 2013 Experian accidentally sold the PII of close to two-thirds of Americans to a criminal group in Vietnam.

Finally, with the omnipresence of smart phones, very private pictures also are released online without personal consent. This ranges from celebrities’ hacked iPhones, to photos used as a component of cyber bullying or revenge porn.

In each of these broad categories, the argument that data demands to be free often prevails, leading to victim blaming instead of facing the larger issue that – given the slow pace of the legal system – currently perpetrators face few, if any, repercussions for the theft and posting of personal information.

Too Much: Government Intervention & Information Control

Many authoritarian regimes operate at the opposite end of the spectrum when it comes to information freedom. For example, as the Chinese stock market plunged at the end of last month, impacting markets across the globe, the major Chinese newspapers either barely referenced the largest drop in eight years, or failed to mention it at all. The search engine Baidu and the micro-blogging site Weibo blocked much of the content related to information about the crash. Similarly, after promising not to censor the Internet a few years ago, the Malaysian government blocked the Sarawak Report, a UK-based news website, following its publication of an article alleging a bank transfer from a state investment fund into the Malaysian Prime Minister’s personal account. Russia was also busy with censorship last week, removing Russian Wikipedia following similar censorship of Reddit earlier this month.

This behavior is not just limited to authoritarian regimes; democracies are also increasingly censoring material. The Mexican government continues to control the narrative surrounding the students who disappeared en route to a protest last year, recently releasing Twitter bots to squash anti-government activists’ online activity. South Korea seems to be borrowing from this playbook, recently censoring LGBT apps as part of a larger censorship campaign, blocking or deleting almost 100,000 sites in 2013 alone. As these examples illustrate, too much government intervention in authoritarian and democratic regimes alike can lead to extreme infringements on civil liberties.

Just Right?

Complete information freedom is not the panacea many imagined, thanks to malicious actors and profiteers who benefit from the release of private information.  At the same time, the rise of censorship, generally a tool of governments hoping to control the flow of information to their citizens, is a serious concern. Is there a middle ground that can support the freedom of information that promotes development and democracy, while also protecting privacy?

Attaining this middle ground will require creativity and innovation both from the security community and the legal system. For instance, there was some discussion that the Intimate Privacy Protection Act would be introduced this summer, but it continues to be stalled while other countries have successfully passed legislation focusing on protecting individual privacy. The UK passed and has sentenced perpetrators under a new law criminalizing revenge porn. TheEuropean Court of Justice ruled in favor of the “right to be forgotten” and has increasingly required compliance by the major search engines.  Most of these laws focus on non-complicit posting of private data, and reveal legislation that addresses the growing concerns over privacy protection. Nevertheless, as we’ve seen with the Ashley Madison hack, because the legal system lags behind technological change, prosecutors may seek more creative solutions to protect personal privacy.  For example, the Canadian government may rely upon recentcyber-bullying legislation to prosecute perpetrators of this behavior.

These examples demonstrate the emerging trend of patchwork legislation focused on protecting private data. However, this kind of legislation often encounters opposition from free-speech groups, many of which worry about a slippery slope toward censorship, as well as concerns about third-party sites being legally liable.  If nothing else, the hacks ranging from OPM to Ashley Madison prove just how insecure private information can be on the Internet. Until the legal system catches up to the pace of technology, there will continue to be a greater need for security solutions. With the watershed hacks of the past year, there will be a greater demand to more actively defend against the range of adversaries and better protect not only personal information but intellectual property as well. It’s unfortunate that we’re increasingly talking about the millions of people affected. Groups and individuals on both sides of this argument must get more comfortable with a middle ground approach that integrates minimal government intervention to protect personal privacy while safeguarding information freedom. An incremental approach based on information sharing legislation is not enough. We should move towards greater protection of privacy, including the ability to prosecute for the theft and non-consensual disclosure of digital data. But until the legal system catches up, technical security solutions remain the main (yet imperfect) safeguard for information protection. This combination of more active security as well as modernized legislation is exactly what is needed to tip the balance back in favor of citizens and privacy protection.

 

A Keynesian Approach to Information Freedom

Andrea Little Limbago

A New Year, A New Normal: Our Top Cybersecurity Predictions for 2016

$
0
0

Each of the last several years has been dubbed the “year of the breach,” or more creatively the “year of the mega-breach.” But instead of continuing this trend and calling 2016 the “year of the uber-mega-breach,” Endgame’s team of engineers, researchers and scientists have pulled together their top predictions for the security industry.  We anticipate a threatscape that will continue to grow in complexity and sophistication.  And while policymakers are yet to acknowledge that cyber innovations like encryption, Tor, and intrusion software will not simply go away through legislation, global enterprises should recognize that the “year of the breach” is the new normal.

 

Increased Focus on the Cloud
Mark Dufresne, Director Malware Research and Threat Intelligence

Cyber attackers will increasingly interact with cloud services to acquire sensitive data from targets. Through compromising credentials and socially engineering their way into access, attackers will successfully gain control of sensitive data and services hosted by commercial cloud providers.  In addition to data exposure, we may see companies that rely heavily on the cloud significantly impacted by ransom demands for restoration of cloud-hosted assets, potentially with new cases of outright destruction of data or services that are often perceived by users as backed-up and secured by their cloud provider.  As part of their continuing effort to evade detection, adversaries will increasingly use these same cloud-based services for command and control as well as exfiltration in order to blend in with the noise in modern Internet-connected environments. Encryption and the heterogeneity of most environments makes drawing a distinction between legitimate and malicious activity very difficult. Attackers will increasingly take advantage of this heterogeneity, leading some organizations to increase investments in securing and controlling their perimeter.

 

Targeted Malvertising Campaigns
Casey Gately, Cyber Intel/Malware Analyst

State sponsored actors will continue exploiting the social dimension of breaches, focusing on susceptible human vulnerabilities in diverse ways, such as through targeted spear phishing or more widespread malvertising campaigns. Many of these widespread campaigns will become increasingly targeted given the growing sophistication of attacks. Spear-phishing is a very reliable method for a state-sponsored actor to gain a foothold into a given network. In contrast, malvertising is more of a 'spray and pray' approach - where attackers generally hope that some of the millions of attempts will succeed.

Attackers could also take a more targeted malvertising approach by dispersing a series of weaponized ads for a particular item – such as weight training equipment. When someone conducts a search for “barbell sets” those ads would be auto-delivered to the potential victim. If the ads were targeted to fit the output, mission statement or goal of a specific corporation, the chance of victimizing someone from that company would be greater.

 

Increase in Mobile App Breaches
Adam Harder, Technical Director of Mobile Strategy

The volume of payments and digital transactions via mobile apps will continue to grow as end-users continue to shift from desktops and the web to mobile platforms.  Walmart is in the process of standing up a complete end-to-end mobile payment system, and 15% of all Starbucks revenue is processed through its mobile app.  Unfortunately, more of these apps will likely fall victim to breaches this year. Consider all the apps installed on your mobile device. How many of these are used to make purchases or view credit/loyalty account balances? Several popular consumer apps - Home Depot, Ebay, Marriott, and Starbucks - have been associated with data breaches in the last 24 months.

 

Public Perception Shift from Security to Safety
Rich Seymour, Data Scientist

People are slowly coming to realize the lack of implicit security in the systems they trust with their data. Many users operate under the false assumption that security is inherently baked into private services.  This isn't a paradigm shift for folks used to untrusted networks (like the manually switched telephone systems of the pre-rotary era), but people who simply assumed their information was stored, retrieved, and archived securely need to recognize that not only must they trust the physical security of a data center, they must also trust the entire connected graph of systems around it.  

Based on some leading literature from last year, including the work of Nancy Leveson, expect to see safety become the buzzword of 2016. There also could be big things from the Rust community (including intermezzOS and nom) and QubesOS. As such, “safety” will likely be the new information security buzzword.

 

Malicious Activity Focused on Exploiting PII & Critical Infrastructure
Doug Weyrauch, Senior CNO Software Engineer

With the rise in frequency and severity of data breaches, including those at OPM and several health care companies, cyber criminals and hacktivists are increasingly using PII and other sensitive data for extortion, public shaming, and to abuse access to victims’ health records and insurance.  Unlike credit card information, personal health and background information cannot be canceled or voided.  If health records are co-opted and modified by a malicious actor, it is very difficult for a victim correct misinformation.  And with the US Presidential election heating up this year, it’s likely one or more candidates will suffer a breach that will negatively impact their campaign.

As more stories surface regarding the cyber risks unique to critical infrastructure, such as in the energy sector, terror groups will increasingly target these systems. In 2016, there will likely be at least one major cyber attack impacting critical infrastructure or public utilities. Hopefully this propels critical infrastructure organizations and governments to actually put money and resources behind overhauling the digital security of the entire industry.

A New Year, A New Normal: Our Top Cybersecurity Predictions for 2016

How Banks' Spending on Cybersecurity Ranks If They Were Small Countries

$
0
0

Last week, our team predicted the biggest cybersecurity trends in the new year – specifically, that as attacks grow in complexity and sophistication, breaches will be the new normal.

Indicative of the growing importance of cybersecurity to critical infrastructure industries, the financial sector is responding to this new normal, and is investing its resources accordingly. In light of high profile breaches like JP Morgan Chase and the Carbanak campaign, current and anticipated spending on cybersecurity in the financial sector exposes the resources required to counter this new normal. To highlight this, we’ve compared cybersecurity spending of four of the largest banks to the GDP of four small countries to demonstrate the vast resources required to manage current and emerging threats.

 

How Banks' Spending on Cybersecurity Ranks If They Were Small Countries

With the new year kicking off with a high profile attack on the Ukrainian power grid, it is increasingly evident that the new normal is here to stay. Tackling this dynamic and complex threatscape requires organizations – especially those in the highly targeted critical infrastructure sectors –  to think like the adversary. That’s why we’ve built a solution that intimately understands adversarial techniques and tactics – enabling our customers to go from being the hunted to the hunter and identifying threats at the earliest possible moment before damage and loss can occur.


Moving Beyond the Encryption Debate

$
0
0

With the Cybersecurity Information Sharing Act snuck into the omnibus budget bill in December, and the horrific terrorist attacks in Paris and San Bernardino, encryption has returned front and center as the next cybersecurity policy battleground.  Unfortunately, like so many reactive policy issues, the encryption debate remains muddled in myopic discussions that ignore the complex realities of both technology as well as the modern international system. Since the technological challenges have been widely covered, below are just three of the key structural social challenges that further indicate that it’s time to move onto more productive discussion regarding the national security implications of the cyber domain.

 

  • Collective Action Problem  – Similar to the Wassenaar Arrangement, any policy that depends on global adherence will fail unless it is in everyone’s interest to abide by it. Digital safe havens will continue to exist with and without legislation requiring backdoor access to data. Nefarious actors will take advantage of and circumvent any legal mandates if deemed in their best interest to do so. This is why norms are so challenging in this domain. Because – whether illegal or not – encryption without backdoor access will be used by criminals, spies and terrorists if it helps them achieve their objectives. Moreover, adhering to the law would then become a self-imposed competitive disadvantage for corporations as it could weaken the security and protection of their PII and IP. Weakening encryption assists those trying to exploit the system or limit civil liberties, while hindering those trying to protect them. Given the very widespread data breaches of the last few years, if anything, we need stronger security practices around our personal and intellectual data, not weaker.

 

  • Dictatorships – While the notion we’re entering an era of authoritarian resurgence remains highly debated, it is clear that major powers such as China and Russia as well as smaller states like Uzbekistan continue to leverage the Internet as a key source of international statecraft and domestic control. Many state and non-state actors better achieve their objectives if the Internet is not free and open. In this case, encryption becomes part of their strategy of domestic control, either by implementing encryption to protect their own communications, or by cracking into it as part of a larger surveillance strategy. Dictatorships further achieve these objectives by working with companies whose main purpose is to crack the encryption systems of companies such as Facebook and Google. As long as there are leaders who pursue domestic policies of censorship and Internet control, they will find ways to impose or crack encryption systems to their benefit. Encryption becomes part of their larger strategy, implementing impenetrable systems to safeguard their own data, thus giving them an advantage, as they are not required to provide backdoor access to their data. Dictatorships also constantly pursue vulnerabilities and weaknesses to exploit – especially among pro-democracy groups and social media companies – and therefore will devote significant resources toward gaining access to data via any backdoor channels.

 

  • Head in the sand – Finally, as policy slowly muddles along to grasp technological realities, encryption systems are increasingly ubiquitous. The recent presidential debates demonstrated the void in comprehension of the problem and certainly did not provide viable solutions. On the one hand, the most recent Democratic debate avoided providing any coherent platform other than the need to balance security and privacy. The Republican debate similarly failed to offer viable solutions, with bewildering comments ranging from cutting off parts of the Internet to confusing statements about smartphone encryption. Unfortunately, it’s possible thatreactive policy responses may win out over more thoughtful recommendations that clearly address the core problems. The recent terrorist acts put renewed pressure on Congress to respond quickly to a dominant national security concern, elevating the risk that misguided policy will be passed.

 

For instance, there has been talk of a bipartisan commission that would bring both DC and Silicon Valley leadership together to explore the problem, similar to the 9/11 commission. Worried that it will take too long, the Senate may instead push forth with encryption legislation that may not be an adequate solution to the actual national security challenges. A bipartisan commission – a rare display of unity in Congress – could help Congressional leaders better grasp the technical implications of their policies, while also helping the tech community better comprehend the complexity of modern national security challenges. Until then, based on the recent level of discourse, the more likely reality unfortunately is ill-conceived, reactionary legislation.

The encryption debate – centered at its core on whether there is a security and privacy trade-off – only continues to further the wedge between DC and Silicon Valley. It would be more productive for both the tech and policy communities to look beyond encryption. Although cybersecurity was not addressed in last month’s State of Union address, hopefully meetings such as that betweennational security leaders and Silicon Valley CEOs last month is a sign that these two sides can work toward more innovative solutions that meet both the technological and geopolitical realities of the current era. Of course, this will require both sides to compromise. Silicon Valley needs to accept that safeguards are necessary given the national security landscape, while Congress needs to lean on Silicon Valley to optimize the way advanced technologies can simultaneously protect both privacy and national security. Until then, we’re likely to see misguided policy proposals that are ill-fitted to achieve the desired national security objectives.

Moving Beyond the Encryption Debate

Andrea Little Limbago

Distilling the Key Aspects of Yesterday’s Threat Assessment, Budget Proposal, and Action Plan

$
0
0

In light of the latest breach– including 200GB of PII of Department of Justice and FBI personnel – yesterday’s news from DC is all the more compelling. As is often the case, the most intriguing aspects are hidden deep within the texts or spread across the various documents and hearings. To help make sense of this extremely active week in cyber policy, we have analyzed some of the crosscutting themes on the threat and policy responses from the following:

 

Disparate & Unprecedented Threats

  • State & Non-State Actors: The CNAP and the threat assessment both highlight the range of adversaries, including criminals, lone wolves, terrorists, and state-sponsored espionage (i.e. spies). The sophistication of their techniques clearly varies, but each type of threat actor is increasingly leaning of the availability and low risk of offensive cyber operations to achieve their objectives.
  • Adversaries’ offensive tradecraft: Threat actors are keeping all options on the table, pursuing the range of cyber statecraft from propaganda to deception to espionage. Both Russia and China rely heavily on misinformation and espionage, while data integrity and accountability is increasingly problematic, which has strategic level implications for attribution and U.S. policy responses.
  • Targets: The targets vary depending on the threat actor, which means that most industries remain potential targets. Those entities with significant PII, IP, or critical infrastructure are at the greatest risk. These include power grids and financial systems, as well as defense contractors.
  • Tech & Data Science: Cyber and technology dominate all discussions of leading national security challenges, consistent with previous assessments. In contrast, data science and security are rarely referenced when talking about adversaries’ capabilities, but this year’s threat assessment breaks new ground in identifying the foreign data science capabilities of threat actors. While Director Clapper focuses more on foreign data collection capabilities, the sophistication of the data science will determine any insights that can be gleaned from the collection.
  • Between the lines: There is increasingly the potential for unintended consequences given the complex mix of actors, capabilities, and targets. Sophisticated digital tools in the hands of unsophisticated actors are likely to produce negative externalities. Moreover, adversaries’ risk calculus is extraordinarily slanted in favor of offensive attacks. As long as the benefits of a cyber attack outweigh the costs, prepare for more high profile breaches.

 

Multi-faceted Responses

  • Greater spending: The new budget proposal includes a 35% increase in cybersecurity spending to $19 billion. This will cover a broad range of initiatives, including new defensive teams, IT modernization, and broader training initiatives across society.
  • Additional bureaucracy: Just as the NCCIC was formed to create a central source for information sharing, the CNAP recommends the creation of a federal CISO. While the attempt is to parallel the organizational feature of the private sector, it may cause confusion considering there is an extant cyber czar.
  • Proactive hunting: Given the seemingly endless string of breaches, the CNAP calls for “proactively hunting for intruders”. This will be an interesting area to observe, as it’s among the first federal signs of an offensive-based strategy to defend the government networks.
  • Tech Outreach: The budget and the CNAP both stress the need for better government relationships with Silicon Valley. This includes the formation of a new commission comprised of national security experts and Silicon Valley technologists, which would be responsible for longer-term cyber initiatives. President Obama’s reference to the federal system as an “Atari game in an Xbox world” likely resonates with the tech crowd. However, given the absence of anything close to security at this week’s Crunchies, it is unclear whether Silicon Valley is ready to invest in the tough security challenges.
  • Elevated Role of R&D: - The CNAP calls for a testing lab for government and industry to pursue cutting-edge technologies. Director Clapper similarly noted the need to stay ahead of the sophisticated research of many adversarial states in the realms of AI, data science and the Internet of Things. This may be another signal that we are working toward crafting this era’s Sputnik moment, just as President Obama described over five years ago.
  • Between the lines: Protecting digital infrastructure remains a top national security priority, with an emphasis on strengthening and diversifying our cyber defenses to counter the growing range of adversaries. Interestingly, the pursuit of norms to counter adversarial behavior was markedly absent, potentially because it has yet to have any clear deterrent effect. Instead, the budget and CNAP advocate for changes across the workforce, modernization of archaic federal IT infrastructure, creative strategic thinking, proactive cyber techniques, and strengthened partnerships between Silicon Valley and DC. This is a challenge that requires the best strategic thinkers working alongside the most innovative technologists to help secure the country’s critical assets. The budget battle has already begun, so it is uncertain whether many of these necessary changes will in fact become a reality.

Distilling the Key Aspects of Yesterday’s Threat Assessment, Budget Proposal, and Action Plan

Andrea Little Limbago

Welcome to the Jungle: RSA 2016

$
0
0

 

RSA is just a few weeks away, and everyone is finalizing his or her dance cards. There are multiple opportunities to meet the Endgame team, and talk about everything from the Endgame Hunt Cycle to data science to our global network of honeypot sensors to gender diversity in the cybersecurity workplace.

 

1. Booth 2127– Stop by our RSA booth to learn how Endgame detects known and unknown adversarial techniques and eradicates them from enterprise networks. 

2. Lightning Tech Talks– We’re excited to share with you some of the great work of our R&D team. Building upon the theme of multi-layer detection, we’ll show you three distinct approaches to detection. The first focuses on strategic level trends, providing insights garnered from our global honeypot network. The second dives into dynamic malware analysis and the tit-for-tat interactions of defenders and attackers. The final talk describes our automated malware classification capabilities, which build upon the broad expertise of our data science team.

3. Personalized demo– Overwhelmed by the crowds and prefer a more quiet and calm environment to take a look at the Endgame platform? Schedule a private demo here.

 

Welcome to the Jungle: RSA 2016

Editorial Team

Employing Latent Semantic Analysis to Detect Malicious Command Line Behavior

$
0
0

Detecting anomalous behavior remains one of security’s most impactful data science challenges. Most approaches rely on signature-based techniques, which are reactionary in nature and fail to predict new patterns of malicious behavior and modern adversarial techniques. Instead, as a key component of research in Intrusion Detection, I’ll focus on command line anomaly detection using a machine-learning based approach.  A model based on command line history can potentially detect a range of anomalous behavior, including intruders using stolen credentials and insider threats. Command lines contain a wealth of information and serve as a valid proxy for user intent. Users have their own discrete preferences for commands, which can be modeled using a combination of unsupervised machine learning and natural language processing. I demonstrate the ability to model discrete commands, highlighting normal behavior, while also detecting outliers that may be indicative of an intrusion. This approach can help inform at scale anomaly detection without requiring extensive resources or domain expertise.

A Little Intro Material

Before diving into the model, it’s helpful to quickly address previous research, the model’s assumptions, and its key components. Some previous work focuses solely on the commands, while some use a command's arguments as well to create a richer dataset.   I focus only on commands and leave the arguments for future work.  In addition, this work focuses on server resources, as opposed to personal computers, where command line is often not the primary means of interacting with the machine.  Since we are focusing on enterprise-scale security, I leave applications of this model for personal computers to future work.  I also focus on UNIX/Linux/BSD machines due to current data availability.

Authors in previous work often rely on the uniqueness of their set of commands.  For (an overly simple) example, developer A uses emacs while developer B uses vi, hence it is an anomaly if user A uses vi.  These works come in many forms including sequence alignment (similar to bioinformatics), command frequency comparisons, and transition models (such as Hidden Markov Models).  One common issue across many of these works is the explosion in the number of dimensions.  To illustrate this, how many commands can you type from your command line? My OS X machine has about 2000 commands.  Now add Linux, Windows and all the uncommon or custom commands.  This can easily grow to the order of tens of thousands of commands (or dimensions)! 

In addition to dimensionality challenges, data representation further contributes to the complexity of the data environment.  There are many ways to represent a bunch of command sequences.  The most simple is to keep them as strings.  Strings can work for some algorithms, but can lack efficiency and generalization.  For example, assumptions of Gaussian distributions don’t really work for strings. In addition, plugging strings into complex models requiring mathematical operators like matrix multiplies (i.e., Neural Nets) are not going to work.  Often, people use one-hot encoding in order to use more complicated models with nominal data, but this still suffers from the curse of dimensionality as the number of unique names increases.  In addition, one-hot encoding treats each unique categorical value as completely independent from other values.  This, of course, is not an accurate assumption when classifying command lines.

Fortunately, dimensionality reduction algorithms can counteract the growing number of dimensions caused by one-hot encoding.  Principle Component Analysis (PCA) is one of the most common data reduction techniques, but one-hot encoding doesn’t follow Gaussian distributions (for which PCA would optimally reduce the data).  Another technique is binary encoding.  This technique is generic, making it easy to use, but can suffer in performance as it doesn’t take domain specific knowledge into account.  Of course, binary encoding is typically used for compression, but it actually works fairly well in encoding categorical variables when each bit is treated as a feature.

So how can we reduce the number of dimensions while utilizing domain knowledge to squeeze the best performance out of our classifiers?  One answer, that I present here, is Latent Semantic Analysis or LSA (also known as Latent Semantic Indexing or LSI). LSA is a technique that takes in a bunch of documents (many thousands or more) and assigns "topics" to each document through singular value decomposition (SVD). LSA is a mature technique that is heavily used (meaning lots of open source!) in many other domains.  To generate the topics, I use man pages and other documentation for each command.  

The assumption (or hypothesis) is that we can represent commands as a distribution of some limited and overlapping set of topics that proxy user intent, and can be used for detecting anomalous behavior. For an overlapping example, mv (or move) can be mimicked using a cp (copy) and an rm (delete). Or, from our previous example, emacs and vi do basically the same thing and probably overlap quite a bit.

LSA on Command Lines

To test the hypothesis, I need to evaluate how well LSA organizes commands into topics using the text from man pages.  I use around 3100 commands (and their respective man pages) to train my LSA model.  Next, I take the top 50 most used commands and show how well they cluster with other commands using cosine similarity.  I could visualize even more commands, but the intent is to show a coherent and understandable clustering of commands (so you don't have to run man a hundred times to understand the graphic). Similarly, only edges with weights greater than .8 are kept for visualization purposes (where cosine similarity is bounded in [0,1] with 1 as the most similar).  

 

 

If you look closely you can see clusters of like commands. This was done totally unsupervised.  No domain experts. That's pretty cool!

That’s a great first step, but how can we use this to classify command lines?  The idea is to average intent over small windows of commands (such as three, ten or fifty commands) and to use this as a feature vector.  For example, if the user types cd, ls, cat, we find the LSA representation of each command from their corresponding man pages.  Assuming we model commands with 200 topics, we take each of the three 200 point feature vectors and do a simple mean to get one 200-point feature vector for those three commands.  I tried a few other simple ways of combining features vectors, such as concatenating, but found mean works the best.  Of course, there are probably better more advance techniques, but this is left to future work.  We can generate a large training and testing set by applying a sliding window over a user’s command sequence. For fun, I use the one-class SVM from sklearn and employ data from the command line histories of eleven colleagues.  I create a total of eleven models trained on each respective user.  These are one-class models, so no positive (i.e., anomalous) examples are in any of the training.  I run ten folds using this setup and average the results. For each fold, I train on 50% of the data and keep 50% of all commands from each user for testing.  I admit this setup is not completely representative of a real world deployment as the numbers of anomalous command sequences far outweigh the numbers of normal. I also do the most basic preprocessing such as stemming and removal of stop words using NLTK and stop_words (both can be installed through pip) on the man pages before running LSA to create topics. 

For a baseline, I run the same experiment using one-hot, binary, and PCA encoded feature vectors for each command. I take the mean of these feature vectors over windows as I did before.

I run the experiment on windows of three, ten, and fifty and display the corresponding receiver operating characteristic (ROC). The ROC curve describes how well the eleven user models identified the held out commands. One caveat is that not all commands are represented in the man pages.  For simplicity and reproducibility, I currently ignore those commands and leave that to future work. 

The first image is not so great.  Here we are displaying the ROC for a window size of 3.  Except for PCA, everything is about the same. LSA is marginally better than one-hot and binary encoding. However, with such a small window size, you’re best off using PCA.

As we increase the window size, the results get a little more interesting.  One-hot and LSA encoding get the largest boost in performance, while PCA degrades.  As I stated earlier, PCA is a bad choice for reducing categorical variables, so this drop-off is not overly surprising. The other interesting point is that larger windows make a better classifier.  This is also not very surprising as the users in this study are very similar in their usage patterns.  Larger windows incorporate more context allowing for a more informative feature vector.

The results get even better for LSA with a window size of fifty.  Of course, we could enrich our features with command line arguments and probably get even better results, but we are already doing really well with just the commands.

Final Thoughts

LSA works very well in clustering the command line arguments, serving as a useful proxy for user intent and more importantly detecting anomalous behavior. This was completely unsupervised making the model an easy fit for a real world deployment where labels often don’t exist. One assumption of this post is the training data is not polluted (i.e., does not contain command line sequences from other users). Also, this data comes from the command lines of software developers or researchers who are very similar in their usage patterns.  This means a command line pattern may be common across several users leading to false negatives in this experimental setup.  Hence, we may see much better results when we feed a malicious user’s command lines to a normal user’s model.  In addition, we could possibly create a more robust model by using the command histories of multiple normal users (instead of building a model from a single user). I will leave the answers to these questions to another post!

 

 

Employing Latent Semantic Analysis to Detect Malicious Command Line Behavior

Jonathan Woodbridge

Endgame Tech Talks @ RSA: Adding Substance to Form

$
0
0

Last week, Endgame’s malware researchers and data scientists provided a welcome break from the the chaos of the convention floor at RSA. Our four talks addressed the need for a multi-stage approach to detection given the sophistication and diversity of attackers, and the complexity of enterprise networks. Since no single detection methodology is fail-proof, multiple comprehensive detection capabilities are required to expedite and optimize the likelihood of detecting known and unknown attacks.  

 

With that in mind, our talks began with an overview of Faraday, Endgame’s globally distributed set of customized sensors that listens to activity on the Internet. This talk addressed the ability to differentiate targeted from non-targeted attacks, and some recent research on the Cisco ASA vulnerability. This was followed by the five most impactful malicious behaviors, what they are, how they have evolved over time and in sophistication, and how to counter them. Next, our data science talk covered the use of machine learning to automate malware classification, and contextualize it by determining capabilities. We concluded with the essential role of stealth to help defenders evade detection by adversaries. Together, our talks provided four unique aspects of our multi-stage approach to detection, which feed into the Endgame cyber operations platform and inform our hunting capabilities.

 

Take a look for yourself at each of these unique presentations and diverse approaches to detection.

Extracting the Malware Signal from the Internet Noise: Andrew Morris

Dynamic Detection of Malicious Behavior: Amanda Rousseau

Machine Learning for Malware Classification and Clustering: Phil Roth

Worst-Case Scenario: Being Detected without Knowing You’re Detected: Braden Preston

 

Endgame Tech Talks @ RSA: Adding Substance to Form

Editorial Team
Viewing all 698 articles
Browse latest View live