Quantcast
Channel: Endgame's Blog
Viewing all 698 articles
Browse latest View live

Geeks, Machines and Outsiders: How the Security Industry Fared at RSA

$
0
0

Last week at RSA—the security industry’s largest conference—Andrew McAfee, co-author of “The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies”, introduced the trifecta of geeks, machines and outsiders as technological innovation’s driving factors. However, after listening to numerous panels and talks during the week that glossed over or downplayed the relevance of geeks, machines and outsiders in moving the security industry forward, it was impossible to miss the irony of McAfee’s argument.

So using the criteria of geeks, machines and outsiders as the driving factors in technology innovation, how does the security industry fare? Based on my week at RSA, here is my assessment:

  • Geeks: By geeks, McAfee refers to people who are driven by evidence and data. Despite the buzzword bingo of anomaly detection, outliers and machine learning, it is not apparent that the implementation of data science has evolved to the point in security that it has in other industries. This might be shocking to insider experts who find that data science has almost reached its peak impact in security. To the contrary, as one presenter accurately noted, data science is, “still in the dark ages in this space.”

    Most data science panels at RSA devoted entire presentations to non-technical and bureaucratic descriptions of data science. In fact, one presenter joked that the goal of the presentation was to only show one equation at most, and only in passing, in order to try to maintain the audience’s attention. While the need to reach a broader audience is understood, panels on similarly technical topics such as malware detection, authentication or encryption dove much deeper into the relevant technologies and methodologies. It’s unfortunate for the industry that the highly technical and complex realm of data science is not always granted the same privilege.

    Incorrect assumptions about data science were also prevalent. At one point during one of the talks, someone commented that “the more data you have, the higher the accuracy of the results.” Comments like these perpetuate the myth that more data is always better and ignore the distinction betweenprecision, recall, and accuracy. Even worse, the notion of “garbage in, garbage out”, which is taught at any introductory level quantitative course, did not even seem to be a consideration.

    Finally, security companies seem to buy into the notion that data scientists are necessary for the complex, dynamic big data environment, but they have no idea how to gainfully employ them. During one panel, a Q&A session focused on what to do with the data scientists in a company. Do you partner them with the marketing team? Finance? Something else? It was clear that data science remains an elusive concept that everyone knows they need, but have no idea how to operationalize.
     

  • Machines: Ironically, it was a data science presentation that, although short on real data science, provided the strongest case for increasing human machine interaction in security by illustrating its success in other industries. In his own argument about machines as a driving factor in technology innovation, McAfee pointed out that companies that ignore human-machine partnerships fall behind. This remains a dominant problem in the security industry, as the numerous high-profile breaches of the last few years illustrate.

    Unlike in many other extraordinarily technical fields, the human factor is often overlooked or ignored in security.  Whether it’s boasting thousands of alerts a day (which no human could ever analyze), or the omnipresent donut/pie chart visualization which is the bane of the existence of anyone who actually has to use it, the human factor approach to security—like data science—lags well behind other industries. While there was an entire RSA category devoted to human factors, the vast majority of those panels were focused on the insider threat, rather than on the user experience in security. The importance of the human-machine interplay is simply not on the security industry’s radar.
     

  • Outsiders: McAfee’s last point about outsiders emphasizes the erroneous mindset in some industries that unless you grew up and are trained in that specific field, you have nothing to offer. Instead, industries that are open to ideas and skills from other fields will have the greatest success in the foreseeable future. This perspective has actually been the driving force of creative innovation throughout time. The wariness (and at times exclusion) of outsiders in the security industry is extraordinarily detrimental not only to the industry, but to corporate and national security as well. It impedes cooperation at the policy level and innovation within the security companies themselves. Although not commenting on the security industry specifically, McAfee reiterated the foundational role of a diversity of views and experiences, working collaboratively together, to foster innovation and paradigm shifts.

    This preference toward industry-insiders is the driving factor limiting the integration of data science and human-machine partnerships and hindering security innovation. The response to McAfee himself was perhaps indicative of the industry’s perspective on the issue of outsiders. McAfee was the last keynote presenter of the day. Many attendees sat through a series of talks by security insiders, but unfortunately left when it came time for an outsider’s perspective. 

Changing an embedded mindset can be even harder than developing the technical skills. This is especially apparent in the security industry, which has yet to figure out how to take the great advances in data science and human-machine interaction from other industries and leverage them for security. As a quantitative social scientist, it was truly mind-boggling to see just how nascent data science and user experience are in the security industry. The future of the security workplace should obviously maintain subject matter experts, but must also pair them with the data scientists who truly understand the realm of the possible, as well as UI/UX experts who can take the enormous complexity of the security data environment and render it useful to the vast user community. It’s ironic that such a technology-driven industry as security completely discounts its roots in Ada Lovelace’s vision of bringing together arts and sciences, machines and humans. Maintaining the status quo—which in the security industry is 0 for 3 in McAfee’s categories for innovation—should not be an option. There is simply too much at stake for corporate and national security. Technical innovation must be coupled with organizational innovation to truly leverage the insights of geeks, machines and outsiders in security.


Change: Three Ways to Challenge Today’s Security (UX) Thinking

$
0
0

Last week, I was fortunate enough to spend three and a half days on the floor at RSA for its “Change: Challenge Today’s Security Thinking” inspired conference. I was simply observing and absorbing the vast array of companies and products. As someone new to the world of security (but very well-versed in the field of UX) I was afforded an opportunity to look at an entire industry with a fresh perspective. One of the most unique challenges that faces the growing world of user experience professionals is knowing just enough about a target user group to create compelling solutions without being too “in the weeds.” In my experience, being too close to a particular industry or audience segment can prevent a more objective approach that a seasoned designer can, and should, bring to a product. Having said that, there were some interesting trends as well as some areas that could benefit from the thematic undercurrent of “Change” presented at RSA. I focused my research on 51 companies spanning multiple verticals, sizes and problem sets—and because the majority were not direct Endgame competitors, the true purpose of my research was to understand more about how the industry thinks and to find key areas of improvement for the field of UX. 

 

Color as a key component

Color played a large part in virtually every product, whether by choice or chance. Color palettes were dominated by bold hues that usually included black, gray, red, orange and blue. Yellow, purple and green were used far less frequently and likely for good reason. Traditionally, black and gray representsimplicity, prestige and balance with red and orange representing importance, danger, caution and change. Blue will always represent strength and trust. On the flip side, yellow, green and purple tend to represent sunshine and warmth,growth and fertility, and magic and mystery, unlikely traits in the security industry. Still, some companies utilized these weaker palette choices in their products, possibly without a true understanding of the “baggage” they bring.

Outside of content color, background color use went one of two ways – either dark content on light background with 80% of companies utilizing this design paradigm or the much less common, light-on-dark construct. Neither is “better” or “correct” in application development, however, the former tends to be more common in the business-to-business realm and is far more familiar to business-centric application users. When I inquired with the companies I observed who had chosen to implement the lesser utilized light-on-dark approach, they generally did so to either differentiate themselves or to successfully target a very specific population of their target market. Whether these two outcomes are true still remains to be seen. These companies were all young start-ups, clearly taking a bit of a risk.

 

Maps as presentation vehicles

There were a multitude of products that featured some sort of map – whether it was a network, geographic, server, GPS, sankey, tree – you name it, there was a map for it. This was both good and bad. For those companies that did it well, the maps provided a much-needed visualization of data that wouldn’t fare well in a tabular or list format. When a security professional needs to have a birds-eye view of where their vulnerabilities lie, providing a visual representation over a list of IP addresses may allow them to better comprehend what requires their attention in a fraction of the time. However, the maps started to suffer in situations where their presence had no clear purpose. Several products had unnecessary animations. Others were so small that the corresponding data and labels overlapped, rendering the graphic unusable. I saw quite a few stuck into a corner of a dashboard simply to fill an otherwise empty space. The D3 collapsible tree map was extremely popular, often at the cost of legibility and a clear understanding of the complexity of the processes that the visualizations were supposed to clarify.

 

Features as framework

Perhaps the greatest challenge I found in the majority of products from both small and large companies, but particularly the industry behemoths, was a clear, well thought-out information architecture (IA), particularly as it related to feature development and organization. There is a common misunderstanding that more features equates to better “sellability”, particularly in products that like to position themselves head-to-head with their competitors.  In the industry, this is often referred to as feature-bloat and time and again it presents itself in products that are designed by product management, marketing and and/or engineers. Generally, these are the individuals who are the most removed from the end-user. It’s the idea that if some is good, more must be better and the false assumption that commanding a big price tag means being able to do a lot. We see this as the mark of success in many industries including the automobile, electronics and vacation/travel sectors.

However, in an industry where time is critical and decision-making is crucial (and competitors are abundant), the feature bloat present in many products shown on the floor can be detractors to product success and may actually make them harder to use when time is of the essence. Think scalpel over Swiss army knife, especially if you’re a startup.

 

What does this mean?

The good news is that UX is starting to make inroads in the security industry and this is an exciting time to be talking about UX in this massive field. Fully bringing UX to an entire product and team takes time, but there are three things that every company can start doing now.

  • First, know your audience and your brand. Figure out to whom you want to sell and for whom you want to build (hint: they may not be the same person). What does your company stand for? What are your core values and selling points? What problems are you solving and for whom? How are you solving them? Then figure out what it is that makes your company and your product different from everyone else. This is yourown brand pyramid. Ask yourself with each new feature that gets proposed: “Does this align with our core strategy and does our product really need this? Does this solve a specific problem for the user” Don’t assume that you know the answer to this simply because you work in marketing or are an engineer. Subsequently, if your answer is “no” and/or the feature doesn’t align with supporting your original brand pyramid, it’s extraneous at best, and distracting or detrimental, at worst.
  • Second, don’t be afraid to be different—but not so different that people don’t even understand it. This is where your UX team needs to understand how to do good user research and then analyze that research. Don’t just comb your analytics—watch people use your product. Don’t just ask your users what they need—it’s  likely that they actually won’t be able to tell you. Don’t assume every user is of a certain demographic and will like some wacky color scheme. Instead try to understand what it is they want to do with your product. Have an open conversation around their roles in their organizations and what problems they face in their roles. Seek ways to create solutions they wouldn’t have thought of and then iterate on how to best manifest that within the product interface without sacrificing usability.
  • Finally, offer unique and targeted solutions even if it means having more than one product. It is better to have several separate but logically connected solutions than it is to have a bloated product with many layers of navigation and too many features. If possible, create roles within the product and give those roles specific policies that can hide data and modules when a particular user does not need them. This may seem obvious, but again, when a feature is proposed, ask “does every user in my system need this and if so, are they all using it the same way?” Chances are, they aren’t.

Interestingly enough, on several occasions at RSA I heard the question “what products will this replace?” The end goal of any product should be to solve problems, not displace the competition. If a competitor’s product already solves a user’s problem, then your company is facing an uphill battle if the only goal is to unseat that product. Instead, ask if there is a more unique way to solve that same problem. Perhaps there is a different problem worth solving. Seek the blue ocean. As Apple would say, “Think Different.” Apple wasn’t successful because Apple wanted to outsell Microsoft. Apple was successful because Apple wanted to make products that solved users’ problems. They’ve done this by investing the necessary resources into their user experience. They’ve aligned their business with user needs. Sounds simple—but it takes dedication and time.

In the end, UX does take effort. It can feel like starting over. In some ways, it is. However in every other industry that has embraced it, especially when that industry is inundated with solutions (think healthcare, education, mobile development) it’s often the difference between an “ok product” and a market success. Even if your organization has already invested a lot of time in your existing products, as RSA taught us, it’s never too late to “Change”.

How the Sino-Russian Cyber Pact Furthers the Geopolitical Digital Divide

$
0
0

As I wrote at the end of last year, China and Russia have been in discussions to initiate a security agreement to tackle the various forms of digital behavior in cyberspace. Last Friday, Xi Jinping and Vladimir Putin formally signed a cyber security pact, bringing the two countries closer together and solidifying a virtual united front against the US. This non-aggression pact serves as just one of a series of cooperative agreements occurring in the cyber domain, and is indicative of the increasingly divisive power politics that are shaping the polarity of the international system for decades to come.

Non-aggression pacts are not new, and by definition focus solely on preventing the use of force between signatories to the pact. However, although they are structured to only impact bilateral country relations, historically they have had significant international implications. By signaling ideological, political, or military intentions, non-aggression pacts can exclude similar levels of cooperation with other states. In fact, when states form neutrality pacts (which are similar, but slightly distinct from non-aggression pacts), the probability of a state initiating a conflict is 57% higher than those without any alliance commitments. Regardless of the make-up of a state’s alliance portfolio—whether non-aggression or neutrality pacts, offensive or defensive alliances—a state’s involvement in alliances of any kind increases the likelihood of that state initiating conflict. It would be a mistake to assume that pacts in the cyber domain should be any different, as they serve a similar signaling mechanism of affiliation in the international system. In fact, last week’s cyber security pact has already prompted analogies to the Molotov-Ribbentrop Pact, the non-aggression treaty signed in 1939 between Germany and the USSR.  While externally the emphasis was on preventing conflict between the two signatories (which clearly didn’t last), the pact contained a privately held aspect dividing parts of Eastern Europe into Soviet and Russian spheres of influence. In short, while non-aggression pacts may appear pacifistic, rarely has that been the case historically.

Moreover, the Sino-Russian pact provides a forum for each state to further shape the guiding principals and norms in cyberspace away from its foundation, which is based on Internet freedom of information and access, and toward the norm of cyberspace sovereignty. Following the surveillance revelations beginning in 2013, global interest in the notion of cyberspace sovereignty has increased, largely aimed at limiting external interventions viewed as an infringement on traditional notions of state sovereignty.  On the surface, this merely extends the Westphalian notion of state sovereignty. However, authoritarian regimes (such as Russia and China) have coopted the de jure legitimacy of state sovereignty to control, monitor and censor information within their borders. This is orthogonal to those norms generally favored by Western democracies and further divides cyberspace into two distinct spheres defined by proponents of freedom of information versus proponents of domestic state control. The Sino-Russian pact will likely only encourage greater fractionalization of the Internet based on the norm of cyberspace sovereignty.

Finally, this pact must be viewed in the context of the growing trend of bilateral cyber security pacts. Japan and the US recently announced the Joint Defense Guidelines, which covers a wide range of cooperative aspects targeted at the cyber domain and the promotion of international cyber norms. Just as the agreement with Japan is likely targeted at countering China, many states in the Middle East are requesting similar cooperation in light of the potential easing of Iranian sanctions. The Gulf Cooperation Council—a political and economic union of Arabic states in the Middle East—is similarly pushing for a cyber security agreement with the US to help deter Iranian aggression in cyberspace. In short, these cooperative cyber security agreements are indicative of the larger power politics that shape the international system. States are increasingly jockeying for positions in cyberspace, signaling their intent and allegiance, which will have implications for the foreseeable future. The Sino-Russian agreement is only the latest in the string of cyber pacts that reflects the competing visions for cyberspace, and the ever-growing geopolitical digital divide.

 

Open-Sourcing Your Own Python Library 101

$
0
0

Python has become an increasingly common language for data scientists, back-end engineers, and front-end engineers, providing a unifying platform for the range of disciplines found on an engineering team. One of the benefits of Python is that it allows software developers to choose and make use of zillions of good code packages. Among the huge number of excellent Python packages, a data scientist may use Pandas for data manipulation, NumPy for matrix computation, matplotlib for plotting, SciPy for mathematical modeling, and Scikit Learn for machine learning. Another benefit of using Python is that it allows developers to contribute their own code packages to the community or share a library with other Python programmers. At Endgame, library sharing is very common across projects for agile product development. For example, the implementation of a new clustering algorithm as a Python library can be used in multiple products with minimum adaptation. This tutorial will cover the basic steps and recommended practices for how to structure a Python project, package the code, distribute it over a Git repository (Github or a private Git repository) and install the package via pip.

For busy readers, I’ve developed a workflow diagram, below, so that you can quickly glance at the steps that I’ll outline in more detail throughout the post. Feel free to look back at the workflow diagram anytime you need a reminder of how the process works.

 

 

Workflow Diagram for Open-Sourcing a Python Library

 

Step One: Setup

Let’s suppose we are going to develop a new Python package that will include some exciting machine learning functionality. We decide to name the package "egclustering" to indicate that it contains functions for clustering. In the future, if we are to develop a new set of functions for classification, we could create a new package called "egclassification". In this way, functions designed for different purposes are organized into different buckets. We will name the project folder on the local computer as "eglearning". In the end, the whole project will be version controlled via Git, and be put on a remote Git repository, either GitHub or a private remote repository. Anyone who wants to use the library would just need to install the package from the remote repository. 

Term Definitions

Before we dig into the details, let’s define some terms:

  • Python Module: A Python module is a py file that contains classes, functions and/or other Python definitions and statements. More detailed information can be found here.
  • Python Package: A Python package includes a collection of modules and a ___init___.py file. Packages can be nested at any depth, provided that the sub-directories contain their own __init__.py file.
  • Distribution: A distribution is one level higher than a package. A distribution may contain one or multiple packages. In file systems, a distribution is the folder that includes the folders of packages and a dedicated setup.py file. 

 

Step Two: Project Structure

A clearly defined project structure is critically important when creating a Python code package. Not only will it present your work in an organized way and help users find valuable information easily, but it will also be much easier to add new packages or files in the future if the project scales.

I will take the recommendation from "Repository Structure and Python" to structure a new project, only adding a new file called README.md which is an introductory file used on GitHub, as shown below.

README.rst
README.md
LICENSE
setup.py
requirements.txt
egclustering
            __init__.py
            clusteringModuleLight.py (This py file contains the code.)
            helpers.py
docs
            conf.py
            index.rst
tests
            test_basic.py
            test_advanced.py 

The project structure is well explained on the page referenced above. Still, it might be helpful to emphasize a few points here:

  • setup.py is the file that tells a distribution tool, such as Distutils or Setuptools, how to install and configure the package. It is a must-have.
  • egclustering is the actual package name. How would we (or a distribution tool) know that? Because it contains a __init__.py file. The __init__.py file could be empty, or contain statements for some initiation activities.
  • clusteringModuleLight.py is the core file that defines the classes and functions. A single py file like that is called a module. A package may contain multiple modules. A package may also contain other packages, namely sub-packages, as long as there is a __init__.py included in a package folder. A project may contain multiple packages as well. For instance, we may create a new folder on par with "egclustering" called "egclassification" and put a new __init__.py under it.
  • Once you find a structure you like, it can serve as a template for future structures. You only need to copy and paste the whole project folder and give it a new project name. More advanced users can try using some template tools, for example, cookiecutter

 

Step Three: Setup Git and GitHub (or private GitHub) Repository

Ctrl+Alt+t to open a new terminal, and type in the following two commands to install Git on your computer, if you haven't done so. 

            sudo apt-get update
            sudo apt-get install git

If the remote repository will be on GitHub (or any other source code host, such as bitbucket.org), open a web browser and go to github.com, apply for an account, and create a new repository with a name like 'peterpan' in my case. If the remote repository will be on a private GitHub, create a new repository in a similar way. In either situation, you will need to tell GitHub your public key so that you can use ssh protocol to access the repository. 

To generate a new pair of ssh keys (private and public), type the commands in the terminal:

            ssh-keygen -t rsa -C "your_email@example.com"
            eval "$(ssh-agent -s)"
            ssh-add ~/.ssh/id_rsa

Then go to the settings page of your github account and copy and paste the content in the pub file into a new key. The details of generating ssh keys can be found on this settings page.

You should now have a new repository on GitHub ready to go. Click on the link of the repo and it will open the repo's webpage. At the moment, you only have a master branch. We need to create a new branch called "develop" so that all the development will happen on the "develop" branch. Once the code reaches a level of maturity, we put it on "master" branch for release.

To do that, click "branch", and in the blank field, type "develop". When that's done, a new branch will be created. 

 

Step Four: Initiate the Local Git and Syn with the Remote Repository

So far, we have installed Git locally to control the source code version, created the skeleton structure of the project, and set up the remote repository that will be linked with the local Git. Now, open a terminal window and change the directory (command ‘cd’) in the project folder (in my case, it is ~/workspace/peterpan). Type:

            git init
            git add .  

The period “.” after “git add” indicates to add the current folder into Git control.

If you haven't done so already, you will need to tell Git who you are. Type:

            git config --global user.name "your name"
            git config --global user.email "your email address"

Now let's tell local Git what remote repository it will be associated with. Before doing that, we need to get the URL of the remote repository so that the local Git knows where to locate it. On your browser, open the remote Git repository webpage, either on Github or your private GitHub. On the bottom of the right-side panel, you will see URL in different protocols of https, SSH, or subversion. If you're using GitHub and your repository is public, you may choose to use the https URL. Otherwise, use the SSH URL. Click the "copy to clipboard" button to copy the link.

In the same terminal, type:

            git remote -v 

to check what remote repositories you currently have. There should be nothing.

Now use the copied URL (which in my case is git@github.com:richardxy/peterpan.git) to construct the command below. "peterpanssh" is the name I gave to this specific remote repository which helps the local Git to identify which remote repository we deal with.

            git remote add peterpanssh git@github.com:richardxy/peterpan.git

When you type in the command “git remote -v” again, you should see the new remote repository has been registered with the local Git. You can add more remote repositories in this way by using “git remote add” command. In the case when you would like to delete a remote repository, which basically means "break the link between the local git and the remote repository", you can do Git remote rm (repository name), such as:

            git remote rm peterpanssh

If you don't like the current name of a repository, you can rename it by using the following command.

            git remote rename (oldname) (newname), such as:
            git remote rename peterpanssh myrepo

At the moment, the local Git repository has only one branch. Use “git branch” to check, and you will see “master” only. A better practice is to create a “develop” branch and develop your work there. To do this, type:

            git checkout -b develop

Now type “git branch” again and hit enter in the terminal window, and you will see the branch “develop” with an asterisk attached ahead of it, which means that the branch “develop” is the current working branch.

Now that we have linked a remote Git repository with the local Git, we can start synchronizing them. When you created the new repository on the remote Git (Github or your company's private Git repository), you may have opted in to add a .gitignore file. At the moment, .gitignore file exists only at the remote repository, but not at the local git repository. So we need to pull it to the local repository and merge it with what we have in the local repository. To do that, we use the command below:

            git pull peterpanssh develop 

Of course, peterpanssh is the name of the remote repository registered with the local git. You may use your own name.

“Git pull” works fine in small and simple projects like this. But when working on a project that has many branches in its repository, separate commands "git fetch" and "git merge" are recommended. More advanced materials can be found at git-pull Documentation and Mark's blog.

Once the local Git repository has everything the remote Git repository has (and more), we can commit and push the contents in the local Git to the remote Git.

The reason for committing to Git is to put the source code under Git's version control. The workflow related to committing usually includes:

Modify code -> Stage code -> Commit code

So, before we actually commit the code, we need to stage the modified files. We do this to tell Git what changes should be kept and put under version control. The easiest way to stage the changes is to use:

            git add -p

That will bring up an interactive session that presents you with all the changes and lets you decide to stage them or not. As we haven't made many changes so far, this interactive session should be short. Now we can enter:

            git commit -m "initial commit"

The letter "m" means "message", and the string after "-m" is the message to describe the commit.

After committing, the staged changes (by the "git add" command) are now placed in the local Git repository. The next step is to push it to the remote repository. Using the command below will do this:

            git push peterpanssh HEAD:develop

In this case, "peterpanssh" is the remote repository name registered with the local Git, and "develop" is the branch that you would like to push the code to. 

 

Step Five: Develop the Software Package

So far, we have built the entire infrastructure for hosting the local project, controlling the software versions both locally and remotely. Now it's time to work on the code in the package. To put the changes under version control (when you’re done with the project, or any time you think it’s needed), use:

            git add -p
            git commit -m "messages"
            git push repo_name HEAD:repo_branch

 

Step Six: Write setup.py

When your code package has reached a certain level of maturity, you can consider releasing it for distribution. A distribution may contain one or multiple packages that are meant to be installed at the same time. A designatedsetup.py file is required to be present in the folder that contains the package(s) to be distributed. Earlier, when we created the project structure, we already created an empty setup.py file. Now it's time to populate it with content. 

A setup.py file contains at least the following information:

           from setuptools import setup, find_packages
           setup(name='eglearning',
           packages=find_packages()
           )

There are a few distribution tools in Python. The standard tool for packaging in Python is distutils, and setuptools is an upgrade of distutils, with more features. In the setup() function, the minimum information we need to supply is the name of the distribution, and what packages are to be included. The function find_packages() will recursively go through the current folder and its sub-folders to collect package information, as long as a __init__.py is found in a package folder.

It is also helpful to provide the meta data for the distribution, such as version, a description of what the distribution does, and author information. If the distribution has dependencies, it is recommended to include the installation requirements in setup.py. Therefore, it may end up looking like this:

           from setuptools import setup, find_packages
           setup(name='eglearning',
                      version='0.1a',
                      description='a machine learning package developed at Endgame',
                      packages=find_packages(),
                      install_requires=[
                                 'Pandas>=0.14',
                                 'Numpy>=1.8',
                                 'scikit-learn>=0.13',
                                 'elasticsearch',
                                 'pyes',
                      ],
           )

To write more advanced setup.py, Python documentation or this web page are good resources.

When you are done with setup.py, commit the change and push it to the remote repository by typing the following commands:

           git add -p
           git commit -m 'modified setup.py'
           git push peterpanssh HEAD:develop

 

Step Seven: Merge Branch Develop to Master

According to Python engineer Vincent Driessen, "we consider origin/master to be the main branch where the source code of HEAD always reflects a production-ready state." When the code in the develop branch enters the production-ready state, it should be merged into the master branch. To do this, simply type in the terminal under the project directory:

           git checkout master
           git merge develop

Now we can push the master branch to the remote repository:

           git push peterpanssh

 

Step Eight: Install the Distribution from the Remote Repository

The Python package management tool "pip" supports the installation of a package distribution from a remote repository such as GitHub, or a private remote repository. pip currently supports cloning over the protocols of git, https and ssh. Here we will use ssh.

You may choose to install from a specific commit (identified by a MD5 check-sum) or whatever the latest commit in a branch. To specify a commit for cloning, type:

           sudo pip install -e git://github.com/richardxy/peterpan.git@4e476e99ce2649a679828cf01bb6b3fd7856281f#egg=MLM0.01

In this case, "github.com/richardxy/peterpan.git" is the ssh clone URL with ":" after ".com" being replaced with "/". This is tricky and it won't work if you omitted the replacement. The parameter "egg" is also a requirement. The value is up to you.

If you opt to clone the latest version in the branch (e.g. “develop” branch), type:

           sudo pip install -e git://github.com/richardxy/peterpan.git@develop#egg=MLM0.02

You only need to specify the branch name after "@" and before "egg" parameter. This is my preferred method.

Then pip will check if the installation requirements are met and install the dependencies and the package for you. Once it's done, type: 

           pip freeze

to find the newly installed package. You will see something like this:

           -e git://github.com/richardxy/peterpan.git@2251f3b9fd1b26cb41526f394dad81016d099b03#egg=eglearning-develop

Here, 2251f3b9fd1b26cb41526f394dad81016d099b03 is the MD5 checksum of the latest commit. 

Type the command below to create a requirements document that registers all of the installed packages and versions. 

           pip freeze > requirements.txt

Then open requirements.txt, replace the checksum with the branch name, such as “develop”, and save it. The reason for doing that is, the next time when a user tries to install the package, there might be new commit and therefore the MD5 would have changed. Using the branch name will always point to the latest commit in that branch.

One caveat: if virtualenv is used, the pip freeze command should look like this so that only the configurations in the virtual environment will be captured:

           pip freeze -l > requirements.txt

 

Conclusion

This tutorial covers the most fundamental and essential procedures for creating a Python project, applying version control during the development, packaging the code, distributing it over code-sharing repositories, and installing the package via cloning the source code. Following this process can help non-computer science-trained data scientists get more comfortable using well-known collaborative tools like Python for software development and distribution.

Stop Saying Stegosploit Is An Exploit

$
0
0

Security researcher Saumil Shah recently presented “Stegosploit” (slides available here). His presentation received a lot of attention on several hacker news sites, including Security AffairsHacker News, and Motherboard, reporting that users could be exploited simply by viewing a malicious image file in their web browser. If that were true, this would be terrifying.

“Just look at the image and you are HACKED!” – thehackernews

Here’s the thing. That is not what is happening with Stegosploit. Saumil Shah has created a “polyglot”. A polyglot is defined as “a person who knows and is able to use several languages,” but in the security world, the term can refer to a file that is a valid representation of two different data types. For example, you can concatenate a RAR file to the end of a JPG file. If you double click the JPG image, a photo pops up. If you then rename that JPG file to a .rar file, the appended RAR file will open. This is due to how the JPG and RAR file formats specify where the file begins. Stegosploit is using this same premise to embed JavaScript code inside of an image file, and obscure the JavaScript payload within pixel data.

This is still an interesting vector due to the difficulty of detection. It adds a layer of obfuscation, which relies on security through obscurity to avoid detection.

Embedding your code inside images requires a defensive product to not only process every packet, but also to inspect the individual artifacts extracted from the connection. Security through obscurity is widely considered ineffective. However, it is important to note that in order to identify even the most rudimentary steganography, you have to analyze every image file, which is computationally expensive, and increases the cost to defenders.

What is really interesting here is that Saumil Shah was actually rather forthcoming about this during his talk, clearly announcing that he was using a loader to deliver the payload, although that may not have been obvious to some of the observers. The exploit was delivered because the attacker sent malicious, obfuscated JavaScript to the browser. Stegosploit simply obfuscates an attack that could have been executed anyway. Just looking at an image will not exploit your web browser.

 

 

In the screenshot above, taken from the recording of the actual conference talk, Saumil is showing the audience the exploit “loader”. This is where a traditional JavaScript payload would be injected. The operative text in that screenshot is<script src=”elephant3.jpg”></script>, which takes a valid image file and interprets it as JavaScript. It simply injects the malicious code into a carrier signal so it looks innocuous. While it may seem like it is splitting hairs, it’s an extremely important distinction between “looking at this photo will exploit your machine”, and “this photo is camouflage that hides an exploit that has already occurred.”

All that being said, legitimate image exploits have been discovered in the past. Most notably, MS04-028 actually exploited the JPG processing library. In this case, loading an image into your browser would quite literally exploit your machine. This was tagged as a critical vulnerability, and promptly patched.

Stegosploit is an obfuscation technique to hide an exploit within images. It creates a JavaScript/image polyglot. Don’t worry, you can keep looking at captioned cat photos without fear.

Much Ado About Wassenaar: The Overlooked Strategic Challenges to the Wassenaar Arrangement’s Implementation

$
0
0

In the past couple of weeks, the US Bureau of Industry and Security (BIS), part of the US Chamber of Commerce, announced the potential implementation of the 2013 changes to the Wassenaar Arrangement (WA), which is a multinational arrangement intended to control the export of certain “dual-use” technologies.  The proposed changes place additional controls on the export of “systems, equipment or components specially designed for the generation, operation or delivery of, or communication with, intrusion software.” Many in the security community have been extraordinarily vocal in opposition to this announcement, especially with regard to the newly proposed definition of "Intrusion Software" in the WA. This debate is important and should contribute to the open comment period requested by the BIS, which ends July 20. While the WA appears to be a legitimate attempt to control the export of subversive software, the vague wording has raised alarms within the security community. 

For decades the security community has developed and studied exploit and intrusion techniques to understand and improve defenses. Like many research endeavors, it has involved the development, sharing, and analysis of information across national boundaries through articles, conferences, and academic publications. This research has successfully produced countermeasures like DEP (Data Execution Prevention) and ASLR (Address Space Layout Randomization), which mitigate numerous exploits seen in the wild. These kinds of countermeasures resulted directly from exploitation research and are protected by the new WA definition. While a robust debate on the WA’s implications is useful for the security community, what seems to be lacking is a strategic level discussion on whether these kinds of arrangements even have the potential to achieve the desired effect. The debate over the definition and wording of key terms is indicative of the larger hurdles these kinds of multinational arrangements encounter. This is especially problematic when building upon legacy agreements. By most measures, the WA simply renamed the COCOM (Coordinating Committee for Multilateral Export Controls) export control regime and is a Cold War relic designed to limit the export of weapons and dual-use technologies to the Soviet bloc. The Cold War ended a quarter of century ago, and yet agreements like WA still are built on that same mentality and framework. Below are four key areas that impact the ability of the WA (and similar agreements) to achieve the desired effect of “international stability” and should be considered when seeking to limit the diffusion of strategically important and potentially destructive materials. 

1. Members only: There are only 41 signatories to the WA (see the map below*). While to some that may seem extensive, it reflects less than a quarter of the states in the international community. In layman’s terms, three-quarters of the countries will be playing by a completely different set of rules and regulations, putting those who implement it at a competitive disadvantage – economically and in national security. Moreover, it means that three-quarters of the countries can export these potentially dual-use technologies – including countries like China, Iran, North Korea – rendering it unlikely to achieve the desired effect. To be clear, this concern is not just about US adversaries, but also about allies that could gain a competitive advantage. Israel, not a signatory of the WA, has a thriving cyber security industry and may increasingly attract more investment (and innovation!) in light of implementation of the WA.

2. Credible commitments: International cooperation depends heavily on credible commitments and the ability of states to implement the policies embedded in the treaty domestically. As membership rises, so too does diversity in domestic political institutions and foreign policy objectives. It would be startling (to say the least) if Western European countries and Russia pursue implementation that produce uniform adherence to the WA. Even within Western Europe, elections may usher in a new way of approaching digital security. Recent UK elections with a Tory majority may alter legislation pertaining to surveillance issues, and may run counter to the WA. 

3. Ambiguity of language: The most unifying theme of the security community’s opposition to the WA is the vague and open-ended definition of intrusion software. By some estimates, anti-virus software and Chrome auto-updates may fit within the definition. The government will likely receive many comments on the definition over the 60-day response period. It is strongly in the best interest of all parties involved if greater specificity is included. Otherwise, there will continue to be headlines vilifying the government for classifying everything digital as a weapon of war, which clearly is not the case. As we grapple with securing systems globally and ensuring our defenses can prevent advanced threats, one might imagine a future where loose policy definitions move software and techniques underground or off-shore for fear of prosecution. This could be counterproductive to understanding and securing the new and changing connected world.

4. Rudderless ship: The most successful international agreements have relied heavily on global leadership, either directly by a hegemonic state or indirectly through leadership within a specific international governmental organization (IGO). This leadership is essential to ensure compliance and norm diffusion of the regulations inherent within a treaty or agreement. The WA lacks any form of IGO support and certainly lacks any hegemonic or bipolar leadership. Even if this leadership did exist, the cyber domain simply lends itself to obfuscation and manipulation of the data and techniques, rendering external monitoring difficult. More so, China and Russia continue to push forth norms completely orthogonal to those of the WA, including cyber sovereignty. Without global acceptance and agreement on these foundational concepts, the WA has little chance of adherence even if there is domestic support for the verbiage (which clearly is not currently the case).

In short, the hurdles the WA will encounter when trying to achieve its objectives is a typical two-level game that hinders international cooperation. States must balance international polarity and norms on the one hand, with domestic constituents, institutions and domestic norms on the other. Without the proper conditions at both the domestic and international level, agreements have little chance of actually achieving the objective. If the goal is truly focusing on international stability, human rights, and privacy, the WA may not be the optimal means of achieving these goals. As organizations, researchers, and activists continue to contribute to the critical debate about the value and feasibility of the WA, the policy and security communities should take advantage of the open comment period to remember that the complexity and dynamism of the current digital landscape requires novel thinking beyond obsolete Cold War approaches.

*Wassenaar Arrangement Participants (source: https://www.armscontrol.org/factsheets/wassenaar)

OPM Breach: Corporate and National Security Adversaries Are One and the Same

$
0
0

On June 5, 1989, images of a lone person standing ground in front of Chinese tanks in Tiananmen Square transfixed the world. On the same day twenty-six years later, the United States government announced one of the biggest breaches of personnel data in history. This breach is already being attributed to China. China has also recently been stepping up its efforts to censor any mention of the Tiananmen Square massacre. The confluence of these two events – censorship of a pivotal human rights incident coupled with the theft of four million USG personnel records – should clarify beyond a doubt China’s intentions and vision for what constitutes appropriate norms in the digital domain. It is time for all of the diverse sectors and industries of the United States – from the financial sector in New York City to the tech industry in Silicon Valley to the government in Washington – to recognize the gravity of this common threat and commit to a legitimate public-private partnership that extends beyond lip service. As the OPM breach demonstrates, the United States government faces the same threats and intellectual property theft as the financial, tech, and other private sector industries. It’s time to move beyond our cultural divisions and unify against the common adversaries who are the true threats to privacy, security, democracy and human rights across the globe.

I attended a “Cyber Risks in the Boardroom” event yesterday in New York City. More often than not, these kinds of cybersecurity conferences will include one panel of private sector experts complaining about government regulations, infringements on privacy, and failure to grasp the competitive disadvantage of US companies thanks to proposed legislation. I have even heard the USG referred to as an “advanced persistent threat.” A government panel generally follows, and bemoans the inability of the private sector to grasp the magnitude of the threat. There is often an anecdote about an unnamed company that refuses government assistant when a breach has been identified, and there’s the obligatory attempt at humor to assuage fears that the government is really not interested in reading your email or tracking your Snapchat conversations.

That did not happen yesterday. The one comment that struck me the most was a call for empathy between the private and public sectors. In fact, at a conference held in the heart of the financial capital of the world, panel after panel reiterated the need for the government and private sector to work together to ensure the United States’ competitive economic advantage. The United States economy and its innovative drive is the bedrock of national security. The financial sector – one of the largest targets of digital theft and espionage – seems to grasp the essential role the government can and should play in safeguarding a level digital playing field. Nonetheless, even in this hospitable environment, cultural and linguistic hurdles, not to mention trust issues, continue to limit cooperation between the financial sector and government.

News of the OPM breach broke just as I was leaving the conference. Many are attributing the breach to China. As someone who lives at the intersection of technology and international affairs, it is impossible to ignore the irony. There continues to be heated debate about US surveillance programs, as well as potentially impending legislation on intrusion software. These debates will not likely end soon, and they are part of the democratic process and freedom of speech that is so often taken for granted. Compare that to China’s expansive censorship and propaganda campaign that not only forces US companies operating in China to censor any mention of Tiananmen Square, but limits any mention of activities that may lead to collective gatherings. Or compare that to China’s 50 cent party, a group of individuals paid by the Chinese government to provide positive social media content about the government. (Russia has a similar program, which extends internationally, including spreading disinformation on US media outlets.) Perhaps even more timely, China iscensoring online discussion about the horrific cruise ship collapse earlier this week on the Yangtze River. This is a very similar approach to that taken following the 2011 train crash that similarly led to censorship of any negative media coverage of the government’s response.

The enormous and historic OPM breach, revealed on the 26th anniversary of the Tiananmen Square protests, should cause the disparate industries and sectors that form the bedrock of US national security to pause…and empathize. Combating common adversaries that threaten not only national security, but also freedom of information and speech, requires a united front. The private and public sectors are much stronger working together than apart. Despite significant cultural differences, there are core values that unite the private and public sectors, and it’s time to put aside differences and work as a cohesive unit against US corporate and national security adversaries—for they are truly one and the same. This does not mean that debates about privacy and legislation should subside. On the contrary, those debates should continue, but must become constructive forms of engagement rather than divisive editorials. Many – especially those in the financial sector – seem to grasp the appropriate role for the government in handling these threats. It’s time to put aside differences and pursue constructive and united private-public sector collaboration to deter the persistent theft of IP and PII information at the hands of the adversaries we all face together.

The Digital Domain’s Inconvenient Truth: Norms are Not the Answer

$
0
0

To say the last week has been a worrisome one for any current or former federal government employees is a vast understatement. Now, with this weekend’s revelations that the data stolen in the OPM breach potentially included SF-86 forms as well—the extraordinarily detailed forms required to obtain a security clearance—almost every American is in fact indirectly impacted, whether they realize it or not.  As China’s repository of data on United States citizens continues to grow, it’s time that the United States adjusts its foreign digital policy to reflect modern realities. Despite this latest massive digital espionage, the United States continues to pursue a policy based largely on installing global norms of appropriate behavior in cyberspace, the success of which depends on all actors playing by the same rules. Norms only work when all relevant actors adhere to and commit to them, and the OPM breach, as well as other recent breaches by Russia, North Korea, and Iran, confirms that each state is playing by their own playbook for appropriate behavior in the digital domain. The U.S. needs to adopt a new approach to digital policy, or else this collective-action problem will continue to plague us for the foreseeable future. Global norms are not the silver bullet that many claim.

The Problem with Norms in a Multi-Polar International System

In recent testimony before Congress, the State Department Coordinator for Cyber Policy, Christopher Painter, outlined the key tenets of US foreign policy in the cyber domain. During this testimony, he highlighted security and cybercrime, with norms as a key approach to tackling that issue. He explicated the following four key tenets (abridged) on which global norms should be based:

1. States cannot conduct online activity that damages critical infrastructure.

2. States cannot prevent CSIRTs from responding to cyber incidents.

3. States should cooperate in investigations of online criminal activity by non-state actors.

4. States should not support the theft of IP information, including that which provides competitive advantage to commercial entities.

While these are all valid pursuits, the OPM breach confirms the age-old tenet that states are self-interested, and therefore are quite simply not going to adhere to the set of norms that the United States seeks to instill. The United States government is not the only one calling for “norms of responsible behavior to achieve global strategic stability”. Microsoft recently released a report entitled International Cybersecurity Norms, while one of the most prominent international relations academics has written about Cultivating International Cyber Norms. Rather than focusing on norms, policy for the digital domain must reflect economic, political, military and diplomatic realities of international relations. It should not be viewed as a stove-piped arena for cooperation and conflict across state and non-state actors. For example, the omnipresent tensions in the South China Sea are indicative of China’s larger, cross-domain global strategy. Russian rhetoric and activities in Eastern Europe similarly are a source of great consternation, with the digital espionage a key aspect of Russia’s foreign policy behavior. These cross-domain issues absolutely spill over into the digital domain and therefore hinder the chance that norms will be successful. These tensions are exacerbated by completely orthogonal perspectives on the desired digital end-state of many authoritarian regimes, which focuses on the notion of cyber sovereignty. These issues are further confounded when these states continue to maintain an economic system predicated on state-owned enterprises, which are essentially an extension of the state, meaning that IP theft directly supports the government and their favorite quasi-commercial entities. Finally, the notion of credible commitments is again an essential factor in norm distribution. Because of the surveillance revelations of recent years, other states remain cautious and dubious that the United States will also adhere to these norms. This lack of trust only exacerbates distrust against the set of norms that the United States is advocating.

Towards a New Approach: Change the Risk Calculus for the Adversary

Instead of a norms-based approach, formal, multi-actor models that focus on calculating the risks and opportunities of actions from an adversary’sperspective could greatly contribute to more creative (and potentially deterrent) policies. Thomas Schelling’s research on bargaining and strategy is emblematic of this approach, expanding on the interdependence and the strategic interplay that occurs between actors. Mancur Olson’s work on collective action similarly remains especially applicable when pursuing policies that require adherence by all actors within a group. These frameworks account for the preferences of multiple actors in a decision-making process and help identify the probability of preferences across a spectrum of options. If done well, incorporating multi-actor preferences not only provides insights into why some actors pursue policies or activities that seem irrational to others, but it also forces the analyst or policymaker to view the range of preferred outcomes from the adversary’s perspective. Multi-actor models advocate for a strong understanding of activities that can favorably impact the expected utility and risk calculus of adversaries. The United States has taken some steps in this direction, and it should increasingly rely on policies that raise the costs of a breach for the adversary. For example, the indictment of the five PLA officers last year is a positive signal that digital intrusions will incur punishment. In addition to punitive legal responses targeted at adversaries, greater technical capabilities that hunt the adversaries within the network can also raise the cost of an intrusion. If the cost of entry outweighs the benefits, adversaries will be much less likely to attack at will. Until then, attackers will steal information without any fear of retribution or retaliation and the digital domain will remain anarchic. Finally, instead of focusing on global norms that give the competitive advantage to those who do not cooperate, digital cooperation should be geared toward allies, encouraging the synchronization of similar punitive legislation and responses in light of an attack. In this regard, cooperation can reinforce collective security, and focus on enabling the capabilities of allied states, not limiting those capabilities to allow adversaries the upper hand.

The United States continues to pursue policies that require global support and commitment in order to be effective, rather than focusing on changing the risk calculus for the adversary. The OPM breach—one that affects almost all former and current federal employees and their contacts and colleagues throughout their lives—is evidence that other states play by a different playbook. While the U.S. should continue its efforts to shape the digital domain as one that fosters economic development, transparency, equality and democracy, the reality is that those views are not shared by some of the most powerful states in the global community. Until that inconvenient truth is integrated into policy, states and state-affiliated groups will continue to compile an ever-expanding database of U.S. personnel and trade secrets, which not only impacts national security, but also the economic competitiveness on which that security is built.


Data Science for Security: Using Passive DNS Query Data to Analyze Malware

$
0
0

Most of the time, DNS services—which produce the human-friendly, easy-to-remember domain names that map to numerical IP addresses—are used for legitimate purposes. But they are also heavily used by hackers to route malicious software (or malware) to victim computers and build botnets to attack targets. In this post, I’ll demonstrate how data science techniques can be applied to passive DNS query data in order to identify and analyze malware.

A botnet is a network of hosts affected by malware to conduct nefarious activities, usually without the awareness of their owners. A command-and-control host hidden in the network communicates with the affected computers to give instructions and receive results. In such a botnet topology, the command-and-control becomes the single point of failure. Once its IP address is identified, it could be easily blocked and the whole communication with the botnet would be lost. Therefore, hackers are more likely to use a domain name to identify the command-and-control, and employ techniques like fast flux to switch IP addresses mapped to a single domain name.

As data scientists at Endgame, we leverage data sets in large variety and volume to tackle botnets. While the data we analyze daily is often proprietary and confidential, there is a publicly available data set provided by Georgia Tech that documents DNS queries issued by malware across the years 2011 - 2014. The malware were contained in a controlled environment and had limited Internet access. Each and every domain name query was recorded, and if a domain name could be resolved, the corresponding IP address was also recorded.

This malware passive DNS data alone would not provide sufficient information to conduct a fully-fledged botnet analysis, but it does possess rich and valuable insights about malware behaviors in terms of DNS queries. I’ll explain how to identify malware based on this data set, using some of the methods the Endgame data science team employs daily.

Graphical Representation of DNS Queries

Here is the data set I’ll examine. Each row is a record of DNS query, including date, MD5 of the malware file, the domain name being queried, and the IP address if the query finds a result. 

What approach might enable the grouping of malware or suspicious programs based on specific domain names? As we have no information about the malware, the conventional static analysis of malware focusing on investigating binary files would not be helpful here. Clustering using machine learning may work only if each domain name is treated as a feature, but the feature space will be very sparse. That would result in expensive computation.

Instead, we can represent the DNS queries using a graphic network showing what domain names a malware is interested in, as displayed in Figure 1. Each malware program is labeled by an MD5 string. While Figure 1 only demonstrates a very small part of the network, the entire data set could actually be transferred into a huge network.

Figure 1. A small DNS query network

There are numerous advantages to expressing the queries in the format of a graph. First, this expedites querying complex relationships. A modern graph database, such as Neo4j, Orientdb, or Titandb, can efficiently store a large graphic network and conduct joint queries that normally are computationally expensive for relational databases, such as MS SQL Server, Oracle or MySql. Second, network analytic methods from a diverse range of scientific fields can be employed to analyze the data set to gain additional insights.

Graph Analysis on the Malware Network

The entire passive DNS data set covers several years, so I randomly picked a day during the data collection period and will present the analysis on the reduced data set. A graph was created out of a day’s worth of data, and the nodes include both domain names and malware MD5 strings. In other words, a node in the graph can either be an MD5 string, or a domain name, and an edge (or a connection) links an MD5 and a domain if the MD5 queries that domain name. The total number of nodes is 17,629, and the number of edges is 54,939. The average number of connections per node is about 3.

In my graph representation of DNS queries, there are two distinct sets of nodes: domain names and malware. A node in one set only connects with a node in the other set, and not one in its own set. Graph theory defines such a network as a bipartite graph, as shown in Figure 2. I wanted to split the graph into two graphs, one containing all the nodes of domain names, and the other containing only malware programs. This can be done by projecting the large graph onto the two sets of nodes, which creates two graphs. In each graph, two nodes are connected by an edge if they have connections to the same node of the other type. For example, domains xudunux.info and rylicap.info would be connected in the domain graph because both of them have connections with the same malware in the larger graph.

Figure 2. Bipartite graph showing two distinct types of nodes

Let’s look at the graph of malware first. For the day 2012-09-29 alone, there are 9876 unique malware recorded in the data set. First, I would like to know the topological layout of these malware and find out how many connected components exist in the malware graph.

A connected component is a subset of nodes where any two nodes are connected to each other by one or multiple paths. We can view connected components (or just components) as islands that have no bridge connecting each other.

Python programming language has an excellent network analysis package called networkx. It has a function to compute the number of connected components of a graph. The result of running that function, named number_connected_components, shows there are 2,114 components in the 9,876-node graph, 1,619 of which are one-node component. There are still 11 components that have more than 100 nodes within them. I will analyze those large components because the malware inside may be variants of the same program.

Figure 3 shows four components of the malware graph. The nodes in each component are densely connected to each other but not to any other components. That means the malware assigned to a component clearly possess some similar characteristics that are not shared by the malware from other components. 

Figure 3. Four out of eleven components in the malware graph

Component 1 contains 201 nodes. I computed the betweenness centrality of the nodes in the graph, which are all zeros, while the closeness centrality values of the nodes are all ones. This indicates that each node has a direct connection with each other node in the component, meaning that each malware queried exactly the same domain names as the other malware programs. This is a strong indication that all 201 malware are variants of a certain type of malicious executable.

Let’s return to the large DNS query graph to find out what domains the malware targeted. Using a graph database like Neo4j or OrientDB, or a graph analytic tool like networkx, the search is easy. The result shows that the malware in component 1 were only interested in three domain names: ns1.musicmixa.net, ns1.musicmixa.org, and ns1.musiczipz.com.

I queried VirusTotal for each of the 201 malware in component 1. VirusTotal submits the MD5 to a list of scanning engines and return the reports from those engines. A report includes its determination of the MD5 to be either positive or negative. If it’s positive, the report would provide more information about what kind of malware the MD5 is, based on the signature that the scanning engine uses.

I assigned a score to each malware by computing the ratio of the number of positive results to the total number of results. The distribution of the scores is shown in Figure 4. The scanning reports imply that the malware is a Wind32 Trojan.

Figure 4. Histogram of VirusTotal score of malware in Component 1

Using Social Network Analytics to Understand Unknowns

When I look at each of the components, not all of them have such high level of homophily as component 1 does. A different component has 2,722 malware nodes, and 681,060 edges. 309 of the 2,722 malware in this component were not known to VirusTotal, while the rest, 2,413 malware, had reports on the website. We need a way to analyze those unknown malware.

Social network analytic (SNA) methods provide insights into unknown malware by identifying known malware that are similar to the unknowns. The first step is to try to break the large component into communities. The concept of community is easy to understand in the context of a social network. Connections within a community are usually much denser than those outside a community. Members of a community tend to share some common trait, such as mission, geo-location, or profession. In this analysis, malware were connected if they queried the same domain that could be interpreted as two malware exhibiting a common interest in a domain name. Therefore, we can expect that malware programs that have queried similar domains represent a community. Communities exist inside a connected component and differ from the concept of components in that communities still have connections between each other.

Community detection is a particular kind of data clustering within the domain of machine learning. There are a wide variety of methods for community detection in a graphic network. Louvain method is a well-known and well-performed one, and tries to optimize the measure of modularity by partitioning a graph into groups of densely connected nodes. By applying the Louvain method to the big component with 2,722 nodes, I can identify 15 communities and the number of nodes within each community as shown in Figure 5.

Figure 5. Number of nodes in each community

Let’s take a specific malware as an example. The MD5 of this malware is 0398eff0ced2fe28d93daeec484feea6, and the search of it on VirusTotal found no result, as shown in Figure 6. 

Figure 6. Malware not found on VirusTotal

I want to know what malware programs have the most similar behavior in terms of DNS queries to this unknown malware. By looking into the similar malware that we do have knowledge about, we could gain insights into the unknown one.

I found malware 0398eff0ced2fe28d93daeec484feea6 in Community 4, which has 256 malware within it. To find the most similar malware programs, we need a quantitative definition of similarity. I chose to use Jaccard index to compute just how similar two sets of queried domains are.

Suppose malware M1 queried a set of domains D1, and malware M2 queried another set of domains D2. The Jaccard index of set D1 and D2 is calculated as:

The Jaccard index goes from 0 to 1, with 1 indicating an exact match.

Out of the total 2,722 nodes in Component 1, 100 malware programs have exactly the same domain queries as malware 0398eff. That means their Jaccard indices against malware 0398eff are 1. However, only 9 malware are known to VirusTotal. The 9 malware are shown below.

Each of the 100 malware programs, including the 9 known ones, that have the same domain queries as malware 0398eff appear in community 4. The histogram of Jaccard index is shown in Figure 7.

Figure 7. Histogram of Jaccard index for nodes in community 4

We can tell from the histogram that the malware programs in community 4 could be generally split into two sets. One set contains 100 malware that have exactly the same domain queries as malware 0398eff, and the other set has nodes that are much less similar to it. The graph visualization in Figure 8 demonstrates the split. By this analysis, we have found those previously unknown 91 malware behaving similarly to some known malware. 

This blog post demonstrates how I used DNS query data to conduct network-based graphic analysis for malware. Similar analysis can be done with the domain names to identify groups of domains that tend to be queried together by a malware program. This can help identify potentially malicious domains that were previously unknown.

Given the vast quantities of data those of us in the security world handle daily, data science techniques are an increasingly efficient and informative way to identify malware and targeted domains. While machine learning and clustering tend to dominate these kinds of analyses, social network based graphic methods should increasingly become another tool in the data science toolbox for malware detection. Through the identification of communities, betweenness, and similarity scores, network analysis helps show not only connectivity, but also logical groupings and outliers within the network. Viewing malware and domains as a network provides another more intuitive approach for wrangling the big data security environment. Given the limited features available in the DNS passive query data, graph analytic approaches supplement traditional static and dynamic approaches and elevate capabilities in malware analytics.

Meet Endgame at Black Hat 2015

$
0
0

 

Endgame will be at Black Hat!

Stop by Booth #1215 to:

 

GET AN ENDGAME ENTERPRISE DEMO

Sign up here for a private demo to learn how we help customers automate the hunt for cyber adversaries.
 

MEET WITH ENDGAME EXPERTS

Meet our experts and learn more about threat detection and data science. Check out the Endgame blog to read the latest news, trends, and research from our experts before you go.
 

EVERYONE NEEDS A SMART WATCH!

Enter to win an Apple or LG smart watch. Stop by the booth Wednesday, August 5 or Thursday, August 6 for a chance to win. We'll announce each day's winner on Twitter at the end of each day at 5pm PT.

Examining Malware with Python

$
0
0

Before I came to Endgame, I had participated in a couple of data science competitions hosted by Kaggle. I didn’t treat them as competitions so much as learning opportunities. Like most things in the data science community, these competitions felt very new. But now that I work for a security company, I’ve learned about the long history of CTF competitions meant to test and add to a security researcher’s skills. When the Microsoft Malware Challenge came along, I thought this would be a great opportunity to learn about new ways of applying machine learning to better understand malware. Also, as I’ve talked about before, the lack of open and labeled datasets is a huge obstacle to developing machine learning models to solve security problems. Here was an opportunity to work with an already prepared large labeled dataset of malware samples.

I gave a talk at the SciPy conference this year that describes how I used the scientific computing tools available in Python to participate in the competition. You can check out my slides or watch the video from that talk here. I tried to drive home two main points in this talk: first, that Python tools for text classification can be easily adopted for malware classification, and second, that details of your disassembler and analysis passes are very important for generalizing any results. I’ll summarize those points here, but take a look at the video and slides for more details and code snippets.

My final solution to the classification challenge was mainly based on counting combinations of bytes and instructions called ngrams. This method is based on counting the frequency that a byte or an instruction occurs in a malware sample. When n is greater than one, I count the frequency of combinations of two, three, or more combinations of bytes or instructions. Because the number of possible combinations climbs very quickly, a hashing vectorizer must be used to keep the size of the feature space manageable.

Figure 1: Example byte 2grams from the binglide documentation

 

Figure 2: Byte 2grams from a malware sample included in the competition

At first, I was only using byte ngrams and I was very surprised that feeding these simple features to a model could provide such good classifications. In order to explore this, I used binglide to better understand what the bytes inside an executable look like. Figure 1 and Figure 2 show the results of this exploration. Figure 1 shows example output from binglide’s documentation and Figure 2 shows the output when I ran the tool on a sample from the competition. In all the images, the entropy of a binary is displayed on the strip to the left and a histogram of the 2gram frequency is shown on the right. For that frequency histogram, each axis contains 256 possible values for a byte and a pixel turns blue as that combination of bytes occurs more frequently.

You can see that the first 2gram pattern in Figure 2 generally looks like the first 2gram pattern in Figure 1. The .text section is usually used for executable code so this match to example x86 code is reassuring. The second 2gram pattern in Figure 2 is very distinctive and doesn’t really match any of the examples from the binglide documentation. Machine learning algorithms are well suited to picking out unique patterns like this if they are reused throughout a class. Finding this gave me more confidence that the classification potential of the byte ngram features was real and not due to any mistake on my part.

I also used instruction ngrams in a similar way. In this case, instructions refer to the first part of the assembly code after it’s been disassembled from the machine code. I wrote some Python code to extract the instructions from the IDA disassembly files that were provided by those running the competition. Again, feature hashing was necessary to restrain the size of the feature space. To me, it’s very easy to see why instruction ngrams could provide good classifications. Developing software is hard, and malware authors are going to want to reuse code in order to not waste effort. That repeated code should produce similar patterns in the instruction ngram space across families of malware.

Using machine learning algorithms to classify text is a mature field with existing software tools. Building word and character ngrams from text is a very similar problem to building the byte and instruction ngrams that I was interested in. In the slides from my SciPy talk I show some snippets of code where I adapted the existing text classification tools in the scikit-learn library to the task of malware classification. Those tools were a couple of different vectorizers, pipelines for cross validating multiple steps together, and a variety of models to try out.

All throughout this process, I was aware that the disassembly provided in the competition would not be available in a large, distributed malware processing engine. IDA Pro is the leading program for reverse engineering and disassembling binaries. It is also restrictively licensed and intended to be run interactively. I’m more interested in extracting features from disassembly automatically in batch and providing some insight to the files generated by a statistical model. I spent a lot of time during and after the competition searching for open source tools that could automatically generate the disassembly provided by the competition.

I found Capstone to be a very easy to use open source disassembler. I used it to generate instruction ngrams and tested the classification performance of models based on those ngrams to the same models based on IDA instructions. They both performed well in that there were very few misclassifications. The competition was judged on a multi-class logarithmic loss metric, though, and this metric was always better when using the IDA instructions.

After talking to some security experts at Endgame, I’ve learned that this could be due to the analysis passes that IDA does before disassembling. Capstone will just execute one sweep over the binary and disassemble anything it finds as it goes. IDA will more intelligently decode the binary looking for entry points, where functions and subroutines begin and end, and what sections actually contain code, data, or imports. I was able to relate this to my machine learning experience in that I viewed IDA’s disassembly as a more intelligent feature engineering pipeline. The result is that I’m still working on finding or building the best performing distributable disassembler.

This Kaggle competition was a great example of how data science can be applied to solve specific security problems. Data science has been described as a combination of skills in software, math, and statistics, along with domain expertise. While I didn’t have the domain expertise when I first joined Endgame, working closely with our security experts has expanded my breadth of knowledge while giving me a new opportunity to explore how data science techniques can be used to solve security challenges.

Why We Need More Cultural Entrepreneurs in Security & Tech

$
0
0

Recently, #RealDiversityNumbers provided another venue for those in the tech community to vent and commiserate over the widely publicized lack of diversity within the industry. The hashtag started trending and gained some media attention. This occurred as Twitter came under fire for organizing a frat-themed party, while also facing a gender inequality claim. Unfortunately, as dire as the diversity situation is in the tech sector writ large, it pales in comparison to the statistics on diversity in the security sector. The security community not only faces a pipeline shortage, but it has also made almost no progress in actively attracting a diverse workforce. The tectonic shifts required to achieve true diversity in the security sector also mean a fundamental shift in the tech culture must take place. However, while companies such as Pinterest have publicly noted their commitment to diversity, very little has changed from top-down approaches to diversification in the tech community. Certainly internal policies and recruiting practices matter, and leadership support is essential. These are the core enablers, but are not sufficient for institutionalizing cultural change. Instead, cultural entrepreneurs reflecting technical expertise across an organization must lead a grassroots movement to truly institutionalize cultural change within organizations and across the tech community. All of us must move beyond our comfort zones of research, writing and coding and truly take ownership of organizational culture.

Given the competition for talent in the security industry, an organization’s culture (ceteris paribus) often proves to be the determining factor that fosters, attracts, and retains a highly skilled and diversified workforce. Because an organization cannot engineer its way toward an innovative, inclusive culture or simply throw money at the issue, this problem can be perplexing to tech-focused industries. As anyone who has even briefly studied cultural approaches knows, culture is very sticky and entails a concerted and persistent effort to achieve the desired effects. It requires a paradigm shift much in the same way Kuhn, Lakatos and Popper all approached the various avenues toward scientific progress. The good news – if there is any – is that many of the cultural shifts required to foster a driven, innovative and (yes!) inclusive work environment do not cost a lot of money. Similar to the role of policy entrepreneurs in pushing forth new ideas in the public sector, cultural entrepreneurs are key individuals who can use their technical credibility to push forth ideas and promote solutions for any cultural challenges they identify or experience. By serving as a gateway between various aspects of an organization, cultural entrepreneurs can move an organization and ideally the industry beyond a “brogramming” mentality and reputation. Cultural entrepreneurs must reflect technical expertise across a diverse range of skills and demographics in order to legitimately encourage diversity and innovation. This enables the credible organic shifts from below that foment cultural change.

Cultural entrepreneurs are required to ensure an organization’s culture is inclusive and purpose-driven, instead of perpetuating the status quo. In this regard, diversity is a key aspect of this cultural shift. Diversity provides an innovation advantage and positively impacts the bottom line. Many in the tech community are starting to realize this, with companies like Intel investing $300 million in diversity, and CEOs lamenting that they wished they had built diversity into their culture from the start. Admitting that the problem exists is an important step, but this rhetoric has yet to translate into a more diversified workforce. A concerted effort by major tech companies to address diversity resulted in at most a 1% increase in gender diversity and an even smaller increase in ethnic diversity. Cultural entrepreneurs, and their ability to foster grassroots cultural shifts, may be the missing link in many of these cultural and diversity initiatives.  

Cultural entrepreneurs across an organization can make a significant impact with minimal work or cost by focusing on both internal and external cultural aspects of an organization. First, there is a large literature on how cross-cutting links (think social network analysis) develop social capital, which in turn has a positive impact on civic engagement and economic success. A recent Gallup Poll reinforces just how hard it is to foster social capital, with results confirming that over 70% of the American workforce does not feel engaged. Many organizations know this, but unfortunately fail at implementation by opting for social activities that reinforce exclusivity or feel contrived or overly corporate. Events ranging from frat-themed parties to cruise boats with concubines clearly do little to attract a diverse workforce. Cultural entrepreneurs can encourage or informally organize inclusive activities – such as sports, team outings, or discussion boards – within and across departments to increase engagement. While these kinds of social activities may seem superfluous to the bottom line, they can positively impact retention, workforce engagement, and inclusivity by building cross-cutting social networks. The kinds of social activities certainly should vary depending on an organization, but they must appeal to multiple segments of the workforce to foster social capital instead of reinforcing stereotypes and stovepipes within organizations. However, with everyone’s heads to keyboard all day every day, technical cultural entrepreneurs rarely emerge, hindering the development of social capital.

Second, perception is reality, and cultural entrepreneurs can help shift external perceptions of the industry. A quick search of Google images for “hacker” reveals endless images of male figures in hoodies working in dark, nefarious environments.  The media perpetuates this with similar images every time a new high profile breach occurs. It’s not just a media problem. It is also perpetuated within the industry itself. A recent analysis of the RSA conference guide showed shockingly little diversity.  The study notes that “women are absent” and “people of colour are totally absent.” While it adequately reflects the reality of the security industry, it makes those of us currently in the security community feel more out of place if we don’t fit that profile, while also deterring anyone not fitting those profiles from entering the field.  Let’s hope the upcoming Black Hat and Def Con conferences are more inclusive, with a broader representation of gender, race and appearance, but I wouldn’t bet on it. It’s up to cultural entrepreneurs to continue to press their organizations and the industry to help shift the perception of the security community away from nefarious loners and toward one with a universal mission that requires a diverse range of skillsets and backgrounds. Providing internal and external thought leadership through blogs, presentations and marketing can go a long way toward helping reality reflect the growing rhetoric advocating for diversity.

The security industry, which mirrors the diversity problems in the tech industry writ large, would benefit from a cultural approach to workforce engagement and inclusivity. All of the amenities in the world are not enough to overcome the tech industry’s cultural problems that not only persist, but that are also much more exclusive than they were two decades ago.  In creative industries, cultural entrepreneurs are essential to fostering the social capital and intrinsic satisfaction that emerges from an inclusive and innovative culture. At Endgame, this is something that we think about daily and always seek to improve. We benefit from leadership that supports and understands the role of culture, while also letting us grow that culture organically.  This organic growth relies on technical leaders across the company working together and pushing both the technical and cultural envelopes. This combination of technical mastery coupled with an collaborative and driven culture provides the foundation on which we will continue to foment inclusivity while disrupting an industry which for too long has relied on outdated solutions to modern technical and workforce challenges.

Sprint Defaults and the Jeep Hack: Could Basic Network Settings Have Prevented the Industry Uproar?

$
0
0

In mid-July, research into the security of a Jeep Cherokee was disclosed though a Wired article and subsequent Black Hat presentation. The researchers, Charlie Miller and Chris Valasek, found an exploitable vulnerability in the Uconnect entertainment system that operates over the Sprint cellular network. The vulnerability was serious enough to prompt a 1.4 million-vehicle recall from Chrysler.

In the Wired article, Miller and Valasek describe two important aspects of the vulnerability. First, they can target their exploit against a specific vehicle: “anyone who knows the car’s IP address can gain access from anywhere in the country,” and second, they can scan the network for vulnerable vehicles including a Dodge Ram, Jeep Cherokee, and a Dodge Durango. Both of these capabilities, to scan and target remotely through the cellular network, are necessary in order to trigger the exploit against a target vehicle.

While it’s really scary to think that a hacker anywhere in the country can drive your car off the road with the push of a button, the good news is that the cellular network has safeguards in place to prevent remotely interacting with phones and devices like Uconnect. For some inexplicable reason, Sprint disabled these safeguards and left the door wide open for the possibility of remote exploitation against the Uconnect cars. Had Sprint not disabled these safeguards, the Uconnect vulnerability would have just been another of several that require physical access to exploit and may not have prompted an immediate recall. 

The Gateway

Cellular networks are firewalled at the edge (Figure 1). GSM, CDMA and LTE networks are all architected a little differently, but each contains one of the following Internet gateways:

  • CDMA: Packet Data Serving Node (PDSN) in CDMA networks (Verizon and Sprint)
  • GSM: Gateway GPRS Support Node (GGSN) (T-Mobile or AT&T)
  • LTE: the responsibilities of the gateway are absorbed into multiple components in the System Architecture Evolution (SAE). All major Telcos in the US operate LTE networks.

Figure 1: Network layout

To keep things simple and generic, we’ll just call this component “the gateway.” Network connections only originate in one direction: outbound. You can think of the core network of your phone network as a big firewalled LAN, and it is not possible to gain access to a phone from outside the phone network (Figure 2). 

Figure 2: The attacker is blocked from outside the core network.

Miller was able to operate behind this firewall by tethering his laptop to a burner phone that was on the Sprint network (Figure 3).

But by default, phones are blocked from seeing each other as well. So even if the attacker knows the IP address of another phone on the network, the network won’t allow her to make a data connection to connect to that phone (Figure 4).  The network enforces this by what are called Access Point Names (APNs). 

Figure 3: Device-to-device was enabled for the car’s APN, enabling remote exploitation. Why?

Figure 4: Default configuration, device-to-device connections disabled. The attacker cannot access the target device from inside the firewall.

When a phone on the network needs to make a data connection, it provides anAPN to the network. If you want to view the APN settings in your personal phone you follow these instructions for iPhone or Android.  The network gateway uses the APN to determine how to allow your phone to connect to the Internet.  There are hundreds of APNs in every network, and your carrier uses APNs to organize how different devices are allocating data for billing purposes. In the case of Uconnect, all Uconnect devices operate on the Sprint network and use their own private APN.  APNs are really useful for third parties, like Uconnect, to sell a service that runs over a cellular network. So that each Uconnect user doesn’t need to maintain a line of service with Sprint, Uconnect is responsible for the data connection, and end users pay Uconnect for service, which runs through a private APN that was set up for Uconnect.

APNs are used extensively to isolate private networks for machine-to-machine systems like smart road signs and home alarm systems. If you’ve ever bought a soda from a vending machine with a credit card, the back end connection was using a private APN. 

Vulnerabilities caused by misconfigured APNs are not new; the APN of the bike-sharing system in Madrid was hacked just last summer. These bike-sharing systems need device-to-device access because technicians perform maintenance on these machines via remote desktop.  

Aftermath

There is no obvious reason for Uconnect to need remote administration. Why then are device-to-device connections allowed for the Uconnect APN, especially since it opens the door to a remote access exploit?  We will probably never know, because six days after the Wired story was published, Miller tweeted that Sprint had blocked phone-to-car traffic as well as car-to-car traffic. What this really means is that Sprint disabled internal traffic for the Uconnect APN. The remote access vector was closed.

The fact that Sprint made this change so quickly suggests that device-to-device traffic was not necessary in the first place, which leads us to two conclusions: 1) Had Sprint simply left device-to-device traffic disabled, the Jeep incident would have required physical access and not have been any more of a story than the Ford Escape story in 2013, or 2) More seriously, if the story hadn’t attracted mainstream media attention, Chrysler might not have taken the underlying vulnerability as seriously, and the fix would have rolled out much later, if ever. 

Security shouldn’t be a function of the drama circus that surrounds it.

 

Firewall icon created by Yazmin Alanis from the Noun Project
Pirate Phone icon created by Adriana Danaila from the Noun Project
Pickup truck icon created by Jamie M. Laurel from the Noun Project

Black Hat 2015 Analysis: The need for Global Thinking and Participation in the Security Community

$
0
0

This year’s Black Hat broke records yet again with the highest levels of attendance, including highest number of countries represented and, based on the size of the business hall, companies represented as well. While it featured some truly novel technical methods and the advanced security research for which it is so well known, this year’s conference even more than others reflected an institutionalization of the status quo within the security industry. Rather than reflecting the major paradigm shifts that are occurring in the security community, it seemed to perpetuate the insularity for which this community is often criticized.

In her Black Hat keynote speech, Jennifer Granick, lawyer and Director of Civil Liberties at Stanford University, noted that inclusion is at the heart of the hacker’s ethos and called for the security community to take the lead and push forth change within the broader tech sector. She explicitly encouraged the security community to refrain from being so insular, and to transform into a community that not only thinks globally but is also much more participatory in the policies and laws that directly affect them. While she focused on diversity and equality, there are several additional areas where the security community could greatly benefit from a more expansive mindset. Unfortunately, these strategic level discussions were largely absent from the majority of the Black Hat briefings that followed the keynote. The tactical, technical presentations understandably comprise the majority of the dialogue and garner the most attention.  However, given the growing size and expanding representation of disparate parts of the community, there was a noticeable absence of nuanced discussion about the state of the security community, including broader thinking about the three big strategic issues and trends that will define the community for the foreseeable future:

  • Where’s the threat? Despite a highly dynamic threat landscape, ranging from foreign governments to terrorist organizations to transnational criminal networks, discussion of these threat actors was embarrassingly absent from the panels this year. Although the security community is often criticized for over-hyping the threat, this was not the case at this year’s Black Hat. Even worse, the majority of discussions of the threat focused on the United States and Western European countries as the greatest security threats. Clearly, technology conferences must focus on the latest technological approaches and trends in the field. However, omitting the international actors and context in which these technologies exist perpetuates an inward-facing bias of the field that leads many to misunderstand the nature, capabilities and magnitude of the greatest threats to corporate and national security.
  • Toward détente? Last year’s Black Hat conference was still reeling from the Snowden revelations that shook the security community. A general feeling of distrust of the U.S. government was still apparent in numerous panels, heightening interest in privacy and circular discussions over surveillance. While sentiments of distrust still exist, this no longer appears to be the only perspective. In a few briefings, there was a surprising lack of the hostility toward the government that existed at similar panels a year ago. In fact, the very few panels that had government representation were not only well attended, but also contained civil discourse between the speakers and the audience. This does not mean that there were softball questions. On the contrary, there was blunt conversation about the "trust deficit" between the security community and the government. For instance, the biggest concern expressed regarding data sharing with the government (including the information sharing bill which Congress discussed last week, but is now delayed) was not about information sharing itself, but rather how the security community can trust that the government can protect the shared data in light of OPM and other high-profile breaches. This is a very valid concern and one that ignited a lot of bilateral dialogue. Organizations from the DHS to the Federal Trade Commission requested greater partnerships with the security community. While there are certainly enormous challenges ahead, it was refreshing to see not only signs of a potential thawing of relations between the government and the security community, but also hopefully some baby steps toward mutually beneficial collaboration.
  • Diversity. The general lack of diversity at the conference comes as no surprise given the well-publicized statistics of the demographics of the security community, as well as the#ilooklikeanengineer campaign that took off last week. However, diversity is not just about gender – it also pertains to diversity of perspectives, backgrounds and industries. Areas such as human factors, policy and data science seemed to be less represented than in previous years, conflicting with much of the rhetoric that permeated the business hall. In many of the talks that did cover these areas, there were both implicit and explicit requests for a more expansive partnership and role within the community.

Given the vast technological, geopolitical and demographic shifts underway, the security community must transform beyond the traditional mindset and truly begin to think beyond the insular perimeter. Returning to Granick’s key points, the security community can consciously provide leadership not only in shaping the political discourse that impacts the entire tech community, but also lead by example through promoting equality and thinking globally. The security community must play a participatory role in the larger strategic shifts that will continue to impact it instead of remaining an insularly focused island in the desert.

NLP for Security: Malicious Language Processing

$
0
0

Natural Language Processing (NLP) is a diverse field in computer science dedicated to automatically parsing and processing human language. NLP has been used to perform authorship attribution and sentiment analysis, as well as being a core function of IBM’s Watson and Apple’s Siri. NLP research is thriving due to the massive amounts of diverse text sources (e.g., Twitter and Wikipedia) and multiple disciplines using text analytics to derive insights. However, NLP can be used for more than human language processing and can be applied to any written text. Data scientists at Endgame apply NLP to security by building upon advanced NLP techniques to better identify and understand malicious code, moving toward an NLP methodology specifically designed for malware analysis—a Malicious Language Processing framework. The goal of this Malicious Language Processing framework is to operationalize NLP to address one of the security domain’s most challenging big data problems by automating and expediting the identification of malicious code hidden within benign code.

How is NLP used in InfoSec?

Before we delve into how Endgame leverages NLP, let’s explore a few different ways others have used it to tackle information security problems:

  • Domain Generation Algorithm classification – Using NLP to identify malicious domains (e.g., blbwpvcyztrepfue.ru) from benign domains (e.g., cnn.com)
  • Source Code Vulnerability Analysis – Determining function patterns associated with known vulnerabilities, then using NLP to identify other potentially vulnerable code segments.
  • Phishing Identification – A bag-of-words model determines the probability an email message contains a phishing attempt or not.
  • Malware Family Analysis –Topic modeling techniques assign samples of malware to families, as discussed in my colleague Phil Roth’s previous blog.

Over the rest of this post, I’ll discuss how Endgame data scientists are using Malicious Language Processing to discover malicious language hidden within benign code. 

Data Acquisition/Corpus Building

In order to perform NLP you must have a corpus, or collection of documents. While this is relatively straightforward in traditional NLP (e.g., APIs and web scraping) it is not necessarily the same in malware analysis. There are two primary techniques used to get data from malicious binaries: static and dynamic analysis. 

Fig 1. Disassembled source code

 

Static analysis, also called source code analysis, is performed using adisassembler providing output similar to the above (Fig 1). The disassembler presents a flat view of a binary, however structurally we lose important contextual information by not clearly delineating the logical order of instructions. In disassembly, jmp or call instructions should lead to different blocks of code that a standard flat file misrepresents. Luckily, static analysis tools exist that can provide call graphs that provide logical flow of instructions via a directed graph, like this and this.

Dynamic analysis, often called behavioral analysis, is the collection of metadata from an executed binary in a sandbox environment. Dynamic analysis can provide data such as network access, registry/file activity, and API function monitoring. While dynamic analysis is often more informative, it is also more resource intensive, requiring a suite of collection tools and a sandboxed virtual environment. Alternatively, static analysis can be automated to generate disassembly over a large set of binaries generating a corpus ready for the NLP pipeline. At Endgame we have engineered a hybrid approach that automates the analysis of malicious binaries providing data scientists with metadata from both static and dynamic analysis.

Lexical Parsing

Lexical parsing is paramount to the NLP process as it provides the ability to turn large bodies of text into individual tokens. The goal of Malicious Language Processing is to parse a binary the same way an NLP researcher would parse a document:

To generate the “words” in this process we must perform a few traditional NLP techniques. First is tokenization, the process of breaking down a string of text into meaningful segments called tokens.  Segmenting on whitespace, new line characters, punctuation or regular expressions can generate tokens. (Fig 2)

Fig 2. Tokenized disassembly

The next step in the lexical parsing process is to merge families of derivationally related words with similar meaning or text normalization. The two forms of this process are called stemming and lemmatization.

Stemming seeks to reduce a word to its functional stem. For example, in malware analysis this could reduce SetWindowTextA or SetWindowTextW to SetWindowText (Windows API), or JE, JLE, JNZ to JMP (x86 instructions) accounting for multiple variations of the essentially the same function.

Lemmatization is more difficult in general because it requires context or the part-of-speech tag of a word (e.g., noun, verb, etc.). In English, the word “better” has “good” as its lemma. In malware we do not yet have the luxury of parts-of-speech tagging, so lemmatization is not yet applicable. However, a rules-based dictionary that associates Windows API equivalents of C runtime functions may provide a step towards lemmatization, such as mapping _fread to ReadFile or _popen to CreateProcess.

Semantic Networks

Semantic or associative networks represent the co-occurrence of words within a body of text to gain an understanding of the semantic relationship between words. For each unique word in a corpus, a node is created on a directed graph. Links between words are generated with an associated weight based on the frequency that the two words co-occurred. The resulting graph can then be clustered to derive cliques or communities of functions that have similar behavior.

A malicious language semantic network could aid in the generation of a lexical database capability for malware similar to WordNet. WordNet is a lexical database of English nouns, verbs, and adjectives grouped into sets of cognitive synonyms. Endgame data scientists are in the incipient stages of exploring ways to search and identify synonyms or synsets of malicious functions. Additionally, we hope to leverage our version of WordNet in the development of lemmatization and the Parts-of-Speech tagging within the Malicious Language Processing framework.

Parts-of-Speech Tagging

Parts-of-Speech (POS) tagging is a piece of software capable of tagging a list of tokens in a string of text with the correct language annotation, such as noun, verb, etc. POS Tagging is crucial for gaining a better understanding of text and establishing semantic relationship within a corpus. Above I mentioned that there is currently no representation of POS tagging for malware. Source code may be too abstract to break down into nouns, prepositions or adjectives. However, it is possible to treat subroutines as “sentences” and gain an understanding of functions used as subjects, verb and predicates. Using pseudo code for a process injection in Windows, for example, would yield the following from a Malicious Language Processing POS-Tagger:

Closing Thoughts

While the majority of the concepts mentioned in this post are being leveraged by Endgame today to better understand malware behavior, there is still plenty of work to be done. The concept of Malicious Language Processing is still in its infancy. We are currently working hard to expand the Malicious Language Processing framework by developing a malicious stop word list (a list of the most common words/functions in a corpus of binaries) and creating an anomaly detector capable of determining which function(s) do not belong in a benign block of code. With more research and larger, more diverse corpuses, we will be able to understand the behavior and basic capabilities of a suspicious binary without executing or having a human reverse engineer it. We view NLP as an additional tool in a data scientist’s toolkit, and a powerful means by which we can apply data science to security problems, quickly parsing the malicious from the benign.


Hunting for Honeypot Attackers: A Data Scientist’s Adventure

$
0
0

The U.S. Office of Personnel Management (known as OPM) won the “Most Epic Fail” award at the 2015 Black Hat Conference for the worst known data breach in U.S. government history, with more than 22 million employee profiles compromised. Joining OPM as contenders for this award were other victims of high-profile cyber attacks, including Poland's Plus Bank and the website AshleyMadison.com. The truth is, hardly a day goes by without news of cyber intrusions. As an example, according to databreachtoday.com, just in recent months PNI Digital Media and many retailers such as Wal-Mart and Rite-Aid had their photo services compromised, UCLA Health’s network was breached, and information of 4.5 million people may have been exposed. Criminals and nation-state actors break into systems for many reasons with catastrophic and often irremediable consequences for the victims.

Traditionally, security experts are the main force for investigating cyber threats and breaches. Their expertise in computers and network communication provides them with an advantage in identifying suspicious activities. However, with more data being collected, not only in quantity but also in variety, data scientists are beginning to play a more significant role in the adventure of hunting malicious attackers. At Endgame, the data scientist team works closely with the security and malware experts to monitor, track and identify cyber threats, and applies a wide range of data science tools to provide our customers with intelligence and insights. In this post, I’ll explain how we analyze attack data collected from a honeypot network, which provides insight into the locations of attackers behind those activities. The analysis captures those organized attacks from a vast amount of seemingly separated attempts.

This post is divided into three sections. The first section describes the context of the analysis and provides an overview of the hacking activities. The second section focuses on investigating the files that the attackers implanted into the breached systems. Finally, the third section demonstrates how I identified similar attacks through uncovering behavioral characteristics. All of this demonstrates one way that data science can be applied to the security domain. (My previous post explained another application of data science to security.)

Background

Cyber attackers are constantly looking for targets on the Internet. Much like a lion pursuing its prey, an attacker usually conducts a sequence of actions, known as the cyber kill chain, including identifying the footprints of a victim system, scanning the open ports of the system, and probing the holes trying to find an entrance into the system. Professional attackers might be doing this all day long until they find a weak system.

All of this would be bad news for any weak system the attacker finds – unless that weak system is a honeypot. A honeypot is a trap set up on the Internet with minimum security settings so an attacker may easily break into it, without knowing his/her activities are being monitored and tracked. Though honeypots have been used widely by researchers to study the methods of attackers, they can also be very useful to defenders. Compared to sophisticated anomaly detection techniques, honeypots provide intrusion alerts with low false positive rates because no legitimate user should be accessing them. Honeypots set up by a company might also be used to confuse attackers and slow down the attacks against their networks. New techniques are on the way to make setting up and managing honeypots easier and more efficient, and may play an increasingly prominent role in future cyber defense.

A network of honeypots is called a honeynet. The particular honeynet for which I have data logged activities showing that an attacker enumerated pairs of common user names and passwords to enter the system, downloaded malicious files from his/her own hosting servers, changed the privilege over the files and then executed them. During the period from March 2015 through the end of June 2015, there were more than 21,000 attacker IP addresses being detected, and about 36 million SSH attempts being logged. Attackers have tried 34,000 unique user names and almost 1 million unique passwords to break into those honeypots. That’s a lot of effort by the attackers to break into the system. Over time, the honeynet has identified about 500 malicious domains and more than 1000 unique malware samples.

The IP addresses that were owned by the attackers and used to host malware are geographically dispersed. Figure 1 shows that the recorded attacks mostly came from China, the U.S., the Middle East and Europe. While geographic origination doesn’t tell us everything, it still gives us a general idea of potential attacker locations. 

Figure 1. Attacks came from all around the world, color coded on counts of attack. The darker the color, the greater the number of attacks originating from that country.

The frequency of attacks varies daily, as shown in Figure 2, but the trend shows that more attacks were observed during workdays than weekends, and peaks often appear on Wednesday or Thursday. This seems to support the suspicion that humans (other than bots) were behind the scenes, and professionals instead of amateur hobbyists conducted the attacks. 

Figure 2. Daily Attack Counts.

Now that we understand where and when those attacks were orchestrated, we want to understand if any of the attacks were organized. In other words, were they carried out by same person or same group of people over and over again?

Attackers change IP addresses from attack to attack, so looking at the IP addresses alone won’t provide us with much information. To find the answer to the question above, we need to use the knowledge about the files left by the attackers. 

File Similarity

Malware to an attacker is like a hammer and level to a carpenter. We expect that an attacker would use his/her set of malware repeatedly in different attacks, even though the files might have appeared in different names or variants. Therefore, the similarity across the downloaded malware files may provide informative links to associated attacks.

One extreme case is a group of 17 different IPs (shown in Figure 3) used on a variety of days containing exactly the same files and folders organized in exactly the same structure. That finding immediately portrayed a lazy hacker who used the same folder time and time again. However, we would imagine that most attackers might be more diligent. For example, file structures in the hosting server may be different, folders could be rearranged, and the content of a malicious binary file may be tweaked. Therefore, a more robust method is needed to calculate the level of similarity across the files, and then use that information to associate similar attacks.

Figure 3. 17 IPs have exactly the same file structure.

How can we quantitatively and algorithmically do this?

The first step is to find similar files to each of the files in which we are interested. The collected files include different types, such as images, HTML pages, text files, compressed tar balls, and binary files, but we are probably only interested in binary files and tar balls, which are riskier. This reduces the number of files to work on, but the same approach can be applied to all file types.

File similarity computation has been researched extensively in the past two decades but still remains a rich field for new methods. Some mature algorithms to compute file similarities include block-based hashing, Context-Triggered Piecewise (CTP) hashing (also known as fuzzy hashing), and Bloom filter hashing. Endgame uses more advanced file similarity techniques based on file structural and behavioral attributes. However, for this investigation I used fuzzy hashing to compute file similarities for simplicity and since open source code is widely available.

I took each of the unique files based on its fuzzy hashing string and computed the similarity to all the other files. The result is a large symmetric similarity matrix for all files, which we can visualize to check if there are any apparent structures in the similarity data. The way I visualize the matrix is to connect two similar files with a line, and here I would choose an arbitrary threshold of 80, which means that if two files are more than 80% similar, they will be connected. The visualization of the file similarity matrix is shown in Figure 4.

Figure 4. Graph of files based on similarity.

It is visually clear that the files are indeed partitioned into a number of groups. Let’s zoom into one group and see the details in Figure 5. The five files, represented by their fuzzy hash strings, are connected to each other, having mutual similarity of over 90%. If we look at them very carefully, they only differ in one or two letters in the strings, even they have totally different file names and MD5 hashes. VirusTotal recognizes four out of the five malware, and the scan reports indicate that these malware are Linux Trojan. 

Figure 5. One group of similar files.

Identifying Similar Attacks

Now that we have identified the groups of similar files, it’s time to identify the attacks that used similar malware. If I treat each attack as a document, and the malware used in an attack as words, I can construct a document-term matrix to encapsulate all the attack information. To incorporate the malware similarity information into the matrix, I tweaked the matrix a bit. For malware that were not used in a specific attack, but that still share a certain amount of similarity with the malware being used, the malware will assume the value of the similarity level for that attack. For example, if malware M1 was not used in attack A1, but M1 is most similar to malware M2 which was used in attack A1, and the similarity level is 90%, then the element at cell (A1, M1) will be 0.9, while (A1, M2) be 1.0.

For readers who are familiar with NLP (Natural Language Processing) and text mining, the matrix I’ve described above is similar to a document-term matrix, except the values are not computed from TF-IDFs (Term Frequency-Inverse Document Frequency). More on applications of NLP on malware analysis can be found in a post published by my fellow Endgamer Bobby Filar. The essence of such a matrix is to reflect the relationship between data records and features. In this case, data records are attacks and features are malware, while for NLP they are documents and words. The resulting matrix is an attack-malware matrix, which has more than 400 columns representing malware hashes. To get a quick idea of how the attacks (the rows) are dispersed in such a high dimensional space, I plotted the data using the T-SNE (t-Distributed Stochastic Neighbor Embedding) technique and colored the points according to the results from K-means (K=10) clustering. I chose K=10 arbitrarily to illustrate the spatial segmentation of the attacks. The T-SNE graph is shown in Figure 6, and each color represents a cluster labeled by the K-means clustering. T-SNE tries to preserve the topology when projecting data points from a high dimensional space to a much lower dimensional space, and it is widely used for visualizing the clusters within a data set.

Figure 6 shows that K-Means did a decent job of spatially grouping close data points into clusters, but it fell short of providing a quantitative measurement of similarity between any two data points. It is also quite difficult to choose the optimum value for K, the number of clusters. To overcome the challenges that K-Means faces, I will use Latent Semantics Indexing (LSI) to compute the similarity level for the attack pairs, and build a graph to connect similar attacks, and eventually apply social network analytics to determine the clusters of similar attacks.

Figure 6. T-SNE projection of Attack-Malware matrix to 2-D space.

LSI is the application of a particular mathematical technique, called Single Value Decomposition or SVD, to a document-term matrix. SVD projects the original n-dimensional space (with n words in columns) onto a k-dimensional space, where k is much smaller than n. The projection then transforms a document’s vector in n-dimensional space into a vector in the reduced k-dimensional space under the requirement that the Euclidean distance between the original matrix and the resulting matrix after transformation is minimized.

SVD decomposes the attack-malware matrix into three matrices, one of which defines the new dimensions in the order of significance. We call the new dimensions principal components. The components are ordered by the amount of explained variance in the original data. Let’s call this matrix attack-component matrix. With the risk of losing some information, we can plot the attack data points on the 2-d space using the first and the second components just to illustrate the differences between data points, as shown in Figure 7. The vectors pointing to perpendicular directions are most different from each other.

Figure 7. Attack data projected to the first and second principal components.

The similarity between attacks can be computed with the results of LSI, more specifically, by calculating the dot product of the attack-component matrix.

Table 1. Attacks Similar to Attack from 61.160.212.21:5947 on 2015-03-23.

I connect two attacks if their similarity is above a certain threshold, e.g. 90%, and come up with a graph of connected attacks, shown in Figure 8.

 

Figure 8. Visualization of attacks connected by similarity.

There are a few big component subgraphs in the large graph. A component subgraph represents a group of attacks closely similar to each other. We can examine each of them in terms of what malware were deployed in the given attack group, what IP addresses were used, and how frequently the attacks were conducted.

I plotted the daily counts of attack for the two largest attack groups in Figure 9 and Figure 10. Both of them show that attacks happened more often on weekdays than on weekends. These attacks may have targeted different geo-located honeypots in the system and could be viewed as a widely expanded search for victims.

Figure 9. Daily counts of attack in one group.

Figure 10. Daily counts of attack in another group.

We can easily find out where those attackers’ IPs were located (latitude and longitude), and the who-is data associated with the IPs. But it’s much more difficult to fully investigate the true identity of the attackers.

Summary

In this post, I explained how to apply data science techniques to identify honeypot attackers. Mathematically, I framed the problem as an Attack-Malware matrix, and used fuzzy hashing to represent files and compute the similarity between files. I then employed latent semantic indexing methods to calculate the similarity between attacks based on file similarity values. Finally, I constructed a network graph where similar attacks are linked so that I could apply social network analytics to cluster the attacks.

As with my last blog post, this post demonstrates that data science can provide a rich set of tools that help security experts make sense of the vast amount of data often seen in cyber security and discover relevant information. Our data science team at Endgame is constantly researching and developing more effective approaches to help our customers defend themselves – because the hunt for attackers never ends.

Three Questions: Smart Sanctions and The Economics of Cyber Deterrence

$
0
0

The concept of deterrence consistently fails to travel well to the cyber realm. One (among the many) reasons is that, although nuclear deterrence is achieved through nuclear means, cyber deterrence is not achieved solely through cyber means. In fact, any cyber activity meant for deterrence is likely going to be covert, while the more public deterrence activities fall into diplomatic, economic, financial, and legal domains. Less than six months after President Obama  signed an executive order to further expand the range of responses available to penalize individuals or companies conducting “malicious cyber-enabled activities”, there are now reports that it may be put to use in a big and unprecedented way. Numerous news outlets have announced the possibility of sanctions against Chinese individuals and organizations associated with economic espionage within the cyber domain. If the sanctions do come to fruition, it may not be for a few more weeks. Until then, below are some of the immediate questions that may help provide greater insight into what may be one of the most significant policy evolutions in the cyber domain.

1. Why now?  

Many question the timing of the potential Chinese sanctions, especially given President Xi Jinping’s upcoming state visit to Washington. It is likely that a combination of events over the summer in both the US and China have instigated this policy shift:

Chinese domestic factors: China’s stock market has been consistently fallingsince June, with the most visible plunge occurring at the end of August, which has had global ramifications. A major slowdown in economic growth has also hit China, which by some estimates could be as low as 4% (counter to the ~10% growth of the last few decades, and lower than even the recent record low of 7.4% in 2014). The latest numbers from today reinforce a slowing economy, with the manufacturing sector recording a three-year low. Simultaneously, President Xi continues to consolidate power, leading a purge of Communist Party officials targeted for corruption and asserting greater control of the military. In short, President Xi is looking increasingly vulnerable, handling economic woes as well as continuing a political power grab, which has led to two influential generals toresign and discontent among some of the highest ranks of leadership.

US domestic factors: The most obvious reason for the timing of potential US sanctions seems to be in response to this summer’s OPM breach, which has been largely attributed to China. This is just the latest in an ongoing list of public and private sector hacks attributed to China, including United Airlines and Anthem. The OPM breach certainly helped elevate the discussions overretaliation, but it’s unlikely that it was the sole factor. Instead, the persistent theft of IP and trade secrets, undermining US competitiveness and creating an uneven playing field, is the dominant rationale provided. Ranging from the defense sector to solar energy to pharmaceuticals to tech, virtually no sectorremains unscathed by Chinese economic espionage. The continuing onslaught of attacks may have finally reached a tipping point.

The White House also has experienced increased pressure to respond in light of this string of high-profile breaches. Along with pressure from foreign policy groups and the public sector, given the administration’s pursuit of greater public-private partnerships, there is likely similar pressure from powerful parts of the private sector – including the financial sector and Silicon Valley – impacting the risk calculus of economic and financial espionage. For instance, last week, Secretary of Defense Ashton Carter visited Silicon Valley, encouraging greater cooperation and announcing a $171 million joint venturewith government, academia and over 160 tech companies. These partnerships have been a high priority for the administration, meaning that the government likely feels pressure to respond when attacks attributed to the Chinese, such as the GitHub attacks this spring, hit America’s tech giants.

2. Why is this different from other sanctions?

Sanctions against Russia and Iran were in response to the aggressive policies of those countries, while those against North Korea were in response to the Sony breach. However, each of these countries lacks the economic interdependence with the US that exists for China.  Mutually assured economic destruction is often used to describe the economic grip the US and China have on each other’s economies. The United States is mainland China’s top trading partner, based on exports plus imports, while China is the United States’ third largest trading partner, following the European Union and Canada. Compare this to the situation in Russia, North Korea, and Iran, the most prominent countries facing US sanctions, none of which have significant trade interdependencies with the US.

Similarly, foreign direct investment (FDI) between China and the US is increasingly significant, with proposals for a bilateral investment treaty (BIT)exchanged this past June, and discussions ongoing in preparation for President Xi’s visit this month. China is also the largest foreign holder of  US Treasury securities, despite its recent unloading of Treasury bonds to help stabilize its currency. Compare this to Russia, North Korea, or Iran, none of which the US economy relied on prior to their respective sanctions. Even in Iran and Russia’s strongest industry – oil and gas– the US has become less reliant and more economically independent, especially given that the US was the world’s largest producer of oil in 2014.

3. Who or what might be targeted?

If sanctions are administered, the US will most likely continue its use of “smart” or targeted sanctions that focus on key individuals and organizations, rather than the entire country. The US sanctions against Russia provide some insight into the approach the administration might take. Russian sanctions are targeted at Putin’s inner circle, including its affiliated companies. These range from defense contractors to the financial sector to the energy sector, and include close allies such as Gennady Timchenko.  Similarly, North Korean sanctionsfollowing the Sony hack focused on three organizations and ten individuals. In the case of China, the state-owned enterprises (SOEs)deemed to reap the most benefits from economic espionage will likely be targeted. In fact, the top twelve Chinese companies are SOEs, meaning they have close ties to the government. More specifically, sanctions could include energy giants CNOOC, Sinopec and PetroChina, some of the large banks, or the global tech giant Huawei because of their large role in the economy and their potential to benefit from IP theft. Interestingly, the largest Chinese companies do not include several of their more famous tech companies, such as Alibaba, Tencent, Baidu and Xiaomi. Most of these enterprises have yet to achieve a significant global footprint, which means they are less likely to top any sanctions list. In considering who among Xi’s network might be targeted, some point to the Shaanxi Gang, Xi’s longtime friends, while others look at those most influential within the economy, such as Premier Li Keqiang.

Given President Xi’s upcoming visit, is the talk of sanctions diplomatic maneuvering, or will it be backed by concrete action? If enacted, the administration’s intent will be revealed through the actual targets of the sanctions.  If the objective is to deter future cyber aggression, then sanctions must be targeted at these influential state-owned companies and inner circle of the regime.  Otherwise, it will be perceived as a purely symbolic act both in the United States and in China and lack the teeth to truly enact change. 

Meet Endgame at AWS re:Invent 2015

$
0
0

See how we automate the hunt for cyber adversaries.

Stop by Booth #1329 to:

SEE A DEMO OF ENDGAME PRODUCTS

Sign up here for a private demo to learn how we detect attacks that:

  • Use native tools to locate, stage, and exfiltrate customer data
  • Exploit application vulnerabilities to install unknown malware
  • Install backdoors to gain control of critical servers
     

JOIN US AT 1923 BOURBON BAR!

Join Endgame for an evening of bourbon, cigar rolling, and jazz at 1923 Bourbon Bar on Wednesday, October 7. Registration is required to attend. Learn more and register here.

MinHash vs. Bitwise Set Hashing: Jaccard Similarity Showdown

$
0
0

As demonstrated in an earlier post, establishing relationships (between files, executable behaviors, or network packets, for example) is a key objective of researchers when automating the hunt.  But, the scale of information security data can present a challenge if naïvely measuring pairwise similarity.  Let’s take a look at two prominent methods used in information security to estimate Jaccard similarity at scale, and compare their strengths and weaknesses.  Everyone loves a good head-to-head matchup, right?

Jaccard distance is a metric1 that measures the similarity of two sets, A and B, by

where Js denotes the Jaccard similarity, bounded on the interval [0,1].  Jaccard similarity has proven useful in applications such as malware nearest-neighbor search, clustering, and code reuse detection.  In such cases, the sets A and B might contain imported functions, byte or mnemonic n-grams, or behavioral properties observed in dynamic analysis of each file.

Since each datapoint (e.g., malware sample) often consists of many feature sets (e.g., imports, exports, strings, etc.) and each set can itself contain many elements, naïve computation of Jaccard similarity can be computationally expensive.  Instead, it’s customary to leverage efficient descriptions of the setsA and B together with a fast comparison mechanism to compute Jd(A,BorJs(A,B). Minwise Hashing (MinHash) and bitwise set hashing are two methods to estimate Jaccard similarity.  Bitwise set hashing will be referred to in this blog post as BitShred since it is used as the core similarity estimator in the BitShred system proposed for large-scale malware triage and similarity detection.

First, let’s review some preliminaries. First, key ideas behind MinHash and BitShred will be reviewed, with a few observations about each estimator.  Then, these two methods will be compared experimentally on supervised and unsupervised machine learning tasks in information security.

 

MinHash

MinHash approximates a set with a random sampling (with replacement) of its elements.  A hash function h(ais used to map any element a from set A to a distinct integer, which mimics (but, with consistency) a draw from a uniform distribution.  For any two sets A and B, Jaccard similarity can be expressed in terms of the probability of hash collisions:

where the min operator acts as the random sampling mechanism.  Approximating the probability by a single MinHash comparison of A and B is actually an unbiased estimator, but has quite large variance—the value is either identically 1 or 0.  To reduce the variance, MinHash averages over m trials to produce an unbiased estimator with variance O(1/m).

Estimating Jaccard similarity via MinHash is particularly efficient if one approximates h(a) using only its least significant bit (LSB).  This of course, introduces collisions between distinct elements since the LSB of h(a) is 1 with 0.5 probability—but the approximation has been shown to be effective if one uses many bits in the code.  Overloading notation a bit, let a (respectively, b) be the bit string of 1-bit MinHashes for set A (respectively, B). Then Jaccard similarity can be approximated via a CPU-efficient Hamming distance computation (xor and popcount instructions):

It has been shown that the variance of 1-bit MinHash is 2(1-Js)/m when using mtotal bits, and indeed the variance of any summary-based Jaccard estimator has variance at least 1/m.  Interestingly, the variance of b-bit MinHash does not decrease if one uses more than b=1 bits to describe each hash output h(a) while retaining the same number of bits in the overall description.  With a little arithmetic, one can see that to achieve an estimation error of at most ε Js with probability exceeding 1/2, one requires m > (1-Js)/ (ε Js)2 bits of 1-bit Minhash, by Chebyshev’s inequality.

Code (golang) to generate a 1-bit MinHash code and approximate Jaccard similarity from two codes is shown below.

func Hash64(s string, seed uint64) uint64
func PopCountUint64(x uint64) int

func OneBitMinHash(set []string, N_BITS int) []uint64 {
  code := make([]uint64, N_BITS/64)
  var minhash_value uint64
  for bitnum := 0; bitnum < N_BITS; bitnum++ {
    minhash_value = math.MaxUint64
    for _, s := range set {
      minhash_tmp := Hash64(s, uint64(bitnum)) // bitnum as seed
      if minhash_tmp < minhash_value {
        minhash_value = minhash_tmp
      }
    }
    whichword := bitnum / 64   // which uint64 in the slice?
    whichbit := bitnum % 64    // which bit in the uint64?
    if minhash_value&0x1 > 0 { // is the bit set?
      code[whichword] = code[whichword] | (1 << uint8(whichbit))
    }
  }
  return code
}

func JaccardSim_OneBitMinHash(codeA []uint64, codeB []uint64) float64 {
  var hamming int
  N_BITS := len(codeA) * 64
  for i, a := range codeA {
    hamming += PopCountUint64(a ^ codeB[i])
  }
  return 1.0 - 2.0*float64(hamming)/float64(N_BITS)
}

 
BitShred: Bitwise Set Hashing

Feature hashing is a space-efficient method to encode feature-value pairs as a sparse vector.  This is useful when the number of features is a priori unknown or when otherwise constructing a feature vector on the fly.  To create an m­-dimensional vector from an arbitrary number of feature/value pairs, one simply applies a hash function and modulo operator for each feature name to retrieve a column index, then updates that column in the vector with the provided value.   Column collisions are a natural consequence in the typical use case where the size of the features space n is much larger than m.

BitShred uses an adaptation of feature hashing in which elements of a set are encoded as a single bit in a bit string.  Since m<<n, a many-to-one mapping between set elements and bit locations introduces collisions.  A concise bit description of set A is created by setting the bit at [h(amod m] for all elements a in A.  Overloading notation again, let a (respectively, b) be the BitShred description of set A (respectively, B).  Then Jaccard similarity is estimated efficiently by replacing set operators with bitwise operators:

To make sense of this estimator, let random variable Ci denote the event that one or more elements from each set A and B both map to the ith bit.  Similarly, let random variable i  denote that one or more elements from either set A orB (or both) map to the ith bit.  Then, the BitShred similarity estimator Js can be analyzed by considering the ratio

which is simply the (noisy, with collisions) sum of the intersections divided by the sum of the union.  Estimating the bias of the ratio of random variables will not be detailed here.   But, note that due to the many-to-one mapping, the numerator generally overestimates the true cardinality of the set intersection, while the numerator underestimates the true cardinality of the set union.  So, without cranking laboriously through any math, it’s easy to see from the ratio of “too big” to “too small” that this estimator is biased2, and generally overestimates the true Jaccard similarity.

Code (golang) to generate a BitShred code and estimate Jaccard similarity from two BitShred codes is shown below.

 

func Hash64(s string, seed uint64) uint64
func PopCountUint64(x uint64) int

func BitShred(set []string, N_BITS uint16) []uint64 {
  code := make([]uint64, N_BITS/64)
  for _, s := range set {
    bitnum := Hash64(s, 0) % uint64(N_BITS)
    whichword := bitnum / 64  // which uint64 in the slice?
    whichbit := bitnum % 64   // which bit in the uint64?
    code[whichword] = code[whichword] | (1 << uint8(whichbit))
  }
  return code
}

func JaccardSim_BitShred(codeA []uint64, codeB []uint64) float64 {
  var numerator, denominator int
  for i, a := range codeA {
    numerator += PopCountUint64(a & codeB[i])
    denominator += PopCountUint64(a | codeB[i])
  }
  return float64(numerator) / float64(denominator)
}

 

 
Estimator Quality

The math is over; let’s look at some plots.

This plot shows the estimated vs. true Jaccard similarity for MinHash and BitShred, for the contrived case where sets and B consist of randomly generated alphanumeric strings, |A|=|B|=64, and the number of bits m=128.  The mean and 1 standard deviation error bars are plotted from 250 trials for each point on the similarity graph.  The y=x identity line (dotted) is also plotted for reference.

A few things are evident. As expected, MinHash shows its unbiasedness with modest variance.  BitShred is grossly biased, but has low variance.   Note however, that the variance of both estimators vanishes as similarity approaches unity.  In many applications such as approximate nearest-neighbor search, it’s the consistent rank-order of similarities that matter, rather than the actual similarity values.  In this regard, one is concerned about the variance and strict monotonicity of this kind of curve only on the right-hand side, whereJs  approaches 1.  The extent to which the bias and variance near Js=1 play a role in applications will be explored next.

 

Nearest Neighbor Search

So, what about nearest-neighbor search?  Let’s compare k-NN recall.

As a function of neighborhood size k, we measure the recall of true nearest neighbors, that is, what fraction of the true k neighbors did we capture in our ­k­-NN query?  The plot above shows recall vs. k averaged over 250 trials with one standard deviation error bars for MinHash vs. BitShred.  The same contrived case is used as before, in which sets and B consist of randomly generated alphanumeric strings, |A|=|B|=64, and the number of bits m=128.  While it’s mostly a wash for small k, one observes that the lower-variance BitShred estimator general provides better recall.

Note that in this toy dataset, the neighborhood size increases linearly with similarity; but in real datasets the monotonic relationship is far from linear.  For example, the first 3 nearest neighbors may enjoy Jaccard similarity greater than 0.9, while the 4th neighbor may be very dissimilar (e.g., Jaccard similarity < 0.5).

Applications: Malware Visualization and Classification

Let’s take a look at an application.   In what follows, we form a symmetric nearest neighbor graph of 250 samples from each of five commodity malware families plus a benign set, with k=5 nearest neighbors retrieved via Jaccard similarity (MinHash or BitShred).  For each sample, codes are generated by concatenating five 128-bit codes (640 bits per sample) consisting of a single 128-bit code for each of the following feature sets extracted from publicly available VirusTotal reports:

  • PE file section names;
  • language resources (English, simplified Chinese, etc.);
  • statically-declared imports;
  • runtime modification to the hosts file (Cuckoo sandbox); and
  • IP addresses used at runtime (Cuckoo sandbox).

A t-SNE plot of the data—which aims to respect local similarity—for MinHash and BitShred are shown below. (I use the same random initialization for both plots.)

Figure 1: MinHash similarity from k=5 symmetric similarity matrix

Figure 2: BitShred similarity from k=5 symmetric similarity matrix

The effects of BitShred’s positive bias can be observed when comparing to the MinHash plot.  It’s evident that BitShred is merging clusters that are distinct in the MinHash plot.  This turns out to be good for Allaple, but very bad for Ramnit, Sality and benign, which exhibit cleaner separation in the MinHash plot.  Very small, tight clusters of Soltern and Vflooder appear to be purer in the BitShred visualization. Embeddings produced from graphs with higher connectivity (e.g., k=50) show qualitatively similar findings.

For a quantitative comparison, we show results for simple k-NN classification with k=5 neighbors, and measure classification performance.  For MinHash the confusion matrix and a classification summary are:


And for BitShred:

In this contrived experiment, the numbers agree with our intuition derived from the visualization: BitShred confuses Ramnit, Sality and benign, but shows marginal improvements for Soltern and Vflooder.

 

Summary

MinHash and BitShred are two useful methods to approximate Jaccard similarity between sets with low memory and computational footprints.  MinHash is unbiased, while BitShred has lower variance with nonnegative bias.  In non-extensive experiments, we verified intuition that BitShred overestimates Jaccard similarity, which can introduce errors for visual clustering and nearest-neighbor classification.  In our contrived experiments (which also plays out in practice), this caused confusion/merging of distinct malware families.

The bias issue of BitShred could be partially ameliorated by using neighbors that fall within a ball of small radius r, where the BitShred bias is small.  (This is in contrast to k-NN approaches in which similarities in the “local” neighborhood can range from 0 to 1, with associated bias/variance.) 

Finally, the Jaccard metric represents a useful measure of similarity.  There are many others based on common or custom similarity measures, which may also be approximated by Hamming distance on compact binary codes.   These, together with efficient search strategies (also not detailed in this blog post) can be employed for powerful large-scale classification, clustering and visualization.

1How can one show that Jaccard  distance is really a metric?  Nonnegativity, coincidence axiom, and symmetric properties? Check, check and check.  But, triangle inequality?  Tricky!  Alternatively, one can start with a known metric—the symmetric set difference between A and B—then rely on the Stenhaus Transform, to crank through the necessary arithmetic and arrive at Jaccard distance.

2One may reduce the bias of BitShred by employing similar tricks to those used in feature hashing. For example, a second hash function may be employed to determine whether to xor the current bit with a 1 or a 0. This reduces bias at the expense of variance.  For brevity, I do not include this approach in comparisons.

Read more blog posts about Data Science.

Follow Hyrum on Twitter @drhyrum.

Webinar: Automating the Hunt for Network Intruders

$
0
0

As adversaries - whether criminal or otherwise - make use of increasingly sophisticated attack methods, network defenses have not kept pace; they remain focused on signature-based, reactive measures that close the barn door after the horses have escaped. Automated threat detection offers the opportunity for truly proactive network defense, by reducing the amount of time an intruder remains undetected and introducing remedies earlier than otherwise possible. Automation can also enable better use of scarce resources and reduced exposure to network-based threats. This webcast discusses how to automate the hunt for network threats and move an organization's security posture to the next level.

Sign up for this SANS webcast and be among the first to receive an advance copy of a SANS whitepaper discussing the automation of threat detection.Register here.

Viewing all 698 articles
Browse latest View live