Quantcast
Channel: Endgame's Blog
Viewing all 698 articles
Browse latest View live

Data-Driven Strategic Warnings: The Case of Yemeni ISPs

$
0
0

In 2007, a flurry of denial of service attacks targeted Estonian government websites as well as commercial sites, including banks. Many of these Russian-backed attacks were hosted on servers located in Russia. The following year, numerous high profile Georgian government and commercial sites were forced offline, redirected to servers in Moscow. Eventually, the Georgian government transferred key sites, such as the president’s site, to US servers. These examples illustrate the potential vulnerability of hosting sites on servers in adversarial countries. Both Estonia and Georgia are highly dependent on the Internet, with Estonia conducting virtually everything online from voting to finance. At the opposite end of the spectrum is Yemen, with twenty Internet users per 100 people. Would the same kind of vulnerability experienced by Georgian sites be a concern for a country with minimal Internet penetration?

For low and middle-income countries, traditional indicators of instability and dependencies – such as conflict measures or foreign aid, respectively – tend to drive risk assessments. When modern technologies are taken into account, most of this work focuses on the role of social media, as the majority of research on the Arab Spring and now ISIS reflects. While these technologies are important to include, they do not reflect the full spectrum of digitally focused insights that can be garnered for geopolitical analyses. More specifically, the hosting and/or transfer of strategic servers hosted in adversarial (or allied) sovereign territory could provide an oft-overlooked signal of a country’s intent. Eliminating this risk could be a subtle, but insightful, change that may warrant additional attention. The changing digital landscape could provide great value and potentially strategic warning of an altering geo-political landscape.

The Public Telecommunication Corporation (PTC) is the operator of Yemen’s major Internet service providers, Yemennet and TeleYemen. Using Endgame’s proprietary data, it is possible to analyze the changing digital landscape of all Internet-facing devices, including the digital footprint of the ISPs. The geo-enrichment and organizational information, when explored temporally, may shed light both on transitioning allegiances, as well as on who controls access to key digital instruments of power during conflict. These are state-affiliated ISPs, and in turn can be used for censorship and propaganda by those who control them, as exemplified in Eastern Europe. In fact, news broke on 26 March that Yemennet is blocking access to numerous websites opposed to Houthi groups. Houthis control the capital and have expanded their reach, leading to the recent air strikes by Saudi Arabia and Gulf Cooperation Council allies.

Looking at data from early 2011 to the present, it is apparent that the PTC and Yemennet particularly had a footprint mainly in Yemen, but also in Saudi Arabia as well.

PTC Cumulative Host Application Footprint 2011-2015

Yemennet Cumulative Host Application footprint 2011-2015

However, the larger temporal horizon masks changes that occurred during these years. The maps below illustrate data over the last year, highlighting that the digital footprint has moved to entirely within Sana.

PTC footprint 2014-15

Yemennet Footprint March 2014-2015

An overview of the time series data shows a dramatic termination of a presence in Saudi Arabia during the summer of 2013.

To ensure this breakpoint was not simply an elimination of the IP blocks located in Riyad and Jeddah, but rather a move to Sana, I explored numerous IP addresses independently to assess the change. In each case, the actual hosting of the IP address transferred from Saudi Arabia to Yemen. Interestingly, just prior to the breakpoint in the data, an (allegedly) Iranian shipment of Chinese missiles was located off the coast of Yemen, which were intended at the time for Houthi rebels in the northwestern part of the country. Moreover, the breakpoint also occurs within the same timeframe of the termination of Saudi Arabia’s aid to Yemen, which had been the bedrock of the relationship for decades. In fact, the elimination of this aid was described as giving “breathing space for it (Yemen) to become independent of its ‘big brother’ next door.” It is plausible that this transfer of domain host locations is similarly part of the larger desire for “breathing space”, or elimination of dependencies on its powerful neighbor.

Does this transfer of the main Yemeni ISPs away from Saudi Arabia to entirely within Yemen’s borders indicate a strategic change? As is the case with all strategic warnings, they should be validated with additional research. Nevertheless, data-driven strategic warnings are few and far between in the realm of international relations. Even the smallest proactive insight into potential changes in the geo-political landscape could help highlight and focus attention to areas previously overlooked. Despite the presence of al-Qaeda in the Arabian Peninsula (AQAP), Yemen has not garnered much attention outside of the counter-terrorism domain. But as we’re seeing now, Yemen could very well be the battleground for a proxy conflict between the dominant actors in the Middle East. Perhaps any exploration of Yemen’s digital landscape during 2013 could have prompted a more holistic and proactive analysis into the changing regional dynamics. The digital landscape of key organizations may offer a range of insights that just may provide enough strategic insight to help enable proactive research into regions that are on the verge of major tectonic geopolitical shifts. With the onset of the cyber domain as a major battleground for power politics, digital data must be integrated not only into tactical analyses, but also can help inform strategic warning as well.


Meet Endgame at RSA 2015

$
0
0

Endgame will be at RSA 2015!

Stop by the South Hall, Booth #2127 to:

  • Get a product demo. Learn more about how we help customers instantly detect and actively respond to adversaries.

  • Learn from our experts. We’ll present three technical talks at our booth throughout the week. No registration required - just show up!

  • Enter to win an iPad mini! We'll be giving one away Monday, Tuesday and Wednesday of RSA. We'll announce the winners at the end of each day here on our website and on Twitter (@EndgameInc). Come to the booth to claim your prize.  ***Congratulations to the iPad mini winners for Tuesday 4/21 - #233026 and Wednesday 4/22 - #233120. Come to our booth tomorrow (South Hall 2127) to claim your prize!***

Don't have an expo pass? Register here for a free expo pass courtesy of Endgame (use the registration code X5EENDGME).

 

Technical Talk Descriptions

Vulnerability and Exploit Stats: Combining Behavioral Analysis and OS Defenses to Combat Emerging Threats


Speaker: Cody Pierce, Endgame Director of Vulnerability Research

Despite the best efforts of the security community—and big claims from security vendors—large areas of vulnerabilities and exploits remain to be leveraged by adversaries. Attendees will learn about:

  • A new perspective on the current state of software flaws.
  • The wide margin between disclosed vulnerabilities and public exploits including a historical analysis and trending patterns.
  • Effective countermeasures that can be deployed to detect, and prevent, the exploitation of vulnerabilities.
  • The limitations of Operating System provided mitigations, and how a combination of increased countermeasures with behavioral analysis will get defenders closer to preventing the largest number of threats.

Cody Pierce has been involved in computer and network security since the mid 90s. For the past 13 years he has focused on discovery and remediation of known and unknown vulnerabilities. Instrumental in the success of HP’s Zero Day Initiative program, Cody has been exposed to hundreds of 0day vulnerabilities, advanced threats, and the most current malware research. At Endgame, Cody has lead a successful team tasked with analyzing complex software to identify unknown vulnerabilities and leveraged global situational awareness to manage customer risk.

Global Attack Patterns to Improve Threat Detection  


Speaker: Curt Barnard, Endgame Software Implementation Engineer

The Internet is flooded with traffic from web crawlers, port scanners, and brute force attacks. Data analyzed from Sensornet™, a unique network of sensors, allows us to observe trends on the Internet at large. Attendees will learn:

  • How to identify if malicious traffic directed at your network service is part of a larger CNO campaign.
  • How to get advanced warning of new attacks and malware seen in the wild but not yet reported on.
  • How network defenders can better protect themselves against attacks that occur at scale.
  • How Endgame identifies malicious hosts that are attempting to leverage exploits such as the Shellshock vulnerability at scale.

Curt Barnard is a network security professional with expertise in advanced methods of covert data exfiltration, steganography, and digital forensics. As a Department of Defense employee, Curt focused on analysis and operations to counter some of the most advanced cyber threats. At Endgame, Curt continues this research, coaxing malicious actors into revealing their TTP’s and creating defensive measures based on real-time threat data.

How Data Science Techniques Can Help Investigators Detect Malicious Behavior 

Speaker: Phil Roth, Endgame Data Scientist

Data science techniques can help organizations solve their security problems — but they aren’t a silver bullet. Working directly with customers, Endgame has been able to match the right science to unsolved customer security challenges to create effective solutions. In this talk, attendees will experience a small part of that process by learning:

  • How machine learning techniques can be used to find security insights in large amounts of data.
  • The difference between supervised and unsupervised learning and the different types of security problems they can solve.
  • How a lack of labeled data and the high cost of misclassifications present challenges to data scientists in the security industry.
  • How Endgame has used an unsupervised clustering technique to group cloud-based infrastructure, a fundamental step in the detection of malicious behavior.

Phil Roth cleans, organizes, and builds models around security data for Endgame. He learned those skills in academia while earning his physics PhD at the University of Maryland. It was there that he built data acquisition systems and machine learning algorithms for a large neutrino telescope called IceCube based at the South Pole. He has also built image processors for air and space based radar systems.

Git Hubris? The Long-Term Implications of China’s Latest Censorship Campaign

$
0
0

Last Friday, GitHub, the popular collaborative site for developers, experienced a series of distributed denial of service (DDoS) attacks. The attacks are the largest in the company’s history, and continued through Tuesday before fully coming under control. GitHub has not been immune to these kinds of attacks in the past, and is quite experienced at maintaining or restoring the site during the onslaught. In both 2012 and 2013, GitHub experienced a series of DDoS attacks and experienced similar attacks earlier in March. By all independent accounts, the Cyberspace Administration of China (CAC) is behind this latest wave of attacks, redirecting traffic from the Chinese search engine, Baidu, to overwhelm GitHub. While the malicious activity bears the fingerprints of a Chinese campaign, they may have awoken a sleeping giant in the open source development community. Unlike the latest high profile attacks – such as Sony and Anthem – these attacks visibly disrupted the day-to-day life of a tight knit, transnational and largely middle-class social network. And it is these kinds of transnational networks that, when unified, spawn social movements.

This week’s attack focused on pressuring GitHub to remove content related to GreatFire.org and another site that hosts links to the Chinese version of The New York Times. Both of these are platforms for circumventing the Great FireWall, and therefore are a direct attack both on free speech and also the tech community. In the past, China has restored access to GitHub due to criticism from the domestic developer community. However, China has been tightening censorship over the last few years, which has instigated the creation of groups like Great Fire to partner with external partners – such as Reporters without Borders – to help fight Chinese censorship. With over 300 cofounders, Great Fire is gaining traction and has tightened relations with the major media outlets outside of China. It is these kinds of transnational activist networks that have proven so successful in the past. Written well before rise of social media, Margaret Keck and Kathryn Sikkink’s Activists without Borders introduced the concept of the boomerang effect. The boomerang effect occurs when a state is unresponsive to the demands of domestic groups, who then form transnational alliances to amplify the demands of the groups and readdress the demands via international pressure. To date, Great Fire is pursuing a similar trajectory to previous successful social movements.

Is it possible that the latest wave of DDoS attacks is enough to fully solidify the relationship of groups like Great Fire not only with journalists, but also with the open source development community? A brief review of Twitter content (similar to those screenshots below) pertaining to the GitHub DDoS attacks produces three general themes: 1) who is doing this?; 2) why are they doing this; 3) stop messing with my project. In fact, one popular sourcefor open source news asks, “Who on Earth would attack GitHub?” The open source community is clearly one of the largest proponents of free speech and collaboration, which has been very vocal in issues of privacy, but has been relatively silent on global events. Nevertheless, couple that intrinsic and core set of beliefs with disruption to their own projects, and the conditions are created under which social movements begin to coalesce. More recent literature on social movements further highlights the greater success of the movements when pursuing non-violent means to instigate change.

The latest executive order sanctions those associated with cyber attacks, but it is more so reactive than proactive. The open source community could build upon lessons learned from the GitHub experience, and collaborate with colleagues throughout the tech community to inflict economic damage on those who are directly attacking open source development. For instance, a de facto embargo on certain technologies to China is much more politically feasible and costly than working through the ITAR process. While the tipping point for awareness has not yet been reached – one indication of which is the lack of prominent mainstream media on the GitHub breach – the conditions are ripe for the start of a transnational social movement, driven by the open source development community if it coalesces around this cause (similar to that which occurred over privacy concerns) instead of allowing it to silently dissipate.

In contrast, China likely sees this latest GitHub campaign as simply an extension of previous breaches, which failed to garner any political blowback, but aided their larger censorship efforts. However, China will increasingly have to deal with the growing paradox of promoting censorship as well as technical development. This is one of the many contradictions China continues to encounter as it simultaneously modernizes its economy and seeks global ambitions. The choice of Baidu, for instance, potentially reveals another rift in China’s approach to development. Robin Li is the CEO of Baidu, is the third wealthiest man in China, and a member of the government’s top political advisory council. This makes the choice of Baidu potentially confrontational, as it is publicly traded and not part of the state-owned enterprises that tend to operate at the behest of the government. So far, Baidu has denied any connection to the GitHub attacks. Contradictions like these are only increasingly surfacing as corruption campaigns, censorship and extension of power dominate Chinese politics.

The latest GitHub attack, the largest in its history, remains off the radar for all but those in larger technology and open source communities. This is unfortunate as it has the potential to have much broader long term implications within China than any of the other Chinese-associated attacks in the last year. It will be interesting to watch whether the open source community will use this as a springboard for global advocacy for free speech with the potential to inflict economic and technological pain. The current response has been luke warm at best, but the conditions are ripe for change. China might do well to heed the advice of Barrington Moore, who over a half century ago wrote about the preconditions for social movements toward democracy and dictatorships. He notes that the tipping point of change tends to occur when the daily routines of the middle class is disrupted or threatens to be destroyed. China has crossed this threshold, and very well may be uniting the transnational network on which movements are made.

The Endgame Guide to Informed Cocktail Party Conversations on Data Science and the Latest Security Trends at RSA 2015

$
0
0

The mathematician George Box famously noted that, “all models are wrong, but some are useful”. This is especially useful advice when looking at quantitatively driven analytics—a topic that is increasingly dominating research and media coverage in the security industry. While the move toward more data-driven analyses is a welcome one, without a proper understanding of the indicators, parameters, and compilation of the data, the field is ripe for misinterpretation and apples-to-oranges comparisons of the state of the security threatscape. New security industry research and related media coverage over the last week indicates a strong focus on the escalation of cyber attacks over the last year. The quantitative findings, coupled with the growing qualitative narrative of China’s Great Cannon escalatory capabilities of censorship outside of Chinese sovereign territory, indicates a troubling trend in rising malicious activity in the cyber domain. These estimates, at a strategic level, are likely correct, but they are prone to misinterpretation and confusion when translating them into business, policy, and course of action decisions for executive leaders. To make sense of competing analytics and best evaluate specific organizational risks, executives need to understand how data and behavioral science work together. As security executives and practitioners get ready to head to RSA next week, here are a few guidelines for comparing and interpreting the latest security research:

  • Parameters: Many of the recent industry reports focus on specific geographic or industry coverage or even company size. For instance, headlines that attacks are up 40% pertains only to large companies with over 2500 employees.  Similarly, headlines that cyber attacks cost companies $400B requires the qualification that this is an estimate for some companies. Similarly, SCADA systems seemed especially vulnerable in analyses that focus solely on SCADA systems, which may or may not apply to other targets. Finally, frequently the research is based on a sample or subset of the data and therefore may not reflect the entire population. In short, findings in one region or vertical or target-type do not necessarily translate into the same risk factor outside of those specific parameters. This could be exacerbated depending on the sample size of the data. The type, severity, and frequency of attacks against the financial services industry in the US likely vary significantly from those targeting the telecommunications industry in Peru. Distinguishing even further based on company size adds another level of complexity that cannot be ignored.
  • Measurement: What constitutes an attack? This is perhaps one of the most challenging and inconsistent areas of quantitative security analytics. For instance, in the critical infrastructure industry there are significant discrepancies in the number of reported attacks, partly due to a lack of consensus on the nature of an attack. Critical infrastructure is not alone, as organizations vary on their definition of an attack. What was the target? From where did the attack occur? Was data breached? For some, the breach of data appears to be the distinguishing element of defining an attack. “It wasn’t an actual hack, no data was breached” noted Alex Willette, when the State of Maine’s website went down last month after being the target of a series of denial of service attacks. And this is just the key independent variable. A series of control and dependent variables also are prone to measurement discrepancies as well. In fact, most quantitative analytics base their measurement on raw numbers and ignore what percent they might be of the larger population. In an industry where the number of connected objects and people continues to expand exponentially, the raw numbers mask the growing population size from which these measurements occur. Are there more attacks simply because there are a greater number of connected devices and people? Maybe not, but it certainly is a factor that must be considered in any rigorous analyses.
  • Collection: Even with the parameters and measurement well established, the security industry faces great challenges in data collection. This is both a technical and a social challenge. Clearly, the technical means to collect the data often remain proprietary and therefore limit apples-to-apples comparisons of the findings. However, the social dimension likely provides an even greater collection problem. Unlike other areas where risk factors are visible (such as conflict), the security industry leans heavily on self-reporting of breaches. This is one of the many areas where behavioral and social science can be integrated into the quantitative analytics. For instance, the notion of norms emerges frequently, but rarely is it applied to norms pertaining to reporting. Previously, companies and organizations were disinclined to report on a breach for fear of the reputational costs. Is this norm even more embedded in light of CEOs at Target and Sony losing their positions? Or is rising awareness of the geo-political threats leading to greater disclosure to the government? In short, the latest figures on the escalating malicious digital activity might reflect changes in reporting, detection, increased activity, or more likely a confluence of the three. Given the nature of obfuscation and continued norms that may limit reporting or even information sharing, it is essential to remain cognizant of how data collection directly impacts any findings in the security industry.

As corporate executives and the security industry flock to San Francisco next week for the RSA conference, there will be plenty of discussion on the latest reports and big data techniques to help tackle the escalatory nature of malicious digital activity. This may be the one time data munging and structuring discussions just might be welcome at the numerous cocktail party receptions that coincide with the RSA conference. When asked for thought-provoking insights on the latest trends in the security industry, it never hurts to remember that models are oversimplifications of reality. The parameters, measurement and collection of the data dramatically impact a model’s robustness, and thus the validity of the findings. It is best to avoid oversimplifying such a complex domain, and instead opt for digging beneath the surface of the latest trends to fully comprehend exactly how they might apply to a given organization.

Geeks, Machines and Outsiders: How the Security Industry Fared at RSA

$
0
0

Last week at RSA—the security industry’s largest conference—Andrew McAfee, co-author of “The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies”, introduced the trifecta of geeks, machines and outsiders as technological innovation’s driving factors. However, after listening to numerous panels and talks during the week that glossed over or downplayed the relevance of geeks, machines and outsiders in moving the security industry forward, it was impossible to miss the irony of McAfee’s argument.

So using the criteria of geeks, machines and outsiders as the driving factors in technology innovation, how does the security industry fare? Based on my week at RSA, here is my assessment:

  • Geeks: By geeks, McAfee refers to people who are driven by evidence and data. Despite the buzzword bingo of anomaly detection, outliers and machine learning, it is not apparent that the implementation of data science has evolved to the point in security that it has in other industries. This might be shocking to insider experts who find that data science has almost reached its peak impact in security. To the contrary, as one presenter accurately noted, data science is, “still in the dark ages in this space.”

    Most data science panels at RSA devoted entire presentations to non-technical and bureaucratic descriptions of data science. In fact, one presenter joked that the goal of the presentation was to only show one equation at most, and only in passing, in order to try to maintain the audience’s attention. While the need to reach a broader audience is understood, panels on similarly technical topics such as malware detection, authentication or encryption dove much deeper into the relevant technologies and methodologies. It’s unfortunate for the industry that the highly technical and complex realm of data science is not always granted the same privilege.

    Incorrect assumptions about data science were also prevalent. At one point during one of the talks, someone commented that “the more data you have, the higher the accuracy of the results.” Comments like these perpetuate the myth that more data is always better and ignore the distinction betweenprecision, recall, and accuracy. Even worse, the notion of “garbage in, garbage out”, which is taught at any introductory level quantitative course, did not even seem to be a consideration.

    Finally, security companies seem to buy into the notion that data scientists are necessary for the complex, dynamic big data environment, but they have no idea how to gainfully employ them. During one panel, a Q&A session focused on what to do with the data scientists in a company. Do you partner them with the marketing team? Finance? Something else? It was clear that data science remains an elusive concept that everyone knows they need, but have no idea how to operationalize.
     

  • Machines: Ironically, it was a data science presentation that, although short on real data science, provided the strongest case for increasing human machine interaction in security by illustrating its success in other industries. In his own argument about machines as a driving factor in technology innovation, McAfee pointed out that companies that ignore human-machine partnerships fall behind. This remains a dominant problem in the security industry, as the numerous high-profile breaches of the last few years illustrate.

    Unlike in many other extraordinarily technical fields, the human factor is often overlooked or ignored in security.  Whether it’s boasting thousands of alerts a day (which no human could ever analyze), or the omnipresent donut/pie chart visualization which is the bane of the existence of anyone who actually has to use it, the human factor approach to security—like data science—lags well behind other industries. While there was an entire RSA category devoted to human factors, the vast majority of those panels were focused on the insider threat, rather than on the user experience in security. The importance of the human-machine interplay is simply not on the security industry’s radar.
     

  • Outsiders: McAfee’s last point about outsiders emphasizes the erroneous mindset in some industries that unless you grew up and are trained in that specific field, you have nothing to offer. Instead, industries that are open to ideas and skills from other fields will have the greatest success in the foreseeable future. This perspective has actually been the driving force of creative innovation throughout time. The wariness (and at times exclusion) of outsiders in the security industry is extraordinarily detrimental not only to the industry, but to corporate and national security as well. It impedes cooperation at the policy level and innovation within the security companies themselves. Although not commenting on the security industry specifically, McAfee reiterated the foundational role of a diversity of views and experiences, working collaboratively together, to foster innovation and paradigm shifts.

    This preference toward industry-insiders is the driving factor limiting the integration of data science and human-machine partnerships and hindering security innovation. The response to McAfee himself was perhaps indicative of the industry’s perspective on the issue of outsiders. McAfee was the last keynote presenter of the day. Many attendees sat through a series of talks by security insiders, but unfortunately left when it came time for an outsider’s perspective. 

Changing an embedded mindset can be even harder than developing the technical skills. This is especially apparent in the security industry, which has yet to figure out how to take the great advances in data science and human-machine interaction from other industries and leverage them for security. As a quantitative social scientist, it was truly mind-boggling to see just how nascent data science and user experience are in the security industry. The future of the security workplace should obviously maintain subject matter experts, but must also pair them with the data scientists who truly understand the realm of the possible, as well as UI/UX experts who can take the enormous complexity of the security data environment and render it useful to the vast user community. It’s ironic that such a technology-driven industry as security completely discounts its roots in Ada Lovelace’s vision of bringing together arts and sciences, machines and humans. Maintaining the status quo—which in the security industry is 0 for 3 in McAfee’s categories for innovation—should not be an option. There is simply too much at stake for corporate and national security. Technical innovation must be coupled with organizational innovation to truly leverage the insights of geeks, machines and outsiders in security.

Change: Three Ways to Challenge Today’s Security (UX) Thinking

$
0
0

Last week, I was fortunate enough to spend three and a half days on the floor at RSA for its “Change: Challenge Today’s Security Thinking” inspired conference. I was simply observing and absorbing the vast array of companies and products. As someone new to the world of security (but very well-versed in the field of UX) I was afforded an opportunity to look at an entire industry with a fresh perspective. One of the most unique challenges that faces the growing world of user experience professionals is knowing just enough about a target user group to create compelling solutions without being too “in the weeds.” In my experience, being too close to a particular industry or audience segment can prevent a more objective approach that a seasoned designer can, and should, bring to a product. Having said that, there were some interesting trends as well as some areas that could benefit from the thematic undercurrent of “Change” presented at RSA. I focused my research on 51 companies spanning multiple verticals, sizes and problem sets—and because the majority were not direct Endgame competitors, the true purpose of my research was to understand more about how the industry thinks and to find key areas of improvement for the field of UX. 

 

Color as a key component

Color played a large part in virtually every product, whether by choice or chance. Color palettes were dominated by bold hues that usually included black, gray, red, orange and blue. Yellow, purple and green were used far less frequently and likely for good reason. Traditionally, black and gray representsimplicity, prestige and balance with red and orange representing importance, danger, caution and change. Blue will always represent strength and trust. On the flip side, yellow, green and purple tend to represent sunshine and warmth,growth and fertility, and magic and mystery, unlikely traits in the security industry. Still, some companies utilized these weaker palette choices in their products, possibly without a true understanding of the “baggage” they bring.

Outside of content color, background color use went one of two ways – either dark content on light background with 80% of companies utilizing this design paradigm or the much less common, light-on-dark construct. Neither is “better” or “correct” in application development, however, the former tends to be more common in the business-to-business realm and is far more familiar to business-centric application users. When I inquired with the companies I observed who had chosen to implement the lesser utilized light-on-dark approach, they generally did so to either differentiate themselves or to successfully target a very specific population of their target market. Whether these two outcomes are true still remains to be seen. These companies were all young start-ups, clearly taking a bit of a risk.

 

Maps as presentation vehicles

There were a multitude of products that featured some sort of map – whether it was a network, geographic, server, GPS, sankey, tree – you name it, there was a map for it. This was both good and bad. For those companies that did it well, the maps provided a much-needed visualization of data that wouldn’t fare well in a tabular or list format. When a security professional needs to have a birds-eye view of where their vulnerabilities lie, providing a visual representation over a list of IP addresses may allow them to better comprehend what requires their attention in a fraction of the time. However, the maps started to suffer in situations where their presence had no clear purpose. Several products had unnecessary animations. Others were so small that the corresponding data and labels overlapped, rendering the graphic unusable. I saw quite a few stuck into a corner of a dashboard simply to fill an otherwise empty space. The D3 collapsible tree map was extremely popular, often at the cost of legibility and a clear understanding of the complexity of the processes that the visualizations were supposed to clarify.

 

Features as framework

Perhaps the greatest challenge I found in the majority of products from both small and large companies, but particularly the industry behemoths, was a clear, well thought-out information architecture (IA), particularly as it related to feature development and organization. There is a common misunderstanding that more features equates to better “sellability”, particularly in products that like to position themselves head-to-head with their competitors.  In the industry, this is often referred to as feature-bloat and time and again it presents itself in products that are designed by product management, marketing and and/or engineers. Generally, these are the individuals who are the most removed from the end-user. It’s the idea that if some is good, more must be better and the false assumption that commanding a big price tag means being able to do a lot. We see this as the mark of success in many industries including the automobile, electronics and vacation/travel sectors.

However, in an industry where time is critical and decision-making is crucial (and competitors are abundant), the feature bloat present in many products shown on the floor can be detractors to product success and may actually make them harder to use when time is of the essence. Think scalpel over Swiss army knife, especially if you’re a startup.

 

What does this mean?

The good news is that UX is starting to make inroads in the security industry and this is an exciting time to be talking about UX in this massive field. Fully bringing UX to an entire product and team takes time, but there are three things that every company can start doing now.

  • First, know your audience and your brand. Figure out to whom you want to sell and for whom you want to build (hint: they may not be the same person). What does your company stand for? What are your core values and selling points? What problems are you solving and for whom? How are you solving them? Then figure out what it is that makes your company and your product different from everyone else. This is yourown brand pyramid. Ask yourself with each new feature that gets proposed: “Does this align with our core strategy and does our product really need this? Does this solve a specific problem for the user” Don’t assume that you know the answer to this simply because you work in marketing or are an engineer. Subsequently, if your answer is “no” and/or the feature doesn’t align with supporting your original brand pyramid, it’s extraneous at best, and distracting or detrimental, at worst.
  • Second, don’t be afraid to be different—but not so different that people don’t even understand it. This is where your UX team needs to understand how to do good user research and then analyze that research. Don’t just comb your analytics—watch people use your product. Don’t just ask your users what they need—it’s  likely that they actually won’t be able to tell you. Don’t assume every user is of a certain demographic and will like some wacky color scheme. Instead try to understand what it is they want to do with your product. Have an open conversation around their roles in their organizations and what problems they face in their roles. Seek ways to create solutions they wouldn’t have thought of and then iterate on how to best manifest that within the product interface without sacrificing usability.
  • Finally, offer unique and targeted solutions even if it means having more than one product. It is better to have several separate but logically connected solutions than it is to have a bloated product with many layers of navigation and too many features. If possible, create roles within the product and give those roles specific policies that can hide data and modules when a particular user does not need them. This may seem obvious, but again, when a feature is proposed, ask “does every user in my system need this and if so, are they all using it the same way?” Chances are, they aren’t.

Interestingly enough, on several occasions at RSA I heard the question “what products will this replace?” The end goal of any product should be to solve problems, not displace the competition. If a competitor’s product already solves a user’s problem, then your company is facing an uphill battle if the only goal is to unseat that product. Instead, ask if there is a more unique way to solve that same problem. Perhaps there is a different problem worth solving. Seek the blue ocean. As Apple would say, “Think Different.” Apple wasn’t successful because Apple wanted to outsell Microsoft. Apple was successful because Apple wanted to make products that solved users’ problems. They’ve done this by investing the necessary resources into their user experience. They’ve aligned their business with user needs. Sounds simple—but it takes dedication and time.

In the end, UX does take effort. It can feel like starting over. In some ways, it is. However in every other industry that has embraced it, especially when that industry is inundated with solutions (think healthcare, education, mobile development) it’s often the difference between an “ok product” and a market success. Even if your organization has already invested a lot of time in your existing products, as RSA taught us, it’s never too late to “Change”.

How the Sino-Russian Cyber Pact Furthers the Geopolitical Digital Divide

$
0
0

As I wrote at the end of last year, China and Russia have been in discussions to initiate a security agreement to tackle the various forms of digital behavior in cyberspace. Last Friday, Xi Jinping and Vladimir Putin formally signed a cyber security pact, bringing the two countries closer together and solidifying a virtual united front against the US. This non-aggression pact serves as just one of a series of cooperative agreements occurring in the cyber domain, and is indicative of the increasingly divisive power politics that are shaping the polarity of the international system for decades to come.

Non-aggression pacts are not new, and by definition focus solely on preventing the use of force between signatories to the pact. However, although they are structured to only impact bilateral country relations, historically they have had significant international implications. By signaling ideological, political, or military intentions, non-aggression pacts can exclude similar levels of cooperation with other states. In fact, when states form neutrality pacts (which are similar, but slightly distinct from non-aggression pacts), the probability of a state initiating a conflict is 57% higher than those without any alliance commitments. Regardless of the make-up of a state’s alliance portfolio—whether non-aggression or neutrality pacts, offensive or defensive alliances—a state’s involvement in alliances of any kind increases the likelihood of that state initiating conflict. It would be a mistake to assume that pacts in the cyber domain should be any different, as they serve a similar signaling mechanism of affiliation in the international system. In fact, last week’s cyber security pact has already prompted analogies to the Molotov-Ribbentrop Pact, the non-aggression treaty signed in 1939 between Germany and the USSR.  While externally the emphasis was on preventing conflict between the two signatories (which clearly didn’t last), the pact contained a privately held aspect dividing parts of Eastern Europe into Soviet and Russian spheres of influence. In short, while non-aggression pacts may appear pacifistic, rarely has that been the case historically.

Moreover, the Sino-Russian pact provides a forum for each state to further shape the guiding principals and norms in cyberspace away from its foundation, which is based on Internet freedom of information and access, and toward the norm of cyberspace sovereignty. Following the surveillance revelations beginning in 2013, global interest in the notion of cyberspace sovereignty has increased, largely aimed at limiting external interventions viewed as an infringement on traditional notions of state sovereignty.  On the surface, this merely extends the Westphalian notion of state sovereignty. However, authoritarian regimes (such as Russia and China) have coopted the de jure legitimacy of state sovereignty to control, monitor and censor information within their borders. This is orthogonal to those norms generally favored by Western democracies and further divides cyberspace into two distinct spheres defined by proponents of freedom of information versus proponents of domestic state control. The Sino-Russian pact will likely only encourage greater fractionalization of the Internet based on the norm of cyberspace sovereignty.

Finally, this pact must be viewed in the context of the growing trend of bilateral cyber security pacts. Japan and the US recently announced the Joint Defense Guidelines, which covers a wide range of cooperative aspects targeted at the cyber domain and the promotion of international cyber norms. Just as the agreement with Japan is likely targeted at countering China, many states in the Middle East are requesting similar cooperation in light of the potential easing of Iranian sanctions. The Gulf Cooperation Council—a political and economic union of Arabic states in the Middle East—is similarly pushing for a cyber security agreement with the US to help deter Iranian aggression in cyberspace. In short, these cooperative cyber security agreements are indicative of the larger power politics that shape the international system. States are increasingly jockeying for positions in cyberspace, signaling their intent and allegiance, which will have implications for the foreseeable future. The Sino-Russian agreement is only the latest in the string of cyber pacts that reflects the competing visions for cyberspace, and the ever-growing geopolitical digital divide.

 

Open-Sourcing Your Own Python Library 101

$
0
0

Python has become an increasingly common language for data scientists, back-end engineers, and front-end engineers, providing a unifying platform for the range of disciplines found on an engineering team. One of the benefits of Python is that it allows software developers to choose and make use of zillions of good code packages. Among the huge number of excellent Python packages, a data scientist may use Pandas for data manipulation, NumPy for matrix computation, matplotlib for plotting, SciPy for mathematical modeling, and Scikit Learn for machine learning. Another benefit of using Python is that it allows developers to contribute their own code packages to the community or share a library with other Python programmers. At Endgame, library sharing is very common across projects for agile product development. For example, the implementation of a new clustering algorithm as a Python library can be used in multiple products with minimum adaptation. This tutorial will cover the basic steps and recommended practices for how to structure a Python project, package the code, distribute it over a Git repository (Github or a private Git repository) and install the package via pip.

For busy readers, I’ve developed a workflow diagram, below, so that you can quickly glance at the steps that I’ll outline in more detail throughout the post. Feel free to look back at the workflow diagram anytime you need a reminder of how the process works.

 

 

Workflow Diagram for Open-Sourcing a Python Library

 

Step One: Setup

Let’s suppose we are going to develop a new Python package that will include some exciting machine learning functionality. We decide to name the package "egclustering" to indicate that it contains functions for clustering. In the future, if we are to develop a new set of functions for classification, we could create a new package called "egclassification". In this way, functions designed for different purposes are organized into different buckets. We will name the project folder on the local computer as "eglearning". In the end, the whole project will be version controlled via Git, and be put on a remote Git repository, either GitHub or a private remote repository. Anyone who wants to use the library would just need to install the package from the remote repository. 

Term Definitions

Before we dig into the details, let’s define some terms:

  • Python Module: A Python module is a py file that contains classes, functions and/or other Python definitions and statements. More detailed information can be found here.
  • Python Package: A Python package includes a collection of modules and a ___init___.py file. Packages can be nested at any depth, provided that the sub-directories contain their own __init__.py file.
  • Distribution: A distribution is one level higher than a package. A distribution may contain one or multiple packages. In file systems, a distribution is the folder that includes the folders of packages and a dedicated setup.py file. 

 

Step Two: Project Structure

A clearly defined project structure is critically important when creating a Python code package. Not only will it present your work in an organized way and help users find valuable information easily, but it will also be much easier to add new packages or files in the future if the project scales.

I will take the recommendation from "Repository Structure and Python" to structure a new project, only adding a new file called README.md which is an introductory file used on GitHub, as shown below.

README.rst
README.md
LICENSE
setup.py
requirements.txt
egclustering
            __init__.py
            clusteringModuleLight.py (This py file contains the code.)
            helpers.py
docs
            conf.py
            index.rst
tests
            test_basic.py
            test_advanced.py 

The project structure is well explained on the page referenced above. Still, it might be helpful to emphasize a few points here:

  • setup.py is the file that tells a distribution tool, such as Distutils or Setuptools, how to install and configure the package. It is a must-have.
  • egclustering is the actual package name. How would we (or a distribution tool) know that? Because it contains a __init__.py file. The __init__.py file could be empty, or contain statements for some initiation activities.
  • clusteringModuleLight.py is the core file that defines the classes and functions. A single py file like that is called a module. A package may contain multiple modules. A package may also contain other packages, namely sub-packages, as long as there is a __init__.py included in a package folder. A project may contain multiple packages as well. For instance, we may create a new folder on par with "egclustering" called "egclassification" and put a new __init__.py under it.
  • Once you find a structure you like, it can serve as a template for future structures. You only need to copy and paste the whole project folder and give it a new project name. More advanced users can try using some template tools, for example, cookiecutter

 

Step Three: Setup Git and GitHub (or private GitHub) Repository

Ctrl+Alt+t to open a new terminal, and type in the following two commands to install Git on your computer, if you haven't done so. 

            sudo apt-get update
            sudo apt-get install git

If the remote repository will be on GitHub (or any other source code host, such as bitbucket.org), open a web browser and go to github.com, apply for an account, and create a new repository with a name like 'peterpan' in my case. If the remote repository will be on a private GitHub, create a new repository in a similar way. In either situation, you will need to tell GitHub your public key so that you can use ssh protocol to access the repository. 

To generate a new pair of ssh keys (private and public), type the commands in the terminal:

            ssh-keygen -t rsa -C "your_email@example.com"
            eval "$(ssh-agent -s)"
            ssh-add ~/.ssh/id_rsa

Then go to the settings page of your github account and copy and paste the content in the pub file into a new key. The details of generating ssh keys can be found on this settings page.

You should now have a new repository on GitHub ready to go. Click on the link of the repo and it will open the repo's webpage. At the moment, you only have a master branch. We need to create a new branch called "develop" so that all the development will happen on the "develop" branch. Once the code reaches a level of maturity, we put it on "master" branch for release.

To do that, click "branch", and in the blank field, type "develop". When that's done, a new branch will be created. 

 

Step Four: Initiate the Local Git and Syn with the Remote Repository

So far, we have installed Git locally to control the source code version, created the skeleton structure of the project, and set up the remote repository that will be linked with the local Git. Now, open a terminal window and change the directory (command ‘cd’) in the project folder (in my case, it is ~/workspace/peterpan). Type:

            git init
            git add .  

The period “.” after “git add” indicates to add the current folder into Git control.

If you haven't done so already, you will need to tell Git who you are. Type:

            git config --global user.name "your name"
            git config --global user.email "your email address"

Now let's tell local Git what remote repository it will be associated with. Before doing that, we need to get the URL of the remote repository so that the local Git knows where to locate it. On your browser, open the remote Git repository webpage, either on Github or your private GitHub. On the bottom of the right-side panel, you will see URL in different protocols of https, SSH, or subversion. If you're using GitHub and your repository is public, you may choose to use the https URL. Otherwise, use the SSH URL. Click the "copy to clipboard" button to copy the link.

In the same terminal, type:

            git remote -v 

to check what remote repositories you currently have. There should be nothing.

Now use the copied URL (which in my case is git@github.com:richardxy/peterpan.git) to construct the command below. "peterpanssh" is the name I gave to this specific remote repository which helps the local Git to identify which remote repository we deal with.

            git remote add peterpanssh git@github.com:richardxy/peterpan.git

When you type in the command “git remote -v” again, you should see the new remote repository has been registered with the local Git. You can add more remote repositories in this way by using “git remote add” command. In the case when you would like to delete a remote repository, which basically means "break the link between the local git and the remote repository", you can do Git remote rm (repository name), such as:

            git remote rm peterpanssh

If you don't like the current name of a repository, you can rename it by using the following command.

            git remote rename (oldname) (newname), such as:
            git remote rename peterpanssh myrepo

At the moment, the local Git repository has only one branch. Use “git branch” to check, and you will see “master” only. A better practice is to create a “develop” branch and develop your work there. To do this, type:

            git checkout -b develop

Now type “git branch” again and hit enter in the terminal window, and you will see the branch “develop” with an asterisk attached ahead of it, which means that the branch “develop” is the current working branch.

Now that we have linked a remote Git repository with the local Git, we can start synchronizing them. When you created the new repository on the remote Git (Github or your company's private Git repository), you may have opted in to add a .gitignore file. At the moment, .gitignore file exists only at the remote repository, but not at the local git repository. So we need to pull it to the local repository and merge it with what we have in the local repository. To do that, we use the command below:

            git pull peterpanssh develop 

Of course, peterpanssh is the name of the remote repository registered with the local git. You may use your own name.

“Git pull” works fine in small and simple projects like this. But when working on a project that has many branches in its repository, separate commands "git fetch" and "git merge" are recommended. More advanced materials can be found at git-pull Documentation and Mark's blog.

Once the local Git repository has everything the remote Git repository has (and more), we can commit and push the contents in the local Git to the remote Git.

The reason for committing to Git is to put the source code under Git's version control. The workflow related to committing usually includes:

Modify code -> Stage code -> Commit code

So, before we actually commit the code, we need to stage the modified files. We do this to tell Git what changes should be kept and put under version control. The easiest way to stage the changes is to use:

            git add -p

That will bring up an interactive session that presents you with all the changes and lets you decide to stage them or not. As we haven't made many changes so far, this interactive session should be short. Now we can enter:

            git commit -m "initial commit"

The letter "m" means "message", and the string after "-m" is the message to describe the commit.

After committing, the staged changes (by the "git add" command) are now placed in the local Git repository. The next step is to push it to the remote repository. Using the command below will do this:

            git push peterpanssh HEAD:develop

In this case, "peterpanssh" is the remote repository name registered with the local Git, and "develop" is the branch that you would like to push the code to. 

 

Step Five: Develop the Software Package

So far, we have built the entire infrastructure for hosting the local project, controlling the software versions both locally and remotely. Now it's time to work on the code in the package. To put the changes under version control (when you’re done with the project, or any time you think it’s needed), use:

            git add -p
            git commit -m "messages"
            git push repo_name HEAD:repo_branch

 

Step Six: Write setup.py

When your code package has reached a certain level of maturity, you can consider releasing it for distribution. A distribution may contain one or multiple packages that are meant to be installed at the same time. A designatedsetup.py file is required to be present in the folder that contains the package(s) to be distributed. Earlier, when we created the project structure, we already created an empty setup.py file. Now it's time to populate it with content. 

A setup.py file contains at least the following information:

           from setuptools import setup, find_packages
           setup(name='eglearning',
           packages=find_packages()
           )

There are a few distribution tools in Python. The standard tool for packaging in Python is distutils, and setuptools is an upgrade of distutils, with more features. In the setup() function, the minimum information we need to supply is the name of the distribution, and what packages are to be included. The function find_packages() will recursively go through the current folder and its sub-folders to collect package information, as long as a __init__.py is found in a package folder.

It is also helpful to provide the meta data for the distribution, such as version, a description of what the distribution does, and author information. If the distribution has dependencies, it is recommended to include the installation requirements in setup.py. Therefore, it may end up looking like this:

           from setuptools import setup, find_packages
           setup(name='eglearning',
                      version='0.1a',
                      description='a machine learning package developed at Endgame',
                      packages=find_packages(),
                      install_requires=[
                                 'Pandas>=0.14',
                                 'Numpy>=1.8',
                                 'scikit-learn>=0.13',
                                 'elasticsearch',
                                 'pyes',
                      ],
           )

To write more advanced setup.py, Python documentation or this web page are good resources.

When you are done with setup.py, commit the change and push it to the remote repository by typing the following commands:

           git add -p
           git commit -m 'modified setup.py'
           git push peterpanssh HEAD:develop

 

Step Seven: Merge Branch Develop to Master

According to Python engineer Vincent Driessen, "we consider origin/master to be the main branch where the source code of HEAD always reflects a production-ready state." When the code in the develop branch enters the production-ready state, it should be merged into the master branch. To do this, simply type in the terminal under the project directory:

           git checkout master
           git merge develop

Now we can push the master branch to the remote repository:

           git push peterpanssh

 

Step Eight: Install the Distribution from the Remote Repository

The Python package management tool "pip" supports the installation of a package distribution from a remote repository such as GitHub, or a private remote repository. pip currently supports cloning over the protocols of git, https and ssh. Here we will use ssh.

You may choose to install from a specific commit (identified by a MD5 check-sum) or whatever the latest commit in a branch. To specify a commit for cloning, type:

           sudo pip install -e git://github.com/richardxy/peterpan.git@4e476e99ce2649a679828cf01bb6b3fd7856281f#egg=MLM0.01

In this case, "github.com/richardxy/peterpan.git" is the ssh clone URL with ":" after ".com" being replaced with "/". This is tricky and it won't work if you omitted the replacement. The parameter "egg" is also a requirement. The value is up to you.

If you opt to clone the latest version in the branch (e.g. “develop” branch), type:

           sudo pip install -e git://github.com/richardxy/peterpan.git@develop#egg=MLM0.02

You only need to specify the branch name after "@" and before "egg" parameter. This is my preferred method.

Then pip will check if the installation requirements are met and install the dependencies and the package for you. Once it's done, type: 

           pip freeze

to find the newly installed package. You will see something like this:

           -e git://github.com/richardxy/peterpan.git@2251f3b9fd1b26cb41526f394dad81016d099b03#egg=eglearning-develop

Here, 2251f3b9fd1b26cb41526f394dad81016d099b03 is the MD5 checksum of the latest commit. 

Type the command below to create a requirements document that registers all of the installed packages and versions. 

           pip freeze > requirements.txt

Then open requirements.txt, replace the checksum with the branch name, such as “develop”, and save it. The reason for doing that is, the next time when a user tries to install the package, there might be new commit and therefore the MD5 would have changed. Using the branch name will always point to the latest commit in that branch.

One caveat: if virtualenv is used, the pip freeze command should look like this so that only the configurations in the virtual environment will be captured:

           pip freeze -l > requirements.txt

 

Conclusion

This tutorial covers the most fundamental and essential procedures for creating a Python project, applying version control during the development, packaging the code, distributing it over code-sharing repositories, and installing the package via cloning the source code. Following this process can help non-computer science-trained data scientists get more comfortable using well-known collaborative tools like Python for software development and distribution.


Stop Saying Stegosploit Is An Exploit

$
0
0

Security researcher Saumil Shah recently presented “Stegosploit” (slides available here). His presentation received a lot of attention on several hacker news sites, including Security AffairsHacker News, and Motherboard, reporting that users could be exploited simply by viewing a malicious image file in their web browser. If that were true, this would be terrifying.

“Just look at the image and you are HACKED!” – thehackernews

Here’s the thing. That is not what is happening with Stegosploit. Saumil Shah has created a “polyglot”. A polyglot is defined as “a person who knows and is able to use several languages,” but in the security world, the term can refer to a file that is a valid representation of two different data types. For example, you can concatenate a RAR file to the end of a JPG file. If you double click the JPG image, a photo pops up. If you then rename that JPG file to a .rar file, the appended RAR file will open. This is due to how the JPG and RAR file formats specify where the file begins. Stegosploit is using this same premise to embed JavaScript code inside of an image file, and obscure the JavaScript payload within pixel data.

This is still an interesting vector due to the difficulty of detection. It adds a layer of obfuscation, which relies on security through obscurity to avoid detection.

Embedding your code inside images requires a defensive product to not only process every packet, but also to inspect the individual artifacts extracted from the connection. Security through obscurity is widely considered ineffective. However, it is important to note that in order to identify even the most rudimentary steganography, you have to analyze every image file, which is computationally expensive, and increases the cost to defenders.

What is really interesting here is that Saumil Shah was actually rather forthcoming about this during his talk, clearly announcing that he was using a loader to deliver the payload, although that may not have been obvious to some of the observers. The exploit was delivered because the attacker sent malicious, obfuscated JavaScript to the browser. Stegosploit simply obfuscates an attack that could have been executed anyway. Just looking at an image will not exploit your web browser.

 

 

In the screenshot above, taken from the recording of the actual conference talk, Saumil is showing the audience the exploit “loader”. This is where a traditional JavaScript payload would be injected. The operative text in that screenshot is<script src=”elephant3.jpg”></script>, which takes a valid image file and interprets it as JavaScript. It simply injects the malicious code into a carrier signal so it looks innocuous. While it may seem like it is splitting hairs, it’s an extremely important distinction between “looking at this photo will exploit your machine”, and “this photo is camouflage that hides an exploit that has already occurred.”

All that being said, legitimate image exploits have been discovered in the past. Most notably, MS04-028 actually exploited the JPG processing library. In this case, loading an image into your browser would quite literally exploit your machine. This was tagged as a critical vulnerability, and promptly patched.

Stegosploit is an obfuscation technique to hide an exploit within images. It creates a JavaScript/image polyglot. Don’t worry, you can keep looking at captioned cat photos without fear.

Much Ado About Wassenaar: The Overlooked Strategic Challenges to the Wassenaar Arrangement’s Implementation

$
0
0

In the past couple of weeks, the US Bureau of Industry and Security (BIS), part of the US Chamber of Commerce, announced the potential implementation of the 2013 changes to the Wassenaar Arrangement (WA), which is a multinational arrangement intended to control the export of certain “dual-use” technologies.  The proposed changes place additional controls on the export of “systems, equipment or components specially designed for the generation, operation or delivery of, or communication with, intrusion software.” Many in the security community have been extraordinarily vocal in opposition to this announcement, especially with regard to the newly proposed definition of "Intrusion Software" in the WA. This debate is important and should contribute to the open comment period requested by the BIS, which ends July 20. While the WA appears to be a legitimate attempt to control the export of subversive software, the vague wording has raised alarms within the security community. 

For decades the security community has developed and studied exploit and intrusion techniques to understand and improve defenses. Like many research endeavors, it has involved the development, sharing, and analysis of information across national boundaries through articles, conferences, and academic publications. This research has successfully produced countermeasures like DEP (Data Execution Prevention) and ASLR (Address Space Layout Randomization), which mitigate numerous exploits seen in the wild. These kinds of countermeasures resulted directly from exploitation research and are protected by the new WA definition. While a robust debate on the WA’s implications is useful for the security community, what seems to be lacking is a strategic level discussion on whether these kinds of arrangements even have the potential to achieve the desired effect. The debate over the definition and wording of key terms is indicative of the larger hurdles these kinds of multinational arrangements encounter. This is especially problematic when building upon legacy agreements. By most measures, the WA simply renamed the COCOM (Coordinating Committee for Multilateral Export Controls) export control regime and is a Cold War relic designed to limit the export of weapons and dual-use technologies to the Soviet bloc. The Cold War ended a quarter of century ago, and yet agreements like WA still are built on that same mentality and framework. Below are four key areas that impact the ability of the WA (and similar agreements) to achieve the desired effect of “international stability” and should be considered when seeking to limit the diffusion of strategically important and potentially destructive materials. 

1. Members only: There are only 41 signatories to the WA (see the map below*). While to some that may seem extensive, it reflects less than a quarter of the states in the international community. In layman’s terms, three-quarters of the countries will be playing by a completely different set of rules and regulations, putting those who implement it at a competitive disadvantage – economically and in national security. Moreover, it means that three-quarters of the countries can export these potentially dual-use technologies – including countries like China, Iran, North Korea – rendering it unlikely to achieve the desired effect. To be clear, this concern is not just about US adversaries, but also about allies that could gain a competitive advantage. Israel, not a signatory of the WA, has a thriving cyber security industry and may increasingly attract more investment (and innovation!) in light of implementation of the WA.

2. Credible commitments: International cooperation depends heavily on credible commitments and the ability of states to implement the policies embedded in the treaty domestically. As membership rises, so too does diversity in domestic political institutions and foreign policy objectives. It would be startling (to say the least) if Western European countries and Russia pursue implementation that produce uniform adherence to the WA. Even within Western Europe, elections may usher in a new way of approaching digital security. Recent UK elections with a Tory majority may alter legislation pertaining to surveillance issues, and may run counter to the WA. 

3. Ambiguity of language: The most unifying theme of the security community’s opposition to the WA is the vague and open-ended definition of intrusion software. By some estimates, anti-virus software and Chrome auto-updates may fit within the definition. The government will likely receive many comments on the definition over the 60-day response period. It is strongly in the best interest of all parties involved if greater specificity is included. Otherwise, there will continue to be headlines vilifying the government for classifying everything digital as a weapon of war, which clearly is not the case. As we grapple with securing systems globally and ensuring our defenses can prevent advanced threats, one might imagine a future where loose policy definitions move software and techniques underground or off-shore for fear of prosecution. This could be counterproductive to understanding and securing the new and changing connected world.

4. Rudderless ship: The most successful international agreements have relied heavily on global leadership, either directly by a hegemonic state or indirectly through leadership within a specific international governmental organization (IGO). This leadership is essential to ensure compliance and norm diffusion of the regulations inherent within a treaty or agreement. The WA lacks any form of IGO support and certainly lacks any hegemonic or bipolar leadership. Even if this leadership did exist, the cyber domain simply lends itself to obfuscation and manipulation of the data and techniques, rendering external monitoring difficult. More so, China and Russia continue to push forth norms completely orthogonal to those of the WA, including cyber sovereignty. Without global acceptance and agreement on these foundational concepts, the WA has little chance of adherence even if there is domestic support for the verbiage (which clearly is not currently the case).

In short, the hurdles the WA will encounter when trying to achieve its objectives is a typical two-level game that hinders international cooperation. States must balance international polarity and norms on the one hand, with domestic constituents, institutions and domestic norms on the other. Without the proper conditions at both the domestic and international level, agreements have little chance of actually achieving the objective. If the goal is truly focusing on international stability, human rights, and privacy, the WA may not be the optimal means of achieving these goals. As organizations, researchers, and activists continue to contribute to the critical debate about the value and feasibility of the WA, the policy and security communities should take advantage of the open comment period to remember that the complexity and dynamism of the current digital landscape requires novel thinking beyond obsolete Cold War approaches.

*Wassenaar Arrangement Participants (source: https://www.armscontrol.org/factsheets/wassenaar)

OPM Breach: Corporate and National Security Adversaries Are One and the Same

$
0
0

On June 5, 1989, images of a lone person standing ground in front of Chinese tanks in Tiananmen Square transfixed the world. On the same day twenty-six years later, the United States government announced one of the biggest breaches of personnel data in history. This breach is already being attributed to China. China has also recently been stepping up its efforts to censor any mention of the Tiananmen Square massacre. The confluence of these two events – censorship of a pivotal human rights incident coupled with the theft of four million USG personnel records – should clarify beyond a doubt China’s intentions and vision for what constitutes appropriate norms in the digital domain. It is time for all of the diverse sectors and industries of the United States – from the financial sector in New York City to the tech industry in Silicon Valley to the government in Washington – to recognize the gravity of this common threat and commit to a legitimate public-private partnership that extends beyond lip service. As the OPM breach demonstrates, the United States government faces the same threats and intellectual property theft as the financial, tech, and other private sector industries. It’s time to move beyond our cultural divisions and unify against the common adversaries who are the true threats to privacy, security, democracy and human rights across the globe.

I attended a “Cyber Risks in the Boardroom” event yesterday in New York City. More often than not, these kinds of cybersecurity conferences will include one panel of private sector experts complaining about government regulations, infringements on privacy, and failure to grasp the competitive disadvantage of US companies thanks to proposed legislation. I have even heard the USG referred to as an “advanced persistent threat.” A government panel generally follows, and bemoans the inability of the private sector to grasp the magnitude of the threat. There is often an anecdote about an unnamed company that refuses government assistant when a breach has been identified, and there’s the obligatory attempt at humor to assuage fears that the government is really not interested in reading your email or tracking your Snapchat conversations.

That did not happen yesterday. The one comment that struck me the most was a call for empathy between the private and public sectors. In fact, at a conference held in the heart of the financial capital of the world, panel after panel reiterated the need for the government and private sector to work together to ensure the United States’ competitive economic advantage. The United States economy and its innovative drive is the bedrock of national security. The financial sector – one of the largest targets of digital theft and espionage – seems to grasp the essential role the government can and should play in safeguarding a level digital playing field. Nonetheless, even in this hospitable environment, cultural and linguistic hurdles, not to mention trust issues, continue to limit cooperation between the financial sector and government.

News of the OPM breach broke just as I was leaving the conference. Many are attributing the breach to China. As someone who lives at the intersection of technology and international affairs, it is impossible to ignore the irony. There continues to be heated debate about US surveillance programs, as well as potentially impending legislation on intrusion software. These debates will not likely end soon, and they are part of the democratic process and freedom of speech that is so often taken for granted. Compare that to China’s expansive censorship and propaganda campaign that not only forces US companies operating in China to censor any mention of Tiananmen Square, but limits any mention of activities that may lead to collective gatherings. Or compare that to China’s 50 cent party, a group of individuals paid by the Chinese government to provide positive social media content about the government. (Russia has a similar program, which extends internationally, including spreading disinformation on US media outlets.) Perhaps even more timely, China iscensoring online discussion about the horrific cruise ship collapse earlier this week on the Yangtze River. This is a very similar approach to that taken following the 2011 train crash that similarly led to censorship of any negative media coverage of the government’s response.

The enormous and historic OPM breach, revealed on the 26th anniversary of the Tiananmen Square protests, should cause the disparate industries and sectors that form the bedrock of US national security to pause…and empathize. Combating common adversaries that threaten not only national security, but also freedom of information and speech, requires a united front. The private and public sectors are much stronger working together than apart. Despite significant cultural differences, there are core values that unite the private and public sectors, and it’s time to put aside differences and work as a cohesive unit against US corporate and national security adversaries—for they are truly one and the same. This does not mean that debates about privacy and legislation should subside. On the contrary, those debates should continue, but must become constructive forms of engagement rather than divisive editorials. Many – especially those in the financial sector – seem to grasp the appropriate role for the government in handling these threats. It’s time to put aside differences and pursue constructive and united private-public sector collaboration to deter the persistent theft of IP and PII information at the hands of the adversaries we all face together.

The Digital Domain’s Inconvenient Truth: Norms are Not the Answer

$
0
0

To say the last week has been a worrisome one for any current or former federal government employees is a vast understatement. Now, with this weekend’s revelations that the data stolen in the OPM breach potentially included SF-86 forms as well—the extraordinarily detailed forms required to obtain a security clearance—almost every American is in fact indirectly impacted, whether they realize it or not.  As China’s repository of data on United States citizens continues to grow, it’s time that the United States adjusts its foreign digital policy to reflect modern realities. Despite this latest massive digital espionage, the United States continues to pursue a policy based largely on installing global norms of appropriate behavior in cyberspace, the success of which depends on all actors playing by the same rules. Norms only work when all relevant actors adhere to and commit to them, and the OPM breach, as well as other recent breaches by Russia, North Korea, and Iran, confirms that each state is playing by their own playbook for appropriate behavior in the digital domain. The U.S. needs to adopt a new approach to digital policy, or else this collective-action problem will continue to plague us for the foreseeable future. Global norms are not the silver bullet that many claim.

The Problem with Norms in a Multi-Polar International System

In recent testimony before Congress, the State Department Coordinator for Cyber Policy, Christopher Painter, outlined the key tenets of US foreign policy in the cyber domain. During this testimony, he highlighted security and cybercrime, with norms as a key approach to tackling that issue. He explicated the following four key tenets (abridged) on which global norms should be based:

1. States cannot conduct online activity that damages critical infrastructure.

2. States cannot prevent CSIRTs from responding to cyber incidents.

3. States should cooperate in investigations of online criminal activity by non-state actors.

4. States should not support the theft of IP information, including that which provides competitive advantage to commercial entities.

While these are all valid pursuits, the OPM breach confirms the age-old tenet that states are self-interested, and therefore are quite simply not going to adhere to the set of norms that the United States seeks to instill. The United States government is not the only one calling for “norms of responsible behavior to achieve global strategic stability”. Microsoft recently released a report entitled International Cybersecurity Norms, while one of the most prominent international relations academics has written about Cultivating International Cyber Norms. Rather than focusing on norms, policy for the digital domain must reflect economic, political, military and diplomatic realities of international relations. It should not be viewed as a stove-piped arena for cooperation and conflict across state and non-state actors. For example, the omnipresent tensions in the South China Sea are indicative of China’s larger, cross-domain global strategy. Russian rhetoric and activities in Eastern Europe similarly are a source of great consternation, with the digital espionage a key aspect of Russia’s foreign policy behavior. These cross-domain issues absolutely spill over into the digital domain and therefore hinder the chance that norms will be successful. These tensions are exacerbated by completely orthogonal perspectives on the desired digital end-state of many authoritarian regimes, which focuses on the notion of cyber sovereignty. These issues are further confounded when these states continue to maintain an economic system predicated on state-owned enterprises, which are essentially an extension of the state, meaning that IP theft directly supports the government and their favorite quasi-commercial entities. Finally, the notion of credible commitments is again an essential factor in norm distribution. Because of the surveillance revelations of recent years, other states remain cautious and dubious that the United States will also adhere to these norms. This lack of trust only exacerbates distrust against the set of norms that the United States is advocating.

Towards a New Approach: Change the Risk Calculus for the Adversary

Instead of a norms-based approach, formal, multi-actor models that focus on calculating the risks and opportunities of actions from an adversary’sperspective could greatly contribute to more creative (and potentially deterrent) policies. Thomas Schelling’s research on bargaining and strategy is emblematic of this approach, expanding on the interdependence and the strategic interplay that occurs between actors. Mancur Olson’s work on collective action similarly remains especially applicable when pursuing policies that require adherence by all actors within a group. These frameworks account for the preferences of multiple actors in a decision-making process and help identify the probability of preferences across a spectrum of options. If done well, incorporating multi-actor preferences not only provides insights into why some actors pursue policies or activities that seem irrational to others, but it also forces the analyst or policymaker to view the range of preferred outcomes from the adversary’s perspective. Multi-actor models advocate for a strong understanding of activities that can favorably impact the expected utility and risk calculus of adversaries. The United States has taken some steps in this direction, and it should increasingly rely on policies that raise the costs of a breach for the adversary. For example, the indictment of the five PLA officers last year is a positive signal that digital intrusions will incur punishment. In addition to punitive legal responses targeted at adversaries, greater technical capabilities that hunt the adversaries within the network can also raise the cost of an intrusion. If the cost of entry outweighs the benefits, adversaries will be much less likely to attack at will. Until then, attackers will steal information without any fear of retribution or retaliation and the digital domain will remain anarchic. Finally, instead of focusing on global norms that give the competitive advantage to those who do not cooperate, digital cooperation should be geared toward allies, encouraging the synchronization of similar punitive legislation and responses in light of an attack. In this regard, cooperation can reinforce collective security, and focus on enabling the capabilities of allied states, not limiting those capabilities to allow adversaries the upper hand.

The United States continues to pursue policies that require global support and commitment in order to be effective, rather than focusing on changing the risk calculus for the adversary. The OPM breach—one that affects almost all former and current federal employees and their contacts and colleagues throughout their lives—is evidence that other states play by a different playbook. While the U.S. should continue its efforts to shape the digital domain as one that fosters economic development, transparency, equality and democracy, the reality is that those views are not shared by some of the most powerful states in the global community. Until that inconvenient truth is integrated into policy, states and state-affiliated groups will continue to compile an ever-expanding database of U.S. personnel and trade secrets, which not only impacts national security, but also the economic competitiveness on which that security is built.

Data Science for Security: Using Passive DNS Query Data to Analyze Malware

$
0
0

Most of the time, DNS services—which produce the human-friendly, easy-to-remember domain names that map to numerical IP addresses—are used for legitimate purposes. But they are also heavily used by hackers to route malicious software (or malware) to victim computers and build botnets to attack targets. In this post, I’ll demonstrate how data science techniques can be applied to passive DNS query data in order to identify and analyze malware.

A botnet is a network of hosts affected by malware to conduct nefarious activities, usually without the awareness of their owners. A command-and-control host hidden in the network communicates with the affected computers to give instructions and receive results. In such a botnet topology, the command-and-control becomes the single point of failure. Once its IP address is identified, it could be easily blocked and the whole communication with the botnet would be lost. Therefore, hackers are more likely to use a domain name to identify the command-and-control, and employ techniques like fast flux to switch IP addresses mapped to a single domain name.

As data scientists at Endgame, we leverage data sets in large variety and volume to tackle botnets. While the data we analyze daily is often proprietary and confidential, there is a publicly available data set provided by Georgia Tech that documents DNS queries issued by malware across the years 2011 - 2014. The malware were contained in a controlled environment and had limited Internet access. Each and every domain name query was recorded, and if a domain name could be resolved, the corresponding IP address was also recorded.

This malware passive DNS data alone would not provide sufficient information to conduct a fully-fledged botnet analysis, but it does possess rich and valuable insights about malware behaviors in terms of DNS queries. I’ll explain how to identify malware based on this data set, using some of the methods the Endgame data science team employs daily.

Graphical Representation of DNS Queries

Here is the data set I’ll examine. Each row is a record of DNS query, including date, MD5 of the malware file, the domain name being queried, and the IP address if the query finds a result. 

What approach might enable the grouping of malware or suspicious programs based on specific domain names? As we have no information about the malware, the conventional static analysis of malware focusing on investigating binary files would not be helpful here. Clustering using machine learning may work only if each domain name is treated as a feature, but the feature space will be very sparse. That would result in expensive computation.

Instead, we can represent the DNS queries using a graphic network showing what domain names a malware is interested in, as displayed in Figure 1. Each malware program is labeled by an MD5 string. While Figure 1 only demonstrates a very small part of the network, the entire data set could actually be transferred into a huge network.

Figure 1. A small DNS query network

There are numerous advantages to expressing the queries in the format of a graph. First, this expedites querying complex relationships. A modern graph database, such as Neo4j, Orientdb, or Titandb, can efficiently store a large graphic network and conduct joint queries that normally are computationally expensive for relational databases, such as MS SQL Server, Oracle or MySql. Second, network analytic methods from a diverse range of scientific fields can be employed to analyze the data set to gain additional insights.

Graph Analysis on the Malware Network

The entire passive DNS data set covers several years, so I randomly picked a day during the data collection period and will present the analysis on the reduced data set. A graph was created out of a day’s worth of data, and the nodes include both domain names and malware MD5 strings. In other words, a node in the graph can either be an MD5 string, or a domain name, and an edge (or a connection) links an MD5 and a domain if the MD5 queries that domain name. The total number of nodes is 17,629, and the number of edges is 54,939. The average number of connections per node is about 3.

In my graph representation of DNS queries, there are two distinct sets of nodes: domain names and malware. A node in one set only connects with a node in the other set, and not one in its own set. Graph theory defines such a network as a bipartite graph, as shown in Figure 2. I wanted to split the graph into two graphs, one containing all the nodes of domain names, and the other containing only malware programs. This can be done by projecting the large graph onto the two sets of nodes, which creates two graphs. In each graph, two nodes are connected by an edge if they have connections to the same node of the other type. For example, domains xudunux.info and rylicap.info would be connected in the domain graph because both of them have connections with the same malware in the larger graph.

Figure 2. Bipartite graph showing two distinct types of nodes

Let’s look at the graph of malware first. For the day 2012-09-29 alone, there are 9876 unique malware recorded in the data set. First, I would like to know the topological layout of these malware and find out how many connected components exist in the malware graph.

A connected component is a subset of nodes where any two nodes are connected to each other by one or multiple paths. We can view connected components (or just components) as islands that have no bridge connecting each other.

Python programming language has an excellent network analysis package called networkx. It has a function to compute the number of connected components of a graph. The result of running that function, named number_connected_components, shows there are 2,114 components in the 9,876-node graph, 1,619 of which are one-node component. There are still 11 components that have more than 100 nodes within them. I will analyze those large components because the malware inside may be variants of the same program.

Figure 3 shows four components of the malware graph. The nodes in each component are densely connected to each other but not to any other components. That means the malware assigned to a component clearly possess some similar characteristics that are not shared by the malware from other components. 

Figure 3. Four out of eleven components in the malware graph

Component 1 contains 201 nodes. I computed the betweenness centrality of the nodes in the graph, which are all zeros, while the closeness centrality values of the nodes are all ones. This indicates that each node has a direct connection with each other node in the component, meaning that each malware queried exactly the same domain names as the other malware programs. This is a strong indication that all 201 malware are variants of a certain type of malicious executable.

Let’s return to the large DNS query graph to find out what domains the malware targeted. Using a graph database like Neo4j or OrientDB, or a graph analytic tool like networkx, the search is easy. The result shows that the malware in component 1 were only interested in three domain names: ns1.musicmixa.net, ns1.musicmixa.org, and ns1.musiczipz.com.

I queried VirusTotal for each of the 201 malware in component 1. VirusTotal submits the MD5 to a list of scanning engines and return the reports from those engines. A report includes its determination of the MD5 to be either positive or negative. If it’s positive, the report would provide more information about what kind of malware the MD5 is, based on the signature that the scanning engine uses.

I assigned a score to each malware by computing the ratio of the number of positive results to the total number of results. The distribution of the scores is shown in Figure 4. The scanning reports imply that the malware is a Wind32 Trojan.

Figure 4. Histogram of VirusTotal score of malware in Component 1

Using Social Network Analytics to Understand Unknowns

When I look at each of the components, not all of them have such high level of homophily as component 1 does. A different component has 2,722 malware nodes, and 681,060 edges. 309 of the 2,722 malware in this component were not known to VirusTotal, while the rest, 2,413 malware, had reports on the website. We need a way to analyze those unknown malware.

Social network analytic (SNA) methods provide insights into unknown malware by identifying known malware that are similar to the unknowns. The first step is to try to break the large component into communities. The concept of community is easy to understand in the context of a social network. Connections within a community are usually much denser than those outside a community. Members of a community tend to share some common trait, such as mission, geo-location, or profession. In this analysis, malware were connected if they queried the same domain that could be interpreted as two malware exhibiting a common interest in a domain name. Therefore, we can expect that malware programs that have queried similar domains represent a community. Communities exist inside a connected component and differ from the concept of components in that communities still have connections between each other.

Community detection is a particular kind of data clustering within the domain of machine learning. There are a wide variety of methods for community detection in a graphic network. Louvain method is a well-known and well-performed one, and tries to optimize the measure of modularity by partitioning a graph into groups of densely connected nodes. By applying the Louvain method to the big component with 2,722 nodes, I can identify 15 communities and the number of nodes within each community as shown in Figure 5.

Figure 5. Number of nodes in each community

Let’s take a specific malware as an example. The MD5 of this malware is 0398eff0ced2fe28d93daeec484feea6, and the search of it on VirusTotal found no result, as shown in Figure 6. 

Figure 6. Malware not found on VirusTotal

I want to know what malware programs have the most similar behavior in terms of DNS queries to this unknown malware. By looking into the similar malware that we do have knowledge about, we could gain insights into the unknown one.

I found malware 0398eff0ced2fe28d93daeec484feea6 in Community 4, which has 256 malware within it. To find the most similar malware programs, we need a quantitative definition of similarity. I chose to use Jaccard index to compute just how similar two sets of queried domains are.

Suppose malware M1 queried a set of domains D1, and malware M2 queried another set of domains D2. The Jaccard index of set D1 and D2 is calculated as:

The Jaccard index goes from 0 to 1, with 1 indicating an exact match.

Out of the total 2,722 nodes in Component 1, 100 malware programs have exactly the same domain queries as malware 0398eff. That means their Jaccard indices against malware 0398eff are 1. However, only 9 malware are known to VirusTotal. The 9 malware are shown below.

Each of the 100 malware programs, including the 9 known ones, that have the same domain queries as malware 0398eff appear in community 4. The histogram of Jaccard index is shown in Figure 7.

Figure 7. Histogram of Jaccard index for nodes in community 4

We can tell from the histogram that the malware programs in community 4 could be generally split into two sets. One set contains 100 malware that have exactly the same domain queries as malware 0398eff, and the other set has nodes that are much less similar to it. The graph visualization in Figure 8 demonstrates the split. By this analysis, we have found those previously unknown 91 malware behaving similarly to some known malware. 

This blog post demonstrates how I used DNS query data to conduct network-based graphic analysis for malware. Similar analysis can be done with the domain names to identify groups of domains that tend to be queried together by a malware program. This can help identify potentially malicious domains that were previously unknown.

Given the vast quantities of data those of us in the security world handle daily, data science techniques are an increasingly efficient and informative way to identify malware and targeted domains. While machine learning and clustering tend to dominate these kinds of analyses, social network based graphic methods should increasingly become another tool in the data science toolbox for malware detection. Through the identification of communities, betweenness, and similarity scores, network analysis helps show not only connectivity, but also logical groupings and outliers within the network. Viewing malware and domains as a network provides another more intuitive approach for wrangling the big data security environment. Given the limited features available in the DNS passive query data, graph analytic approaches supplement traditional static and dynamic approaches and elevate capabilities in malware analytics.

Meet Endgame at Black Hat 2015

$
0
0

 

Endgame will be at Black Hat!

Stop by Booth #1215 to:

 

GET AN ENDGAME ENTERPRISE DEMO

Sign up here for a private demo to learn how we help customers automate the hunt for cyber adversaries.
 

MEET WITH ENDGAME EXPERTS

Meet our experts and learn more about threat detection and data science. Check out the Endgame blog to read the latest news, trends, and research from our experts before you go.
 

EVERYONE NEEDS A SMART WATCH!

Enter to win an Apple or LG smart watch. Stop by the booth Wednesday, August 5 or Thursday, August 6 for a chance to win. We'll announce each day's winner on Twitter at the end of each day at 5pm PT.

Examining Malware with Python

$
0
0

Before I came to Endgame, I had participated in a couple of data science competitions hosted by Kaggle. I didn’t treat them as competitions so much as learning opportunities. Like most things in the data science community, these competitions felt very new. But now that I work for a security company, I’ve learned about the long history of CTF competitions meant to test and add to a security researcher’s skills. When the Microsoft Malware Challenge came along, I thought this would be a great opportunity to learn about new ways of applying machine learning to better understand malware. Also, as I’ve talked about before, the lack of open and labeled datasets is a huge obstacle to developing machine learning models to solve security problems. Here was an opportunity to work with an already prepared large labeled dataset of malware samples.

I gave a talk at the SciPy conference this year that describes how I used the scientific computing tools available in Python to participate in the competition. You can check out my slides or watch the video from that talk here. I tried to drive home two main points in this talk: first, that Python tools for text classification can be easily adopted for malware classification, and second, that details of your disassembler and analysis passes are very important for generalizing any results. I’ll summarize those points here, but take a look at the video and slides for more details and code snippets.

My final solution to the classification challenge was mainly based on counting combinations of bytes and instructions called ngrams. This method is based on counting the frequency that a byte or an instruction occurs in a malware sample. When n is greater than one, I count the frequency of combinations of two, three, or more combinations of bytes or instructions. Because the number of possible combinations climbs very quickly, a hashing vectorizer must be used to keep the size of the feature space manageable.

Figure 1: Example byte 2grams from the binglide documentation

 

Figure 2: Byte 2grams from a malware sample included in the competition

At first, I was only using byte ngrams and I was very surprised that feeding these simple features to a model could provide such good classifications. In order to explore this, I used binglide to better understand what the bytes inside an executable look like. Figure 1 and Figure 2 show the results of this exploration. Figure 1 shows example output from binglide’s documentation and Figure 2 shows the output when I ran the tool on a sample from the competition. In all the images, the entropy of a binary is displayed on the strip to the left and a histogram of the 2gram frequency is shown on the right. For that frequency histogram, each axis contains 256 possible values for a byte and a pixel turns blue as that combination of bytes occurs more frequently.

You can see that the first 2gram pattern in Figure 2 generally looks like the first 2gram pattern in Figure 1. The .text section is usually used for executable code so this match to example x86 code is reassuring. The second 2gram pattern in Figure 2 is very distinctive and doesn’t really match any of the examples from the binglide documentation. Machine learning algorithms are well suited to picking out unique patterns like this if they are reused throughout a class. Finding this gave me more confidence that the classification potential of the byte ngram features was real and not due to any mistake on my part.

I also used instruction ngrams in a similar way. In this case, instructions refer to the first part of the assembly code after it’s been disassembled from the machine code. I wrote some Python code to extract the instructions from the IDA disassembly files that were provided by those running the competition. Again, feature hashing was necessary to restrain the size of the feature space. To me, it’s very easy to see why instruction ngrams could provide good classifications. Developing software is hard, and malware authors are going to want to reuse code in order to not waste effort. That repeated code should produce similar patterns in the instruction ngram space across families of malware.

Using machine learning algorithms to classify text is a mature field with existing software tools. Building word and character ngrams from text is a very similar problem to building the byte and instruction ngrams that I was interested in. In the slides from my SciPy talk I show some snippets of code where I adapted the existing text classification tools in the scikit-learn library to the task of malware classification. Those tools were a couple of different vectorizers, pipelines for cross validating multiple steps together, and a variety of models to try out.

All throughout this process, I was aware that the disassembly provided in the competition would not be available in a large, distributed malware processing engine. IDA Pro is the leading program for reverse engineering and disassembling binaries. It is also restrictively licensed and intended to be run interactively. I’m more interested in extracting features from disassembly automatically in batch and providing some insight to the files generated by a statistical model. I spent a lot of time during and after the competition searching for open source tools that could automatically generate the disassembly provided by the competition.

I found Capstone to be a very easy to use open source disassembler. I used it to generate instruction ngrams and tested the classification performance of models based on those ngrams to the same models based on IDA instructions. They both performed well in that there were very few misclassifications. The competition was judged on a multi-class logarithmic loss metric, though, and this metric was always better when using the IDA instructions.

After talking to some security experts at Endgame, I’ve learned that this could be due to the analysis passes that IDA does before disassembling. Capstone will just execute one sweep over the binary and disassemble anything it finds as it goes. IDA will more intelligently decode the binary looking for entry points, where functions and subroutines begin and end, and what sections actually contain code, data, or imports. I was able to relate this to my machine learning experience in that I viewed IDA’s disassembly as a more intelligent feature engineering pipeline. The result is that I’m still working on finding or building the best performing distributable disassembler.

This Kaggle competition was a great example of how data science can be applied to solve specific security problems. Data science has been described as a combination of skills in software, math, and statistics, along with domain expertise. While I didn’t have the domain expertise when I first joined Endgame, working closely with our security experts has expanded my breadth of knowledge while giving me a new opportunity to explore how data science techniques can be used to solve security challenges.


Why We Need More Cultural Entrepreneurs in Security & Tech

$
0
0

Recently, #RealDiversityNumbers provided another venue for those in the tech community to vent and commiserate over the widely publicized lack of diversity within the industry. The hashtag started trending and gained some media attention. This occurred as Twitter came under fire for organizing a frat-themed party, while also facing a gender inequality claim. Unfortunately, as dire as the diversity situation is in the tech sector writ large, it pales in comparison to the statistics on diversity in the security sector. The security community not only faces a pipeline shortage, but it has also made almost no progress in actively attracting a diverse workforce. The tectonic shifts required to achieve true diversity in the security sector also mean a fundamental shift in the tech culture must take place. However, while companies such as Pinterest have publicly noted their commitment to diversity, very little has changed from top-down approaches to diversification in the tech community. Certainly internal policies and recruiting practices matter, and leadership support is essential. These are the core enablers, but are not sufficient for institutionalizing cultural change. Instead, cultural entrepreneurs reflecting technical expertise across an organization must lead a grassroots movement to truly institutionalize cultural change within organizations and across the tech community. All of us must move beyond our comfort zones of research, writing and coding and truly take ownership of organizational culture.

Given the competition for talent in the security industry, an organization’s culture (ceteris paribus) often proves to be the determining factor that fosters, attracts, and retains a highly skilled and diversified workforce. Because an organization cannot engineer its way toward an innovative, inclusive culture or simply throw money at the issue, this problem can be perplexing to tech-focused industries. As anyone who has even briefly studied cultural approaches knows, culture is very sticky and entails a concerted and persistent effort to achieve the desired effects. It requires a paradigm shift much in the same way Kuhn, Lakatos and Popper all approached the various avenues toward scientific progress. The good news – if there is any – is that many of the cultural shifts required to foster a driven, innovative and (yes!) inclusive work environment do not cost a lot of money. Similar to the role of policy entrepreneurs in pushing forth new ideas in the public sector, cultural entrepreneurs are key individuals who can use their technical credibility to push forth ideas and promote solutions for any cultural challenges they identify or experience. By serving as a gateway between various aspects of an organization, cultural entrepreneurs can move an organization and ideally the industry beyond a “brogramming” mentality and reputation. Cultural entrepreneurs must reflect technical expertise across a diverse range of skills and demographics in order to legitimately encourage diversity and innovation. This enables the credible organic shifts from below that foment cultural change.

Cultural entrepreneurs are required to ensure an organization’s culture is inclusive and purpose-driven, instead of perpetuating the status quo. In this regard, diversity is a key aspect of this cultural shift. Diversity provides an innovation advantage and positively impacts the bottom line. Many in the tech community are starting to realize this, with companies like Intel investing $300 million in diversity, and CEOs lamenting that they wished they had built diversity into their culture from the start. Admitting that the problem exists is an important step, but this rhetoric has yet to translate into a more diversified workforce. A concerted effort by major tech companies to address diversity resulted in at most a 1% increase in gender diversity and an even smaller increase in ethnic diversity. Cultural entrepreneurs, and their ability to foster grassroots cultural shifts, may be the missing link in many of these cultural and diversity initiatives.  

Cultural entrepreneurs across an organization can make a significant impact with minimal work or cost by focusing on both internal and external cultural aspects of an organization. First, there is a large literature on how cross-cutting links (think social network analysis) develop social capital, which in turn has a positive impact on civic engagement and economic success. A recent Gallup Poll reinforces just how hard it is to foster social capital, with results confirming that over 70% of the American workforce does not feel engaged. Many organizations know this, but unfortunately fail at implementation by opting for social activities that reinforce exclusivity or feel contrived or overly corporate. Events ranging from frat-themed parties to cruise boats with concubines clearly do little to attract a diverse workforce. Cultural entrepreneurs can encourage or informally organize inclusive activities – such as sports, team outings, or discussion boards – within and across departments to increase engagement. While these kinds of social activities may seem superfluous to the bottom line, they can positively impact retention, workforce engagement, and inclusivity by building cross-cutting social networks. The kinds of social activities certainly should vary depending on an organization, but they must appeal to multiple segments of the workforce to foster social capital instead of reinforcing stereotypes and stovepipes within organizations. However, with everyone’s heads to keyboard all day every day, technical cultural entrepreneurs rarely emerge, hindering the development of social capital.

Second, perception is reality, and cultural entrepreneurs can help shift external perceptions of the industry. A quick search of Google images for “hacker” reveals endless images of male figures in hoodies working in dark, nefarious environments.  The media perpetuates this with similar images every time a new high profile breach occurs. It’s not just a media problem. It is also perpetuated within the industry itself. A recent analysis of the RSA conference guide showed shockingly little diversity.  The study notes that “women are absent” and “people of colour are totally absent.” While it adequately reflects the reality of the security industry, it makes those of us currently in the security community feel more out of place if we don’t fit that profile, while also deterring anyone not fitting those profiles from entering the field.  Let’s hope the upcoming Black Hat and Def Con conferences are more inclusive, with a broader representation of gender, race and appearance, but I wouldn’t bet on it. It’s up to cultural entrepreneurs to continue to press their organizations and the industry to help shift the perception of the security community away from nefarious loners and toward one with a universal mission that requires a diverse range of skillsets and backgrounds. Providing internal and external thought leadership through blogs, presentations and marketing can go a long way toward helping reality reflect the growing rhetoric advocating for diversity.

The security industry, which mirrors the diversity problems in the tech industry writ large, would benefit from a cultural approach to workforce engagement and inclusivity. All of the amenities in the world are not enough to overcome the tech industry’s cultural problems that not only persist, but that are also much more exclusive than they were two decades ago.  In creative industries, cultural entrepreneurs are essential to fostering the social capital and intrinsic satisfaction that emerges from an inclusive and innovative culture. At Endgame, this is something that we think about daily and always seek to improve. We benefit from leadership that supports and understands the role of culture, while also letting us grow that culture organically.  This organic growth relies on technical leaders across the company working together and pushing both the technical and cultural envelopes. This combination of technical mastery coupled with an collaborative and driven culture provides the foundation on which we will continue to foment inclusivity while disrupting an industry which for too long has relied on outdated solutions to modern technical and workforce challenges.

Sprint Defaults and the Jeep Hack: Could Basic Network Settings Have Prevented the Industry Uproar?

$
0
0

In mid-July, research into the security of a Jeep Cherokee was disclosed though a Wired article and subsequent Black Hat presentation. The researchers, Charlie Miller and Chris Valasek, found an exploitable vulnerability in the Uconnect entertainment system that operates over the Sprint cellular network. The vulnerability was serious enough to prompt a 1.4 million-vehicle recall from Chrysler.

In the Wired article, Miller and Valasek describe two important aspects of the vulnerability. First, they can target their exploit against a specific vehicle: “anyone who knows the car’s IP address can gain access from anywhere in the country,” and second, they can scan the network for vulnerable vehicles including a Dodge Ram, Jeep Cherokee, and a Dodge Durango. Both of these capabilities, to scan and target remotely through the cellular network, are necessary in order to trigger the exploit against a target vehicle.

While it’s really scary to think that a hacker anywhere in the country can drive your car off the road with the push of a button, the good news is that the cellular network has safeguards in place to prevent remotely interacting with phones and devices like Uconnect. For some inexplicable reason, Sprint disabled these safeguards and left the door wide open for the possibility of remote exploitation against the Uconnect cars. Had Sprint not disabled these safeguards, the Uconnect vulnerability would have just been another of several that require physical access to exploit and may not have prompted an immediate recall. 

The Gateway

Cellular networks are firewalled at the edge (Figure 1). GSM, CDMA and LTE networks are all architected a little differently, but each contains one of the following Internet gateways:

  • CDMA: Packet Data Serving Node (PDSN) in CDMA networks (Verizon and Sprint)
  • GSM: Gateway GPRS Support Node (GGSN) (T-Mobile or AT&T)
  • LTE: the responsibilities of the gateway are absorbed into multiple components in the System Architecture Evolution (SAE). All major Telcos in the US operate LTE networks.

Figure 1: Network layout

To keep things simple and generic, we’ll just call this component “the gateway.” Network connections only originate in one direction: outbound. You can think of the core network of your phone network as a big firewalled LAN, and it is not possible to gain access to a phone from outside the phone network (Figure 2). 

Figure 2: The attacker is blocked from outside the core network.

Miller was able to operate behind this firewall by tethering his laptop to a burner phone that was on the Sprint network (Figure 3).

But by default, phones are blocked from seeing each other as well. So even if the attacker knows the IP address of another phone on the network, the network won’t allow her to make a data connection to connect to that phone (Figure 4).  The network enforces this by what are called Access Point Names (APNs). 

Figure 3: Device-to-device was enabled for the car’s APN, enabling remote exploitation. Why?

Figure 4: Default configuration, device-to-device connections disabled. The attacker cannot access the target device from inside the firewall.

When a phone on the network needs to make a data connection, it provides anAPN to the network. If you want to view the APN settings in your personal phone you follow these instructions for iPhone or Android.  The network gateway uses the APN to determine how to allow your phone to connect to the Internet.  There are hundreds of APNs in every network, and your carrier uses APNs to organize how different devices are allocating data for billing purposes. In the case of Uconnect, all Uconnect devices operate on the Sprint network and use their own private APN.  APNs are really useful for third parties, like Uconnect, to sell a service that runs over a cellular network. So that each Uconnect user doesn’t need to maintain a line of service with Sprint, Uconnect is responsible for the data connection, and end users pay Uconnect for service, which runs through a private APN that was set up for Uconnect.

APNs are used extensively to isolate private networks for machine-to-machine systems like smart road signs and home alarm systems. If you’ve ever bought a soda from a vending machine with a credit card, the back end connection was using a private APN. 

Vulnerabilities caused by misconfigured APNs are not new; the APN of the bike-sharing system in Madrid was hacked just last summer. These bike-sharing systems need device-to-device access because technicians perform maintenance on these machines via remote desktop.  

Aftermath

There is no obvious reason for Uconnect to need remote administration. Why then are device-to-device connections allowed for the Uconnect APN, especially since it opens the door to a remote access exploit?  We will probably never know, because six days after the Wired story was published, Miller tweeted that Sprint had blocked phone-to-car traffic as well as car-to-car traffic. What this really means is that Sprint disabled internal traffic for the Uconnect APN. The remote access vector was closed.

The fact that Sprint made this change so quickly suggests that device-to-device traffic was not necessary in the first place, which leads us to two conclusions: 1) Had Sprint simply left device-to-device traffic disabled, the Jeep incident would have required physical access and not have been any more of a story than the Ford Escape story in 2013, or 2) More seriously, if the story hadn’t attracted mainstream media attention, Chrysler might not have taken the underlying vulnerability as seriously, and the fix would have rolled out much later, if ever. 

Security shouldn’t be a function of the drama circus that surrounds it.

 

Firewall icon created by Yazmin Alanis from the Noun Project
Pirate Phone icon created by Adriana Danaila from the Noun Project
Pickup truck icon created by Jamie M. Laurel from the Noun Project

Black Hat 2015 Analysis: An Island in the Desert

$
0
0

This year’s Black Hat broke records yet again with the highest levels of attendance, including highest number of countries represented and, based on the size of the business hall, companies represented as well. While it featured some truly novel technical methods and the advanced security research for which it is so well known, this year’s conference even more than others reflected an institutionalization of the status quo within the security industry. Rather than reflecting the major paradigm shifts that are occurring in the security community, it seemed to perpetuate the insularity for which this community is often criticized.

In her Black Hat keynote speech, Jennifer Granick, lawyer and Director of Civil Liberties at Stanford University, noted that inclusion is at the heart of the hacker’s ethos and called for the security community to take the lead and push forth change within the broader tech sector. She explicitly encouraged the security community to refrain from being so insular, and to transform into a community that not only thinks globally but is also much more participatory in the policies and laws that directly affect them. While she focused on diversity and equality, there are several additional areas where the security community could greatly benefit from a more expansive mindset. Unfortunately, these strategic level discussions were largely absent from the majority of the Black Hat briefings that followed the keynote. The tactical, technical presentations understandably comprise the majority of the dialogue and garner the most attention.  However, given the growing size and expanding representation of disparate parts of the community, there was a noticeable absence of nuanced discussion about the state of the security community, including broader thinking about the three big strategic issues and trends that will define the community for the foreseeable future:

  • Where’s the threat? Despite a highly dynamic threat landscape, ranging from foreign governments to terrorist organizations to transnational criminal networks, discussion of these threat actors was embarrassingly absent from the panels this year. Although the security community is often criticized for over-hyping the threat, this was not the case at this year’s Black Hat. Even worse, the majority of discussions of the threat focused on the United States and Western European countries as the greatest security threats. Clearly, technology conferences must focus on the latest technological approaches and trends in the field. However, omitting the international actors and context in which these technologies exist perpetuates an inward-facing bias of the field that leads many to misunderstand the nature, capabilities and magnitude of the greatest threats to corporate and national security.
  • Toward détente? Last year’s Black Hat conference was still reeling from the Snowden revelations that shook the security community. A general feeling of distrust of the U.S. government was still apparent in numerous panels, heightening interest in privacy and circular discussions over surveillance. While sentiments of distrust still exist, this no longer appears to be the only perspective. In a few briefings, there was a surprising lack of the hostility toward the government that existed at similar panels a year ago. In fact, the very few panels that had government representation were not only well attended, but also contained civil discourse between the speakers and the audience. This does not mean that there were softball questions. On the contrary, there was blunt conversation about the "trust deficit" between the security community and the government. For instance, the biggest concern expressed regarding data sharing with the government (including the information sharing bill which Congress discussed last week, but is now delayed) was not about information sharing itself, but rather how the security community can trust that the government can protect the shared data in light of OPM and other high-profile breaches. This is a very valid concern and one that ignited a lot of bilateral dialogue. Organizations from the DHS to the Federal Trade Commission requested greater partnerships with the security community. While there are certainly enormous challenges ahead, it was refreshing to see not only signs of a potential thawing of relations between the government and the security community, but also hopefully some baby steps toward mutually beneficial collaboration.
  • Diversity. The general lack of diversity at the conference comes as no surprise given the well-publicized statistics of the demographics of the security community, as well as the#ilooklikeanengineer campaign that took off last week. However, diversity is not just about gender – it also pertains to diversity of perspectives, backgrounds and industries. Areas such as human factors, policy and data science seemed to be less represented than in previous years, conflicting with much of the rhetoric that permeated the business hall. In many of the talks that did cover these areas, there were both implicit and explicit requests for a more expansive partnership and role within the community.

Given the vast technological, geopolitical and demographic shifts underway, the security community must transform beyond the traditional mindset and truly begin to think beyond the insular perimeter. Returning to Granick’s key points, the security community can consciously provide leadership not only in shaping the political discourse that impacts the entire tech community, but also lead by example through promoting equality and thinking globally. The security community must play a participatory role in the larger strategic shifts that will continue to impact it instead of remaining an insularly focused island in the desert.

NLP for Security: Malicious Language Processing

$
0
0

Natural Language Processing (NLP) is a diverse field in computer science dedicated to automatically parsing and processing human language. NLP has been used to perform authorship attribution and sentiment analysis, as well as being a core function of IBM’s Watson and Apple’s Siri. NLP research is thriving due to the massive amounts of diverse text sources (e.g., Twitter and Wikipedia) and multiple disciplines using text analytics to derive insights. However, NLP can be used for more than human language processing and can be applied to any written text. Data scientists at Endgame apply NLP to security by building upon advanced NLP techniques to better identify and understand malicious code, moving toward an NLP methodology specifically designed for malware analysis—a Malicious Language Processing framework. The goal of this Malicious Language Processing framework is to operationalize NLP to address one of the security domain’s most challenging big data problems by automating and expediting the identification of malicious code hidden within benign code.

How is NLP used in InfoSec?

Before we delve into how Endgame leverages NLP, let’s explore a few different ways others have used it to tackle information security problems:

  • Domain Generation Algorithm classification – Using NLP to identify malicious domains (e.g., blbwpvcyztrepfue.ru) from benign domains (e.g., cnn.com)
  • Source Code Vulnerability Analysis – Determining function patterns associated with known vulnerabilities, then using NLP to identify other potentially vulnerable code segments.
  • Phishing Identification – A bag-of-words model determines the probability an email message contains a phishing attempt or not.
  • Malware Family Analysis –Topic modeling techniques assign samples of malware to families, as discussed in my colleague Phil Roth’s previous blog.

Over the rest of this post, I’ll discuss how Endgame data scientists are using Malicious Language Processing to discover malicious language hidden within benign code. 

Data Acquisition/Corpus Building

In order to perform NLP you must have a corpus, or collection of documents. While this is relatively straightforward in traditional NLP (e.g., APIs and web scraping) it is not necessarily the same in malware analysis. There are two primary techniques used to get data from malicious binaries: static and dynamic analysis. 

Fig 1. Disassembled source code

 

Static analysis, also called source code analysis, is performed using adisassembler providing output similar to the above (Fig 1). The disassembler presents a flat view of a binary, however structurally we lose important contextual information by not clearly delineating the logical order of instructions. In disassembly, jmp or call instructions should lead to different blocks of code that a standard flat file misrepresents. Luckily, static analysis tools exist that can provide call graphs that provide logical flow of instructions via a directed graph, like this and this.

Dynamic analysis, often called behavioral analysis, is the collection of metadata from an executed binary in a sandbox environment. Dynamic analysis can provide data such as network access, registry/file activity, and API function monitoring. While dynamic analysis is often more informative, it is also more resource intensive, requiring a suite of collection tools and a sandboxed virtual environment. Alternatively, static analysis can be automated to generate disassembly over a large set of binaries generating a corpus ready for the NLP pipeline. At Endgame we have engineered a hybrid approach that automates the analysis of malicious binaries providing data scientists with metadata from both static and dynamic analysis.

Lexical Parsing

Lexical parsing is paramount to the NLP process as it provides the ability to turn large bodies of text into individual tokens. The goal of Malicious Language Processing is to parse a binary the same way an NLP researcher would parse a document:

To generate the “words” in this process we must perform a few traditional NLP techniques. First is tokenization, the process of breaking down a string of text into meaningful segments called tokens.  Segmenting on whitespace, new line characters, punctuation or regular expressions can generate tokens. (Fig 2)

Fig 2. Tokenized disassembly

The next step in the lexical parsing process is to merge families of derivationally related words with similar meaning or text normalization. The two forms of this process are called stemming and lemmatization.

Stemming seeks to reduce a word to its functional stem. For example, in malware analysis this could reduce SetWindowTextA or SetWindowTextW to SetWindowText (Windows API), or JE, JLE, JNZ to JMP (x86 instructions) accounting for multiple variations of the essentially the same function.

Lemmatization is more difficult in general because it requires context or the part-of-speech tag of a word (e.g., noun, verb, etc.). In English, the word “better” has “good” as its lemma. In malware we do not yet have the luxury of parts-of-speech tagging, so lemmatization is not yet applicable. However, a rules-based dictionary that associates Windows API equivalents of C runtime functions may provide a step towards lemmatization, such as mapping _fread to ReadFile or _popen to CreateProcess.

Semantic Networks

Semantic or associative networks represent the co-occurrence of words within a body of text to gain an understanding of the semantic relationship between words. For each unique word in a corpus, a node is created on a directed graph. Links between words are generated with an associated weight based on the frequency that the two words co-occurred. The resulting graph can then be clustered to derive cliques or communities of functions that have similar behavior.

A malicious language semantic network could aid in the generation of a lexical database capability for malware similar to WordNet. WordNet is a lexical database of English nouns, verbs, and adjectives grouped into sets of cognitive synonyms. Endgame data scientists are in the incipient stages of exploring ways to search and identify synonyms or synsets of malicious functions. Additionally, we hope to leverage our version of WordNet in the development of lemmatization and the Parts-of-Speech tagging within the Malicious Language Processing framework.

Parts-of-Speech Tagging

Parts-of-Speech (POS) tagging is a piece of software capable of tagging a list of tokens in a string of text with the correct language annotation, such as noun, verb, etc. POS Tagging is crucial for gaining a better understanding of text and establishing semantic relationship within a corpus. Above I mentioned that there is currently no representation of POS tagging for malware. Source code may be too abstract to break down into nouns, prepositions or adjectives. However, it is possible to treat subroutines as “sentences” and gain an understanding of functions used as subjects, verb and predicates. Using pseudo code for a process injection in Windows, for example, would yield the following from a Malicious Language Processing POS-Tagger:

Closing Thoughts

While the majority of the concepts mentioned in this post are being leveraged by Endgame today to better understand malware behavior, there is still plenty of work to be done. The concept of Malicious Language Processing is still in its infancy. We are currently working hard to expand the Malicious Language Processing framework by developing a malicious stop word list (a list of the most common words/functions in a corpus of binaries) and creating an anomaly detector capable of determining which function(s) do not belong in a benign block of code. With more research and larger, more diverse corpuses, we will be able to understand the behavior and basic capabilities of a suspicious binary without executing or having a human reverse engineer it. We view NLP as an additional tool in a data scientist’s toolkit, and a powerful means by which we can apply data science to security problems, quickly parsing the malicious from the benign.

Hunting for Honeypot Attackers: A Data Scientist’s Adventure

$
0
0

The U.S. Office of Personnel Management (known as OPM) won the “Most Epic Fail” award at the 2015 Black Hat Conference for the worst known data breach in U.S. government history, with more than 22 million employee profiles compromised. Joining OPM as contenders for this award were other victims of high-profile cyber attacks, including Poland's Plus Bank and the website AshleyMadison.com. The truth is, hardly a day goes by without news of cyber intrusions. As an example, according to databreachtoday.com, just in recent months PNI Digital Media and many retailers such as Wal-Mart and Rite-Aid had their photo services compromised, UCLA Health’s network was breached, and information of 4.5 million people may have been exposed. Criminals and nation-state actors break into systems for many reasons with catastrophic and often irremediable consequences for the victims.

Traditionally, security experts are the main force for investigating cyber threats and breaches. Their expertise in computers and network communication provides them with an advantage in identifying suspicious activities. However, with more data being collected, not only in quantity but also in variety, data scientists are beginning to play a more significant role in the adventure of hunting malicious attackers. At Endgame, the data scientist team works closely with the security and malware experts to monitor, track and identify cyber threats, and applies a wide range of data science tools to provide our customers with intelligence and insights. In this post, I’ll explain how we analyze attack data collected from a honeypot network, which provides insight into the locations of attackers behind those activities. The analysis captures those organized attacks from a vast amount of seemingly separated attempts.

This post is divided into three sections. The first section describes the context of the analysis and provides an overview of the hacking activities. The second section focuses on investigating the files that the attackers implanted into the breached systems. Finally, the third section demonstrates how I identified similar attacks through uncovering behavioral characteristics. All of this demonstrates one way that data science can be applied to the security domain. (My previous post explained another application of data science to security.)

Background

Cyber attackers are constantly looking for targets on the Internet. Much like a lion pursuing its prey, an attacker usually conducts a sequence of actions, known as the cyber kill chain, including identifying the footprints of a victim system, scanning the open ports of the system, and probing the holes trying to find an entrance into the system. Professional attackers might be doing this all day long until they find a weak system.

All of this would be bad news for any weak system the attacker finds – unless that weak system is a honeypot. A honeypot is a trap set up on the Internet with minimum security settings so an attacker may easily break into it, without knowing his/her activities are being monitored and tracked. Though honeypots have been used widely by researchers to study the methods of attackers, they can also be very useful to defenders. Compared to sophisticated anomaly detection techniques, honeypots provide intrusion alerts with low false positive rates because no legitimate user should be accessing them. Honeypots set up by a company might also be used to confuse attackers and slow down the attacks against their networks. New techniques are on the way to make setting up and managing honeypots easier and more efficient, and may play an increasingly prominent role in future cyber defense.

A network of honeypots is called a honeynet. The particular honeynet for which I have data logged activities showing that an attacker enumerated pairs of common user names and passwords to enter the system, downloaded malicious files from his/her own hosting servers, changed the privilege over the files and then executed them. During the period from March 2015 through the end of June 2015, there were more than 21,000 attacker IP addresses being detected, and about 36 million SSH attempts being logged. Attackers have tried 34,000 unique user names and almost 1 million unique passwords to break into those honeypots. That’s a lot of effort by the attackers to break into the system. Over time, the honeynet has identified about 500 malicious domains and more than 1000 unique malware samples.

The IP addresses that were owned by the attackers and used to host malware are geographically dispersed. Figure 1 shows that the recorded attacks mostly came from China, the U.S., the Middle East and Europe. While geographic origination doesn’t tell us everything, it still gives us a general idea of potential attacker locations. 

Figure 1. Attacks came from all around the world, color coded on counts of attack. The darker the color, the greater the number of attacks originating from that country.

The frequency of attacks varies daily, as shown in Figure 2, but the trend shows that more attacks were observed during workdays than weekends, and peaks often appear on Wednesday or Thursday. This seems to support the suspicion that humans (other than bots) were behind the scenes, and professionals instead of amateur hobbyists conducted the attacks. 

Figure 2. Daily Attack Counts.

Now that we understand where and when those attacks were orchestrated, we want to understand if any of the attacks were organized. In other words, were they carried out by same person or same group of people over and over again?

Attackers change IP addresses from attack to attack, so looking at the IP addresses alone won’t provide us with much information. To find the answer to the question above, we need to use the knowledge about the files left by the attackers. 

File Similarity

Malware to an attacker is like a hammer and level to a carpenter. We expect that an attacker would use his/her set of malware repeatedly in different attacks, even though the files might have appeared in different names or variants. Therefore, the similarity across the downloaded malware files may provide informative links to associated attacks.

One extreme case is a group of 17 different IPs (shown in Figure 3) used on a variety of days containing exactly the same files and folders organized in exactly the same structure. That finding immediately portrayed a lazy hacker who used the same folder time and time again. However, we would imagine that most attackers might be more diligent. For example, file structures in the hosting server may be different, folders could be rearranged, and the content of a malicious binary file may be tweaked. Therefore, a more robust method is needed to calculate the level of similarity across the files, and then use that information to associate similar attacks.

Figure 3. 17 IPs have exactly the same file structure.

How can we quantitatively and algorithmically do this?

The first step is to find similar files to each of the files in which we are interested. The collected files include different types, such as images, HTML pages, text files, compressed tar balls, and binary files, but we are probably only interested in binary files and tar balls, which are riskier. This reduces the number of files to work on, but the same approach can be applied to all file types.

File similarity computation has been researched extensively in the past two decades but still remains a rich field for new methods. Some mature algorithms to compute file similarities include block-based hashing, Context-Triggered Piecewise (CTP) hashing (also known as fuzzy hashing), and Bloom filter hashing. Endgame uses more advanced file similarity techniques based on file structural and behavioral attributes. However, for this investigation I used fuzzy hashing to compute file similarities for simplicity and since open source code is widely available.

I took each of the unique files based on its fuzzy hashing string and computed the similarity to all the other files. The result is a large symmetric similarity matrix for all files, which we can visualize to check if there are any apparent structures in the similarity data. The way I visualize the matrix is to connect two similar files with a line, and here I would choose an arbitrary threshold of 80, which means that if two files are more than 80% similar, they will be connected. The visualization of the file similarity matrix is shown in Figure 4.

Figure 4. Graph of files based on similarity.

It is visually clear that the files are indeed partitioned into a number of groups. Let’s zoom into one group and see the details in Figure 5. The five files, represented by their fuzzy hash strings, are connected to each other, having mutual similarity of over 90%. If we look at them very carefully, they only differ in one or two letters in the strings, even they have totally different file names and MD5 hashes. VirusTotal recognizes four out of the five malware, and the scan reports indicate that these malware are Linux Trojan. 

Figure 5. One group of similar files.

Identifying Similar Attacks

Now that we have identified the groups of similar files, it’s time to identify the attacks that used similar malware. If I treat each attack as a document, and the malware used in an attack as words, I can construct a document-term matrix to encapsulate all the attack information. To incorporate the malware similarity information into the matrix, I tweaked the matrix a bit. For malware that were not used in a specific attack, but that still share a certain amount of similarity with the malware being used, the malware will assume the value of the similarity level for that attack. For example, if malware M1 was not used in attack A1, but M1 is most similar to malware M2 which was used in attack A1, and the similarity level is 90%, then the element at cell (A1, M1) will be 0.9, while (A1, M2) be 1.0.

For readers who are familiar with NLP (Natural Language Processing) and text mining, the matrix I’ve described above is similar to a document-term matrix, except the values are not computed from TF-IDFs (Term Frequency-Inverse Document Frequency). More on applications of NLP on malware analysis can be found in a post published by my fellow Endgamer Bobby Filar. The essence of such a matrix is to reflect the relationship between data records and features. In this case, data records are attacks and features are malware, while for NLP they are documents and words. The resulting matrix is an attack-malware matrix, which has more than 400 columns representing malware hashes. To get a quick idea of how the attacks (the rows) are dispersed in such a high dimensional space, I plotted the data using the T-SNE (t-Distributed Stochastic Neighbor Embedding) technique and colored the points according to the results from K-means (K=10) clustering. I chose K=10 arbitrarily to illustrate the spatial segmentation of the attacks. The T-SNE graph is shown in Figure 6, and each color represents a cluster labeled by the K-means clustering. T-SNE tries to preserve the topology when projecting data points from a high dimensional space to a much lower dimensional space, and it is widely used for visualizing the clusters within a data set.

Figure 6 shows that K-Means did a decent job of spatially grouping close data points into clusters, but it fell short of providing a quantitative measurement of similarity between any two data points. It is also quite difficult to choose the optimum value for K, the number of clusters. To overcome the challenges that K-Means faces, I will use Latent Semantics Indexing (LSI) to compute the similarity level for the attack pairs, and build a graph to connect similar attacks, and eventually apply social network analytics to determine the clusters of similar attacks.

Figure 6. T-SNE projection of Attack-Malware matrix to 2-D space.

LSI is the application of a particular mathematical technique, called Single Value Decomposition or SVD, to a document-term matrix. SVD projects the original n-dimensional space (with n words in columns) onto a k-dimensional space, where k is much smaller than n. The projection then transforms a document’s vector in n-dimensional space into a vector in the reduced k-dimensional space under the requirement that the Euclidean distance between the original matrix and the resulting matrix after transformation is minimized.

SVD decomposes the attack-malware matrix into three matrices, one of which defines the new dimensions in the order of significance. We call the new dimensions principal components. The components are ordered by the amount of explained variance in the original data. Let’s call this matrix attack-component matrix. With the risk of losing some information, we can plot the attack data points on the 2-d space using the first and the second components just to illustrate the differences between data points, as shown in Figure 7. The vectors pointing to perpendicular directions are most different from each other.

Figure 7. Attack data projected to the first and second principal components.

The similarity between attacks can be computed with the results of LSI, more specifically, by calculating the dot product of the attack-component matrix.

Table 1. Attacks Similar to Attack from 61.160.212.21:5947 on 2015-03-23.

I connect two attacks if their similarity is above a certain threshold, e.g. 90%, and come up with a graph of connected attacks, shown in Figure 8.

 

Figure 8. Visualization of attacks connected by similarity.

There are a few big component subgraphs in the large graph. A component subgraph represents a group of attacks closely similar to each other. We can examine each of them in terms of what malware were deployed in the given attack group, what IP addresses were used, and how frequently the attacks were conducted.

I plotted the daily counts of attack for the two largest attack groups in Figure 9 and Figure 10. Both of them show that attacks happened more often on weekdays than on weekends. These attacks may have targeted different geo-located honeypots in the system and could be viewed as a widely expanded search for victims.

Figure 9. Daily counts of attack in one group.

Figure 10. Daily counts of attack in another group.

We can easily find out where those attackers’ IPs were located (latitude and longitude), and the who-is data associated with the IPs. But it’s much more difficult to fully investigate the true identity of the attackers.

Summary

In this post, I explained how to apply data science techniques to identify honeypot attackers. Mathematically, I framed the problem as an Attack-Malware matrix, and used fuzzy hashing to represent files and compute the similarity between files. I then employed latent semantic indexing methods to calculate the similarity between attacks based on file similarity values. Finally, I constructed a network graph where similar attacks are linked so that I could apply social network analytics to cluster the attacks.

As with my last blog post, this post demonstrates that data science can provide a rich set of tools that help security experts make sense of the vast amount of data often seen in cyber security and discover relevant information. Our data science team at Endgame is constantly researching and developing more effective approaches to help our customers defend themselves – because the hunt for attackers never ends.

Viewing all 698 articles
Browse latest View live