Endgame's Blog

This latest development in the realm of cyber cooperation is by no means unique. In fact, the US has signed its own cyber security agreement with Russia (although it is not as comprehensive as the potential Sino-Russian one) – as well as with India, with the EU, one with Australia as part of a defense treaty, and a cyber security action plan with Canada. Similarly, the EU has formal cyber agreements with Japan, and the UK with Israel, while Japan and Israel also have formed their own bilateral cyber security agreement. India has cyber security agreements with countries as diverse asKazakhstan and Brazil. RTAs are also being augmented with the inclusion of cyber. The African Union, the Shanghai Cooperation Agreement, and the EU’s Budapest Convention are all examples of this. This pattern parallels one found in the economic arena, with cooperative agreements often following closely to geopolitical affinities.

To better understand the impact of future cooperative cyber security agreements, policymakers should revisit the economic models and RTAs of the last quarter century – looking especially at the divergent perspectives that RTAs would either be building blocs or stumbling blocs of a global international order. The building bloc camp believes the RTAs are merely a stepping-stone toward global integration. The stumbling bloc camp believes that RTAs are a new form of neo-mercantilism, which would lead to protectionist walls built around member-states. These camps have theoretical equivalents in today’s cyber domain. The stumbling bloc argument has profound parallels to discussions around the Balkanization of the Internet (i.e. the Splinternet), while the building bloc camp is representative of those suggesting a global diffusion of the Internet. In fact, these two perspectives greatly mirror the divergent ways in which China and Russia approach the Internet (i.e. cyber-nationalism) as opposed to the US approach (i.e. global integration).

While cyberspace will continue to be portrayed as a combative domain as long as attacks persist, policymakers cannot ignore the cooperative aspects of cyber, which increasingly reflect the larger geopolitical and economic landscape. Beijing and Moscow have been expanding collaboration on a range of economic issues. While it’s convenient to point to Sino-Soviet tensions during the Cold War to discount any trans-Asian partnerships by these two giants, such a heuristic not only would be erroneous but it would be detrimental to understanding global cyber trends. These two countries are increasingly aligned diplomatically, and even more so economically. This past spring, Russia and China signed an agreement between their largest banks to exchange in local currencies, bypassing the historic role of the dollar. This summer, the two countries signed a more comprehensive agreement to further trade in local currencies, again eliminating the need for the US dollar. If the latest rumors are correct, next week Russia and China will sign a cyber security agreement at–of all places–the Asia Pacific Economic Cooperation (APEC) summit.

APEC will provide a global forum for China to assert an agenda of greater economic integration in the region, including a push for the Asian Infrastructure Investment Bank (AIIB). This AIIB is viewed as a Chinese attempt at restructuring the post-World War II economic order established by the US and Europe. The US has openly challenged the creation of the AIIB exactly for this reason, and the possibility that it would emerge as a competitor to the World Bank (which was created at the Bretton Woods conference as one of the three pillars of the new Western-dominated global order). While China pushes forth with the AIIB, the US continues to press for the Trans-Pacific Partnership (TPP), a proposed free-trade agreement among a dozen states in the Asian region, and currently excludes China. China claims the TPP is a US attempt to contain China in the region and has been pushing forth with its own alternatives in the region such as the AIIB as well as the Shanghai Cooperation Organization. Now with a potential cyber agreement between Russian and China, it’s likely that this tit-for-tat behavior will overtly manifest in the cyber domain.

Former Secretary of Defense Leon Panetta called cyberspace “the battlefield of the future,” and this characterization of the cyber domain has only increased as cyber attacks grow more prevalent and disruptive. But this militarization of the cyber domain often masks an underlying cooperation that is occurring simultaneous to rising geopolitical friction. Rumors of a Sino-Russian cyber agreement have sparked alarm, and are a reminder that both cooperation and conflict are natural outcomes as states jockey for power in cyberspace.

The rumored Sino-Russian cyber agreement is just the latest in a global trend of states signaling diplomatic preferences and commitments via formalized cooperative cyber security agreements. Cooperation in cyberspace in the modern era is reminiscent of the transition to economic cooperation in the post-World War II era and the military cooperation that dominated the earlier eras. In each case, states rely upon those distinct domains to signal affinities and exert power. Since the latter part of the 20th century, economic regionalism has become the defining mode of cooperation among states, in many instances replacing the role alliances once played. With that in mind, policymakers should look to the economic cooperative landscape as a foundation for forecasting the future of cyber security cooperation.

Sino-Russian collaboration across the monetary, commercial, and investment space reveals ever tighter integration among the two countries, and thus a cyber agreement should come as no surprise to those who follow global economic relations. However, the real insights may come in using economic regionalism to assess the implications of this rumored agreement. While a Sino-Russian agreement could be extraordinarily disruptive to the global order, it may have unintentional positive ramifications for the US. In fact, such an agreement may encourage other countries across the globe to ameliorate the persistent tensions with the US that have occurred since the Snowden disclosures. Given the current divergent approaches to the role of the Internet, most states are likely to find a universal approach to the Internet much more appealing than the model of censorship and control that Russia and China represent. A quick review of economic regionalism exemplifies the role of agreements, and soft power, in shaping global geopolitical partnerships.

Economic regionalism constitutes the range of economic relations between states, the most prevalent of which are regional trade agreements (RTAs). RTAs increased exponentially beginning with the end of the Cold War and the subsequent global economic liberation. According to the World Trade Organization, there are currently 379 RTAs in force. In many cases, these RTAs have taken on military cooperative aspects, such as Africa’s Economic Community of West African States (ECOWAS). In fact, with the rise of globalization, RTAs often serve as the preferred mode of cooperation as formal alliances have declined. Similarly, cyber security cooperative agreements may soon become the modus operandi for power politics cooperation across the globe, superseding or augmenting the role of economic agreements.

While the impact of today’s RTA-influenced global economic order has been debated considerably, it is clear that cooperation in cyberspace is following a similar structure to that of cooperation in the commercial domain over the last 25 years. In a seminal overview of global political economy, Robert Gilpin notes that, “Important factors in the spread of economic regionalism include the emergence of new economic powers, intensification of international economic competition, and rapid technological developments…Economic regionalism is also driven by the dynamics of an economic security dilemma.” It’s easy to foresee a future wherein “cyber” replaces “economic” in Gilpin’s analysis. In fact, it’s not a stretch to imagine a cyber security dilemma emerging in response to a Sino-Russian cyber security agreement.

In the cult classic trilogy Back to the Future, Doc claims, “Where we’re going, we don’t need roads.” He’s referencing 2015, and his assertion reminds us just how difficult it is to forecast the future of modern technology. The movies also remind us how tempting it can be to reflect on how things might have been. The current cyber security landscape is ripe for such reflection. What if you could go back in time, knowing what you know today, and alter the armed forces’ approach to cyber security? This was the focus of a dinner I recently had the privilege of attending at the United States Naval Academy Foundation (USNAF), which addressed the specific question,

“Knowing what you know now about cyber threats, cyber espionage, etc., if you could go back to the year 1999 (15 years ago), what advice would you give the armed forces regarding what is needed to prepare for the future…which is now. And how are we doing compared to what you would have said?”

Below are some of the key themes that emerged from this lively discussion, which brought together a diverse range of military, academic and industry perspectives—though unfortunately without the assistance of a Delorean to facilitate implementation of the recommendations. But it’s never too late, and many of these themes and recommendations can help inform future capabilities and the structure of the cyber workforce:

Cyber-safe as a Precondition, Not an Afterthought

For the last fifteen years, cyber security has been treated as a luxury, not a necessity. This has created a technical debt that is difficult but essential to overcome. The acquisition process and all of its warts is a critical component for implementing cyber-safe requirements and ensuring that everything is built to a pre-defined minimal requirement of cyber-safety. Cyber-safe as a precondition would have produced many unforeseen, but beneficial, externalities beyond the obvious ones of improved cyber security. For example, users who demand modern web experiences but are currently stuck using archaic web applications would have greatly benefited from this approach. Too often, analytic solutions must be compatible with a five-year old web browser (not naming names) that currently lacks available patches. A key challenge in the cyber domain – and really across the analytic spectrum – is creating modern applications for the community that are on par with their experiences in the unclassified environment. But in a world with cyber-safe as a requirement, users could benefit from modern web applications and all of the user-experience features and functionality that accompany modern web browsers. Data storage, indexing, processing, and many other areas well beyond data analysis would benefit from an a priori cyber-safe requirement for all technologies. Cyber-safe should not be viewed as an afterthought, and the armed forces must overcome significant technical debt to achieve greater cyber security.

Revolutionary, not Evolutionary, Changes to the Cyber Mindset

In addition to the technology itself, cyber practitioners are equally essential for successful cyber security. During the discussion, we debated the opportunities and challenges associated with greater inclusion of cyber experts who may follow what are currently viewed as non-traditional career tracks (i.e. little or no formal computer science experience). Including these non-traditional experts would require overcoming significant gaps in both pay and culture to attract many of the best and brightest in cyber security. While this may be a longer-term solution, several near-term and more tangible recommendations also emerged. The notion of a military version of the Black Hat conference (which I wrote about here) gained some traction within the group. This type of forum could bring together cyber practitioners across the military, academic and industry spectrum to highlight innovative research and thought leadership and ideally bridge the gap between these communities. There was also interest in formulating analogies in the cyber domain to current practices and doctrine—likely more geared toward tactical application and technical training, but pertinent at the strategic and policy level as well. Frameworks and analogies are useful heuristics, and should be emphasized to help evolve our thinking within the cyber domain.

Redefining Cyberwarriors

The US government has not been shy about its plans to dramatically expand its cadre of cyberwarriors. However, this usually entails an emphasis on STEM-centric training applied to information security. This is the bedrock of a strong cyber security foundation, but it is not enough. Everyone, regardless of discipline, must become cyber competent. The USNA has already started down this path ahead of most other academic institutions. Upon graduation, every student will have completed two core cyber courses, many take additional interdisciplinary cyber electives, and this year will be the second in which graduates can major in cyber operations. We discussed the need to further expand upon this core, especially in areas such as law that will enable graduates to navigate the complicated legal hurdles encountered within the cyber domain.

As expected with any paradigm shift, there has been resistance to this approach. Nevertheless, the USNA continues to push forward with dual cyber tracks – one for cyber operations majors, and another track for other majors to maintain cyber competency. This will pay great dividends in both the short and long term. Having now spent a significant amount of time with diverse groups of people from engineering, humanities and social science backgrounds, it is clear that linguistic and cultural divisions exist among these groups. Bridging this divide has longer-term implications for cyber competency both at the policy and tactical levels, and it can also spark innovation in the cyber security domain. It will ensure that cyber security technologists understand how their work fits into the larger mission, while similarly elevating technical cyber competency among military leaders and decision makers.

Expanding the notion of what constitutes a cyber warrior may in fact be one of the most important recommendations we discussed. Cyber can no longer be relegated to a niche competency only required for a small percentage of the workforce. The situation reminds me of quite possibly my favorite quote. When releasing the iPad a few years back, Steve Jobs noted, “It’s in Apple’s DNA that technology alone is not enough. It’s technology married with liberal arts, married with the humanities, that yields the results that make our hearts sing.” Knowing what we know now about the great potential for innovation in solutions that draw from technology as well as other disciplines, perhaps this same sort of cross-disciplinary competency can be applied equally to cyber challenges, which will only become more complex and post even greater challenges to our national interests.

DEFCON 22 was a great learning experience for me. My goal was to soak up as much information security knowledge as possible to complement my existing data science experience. I grew more and more excited as each new talk taught me more and more security domain knowledge. But as Alex Pinto began his talk, this excitement turned to terror.

I knew exactly where he was going with this. And I also knew that any of those marketing blurbs about behavioral analysis, mathematical models, and anomalous activity could have easily been from Endgame. I had visions of being named, pointed out, and subsequently laughed out of the room. None of that happened of course. Between Alex’s talk and a quick Google search I determined that none of those blurbs were from my company. But that wasn’t really the point. They could have been.

That’s because we at Endgame are facing the same challenges that Alex describes in that talk. We are building products that use machine learning and statistical models to help solve security problems. Anyone doing that is entering a field littered with past failures. To try and avoid the same fate, we’ve made sure to educate ourselves about what’s worked and what hasn’t in the past.

Alex’s talk at DEFCON was part of that education. He talked about the curse of dimensionality, adversaries gaming any statistical solution, and algorithms detecting operational rather than security concerns. This paper by Robin Sommer and Vern Paxson is another great resource that enumerates the problems that past attempts have run up against. It talks about general challenges facing unsupervised anomaly detection, the high cost of false-positive and false-negative misclassifications, the extreme diversity of network traffic data, and the lack of open and complete data sets to train on. Another paper critiques the frequent use of an old DARPA dataset for testing intrusion detection systems, and by doing that reveals a lot of the challenges facing machine learning researchers looking for data to train on.

Despite all that pessimism, there have been successes using data science techniques to solve security problems. For years here at Endgame, we’ve successfully clustered content found on the web, provided data exploration tools for vulnerability researchers, and used large scale computing resources to analyze malware. We’ve been able to do this by engaging our customers in a conversation about the opportunities—and the limitations—presented by data science for security. The customers tell us what problems they have, and we tell them what data science techniques can and cannot do for them. This very rarely results in an algorithm that will immediately identify attackers or point out the exact anomalies you’d like it to. But it does help us create tools that enable analysts to do their jobs better.

There is a trove of other success stories included in this blog post by Jason Trost. One of these papers describes Polonium, a graph algorithm that classifies files as malware or not based on the reputations of the systems they are found on. This system avoids many of the pitfalls mentioned above. Trustworthy-labeled malware data from Symantec allows the system to bootstrap its training. The large-scale reputation based algorithm makes gaming the system difficult beyond file obfuscation.

The existence of success stories like these proves that data-driven approaches can help solve information security problems. When developing those solutions, it’s important to understand the challenges that have tested past approaches and always be cognizant of how your approach will avoid them.

We’ll use this blog over the next few months to share some of the successes and failures we here at Endgame have had in this area. Our next post will focus on our application of unsupervised clustering for visualizing large, high dimensional data sets. Stay tuned!

For three days, Chinese citizens are able to tweet at will and access Google, Facebook, and other forms of social media and traditionally censored content—but only if they are in the historic town of Wuzhen, where China is currently hosting the World Internet Conference. During this temporary reprieve from Internet censorship in Wuzhen, the rest of the country experienced a surge in censorship targeted at blocking access to several media outlets such as The Atlantic and the content delivery network Edgecast. The conference appears to have been put together in response to a similar series of conferences on global cyber norms led by the UK, South Korea, Hungary and the Netherlands, and it’s just the latest effort that China has made to influence and structure 21st century cyberspace norms. However, just as China failed to conceal the pollution during last week’s APEC summit, it seems that the government is encountering similar challenges in its attempt to simultaneously disguise the Great Firewall and promote Internet freedoms. The conference illuminates the stark contrast between China’s version of a state-controlled Internet within sovereign borders and the free and open Internet promoted by democratic states across the globe.

The goal of the World Internet Conference is to “give a panoramic view for the first time of the concept of the development of China’s Internet and its achievements,” according to Lu Wei, the minister of China’s new Cyberspace Administration. However, the conference may inadvertently highlight the hypocrisy of an uncensored Internet conference occurring within one of the most censored countries in the world. In fact, much of the world seems absent from what was meant to be a global conference, understanding full well the cognitive dissonance that seems to have evaded the Chinese leadership when organizing this conference. Only a handful of the speakers are non-Chinese, but in general the world’s biggest players in the Internet are absent from the discussion.

For several years, China has leveraged its clout to attempt to shape global cyberspace norms. China and Russia jointly proposed The International Code of Conduct for Information Security to the United Nations, ironically calling for a free and open global Internet while their domestic censorship continues to expand. Just as rising powers exerted their influence to shape the post-World War II international order, China is similarly leaning on extant institutions, and also introducing new international institutions, to shape the cyber norms of the 21st century global order. However, China fails to grasp the importance of soft power in shaping global norms of any kind. Power can be achieved via coercion, payment, or attraction. Soft power occupies the realm of attraction, and promoting values that are attractive to others. As Joseph Nye explained last year, China (and Russia for that matter) is failing miserably at soft power because they fail to account for the attraction component of the equation. The World Internet Conference makes this unabashedly clear.

As China continues to exert influence over the global cyber commons, there is certainly cause for concern that they might extend their sphere of influence and encourage others to limit Internet freedoms. As William Nee of Amnesty International notes, “Now China appears eager to promote its own domestic Internet rules as a model for global regulation. This should send a chill down the spine of anyone that values online freedom.” While concern is warranted, it would be myopic to overreact and ignore the vital component of attraction within soft power. What China fails to understand is that its attempt at soft power will present challenges. Soft power is only truly effective when it promotes universal values such as freedom and openness—not dictatorial control over access to information.

The Iranian nuclear negotiations occupy a persistent spot in the foreign policy news cycle. The Associated Press recently reported that Iran has agreed to a list of nuclear concessions. Although still improbable, the likelihood of even minimal collaboration between the United States and Iran appears greater now than in recent memory. Unless, of course, you happened to stumble upon the revelations of Operation Cleaver, which has been largely ignored by all but the tech media outlets. The report highlights an alleged widespread Iranian cyber campaign targeted at critical infrastructure in about a dozen countries, including the United States. Just as we’re seeing the first glimpses of potential US and Iran cooperation in the nuclear realm, the opposite is happening in cyberspace. This uncomfortable reality highlights the modern age of diplomacy, wherein diplomacy in the physical world and in the virtual world is completely orthogonal.

Congressman Mike Rogers, House Committee Intelligence Chair, is one of the few policymakers who has actually noted this potential relationship between policy in the physical and virtual worlds, stating that if the nuclear negotiations fail, Iran could resume cyber activity. Unfortunately, as Operation Cleaver highlights, Iranian cyber activity targeting physical infrastructure has likely been escalating, not de-escalating, over the last two years. Operation Cleaver is perhaps the timeliest example of the Janus-faced nature of foreign policy, which has occurred for well over a decade but is not unique to Iranian-US relations. Take the recent APEC meeting, for example, where the US and China brokered a deal to counter climate change. This occurred within weeks of an FBI warning of a widespread Chinese cyber campaign targeted both at the US private sector as well as government agencies, and within days of the announcement of a September breach at the US National Weather Service. This, too, has been linked to China. Similarly, the US and Russia continued the START nuclear non-proliferation negotiations earlier this year just as cyber-attacks escalated, some of which were targeted at US federal agencies. Of course, both states actually escalated their deployed nuclear forces since this past March, but nevertheless the two countries are still on track for additional negotiations in 2015. It’s not unusual for states to pursue divergent relationships across distinct areas of foreign policy. Cooperation vacillates between the various arenas, but rarely does it take on the dueling nature we see occurring between the physical and virtual worlds.

The Director of National Intelligence, James Clapper, and many, many other leaders have described the modern era as containing an unprecedented array of diverse and dynamic threats. This brings new challenges, of course, but perhaps one of the most striking challenges remains largely unspoken. Foreign policy in the modern era has thus far differentiated relationships in the physical and virtual worlds. Will this remain a distinct, modern foreign policy challenge? With the continued trend of the private sector surfacing foreign nation-state cyber campaigns, it seems 2014 may mark the beginning of the end of dueling foreign policies. The ongoing series of revelations of alleged foreign states and their affiliates targeting the US public and private sectors (e.g. China’s PLA Unit 61398 and Axiom group, Russian association with the JP Morgan breaches, North Korea with Sony, and now Operation Cleaver) is likely indicative of the future “outing” of cyber behavior by the private sector. In the future, the US is likely to leverage disclosures made by the private sector, which in turn provides the government the luxury of concealing or revealing its own information, and can even assist negotiations across the diplomatic spectrum.

This period of disparate US policies in the physical and virtual worlds will be increasingly difficult to juggle in light of publicized revelations of cyber campaigns conducted against US federal agencies and corporations. At some point, public opinion will reach a tipping point and demand a more coordinated response and defense against cyber campaigns by foreign states. It will be increasingly difficult to maintain a two-track foreign policy as new revelations occur. That tipping point may still be in the distant future, as the US public remains largely unaware of many of these campaigns because they are not broadly publicized. In fact, many of these foreign-sponsored cyber campaigns – especially if targeted against federal agencies – remain publicized only by tech-focused media outlets. We’ll spend some time examining this particular trend in more detail in a future post.

Several US government agencies have experienced targeted cyber attacks over the last few months. Many believe China is responsible for cyber attacks on the Office of Personnel Management, the US Postal Service and National Weather Service. Russia has been linked to many recent breaches including those on the White House and State Department unclassified networks. Given the national security implications of such breaches, these attacks should have monopolized the news cycle. However, they have barely registered a small blip. Conversely, the data breaches at large companies such as Sony, Home Depot, Target and Neiman Marcus have dominated the news and have led many Americans to rank concern over hacking higher than any other criminal activity. But characterizing these events as solely private sector or public sector breaches oversimplifies the state of cyber security today. Many of the private sector intrusions are linked to Russia, China, Iran, and now even North Korea. While the Sony breach remains contested, a North Korean spokesman claimed it was part of the larger struggle against US imperialism. In fact, many of these private sector breaches have been directly linked to or are considered retaliation for various aspects of US foreign policy. Formulating a rigid line between public and private sector categorization is not only erroneous, but it also masks the reality of the complex cyber challenges the US faces.

From Unity Against a Common Threat to Disunity Against a Hydra

In the late 1980s and early 1990s, Japan was perceived as a greater threat to US security than the Soviet Union. The private sector was quite vocal during this time, providing evidence of dumping and unfair trade practices, while supporting voluntary export restraints and a series of other protectionist measures for the US domestic sector. While one can question the success of the policies (and assessment of the threat!), it is clear that a unified understanding of a common threat among private and public sectors greatly enhanced the efficiency with which the US was able to respond. It is this common understanding between the two groups that is still missing today.

Russia, China, Iran and many, many other groups have been elevating cyber-attacks on the federal government and private sector for well over a decade. China has been wielding cyber attacks against federal agencies since at least 1999, when it targeted the National Park Service and the Departments of Energy and Interior. However, this is no longer a government-to-government problem, with the increase of non-state actors as both perpetrators (e.g. Syrian Electronic Army) and victims (e.g. multi-national corporations). Each kind of attack – regardless of state or non-state actor involvement – has both national security and economic implications. For instance, Target’s profits and reputation have taken a big hit following last year’s credit card breach. Home Depot faces similar economic risk over the loss of customers following its data breach. It’s too soon to tell exactly how much financial and reputational damage the breach at Sony will incur. These private sector breaches also have natural security implications, especially when targeted at the financial sector and critical infrastructure, which is increasingly a target of cyber-attacks by foreign governments (e.g. Operation Cleaver). Despite these similarities in adversaries, there remains a stark disconnect in the portrayal and general contextualization of breaches in the private and public sectors.

Technical Similarities

These private and public sector breaches exhibit not only similar threat profiles, but also technical similarities. These attacks are indicative of the larger tactics, techniques and procedures (TTPs) of adversaries as they conduct reconnaissance and trust-building intrusions that lead to major attacks such as the Sony breach. In many cases, the initial access to the target systems was through third party contractors, both government and commercial, as well as through targeted spear phishing and watering hole attacks. In each case, the commonality is leveraging trust. From an attacker’s standpoint, every breach of trust enables more opportunity. Successful spearphishing campaigns gather enough information about their targets to properly craft the most effective message to entice a click. In the case of recent federal agency breaches, it is important to remember that adversaries conduct reconnaissance of networks prior to escalating to major attacks, and they often begin with lower value targets before escalating to higher value targets. Every seemingly harmless intrusion must be viewed as a first step toward a larger attack, and not an end in and of itself. If an attacker compromises a government office, what information does that office have that could be used to further compromise both government and commercial companies? Something seemingly innocuous, like email addresses of contractors, could be used to launch a new targeted operation. At some point, people make mistakes, and attackers thrive on mistakes. They have the benefit of time and information to make the best decision about how to increase their trust until critical systems and information have been infiltrated. In short, the TTPs – especially the exploitation of trust to conduct ever-greater intrusions – are very similar in private and public sector breaches.

Could More Convergence Lead to a Unified Response?

Last week, the Senate Banking Committee discussed cybersecurity in the financial sector, including the Cybersecurity Information Sharing Act. Clearly, this is an important step. However, absent from this discussion were some of the major stakeholders in the financial industry, further perpetuating the divide between the public and private sectors. Only when there is a common understanding of the threats and challenges of cyberspace can the two sides come together and provide more holistic and effective responses. The cyber attacks on federal agencies and the private sector must finally be elevated within popular discourse and be understood for what they are – reconnaissance and trust-building intrusions, increasingly by the same foreign adversaries. As news of another cyber attack on a federal agency or private sector occurs, it would be much more helpful if it was placed in the larger context as a targeted, national security breach. A unified response by the US first requires a unified understanding of the threat. Absent a coherent and integrated understanding of the threat, attacks against banks, corporations and federal agencies will only continue to grow.

A couple of years ago, in an effort to better understand technology trends, we initiated a project to identify typical web site characteristics for various geographic regions. We wanted to build a simple query-able interface that would allow our analysts to interact with crawl data and identify nonobvious trends in technology and web design choices. At first this may sound like a pretty typical analysis problem, but we have faced numerous challenges and gained some pretty interesting insights over the years.

Aside from the many challenges present in crawling the Internet and processing that data, at the end of the day, we end up with hundreds of millions of records, each with hundreds of features. Identifying “normal trends” over such a large feature set can be a daunting task. Traditional statistical methods really break down at this point. These statistical methods work well for one or two variables but are rendered pretty useless once you hit more than 10 variables. This is why we have chosen to use cluster analysis in our approach to the problem.

Machine learning algorithms, the Swiss army knife of a data scientist’s toolbox, break down into three classifications: supervised learning, unsupervised learning, and reinforcement learning. Although mixed approaches are common, each of the three lends itself to different tasks. Supervised learning is great for classification problems where you have a lot of labeled training data and you want to identify appropriate labels for new data points. Unsupervised techniques help to determine the shape of your data, categorizing data points into groups by mathematical similarity. Reinforcement learning includes a set of behavioral models for agent-based decision-making in environments where the rewards (and penalties) are only given out on occasion (like candy!). Cluster analysis fits well within the realm of unsupervised learning but can take advantage of supervised learning (making it semi-supervised learning) in a lot of scenarios, too.

So what is cluster analysis and why do we care? Consider web sites and features of those sites. Some sites will be large, others small. Some will have lots of images; others will have lots of words. Some will have lots of outbound links, and others will have lots of internal links. Some web sites will use Angular; others will prefer React. If you look at each feature individually, you may find that the average web site has 11 pages, 4 images and 347 words. But what does that get you? Not a whole lot. Instead, let’s sit back and think about why some sites may have more images than others or choose one JavaScript library over another. Each webpage was built for a purpose, be it to disseminate news, create a community forum, or blog about food. The goals of the web site designer will often guide his or her design decisions. Cluster analysis applies #math to a wide range of features and attempts to cluster websites into groups that reflect similar design decisions.

Once you have your groups, generated by #math, you’ve just made your life a whole lot simpler. A few minutes (or hours) ago you had potentially thousands or millions of items to compare across hundreds of fields. Now you’ve got tens of groups that you can compare in aggregate. Additionally, you now know what makes each group a group and how it distinguishes itself from one or more other groups. Instead of looking at each website or field individually, now you’re looking at everything holistically. Your job just got a whole lot easier!

Cluster analysis gives you some additional bonus wins. Now that you have normal groups of websites, you can identify outliers within the set - those that are substantially dissimilar from the bulk of their assigned group. You can also use these clusters as labels in a classifier and determine in which group of sites a new one fits best.

In coming posts, we will go into more detail about how we cluster and visualize web crawl data. Stay tuned!

The Sony Pictures Classics film The Fog of War is a comprehensive and seemingly unfiltered examination of former Secretary of Defense Robert McNamara, highlighting the key lessons he learned during his time as a central figure in US national security from WWII through the Cold War. The biopic calls particular attention to jus ad bellum – the criteria for engaging in conflict. Over a decade later, Sony itself is now at the center of a national security debate. As the US government ponders a “proportional response” – a key tenet of Just War theory – in retribution for the Sony hack, and many in the security community continue to question the government’s attribution of the breach to North Korea, it is time to return to many of McNamara’s key lessons and consider how the difficulty of cyber attribution – and the prospect of misattribution – can only exacerbate the already tenuous decision-making process in international relations.

Misperception: The misperception and miscalculation that stem from incomplete information are perhaps the most omnipresent instigators across all forms of conflict. McNamara addresses this through the notion that “seeing and belief” are often wrong. Similarly, given the difficulty of positively attributing a cyber attack, victims and governments often resort to confirmation bias, selecting the circumstantial evidence which best confirms their beliefs. Cyber attacks aggravate the misguided role of incomplete information, leaving victims to formulate a response without fully knowing: 1) the financial and national security magnitude of the breach; 2) what the perpetrator will do with the information; 3) the perpetrator’s identity. Absent this information, a victim may respond disproportionally and target the wrong adversary in response.
Empathize with your Enemy: McNamara’s lesson draws from Sun Tzu’s “know thy enemy” and describes the need to evaluate an adversary’s intent by seeing the situation through their eyes. Understanding the adversary and their incentives is an effective way to help identify the perpetrator, given the technical challenges with attribution. To oversimplify, code can be recycled from previous attacks, purchased through black markets for malware, and can be socially engineered to deflect investigations towards other actors. Moreover, states can outsource the attack to further redirect suspicions. A technical approach can limit the realm of potential actors responsible, such as to nation-states due to the scope and complexity of the malware. But it is even more beneficial to marry the technical approach with an understanding of adversarial intent to help gain greater certainty in attribution.
Proportionality: Proportionality is a key component both of jus ad bellum, as well as jus in bello (criteria for behavior once in war). Given his role in the carpet-bombing of Japan, McNamara somewhat surprisingly stresses the role of a proportional response. President Obama’s promise of a proportional response to the Sony breach draws specifically on this Just War mentality. But the attribution problem coupled with misperception and incomplete information make it exceedingly difficult to formulate a proportional response to a cyber attack. Clearly, a response would be more straightforward if there were a kinetic effect of a cyber attack, such as was recently revealed in theTurkey attack that occurred six years ago. But even this still begs the question of what a proportional response looks like after so many years. It could similarly be years before the complete magnitude of the Sony breach is realized, or exactly what that ‘red line’ might be that would trigger a kinetic or non-kinetic response to a cyber attack.
Rational choice: A key theory in international relations, rational choice theory assumes actors logically make decisions based on weighing potential costs and benefits of an action. While this continues to be debated, McNamara notes that with the advent of nuclear weapons, human error can lead to unprecedented destruction despite rational behavior. This is yet again magnified in the cyber domain, especially if misattribution leads to retaliation against the wrong adversary, or human error in a cyber response has unintended consequences. Rational choice decisions are only as good as the data at hand, and therefore seemingly “rational” decisions can inadvertently result in unintended results due to limited data or misguided data interpretations. Moreover, similar to the nuclear era, human error can also lead to unprecedented destruction in the cyber domain. However, cyber retaliatory responses are not limited to a select few high level officials, but rather the capabilities are much more dispersed across agencies and leadership levels, expanding the scope for potential human error.
Data-driven Analyses: McNamara’s decision to bring in a team of quants to take a more innovative approach to national security analysis is a milestone in international relations. However, like all forms of analyses, quantitative and computational analyses must not be accepted at face value, but rather must be subjected to rigorous inspection of the data and methodologies employed to produce the findings. The last few weeks have seen a range of analyses used to either validate or add skepticism to the attribution of North Korea to the Sony breach. These clearly range significantly in the level of analytic rigor, but many are plagued with limited data which produces analytic problems such as: 1) a small-N, meaning any results should be met with skepticism and are not statistically significant; 2) natural language processing analyses using models that are trained on different language structures and so do not travel well to coding languages; 3) selection bias wherein the sample of potential actors analyzed is not a representative sample; 4) poor data sampling, wherein analysis of different subsets of the data lead to differing conclusions. Because of these different analytic hurdles, various analyses point unequivocally to actors as diverse as North Korea, the Lizard Squad, Russia, Guardians of Peace, and an insider threat. Clearly, attributing the attack is a key goal of the analyses, but limited data exacerbates the ability to confirm prior beliefs. Data-driven analyses provide solid footing when making claims, but the various forms of data gaps inherent in cyber make it much more vulnerable to misinterpretation.

Beyond a Cold War Framework: Each of these lessons highlights how the digital age amplifies the already complex and opaque circumstances surrounding jus ad bellum. As we begin another year, we are yet again reminded not only of the seemingly cyclical nature of history, but also of just how distinct the modern era is from its predecessors. It’s time for a framework that builds upon past knowledge while also adapting to the realities of the cyber domain. Too often, decision-making remains relegated to a Cold War framework, such as the frameworks for conventional warfare, mutually assured destruction, and a known adversary. It would be devastating if the complexity of the cyber domain led to misattribution and a response against the wrong adversary – and all of the unintended consequences that would entail. If nothing else, let’s hope the Sony breach serves as a wake up call for a new policy framework rigorous enough to handle the fog of cyber war.

From the first CEO of a major corporation resigning in the wake of a cyber attack, to NATO incorporating the cyber realm into Article 5, to the still fresh-in-our-minds Sony attack, 2014 was certainly a year to remember in cyber security. As we begin another year, here’s what some of us at Endgame predict, anticipate, or hope 2015 will bring for cyber:

Lyndon Brown, Enterprise Product Manager

In 2014, security teams were blind to most of the activity that happened within their networks and on their devices. While the majority of this activity was benign, security breaches and other malicious activity went unnoticed. These incidents often exposed corporate data and disrupted business operations.

2015 is the year that CISOs must decide that this reality is unsustainable. Motivated, in part, by high-profile breaches, security heads will adjust their strategy and manifest this shift in their 2015 budgets. On average, CISOs will increasingly fund threat detection and incidence response initiatives. As the top security executive of a leading technology company poignantly stated, “we’ve finally accepted that any of our systems are or can be compromised”.

Since security budgeting is usually a zero-sum game, spending on preventive controls (such as anti-virus products) will stay stagnant or decline. As security buyers evaluate new products, they will prioritize solutions that leverage context and analysis to make advanced security judgments, and that see all security-relevant behavior – not just what is available in logs.

Rich Seymour, Senior Data Scientist @rseymour

The world of computer security will no doubt see some harrowing attacks this year, but I remain more hopeful than in years past. Burgeoning work in electronic communication—secure, encrypted, pseudo-anonymized and otherwise (like Pond, ssh-chat, bitmessage, DIME, etc)—won’t likely move into the mainstream in 2015, but it’s always neat to see which projects gain traction. The slowly paced rollout of sorely needed secure open voting systems will continue, which is awesome, and includes California’s SB360 allowing certification of open source voting systems, LA County’s work in revamping its election experience, Virginia’s online voter registration, and the OSET foundation’s work, just to name a few.

I hope that this year’s inevitable front-page security SNAFUs will lead more people to temper their early adoption with a measure of humorous cynicism. Far on the other side of the innovation adoption graph, let’s hope that those same security SNAFUs lead the behemoth tech laggards to pull the plug on dubious legacy systems and begin a blunt examination of their infrastructural vulnerabilities. As a data scientist at Endgame, I don’t want to make any predictions in that domain, lest I get thrown to the wolves on twitter for incorrectly predicting that 2015 will be the year a convolutional deep learning network will pre-attribute an attack before the first datagram hits the wire. Let’s not kid ourselves—that’s not happening until 2016 at the earliest.

Jason Rodzik, Director of CNO Software Engineering

In 2015, I expect to see companies—and maybe even the public as a whole—taking computer security much more seriously than they have previously. 2014 ended with not only a number of high-profile breaches, but also unprecedented fallout from those breaches, including the replacement of a major corporation’s (Target’s) CEO and CIO, increased interest in holding companies legally responsible if they fail to secure their systems, and most drastically, a chilling effect on artistic expression and speech (in addition to the large financial damages) with the reactions resulting from the Sony hack. Historically, it’s been hard for anyone looking at financial projections to justify spending money on a security department when it doesn’t generate revenue, but the cost associated with poor security is growing to the point where more organizations will have to be much more proactive in strengthening their security posture.

Douglas Raymond, Vice President

One area where cybersecurity products will change in 2015 is in the application of modern design principles to the user interfaces. There’s a shortage of skilled operators everywhere in the industry, and there isn’t enough time or resources to train them. Companies must solve their challenges with small staffs that have a diversity of responsibilities and not enough time to learn how to integrate a multitude of products. The cost of cognitive overload is high. Examples such as the shooting down of ML17 over Ukraine, the U.S. bombing of the Chinese Embassy in Belgrade, and the Target data breach, to cite a well known cybersecurity example, demonstrate the real costs of presenting operators with too much information in a poorly designed interface. Data science isn’t enough—cyber companies in 2015 will synthesize data and control interfaces to provide operators with only the most critical information they need to solve the immediate security challenge.

Andrea Little Limbago, Principal Social Scientist @limbagoa

This year will be characterized by the competing trends of diversity and stagnation. The diversity of actors, targets, activities, and objectives in cyberspace will continue to well outpace the persistent dearth of a strategic understanding of the causes and repercussions of computer network operations. A growing number of state and non-state actors will seek creative means to use information technology to achieve their objectives. These will range from nation-state sponsored cyber attacks that may result in physical damage on the one extreme, to the use of cyber statecraft to advance political protest and social movements (e.g. potentially a non-intuitive employment on DDoS attacks) and give a voice to those censored by their own governments on the other. Furthermore, there will be greater diversity in the actors involved in international computer network operations. With the transition away from resources and population to knowledge-based capabilities within cyberspace, there will be a “rise of the rest” similar to economic forecasts of the BRICs (Brazil, Russia, India, China, and later South Africa) a decade and a half ago. Just like those forecasts, some of the rising actors will succeed, and some will falter. In fact, the BRIC countries will be key 2015 cyber actors, simultaneously using computer network operations internally to achieve domestic objectives, and externally to further geopolitical objectives. Additionally, those actors new to the cyber domain – from rising states to multinational corporations to nongovernment organizations – may subsequently expose themselves to retaliation for which they are ill prepared.

However, despite this diversity, we’ll continue to witness the juxtaposition of theoretical models from previous areas onto the cyber domain. From a Cold War framework to the last decade’s counter-terrorism models, many will attempt to simplify the complexities of cyberspace by merely placing it in the context of previous doctrine and theory. This “square peg in a round hole” problem will continue to plague the public and private sectors, and hinder the appropriate institutional changes required for the modern cyber landscape. Most actors will continue to respond reactively instead of proactively, with little understanding of the strategic repercussions of the various aspects of tactical computer network operations.

Graphic credit: Anne Harper

Two months ago, near-peer cyber competitors breached numerous government systems. During this same time, China debuted its new J-31 stealth fighter jet, which has components that bear a remarkable resemblance to the F-35 thanks to the cyber-theft of data from Lockheed Martin and subcontractors. One might think that this string of cyber breaches into a series of government systems and emails, coupled with China’s display of the fighter jet, would raise public alarm about the increasing national security impact of cyber threats. But that didn’t happen. Instead, it took the breach of an entertainment company, and the cancellation of a movie, to dramatically increase public awareness and media coverage of these threats. While the Sony breach ultimately had minimal direct national security implications, it nevertheless marks a dramatic turning point in the level of attention and public concern over cybersecurity.

Whereas the hack of a combatant command’s Twitter feed a month ago would not have garnered much attention, this week it was considered breaking news and covered by all major news outlets - despite the fact that the Twitter account is not hosted on government servers, and the Department of Defense noted that although it was a nuisance, it does not have direct operational impact. Media coverage consistently reflects public interest. The high-profile coverage of these two latest events, which exhibit tertiary links to national security, reflects the sharp shift in public interest toward cybersecurity and a potentially greater demand for government involvement in the cybersecurity domain. In all likelihood, the Sony breach will not be remembered for its vast financial and reputational impact, but rather for its impact on the public discourse. This discourse, in turn, may well be the impetus that the government requires to finally emerge from a legislative stasis and enable Congress and the President to pursue the comprehensive cyber legislation and response strategies that have been lacking for far too long.

The widespread reporting and interest in the Sony breach may in fact spark a sharp change from an incremental approach to public policy toward a much more dramatic shift. In social and organizational theory, this is known as punctuated equilibrium, whereby events occur that instigate major policy changes. While it is disconcerting - but not shocking - that the Sony breach may be just this event, the recent large media focus on CENTCOM’s Twitter feed (which some go so far as to call a security threat) signals that the discourse has dramatically changed. This is great timing for President Obama, as he speaks this week about private-public information sharing and partnerships prior to highlighting cyber threats within his State of the Union speech next week. In fact, he is using these recent events to validate his emphasis on cybersecurity in next week’s address, noting “With the Sony attack that took place, with the Twitter account that was hacked by Islamist jihadist sympathizers yesterday, it just goes to show much more work we need to do both public and private sector to strengthen our cyber security.” Clearly, these events - which on the national security spectrum of breaches over the last few years are relatively mundane - have triggered a tipping point in the discourse of cybersecurity threats such that cyber legislation may actually be possible.

These recent events provide a “rally around the flag” effect, fostering a public environment that is encouraging of greater government involvement in the cybersecurity realm (and is a notably stark contrast to the public discourse post-Snowden in 2013). Of course, while there is reason for optimism that 2015 may be the year of significant cybersecurity legislation, even profound public support for greater government involvement in cybersecurity cannot fix a divided Congress. With previous cybersecurity legislation passing through an Executive Order after it failed to pass Congress, there is little reason to believe there won’t be similar roadblocks this time around. In addition to the institutional hurdles, legislators will also have to strike the balance between freedom of speech, privacy and security - a debate that has divided the policy and tech communities for years. European leaders just released a Joint Statement, which includes greater emphasis to “combat terrorist propaganda and the misleading messages it conveys”. Doing this effectively without stepping on freedom of speech will be challenging to say the least. However, despite these potential roadblocks, the environment is finally ripe for cyber legislation thanks to the cancellation of a movie over the holiday season and a well-timed hack of a COCOM Twitter feed. Now that the public is paying more attention, cybersecurity policy and legislation may finally move beyond an incremental shift and closer to the dramatic change that is ultimately in sync with the realities of the cyber threat landscape.

Effective analysis of cyber security data requires understanding the composition of networks and the ability to profile the hosts within them according to the large variety of features they possess. Cyber-infrastructure profiling can generate many useful insights. These can include: identification of general groups of similar hosts, identification of unusual host behavior, vulnerability prediction, and development of a chronicle of technology adoption by hosts. But cyber-infrastructure profiling also presents many challenges because of the volume, variety, and velocity of data. There are roughly one billion Internet hosts in existence today. Hosts may vary from each other so much that we need hundreds of features to describe them. The speed of technology changes can also be astonishing. We need a technique to address these rapid changes and enormous features sets that will save analysts and security operators time and provide them with useful information faster. In this post and the next, I will demonstrate some techniques in clustering and visualization that we have been using for cyber security analytics.

To deal with these challenges, the data scientists at Endgame leverage the power of clustering. Clustering is one of the most important analytic methodologies used to boil down a big data set into groups of smaller sets in a meaningful way. Analysts can then gain further insights using the smaller data sets.

I will continue the use case given in Understanding Crawl Data at Scale (Part 1): the crawled data of hosts. At Endgame, we crawl a large, global set of websites and extract summary statistics from each. These statistics include technical information like the average number of javascript links or image files per page. We aggregate all statistics by domain and then index these into our local Elasticsearch cluster for browsing through the results. The crawled data is structured into hundreds of features including both categorical features and numerical features. For the purpose of illustration, I will only use 82 numerical features in this post. The total number of data points is 6668.

First, I’ll cover how we use visualization to reduce the number of features. In a later post, I’ll talk about clustering and the visualization of clustering results.

Before we actually start clustering, we first should try to reduce the dimensionality of the data. The very basic EDA (Exploratory Data Analysis) method of numerical features is to plot them on a scatter matrix graph, as shown in Figure 1. It is an 82 by 82 plot matrix. Each cell in the matrix, except the ones on the diagonal line, is a two-variable scatter plot, and the plots on the diagonal are the histograms of each variable. Given the large number of features, we can hardly see anything from this busy graph. An analyst could spend hours trying to decipher this and derive useful insights:

Figure 1. Scatter Matrix of 82 Features

Of course, we can try to break up the 82 variables into smaller sets and develop a scattered matrix for each set. However, there is a better visualization technique available for handling the high dimensional data called a Self-Organizing Map (SOM).

The basic idea of a SOM is to place similar data points closely on a (usually) two dimensional map by training the weight vector of each cell on the map with the given data set. A SOM can also be applied to generate a heat map for each of the variables, like in Figure 2. In that case, a one-variable data set is used for creating each subplot in the component plane.

Figure 2. SOM Component Plane of 82 Features

By color-coding the magnitude of a variable, as shown in Figure 2, we can vividly identify those variables whose plots are covered by mostly blue. These variables have low entropy values, which, in information theory, implies that the amount of information is low. We can safely remove those variables and only keep the ones whose heat maps are more colorful. The component plane can also be used to identify similar or linearly correlated variables, such as the image at cell (2,5) and the one at cell (2,6). These cells represent the internal HTML pages count and HTML files count variables, respectively.

Based on Figure 2, 29 variables stood out as potential high information variables. This is a data-driven heuristic for distilling the data, without needing to know anything about information gains, entropy, or standard deviation.

However, 29 variables may still be too many, as we can see that some of them are pretty similar. It would be great to sort the 29 variables based on their similarities, and that can be done with a SOM. Figure 3 is an ordered SOM component plane of the 29 variables, in which similar features are placed close to each other. Again, the benefit of creating this sorted component plane is that any analyst, without the requirement of strong statistical training, can safely look at the graph and hand pick similar features out of each feature group.

Figure 3. Ordered SOM Component Plane

So far, I demonstrated how to use visualization, specifically a SOM, to help reduce the dimensionality of the data set. Please note that dimensionality reduction is another very rich research topic (besides clustering) in data science. Here I only mentioned an extremely small tip of the iceberg, using a SOM component plane to visually select a subset of features. One more important point about the SOM is that it not only helps reduce the number of features, but also brings down the number of data points for analysis by generating a set of codebook data points that summarize the original larger data set according to some criteria.

In Part 3 of this series on Understanding Crawl Data at Scale, I’ll show how we use codebook data to visualize clustering results.

In Understanding Crawl Data at Scale (Part 2), I demonstrated using SOM to visualize a high-dimensional dataset and use the technique to help reduce the dimensionality. As you may remember, this technique is a time-saver for analysts who are dealing with large data sets consisting of hundreds of features. In this section, I will briefly show the process of clustering and the visualization of its results using a few classical clustering methods. As we know, it is difficult for humans to visually digest any information with more than three dimensions, so I would like to start with illustrating the clustering process using a 2-D data set. The two sort-of-arbitrarily-chosen features come from the data set used before, namely the minimum number of image files and the total number of HTML files.

The 2-D data set can be easily drawn as a scatter plot, as shown in Figure 1. Most of the data are located in a small region at the lower-left corner, while sparser points are stretched across far away in both dimensions. Intuitively we may come up with 3 clusters in our mind, something like the shaded areas below, by just looking at the scatter plot. How true is that? We can use a SOM to get a better idea.

Figure 1. Scatter Plot of the 2-D Data Set

A SOM technique places similar data points close to each other, or even together in the same cell, on a given map. The cells with data populated are part of a codebook data set, which is viewed as a representation of the original data set but with a much smaller number of data points. After the placement is done, the extent of dissimilarity (or distance) between the cells can be computed, and the results can be plotted as a unified distance matrix plot (or U-Mat plot).

Figure 2 shows the same U-Mat plot, but with two possible splits of clusters on the SOM. Darker regions indicate lower distance values and bright red color usually indicates a separation of clusters. Legitimate guesses of the number of clusters might be three or four on the given SOM U-Mat plot, and we are confident that it won’t be more than that.

Figure 2. U-Mat Plot with Possible Separation of Clusters

Now that we have a good idea of how many clusters we would like to try, we can use the K-Means method to group the data points into 3 or 4 clusters (K = 3 or 4 respectively). It is also always a good practice to normalize the data in each dimension before clustering takes place. Here I normalized the data into the range of [0, 1] in both dimensions so that they are comparable.

Figure 3 and Figure 4 show the color-coded clustering results with 3 and 4 clusters. With 3 clusters, 90.5% of data points are assigned to cluster 2, 9% to cluster 1, and 0.5% to cluster 3. Apparently the data points in cluster 3 are outliers in this data set.

Figure 3. Three-Cluster Split Using K-Means

The four-cluster split is a bit different. Cluster 1 now takes 17.5% of the data points, cluster 2, 3.4%, cluster 3, 78.6%, cluster 4, 0.4%.

The contours in both Figure 3 and Figure 4 indicate the areas where data points may have the same level of membership likelihood. In the area where data points are dense, the contours change much more rapidly than those in sparse areas because the clustering is sensitive to the distance of data points to the cluster centers.

Figure 4. Four-Cluster Split Using K-Means

Thanks to the low dimensionality of the hypothetical data set, the split in each case is clear-cut. We can visualize the two different labeling systems using the codebook data placed on a SOM map, as in Figure 5.

The U-Mat plot on the left side of Figure 5 is the same as those in Figure 2, only drawn on a map of slightly different sizes. On the right side of Figure 5, the codebook data are plotted with 3-cluster or 4-cluster labeling. Both of them seem to make sense, and the choice of which to use is really up to the analyst.

Figure 5. Labeled Codebook Data with K-Means(3) and K-Means(4), 2-D Dataset

The world of high dimensionality would be much more blurry. After showing that we can get some satisfactory clustering result with 2-D data, let’s move up to the high-dimensional space, with the same data set but containing 29 variables.

The SOM U-Mat of the 29-feature data set is shown on the left side of Figure 6. The separation of clusters is much less obvious than that in 2-D space. Although the 29 features include the 2 features we used in the 2-D data set, there are many other features adding noises on the once clear split of the data. The consequence is somewhat mixed-up labels as shown in the labeled codebook plots on the right side of Figure 6.

Figure 6. Labeled Codebook Data with K-Means(3) and K-Means(4), 29-D Dataset

We also may want to use the same technique to visually compare the results from different clustering methods. Even with more rigorous measurements of clustering evaluation available, visualization remains a very powerful way for analysts (who may not necessarily be statisticians or data scientists) to gauge the performance of a variety of clustering methods. That being said, a rigorous validation of clustering is always encouraged whenever it is possible, but we won’t be going into the details of this today.

Figure 7 shows the results from two other clustering methods, KMedoid with K=4 and Fuzzy C-Means with C = 4.

Figure 7. Labeled Codebook Data with K-Medoid(4) and Fuzzy C-Means(4), 29-D Dataset

Lastly, I’d like to close this post with hierarchical clustering using the codebook data. When dealing with a very large amount of data, directly clustering might not be a feasible solution. In that case, vector quantization (VQ) will be a handy tool to reduce the data set. SOM’s are one kind of such VQ method. By training the weight vector of each cell in a map, some or all of the cells will resemble a portion of the original data set. The weight vectors associated with those cells are the codebook data. Clustering on the codebook data becomes much less computationally expensive because of the dramatic reduction in data set size.

Figure 8 shows the K-Means clustering (K=10) on the codebook data. K=10 is a sort of arbitrarily chosen large number. With the K-Means clustering result, we can do agglomerative hierarchical clustering.

Figure 8. Codebook Data Grouped into 10 Clusters Created with K-Means

Figure 9 shows the dendrograms of two hierarchical clustering results. The difference lies in the choice of how the distance between two clusters is computed. The x-axis of the dendrogram is the clusters being agglomerated, and the y-axis is the distance measure between two merged clusters. By choosing a threshold of cluster distance, one can cut off the linkage and identify a number of separate clusters. More clusters are generated as the threshold decreases.

Figure 9. Dendrograms of Hierarchical Clustering with Single Linkage (top) and Complete Linkage (bottom)

In summary, this post only highlights some of the ways for visualizing high-dimensional data and the clustering results. It certainly cannot cover everything related to multivariate clustering and visualization. We didn’t even mention projection methods, such as PCA (Principal Component Analysis), MDS (Multi-Dimensional Scaling), and Sammon Mapping. However, I hope that this post provides some interesting ideas for data enthusiasts on clustering and visualization, two of the techniques that are extremely useful in data science. Although the example data is taken from a cybersecurity context, the same practice can be successfully applied to other industries, such as credit risk, customer segmentation, biology, finance, and more.

In the recent New York Times bestselling book, The Martian, Andy Weir depicts a future world where space travel to Mars is feasible. Through an unfortunate string of events, the book’s hero, Mark Watney, becomes stranded on Mars, unable to communicate with anyone on Earth. After watching Friday’s release of the National Security Strategy (NSS) and the way in which it mimicked last month’s State of the Union (SOTU) address, I wondered how Watney (if he makes it back to Earth – not giving the ending away!) – would interpret the major foreign policy challenges depicted in those speeches. If someone were to land on Earth after being away for years, what would they think of the state of international relations if they only based it on the NSS and SOTU? When it comes to the cyber domain, the rhetoric seems completely misaligned with the realities of the global system. But what would Watney think? Let’s imagine Watney’s interpretation of the NSS and SOTU after having been away from Earth for years…

Log Entry: Day 8

I’m not sure which is worse – being on the brink of death everyday thanks to the inhospitable environment on Mars, or coming home and learning about the various threats present in the inhospitable international environment. Sure, this isn’t really my area, but I’m dying to focus on anything besides botany and engineering for once. The good news is that, despite the laundry list of challenges, we’re in it together with China. Both the SOTU and the NSS give big props to China for its great cooperation in helping battle climate change. Well, that’s a huge relief! We certainly can’t reverse this most existential of threats without the support of the world’s most populous country and second (phew, we’re still number 1!) largest economy.

But here’s what I don’t understand. Sure, I get that climate change is important, but what does it have to do with Ebola and cyber? In each address, those three are grouped together, apparently because they all rely on international norms and cooperation and aren’t considered geopolitical. It’s strange to think that cyber doesn’t belong in the discussion about foreign adversaries like Russia, Iran and North Korea, but I’m simply an engineer, what do I know about that? I guess when it comes to cyber the key concern is privacy and individual rights. I haven’t had privacy for years, so no biggie there. I’m just glad we’re friends with China. I’d hate to relive those days of major power rivalries and espionage.

Log Entry: Day 10

At first I was thankful to finally have something to read besides Agatha Christie novels, but I’ll tell you what, this NSS is even more of a mystery when it comes to cyber. I wasn’t planning on reading it, since Friday’s release simply provided a bit more detail on the list of foreign policy challenges elucidated in the SOTU. But here’s what is interesting. If you actually read the document, there’s a single, yet important line in there that would go completely unnoticed if you listened only to the speeches. On page 24, the NSS states, “On cybersecurity, we will take necessary actions to protect our businesses and defend our networks against cyber-theft of trade secrets for commercial gain whether by private actors or the Chinese government.” What? Where did this come from? This is the concluding sentence in a paragraph that actually talks about concern over China’s military mobilization and the potential for miscalculation. So wait, China is a major cyber threat and has been stealing from us? Where did this come from? I thought we were BFFs. This is so confusing. So I went back and looked at some previous doctrine, just for the heck of it. China isn’t mentioned explicitly by name in the 2011 International Strategy for Cyberspace, and the 2010 NSS is all about seeking cooperation with China. Sooo….the speeches say one thing, the document says another, and this latest NSS takes one big step forward in surfacing China’s espionage within a strategic document. Foreign policy is not for me. I’d much rather deal with the certainty of the plant world instead of these competing narratives. In my world, any miscalculations are entirely my fault and are much more predictable than those in the foreign policy world.

The Obama Administration deserves credit for putting together the first-ever White House summit on cybersecurity on Friday and – contrary to what some media coverage may lead you to believe – the U.S. private sector mostly deserves credit for showing up.

Rather than offer yet another perspective on how to structure the Cyber Threat Intelligence Integration Center (CTIIC), or speculate on what it means that this or that CEO didn’t attend, I thought I’d just share a few thoughts from a day at Stanford that was packed with conversations with colleagues from across the government, the security industry, and the nation’s critical infrastructure.

1. More than most industries, the security community really is a community and must be bound by trust. Examples of this oft-overlooked reality were abundant: government officials pledging that “the U.S. government will not leave the private sector to fend for itself” and that our actions should be guided by “a shared approach” as a basic, guiding principle; Palo Alto Networks CEO Mark McLaughlin plugging the much-needed Cyber Threat Alliance, a voluntary network of security companies sharing threat intelligence for the good of all; Facebook CISO Joe Sullivan stressing the importance of humility, of talking openly about security failures, and about information security as a field that’s ultimately about helping people. Many of the day’s conversations kept coming back to trust – both the magnitude of what we can accomplish when we have it, and the paralyzing effect of its absence.

2. All companies are now tech companies. Home Depot doesn’t just sell hammers, and even small businesses have learned the great lesson of the past decade’s dev-ops revolution: outsource any software you don’t write yourself by moving it to the cloud and putting the security responsibility on the vendor. An interesting corollary to this is whether, as larger companies get more capable with their security, we will see hackers moving down-market to target smaller companies in increasingly sophisticated ways. This is sobering because scoping the magnitude of the challenge before us leads to the conclusion that it includes…well…everything.

3. Our adversaries will continue getting better partly because we will continue getting better. There’s a nuance here that isn’t captured in the simple notion that higher walls only beget taller ladders. An example from the military world is that Iraq’s insurgents became vastly more capable between 2003 and 2007 because they spent those four years sharpening their blades on a very hard stone: us. So consider, for example, the challenge facing new payments companies today: you’re fighting the guys who cut their teeth against PayPal fifteen years ago, and you’re doing it with a tiny number of defenders since you’re only a start-up, not with the major resources of PayPal’s current security team. Submitting to an “arms race” mentality—or quitting the race altogether—isn’t the answer. But this reality does put the security bar higher and higher for new ventures, and suggests that competition for experienced security talent will only grow more heated.

4. Too many policy-makers are still a long way from basic fluency in this field.That’s intended more as observation than criticism. It takes time to build a deep reservoir of talent in any field of endeavor – across the whole pipeline from funding basic research in science and technology, through nurturing the ecosystem of analysts and writers who can inform a robust conversation about occasionally arcane topics, to reaping the benefits of multi-generational experience where newer practitioners can learn from the battle scars of those who came before them. The traditional defense community has this, as do tax policy, health care policy, and most other major areas of public-private collaboration. It’ll come in the cyber arena too. What worries me, though, is that too many policy makers, when they refer to “the private sector” in this context, seem to imply either that it’s less important than the government, or even (bizarrely) that it’s smaller than the government. The government has a massively important role in cyber security, but it isn’t the whole game, and it probably isn’t even most of the game.

5. Information sharing is only a means to an end. If one of the day’s two major themes was “trust,” then the other was “information sharing.” Yes, our security is only as good as the data we have. Yes, there can be a “neighborhood watch-like” network effect in sharing threat intelligence. Yes, the sharing needs to happen across multiple axes: public to public, public to private, and private to private. But all of that sharing will be for naught if it doesn’t lead to some kind of effective action – across people, process, and technology. (Remember that “Bin Laden Determined to Strike in U.S.” was the heading of the President’s daily briefing from the CIA on August 6, 2001…) The Summit was one action, and the security community needs to take many, many more.

Streaming data processing has existed in our computing lexicon for at least 50 years. The ideas Doug McIlroy presented in 1964 regarding what would become UNIX pipes have been revisited, reimagined and reengineered countless times. As of this writing the Apache Software Foundation has Samza, Spark and Stormfor processing streaming data… and those are just the projects beginning with S! Since we use Spark and Python at Endgame I was excited to try out the newly released PySpark Streaming API when it was announced for Apache Spark 1.2. I recently gave a talk on this at the Washington DC Area Apache Spark Interactive Meetup. The slides for the talk are available here. What follows in this blog post is an in depth look at some PySpark functionality that some early adopters might be interested in playing with.

USING UPDATESTATEBYKEY IN PYSPARK STREAMING

In the meetup slides, I present a rather convoluted method for calculating CPU Percentage use from the Docker stats API using PySpark Streaming.updateStateByKey is a better way to calculate such information on a stream, but the python documentation was a bit lacking. Also the lack of type signatures can make PySpark programming a bit frustrating. To make sure my code worked, I took a cue from one of the attendees (thanks Jon) and did some test driven development. TDD works so well I would highly suggest it for your PySpark transforms, since you don’t have a type system protecting you from returning a tuple when you should be returning a list of tuples.

Let’s dig in. Here is the unit test for updateStateByKey fromhttps ://github.com/apache/spark/blob/master/python/pyspark/streaming/tests.py#L344-L359:

def test_update_state_by_key(self):

    def updater(vs, s):
        if not s:
            s = []
        s.extend(vs)
        return s

    input = [[('k', i)] for i in range(5)]

    def func(dstream):
        return dstream.updateStateByKey(updater)

    expected = [[0], [0, 1], [0, 1, 2], [0, 1, 2, 3], [0, 1, 2, 3, 4]]
    expected = [[('k', v)] for v in expected]
    self._test_func(input, func, expected)

This test code tells us, if we play around a bit that for the input:

[[('k', 0)], [('k', 1)], [('k', 2)], [('k', 3)], [('k', 4)]]

we expect the output:

[[('k', [0])],
 [('k', [0, 1])],
 [('k', [0, 1, 2])],
 [('k', [0, 1, 2, 3])],
 [('k', [0, 1, 2, 3, 4])]]

UpdateStateByKey allows you maintain a state by key. This test is fine, but if you ran it in production you’d end up with an out of memory error as 's' will extend without bounds. In a unit test with a fixed input it’s fine, though. For my presentation, I wanted to pull out the time in nanoseconds that a given container had used the CPUs of my machine and divide it by the time in nanoseconds that the system CPU had used. For those of you thinking back to calculus, I want to do a derivative on a stream.

How do I do that and keep it continuous? Well one idea is to keep a limited amount of these delta x’s and delta y’s around and then calculate it. In the presentation slides, you’ll see that’s what I did by creating multiple DStreams, joining them, doing differences in lambda functions. It was overly complicated, but it worked.

In this blog I want to present a different idea that I cooked up after the meetup. First the code:

from itertools import chain, tee, izip
def test_complex_state_by_key(self):

    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return izip(a, b)

    def derivative(s,x,y):
        "({'x':2,'y':1},{'x':6,'y':2}) -> derivative(_,'x','y') -> float(1)/4 -> 0.25"
        return float(s[1][y] - s[0][y])/(s[1][x]-s[0][x])

    def updater(vs, s): # vs is the input stream, s is the state
        if s and s.has_key('lv'):
            _input = [s['lv']] + vs
        else:
            _input = vs
        d = [derivative(p,'x','y') for p in pairwise(_input)]
        if s and s.has_key('d'):
            d = s['d'] + d
        last_value = vs[-1]
        if len(d) > len(_input):
            d = d[-len(_input)] # trim to length of _input
        state = {'d':d,'lv':last_value}
        return state

    input = [[('k',{'x':2,'y':1})],[('k',{'x':3,'y':2})],[('k',{'x':5,'y':3})]]

    def func(dstream):
        return dstream.updateStateByKey(updater)

    expected = [[('k', {'d': [], 'lv': {'x': 2, 'y': 1}})],
                [('k', {'d': [1.0], 'lv': {'x': 3, 'y': 2}})],
                [('k', {'d': [1.0, 0.5], 'lv': {'x': 5, 'y': 3}})]]
    self._test_func(input, func, expected)

Here’s an explanation of what I’m trying to do here. I pulled in the pairwisefunction from the itertools recipe page. Then I crafted a very specificderivative method that takes a dictionary, and two key names and returns the slope of the line. Rise over run You can plug this code into the pyspark streaming tests and it passes. It can be used as an unoptimized recipe for keeping a continuous stream of derivatives, although I can imagine a few nice changes for usability/speed. The state keeps d which is the differences between pairs of the input, and lv which is the last value of the data stream. That should allow this to work on a continuous stream of values. Integrating this into the demo I did in the presentation is left as an exercise for the reader. ;)

Comments, questions, code review welcome at @rseymour. If you find these sorts of problems and their applications to the diverse world of cyber security interesting, you might like to work with the data science team here at Endgame.

Last week, as discussions of striped dresses and llamas dominated the headlines, academia and policy coalesced in a way that rarely happens. On February 25th, Director of National Intelligence James Clapper addressed the Senate Armed Services Committee to provide the annual worldwide threat assessment. In addition to highlighting the rampant instability, Director Clapper specified Russia as the number one threat in the cyber domain. He noted, “the Russian cyber threat is more severe than we’ve previously assessed.” Almost simultaneously, the Journal of Peace Research, a preeminent international relations publication, pre-released its next issue that focuses on communication, technology and political conflict. Within this issue, an article contends that internet penetration in authoritarian states leads to greater repression, not greater freedoms. Social media quickly was abuzz, with the national security community focusing on Russia’s external relations, while international relations academics were debating the internal relations of authoritarian states, like Russia. And thus, within twenty-four hours, policy and academia combined to present a holistic, yet rarely addressed, perspective on the threat – the domestic and international authoritarian whole of government approach when it comes to controlling the cyber domain.

First, Director Clapper made headlines when he elevated the Russian cyber threat above that of the Chinese. Both are still the dominant threats, a select group to which he also includes Iran and North Korea – responsible for (most prominently) the attack on the Las Vegas Sands Casino Corporation, and Sony, respectively. This authoritarian quartet stands out for their advanced digital techniques and targeting of numerous foreign sectors and states. Director Clapper highlighted the sophistication of the Russian capabilities, while also noting China’s persistent espionage campaign. Clearly, this perspective should predominate a worldwide threat assessment.

At the same time, the Department of State calls this the “Internet Moment in Foreign Policy”, reinforcing former Secretary of State Hillary Clinton’s push for internet freedoms to promote freedom of speech and civil liberties. However, what is often overlooked in her speech from five years ago is the double-edged sword of any form of information technology. Clinton warned, “technologies with the potential to open up access to government and promote transparency can also be hijacked by governments to crush dissent and deny human rights.” She succinctly describes the liberation versus repression technology hypotheses around internet penetration. While the view of liberation technology is the one largely purported by the tech community and diplomats in a rare agreement, the actual impact of internet penetration in authoritarian regimes has never been empirically tested – until now. Espen Geelmuyden Rod and Nils B Weidmann provide the first empirical analysis to test the liberation versus repression technology debate by analyzing the impact of internet penetration on censorship within authoritarian regimes. They find that, contrary to popular perceptions, there is a statistically significant association between internet penetration and repression technology, even after controlling for a variety of domestic indicators and temporal lags. The authoritarian regimes in the sample reflect the authoritarian quarter Clapper references, and is a group that clearly employs digital statecraft both domestically and internationally to achieve national objectives.

These two distinct perspectives together provide the ying and the yang of authoritarian regime behavior in cyberspace. Instead of being viewed in isolation from one another, the international and domestic use of digital instruments of power reflect a whole of government strategy pursued by China and Russia, and other authoritarian states to various degrees. As I wrote last year, internet censorship globally is increasing, but clearly is more pronounced in authoritarian regimes. For instance, since the time of that post, China has begun to crackdown on VPN access as part of an even more concerted internet crackdown. In February, Russia declared that it too might follow suit, cracking down not only on VPN access, but also Tor. When focusing on US national interests, it may seem like only the foreign behavior of these states matters. However, that is a myopic assumption and ignores one of the most prevalent aspects of international relations – the necessity to understand the adversary. While the US was extraordinarily well informed about Soviet capabilities domestic and abroad, the same is no longer true for this larger and more diverse threatscape, especially as it pertains to the cyber domain. This gap could be ameliorated through an integrated perspective of the domestic and international digital statecraft of adversaries.

The confluence of this worldwide threat assessment to Congress and the academic publication is striking, and should be more than an esoteric exercise. It simultaneously reinforced the current gap between academia and policy in matters pertaining to the cyber domain, while also demonstrating that the academic perspective can and should help augment the dialogue when it comes to digital statecraft. However, perhaps even more pertinent is the way in which the article and the Congressional remarks reflect two pieces of the whole. Governments pursue national interests domestically and internationally. It is time we viewed these high priority authoritarian regimes through this bifocal lens. There are many insights to be gained about adversarial centralization of power, regime durability, and technological capabilities by also looking at the domestic digital behavior of authoritarian regimes. Coupling the international perspective with the domestic cyber behavior into threat assessments can help provide great insights into the capabilities, targets, and intent of adversaries.

As we approach International Women’s Day this week and edge closer to the 100th anniversary of women’s suffrage (okay, four years to go, but still, a remarkable moment), and as news and current events are sometimes focused on the negative facts and statistics related to the field of women and technology and especially women and venture capital, I feel particularly grateful to be working at Endgame- a technology company that has an amazing cast of phenomenal women—from our developers to scientists to business minds. Our team—not just our leadership, but our entire company- is dynamic and diverse. Of course, Endgame is not alone. At the Montgomery Summit, a technology conference that takes place March 9th-11th in Los Angeles, there is a session devoted to Female Founders of technology companies. I am thrilled to be taking part in this event, which highlights a group of remarkable women who have founded and are leading tech companies in a diverse set of industries.

As a prelude to the conference and the celebration of International Women’s Day, and in hopes of encouraging more girls to embrace the STEM disciplines in school and pursue a career in technology, I want to highlight some amazing women who have dedicated their lives to making a difference—as technologists and as entrepreneurs, because there is true cause for inspiration.

The list of technology heroines is long and hard to winnow. So many have dedicated their lives and technical genius to service and solving some of our hardest problems, especially in the field of security- cyber, information and national security. Many will never be acknowledged publicly, but below are a few who can be:

• Professor Dorothy Denning is not only teaching and working with the next generation of security vanguards at the Naval Postgraduate School, but she is also credited with the original idea of IDS back in 1986.

• Chien-Shiung Wu, the first female professor in Princeton’s physics department, earned a reputation as a pioneer of experimental physics, not only by disproving a “law” of nature (the Law of Conservation of Parity), but also in her work on the Manhattan Project. Wu’s discoveries earned her colleagues the Nobel Prize in physics.

• Lene Hau is Danish physicist who literally stopped light in its tracks. This critical process of manipulating coherent optical information by sharing information in light-form has important implications in the fields of quantum encryption and quantum computing.

• There are many visionary entrepreneurs like Sandy Lerner, co-founder of CISCO, Joan Lyman, co-founder of SecureWorks, and Helen Greiner, co-founder of iRobot and CEO of CyPhyWorks, who work tirelessly and brilliantly to deliver the solutions necessary to keep the world, and the people in it, safe.

• Window Snyder, a security and privacy specialist at Apple, Inc., significantly reduced the attack surface of Windows XP during her tenure at Microsoft, which led to a new way of thinking about threat modeling. She has many contemporaries who have also broken with stereotype and are having tremendous impact in making the technologies we interact with safer. Women like Jennifer Lesser Henley who heads up security operations at Facebook, and Katie Moussouris, Chief Policy Officer at HackerOne.

If we look further back in history, the list of amazing women in technology gets even longer. Many of the names may even surprise you:

• Ada Lovelace: The world’s first computer programmer and Lord Byron’s daughter (“She walks in Beauty, like the night/ Of cloudless climes and starry skies;/ And all that’s best of dark and bright/ Meet in her aspect and her eyes”), she has a day, a medal, a competition and most notably, a Department of Defense language named after her. Ada, the computer language, is a high-level programming language used for mission-critical applications in defense and commercial markets where there is low tolerance for bugs. And herein lies the admittedly tenuous connection to security– despite being a cumbersome language in some ways, “Ada churns out less buggy code” and buggy code remains the Achilles’ heel of security

• Hedy Lamarr: A contract star during MGM’s Golden Age, Hedy Lamarr was “the most beautiful woman in films,” an actress, dancer, singer, and dazzling goddess. She was also joint owner of US Patent 2,292,387, a secret communication system (frequency hopping) that serves as the basis for spread-spectrum communication technology, secure military communications, and mobile phone technology (CDMA). Famous for her quote, “Any girl can look glamorous. All you have to do is stand still and look stupid,” Hedy Lamarr’s legacy is that of a stunningly beautiful woman who refused to stand still. Thankfully, her refusal to accept society’s chosen role for her resulted in a very significant contribution to secure mobile communications.

• Rear Admiral Grace Hopper: Also known as the Grand Lady of Software, Amazing Grace, Grandma COBOL, and Admiral of the Cyber Sea, say hello to Rear Admiral Grace Hopper, a “feisty old salt who gave off an aura of power.” She was a pioneer in information technology and computing before anyone knew what that meant. Embracing the unconventional, Admiral Grace believed the most damaging phrase in the English language is “We’ve always done it this way,” and to bring the point home, the clock in her office ran counterclockwise. Grace Hopper invented the first machine independent computer language and literally discovered the first computer “bug.” Hopper began her career in the Navy as the first programmer of the Mark I computer, the mechanical miracle of its day. The Mark I was a five ton, fifty foot long, glass-encased behemoth — a scientific miracle at the time, made of vacuum tubes, relays, rotating shafts and clutches with a memory for 72 numbers and the ability to perform 23-digit multiplication in four seconds. It contained over 750,000 components and was described as sounding like a “roomful of ladies knitting.” Unable to balance a checkbook (as she jokingly described herself), Hopper changed the computer industry by developing COBOL (common-business-oriented language), which made it possible for computers to respond to words rather than numbers. Admiral Hopper is also credited with coining the term “bug” when she traced an error in the Mark II to a moth trapped in a relay. The bug was carefully removed and taped to a daily log book- hence the term “computer bug” was born.

There is also a group of women who helped save the world with the work they did in cryptology/cryptanalysis during World War I and World War II. There were thousands of female scientists and thinkers who helped ensure Allied victory. I will only highlight a few, but they were emblematic of the many.

• Agnes Meyer Driscoll: Born in 1889, the “first lady of cryptology” studied mathematics and physics in college, when it was very atypical for a woman to do so. Miss Aggie, as she was known, was responsible for breaking a multitude of Japanese naval manual codes (the Red Book Code of the ‘20s, the Blue Book Code of the ‘30s, and the JN-25 Naval codes in the ‘40s) as well as a developer of early machine systems, such as the CM cipher machine.

• Elizebeth Friedman: Another cryptanalyst pioneer, with minimal mathematical training, she was able to decipher coded messages regardless of the language or complexity. During her career, she deciphered messages from ships at sea (during the Prohibition era, she deciphered over 12,000 rum-runner messages in a three-year period) to Chinese drug smugglers. An impatient, opinionated Quaker with a disdain for stupidity, she spent the early part of her career working as a hairdresser, a seamstress, a fashion consultant, and a high school principal. Her love of Shakespeare took her to Riverbank Laboratories, the only U.S. facility capable of exploiting and solving enciphered messages. There she worked on a project to prove that Sir Francis Bacon had authored Shakespeare’s plays and sonnets using a cipher that was supposed to have been contained within. She eventually went to work for the US government where she deciphered innumerable coded messages for the Coast Guard, the Bureau of Customs, the Bureau of Narcotics, the Bureau of Prohibition, the Bureau of Internal Revenue, and the Department of Justice.

• Genevieve Grotjan: Another code breaker whose discovery in September 1940, a correlation in a series of intercepted Japanese coded messages, changed the course of history and allowed the U.S. Navy to build a “Purple” analog machine to decode Japanese diplomatic messages. This allowed Allied forces to continue reading coded Japanese missives throughout World War II. Prior to Grotjan's success, the Purple Code had proved so hard to break that William Friedman, the chief cryptologist at the US Army Signal Corps (and Elizebeth Friedman’s husband), suffered a breakdown trying to break it. So as we approach International Women’s Day and as we reflect on the many amazing women who have made a difference throughout history, I hope everyone joins me in celebrating these stories, finding inspiration, and most importantly, sharing that inspiration with the next generation in the hopes that they, too, might find themselves in the position of using their intellect, their skills, and their spirit to change the world for the better.

Big Data and UX are much more than industry buzzwords—they are some of the most important solutions making sense of the ever-increasing complexity and dynamism of the international system. While big data analytics and user experience communities (UX) have made phenomenal technical and analytic breakthroughs, they remain stovepiped, often working at odds, and alone will never be silver bullets. Big data solutions aim to contextualize and forecast anything from disease outbreaks to the next Arab Spring. Conversely, the UX community points to the interface as the determinant battleground that will either make or break companies. This disconnect is especially prevalent in cyber security and it is the user (and their respective companies) who suffers most. Users are either left with too much data but not the means within their skillset to explore it, or a beautiful interface that lacks the data or functionality the users require. But the monumental advances in data science and UX together have the potential to instigate a paradigm shift in the security industry. These disparate worlds must be brought together to finally contextualize the threat and the risks, and make the vast range of security data much more accessible to a larger analytic and user base within an organization.

THE TECH BATTLEGROUNDS

At a 2012 Strata conference, there was a pointed discussion on the importance of machine learning versus domain expertise. Not surprisingly, the panelists leaned in favor of machine learning, highlighting its many successes in forecasting across a variety of fields. The die was cast. Big data replaced the need for domain expertise and has become a booming industry, expanding from $3.2B in 2010 to $16.9B in 2015. For companies, the ability to effectively and efficiently sift through the data is essential. This is especially true in security, where the challenges of big data are even more pronounced given the need to expeditiously and persistently maintain situational awareness of all aspects of a network. Called anything from thesexiest job of the twenty-first century to a field whose demand is exploding, there is no shortage of articles highlighting the need for strong data scientists. More often than not, the spotlight is warranted. Depending on which source is referenced, over 90% of the world’s data has been created in the last two years, garnering big data superlatives such as total domination and the data deluge.

Clearly, there is a need to leverage everything from machine learning to applied statistics to natural language processing to help make sense of this data. However, most big data analysis tools – such as Hadoop, NoSQL, Hive, R or Python – are crafted for experienced data scientists. These tools are great for the experts, but are completely foreign to many. As has been well documented, the experts are few and far between, restricting full data exploration to the technical experts, no matter how quantitatively minded one might be. The user experience of these tools is not big data’s only problem. Without the proper understanding of the data and its constraints, data analytics can have numerous unintended consequences. For instance, had first responders focused on big data analyses of Twitter during Hurricane Sandy, they would have ignored the large swath of land without Internet access, where the help was most needed. In the education realm, universities are worried about profiling as a result of data analysis, even to the extreme of viewing big data as an intruder. Similarly, even with the most comprehensive data, policy responses require a combination of data-driven input, as well as contextual cultural, social, and economic trade-offs that correspond with various policy alternatives. As Erin Simpson notes, “The information revolution is too important to be left to engineers alone.” David Brooks summarized some of the shortcomings of big data, with an emphasis on bringing the necessary human element to big data analytics. Not only are algorithms required, but contextualization and domain expertise are also necessary conditions in this realm. This is especially true in cyber security, where some of the major breaches of the last few years occurred despite the targets actually possessing the data to identify a breach.

So how can companies turn big data to their advantage in a way that actually enables their current workforce to explore, access and discover within a big data environment? A new tech battleground has emerged, one for the customer interface. The UX community boasts its essential role in determining a tech company’s success and ability to bring services to users. Similar to the demand for data scientists, UX is one of the fastest growing fields, becoming “the most important leaders of the new business era…The success of companies in the Interface Layer will be designer-driven, and the greatest user experience (speed, design, etc.) will win.” The user-experience can either breed great product loyalty, or forever deter a user from a given product or service. From this perspective, technology is a secondary concern, driven by UX. The UX community prioritizes the essential role of humans over technologies, focusing on what the users experience and perceive. This is not just a matter of preferences and brand loyalty; it’s about the bottom line. By one measure, every $1 invested in UX yields a $2-$100 return.

In fact, the UX community is increasingly denoting the essential role of UX in extracting insights from the data. Until relatively recent advances in UX, the data and the technologies were both inaccessible for the majority of the population, driving them to spreadsheets and post-it notes to explore data. UX provides the translation layer between the big data analytics technologies and the users, enabling visually intuitive and functional access to data. The UX democratizes access to big data – both the technologies driving big data analytics as well as the data itself. Unfortunately, the pendulum may have swung too far, with data perceived at best as “a supporting character in a story written by user experience” and at worst as simply ignored. The interface layer alone is not sufficient for meeting the challenges of a modern data environment.

A UNIFIED APPROACH

The data science and UX communities are innovating and modernizing in parallel silos. In some industries, such as cyber security, they are unfortunately rarely a consideration. Although necessary, neither is sufficient to meet the needs of the user community. Customers are not drawn to a given product for its interface, no matter how beautiful and elegant it might be. It has to solve a problem. The reason products such as Amazon, Uber and Spotify are so popular is because of the data and data analytics underlying the services they provide. In each case, each product filled a niche or disrupted an inefficient process. That said, none of these would have caught on so quickly or at all without the modern UX that enabled that fast, efficient and intuitive exploration of the data. Steve Jobs mastered this confluence of technology and the arts, noting “technology alone is not enough. It’s technology married with liberal arts, married with humanities, that yields the results that make our hearts sing.”

It is this confluence of the arts and technology – the UX and the data science – that can truly revolutionize the security industry. The tech battlegrounds over machine learning and domain expertise or big data and UX are simply a waste of time. To borrow from Jerome Kagan, this is similar to asking whether a blizzard is caused by temperature or humidity – both are required. Together, sophisticated data science and modern, intuitive UX can truly innovate the security community. It is not a zero sum game, and the integration of the two is long overdue for security practitioners. The security threatscape is simply too dynamic, diverse and disparate to be tackled with a single approach. Moreover, the stakes are too high to continue limiting access to digital tools and data to only a select few personnel within a company. The smart integration of data science and the UX communities could very well be the long overdue paradigm shift the security community needs to truly distill the signal from the noise.

Graphic credit: Philip Jean-Pierre

Today, Harvard Business Review published “See Your Company Through the Eyes of a Hacker: Turning the Map Around On Cybersecurity” by Endgame CEO Nate Fick. In this piece, Nate argues that in order for enterprises to better defend themselves against the numerous advanced and pervasive threats that exist today, they must take a new approach. By looking at themselves through the eyes of their attackers—in the military, “turning the map around”—companies can get inside the mind of the adversary, see the situation as they do, and better prepare for what’s to come.

Nate identifies four ways that companies can “turn the map around” and better defend themselves against attackers. Read the full article at HBR.org

Meet Nate and the Endgame team at RSA 2015. We’ll be in booth #2127 – register here for a free expo pass (use the registration code X5EENDGME) and stop by to learn more about Endgame.

To Forecast Global Cyber Alliances, Just Follow the Money (Part 2): Cooperation in the Cyber Domain - A Little-Noticed Global Trend That is Mirroring Economic Regionalism

To Forecast Global Cyber Alliances, Just Follow the Money (Part 1): Understanding a Sino-Russian Cyber Agreement through Economic Regionalism

Back to the Future: Leveraging the Delorean to Secure the Information Superhighway

Cyber-safe as a Precondition, Not an Afterthought

Revolutionary, not Evolutionary, Changes to the Cyber Mindset

Redefining Cyberwarriors

Challenges in Data-Driven Security

Soft Power is Hard: The World Internet Conference Behind the Great Firewall

Is This the Beginning of the End of “Duel”-track Foreign Policy?

Blurred Lines: Dispelling the False Dichotomy between National & Corporate Security

From Unity Against a Common Threat to Disunity Against a Hydra

Technical Similarities

Could More Convergence Lead to a Unified Response?

Understanding Crawl Data at Scale (Part 1)

The Fog of (Cyber) War: The Attribution Problem and Jus ad Bellum

The Year Ahead in Cyber: Endgame Perspectives on 2015

Could a Hollywood Breach and Some Tweets Be the Tipping Point for New Cyber Legislation?

Understanding Crawl Data at Scale (Part 2)

Understanding Crawl Data at Scale (Part 3)

A Martian's Take on Cyber in the National Security Strategy

Five Thoughts from the White House Summit on Cybersecurity and Consumer Protection

Streaming Data Processing with PySpark Streaming

USING UPDATESTATEBYKEY IN PYSPARK STREAMING

Repression Technology: An Authoritarian Whole of Government Approach to Digital Statecraft

Hacking the Glass Ceiling

Beyond the Buzz: Integrating Big Data & User Experience for Improved Cyber Security

THE TECH BATTLEGROUNDS

A UNIFIED APPROACH

See Your Company Through the Eyes of a Hacker: Turning the Map Around On Cybersecurity