A few years ago Jeff Hammerbacher famously claimed that, “The best minds of my generation are thinking about how to make people click ads.” This seems to have only marginally changed with teams of data scientists in Silicon Valley often devoted to discovering solutions that yield indiscernible improvements within a broader range of recommender engines. In large part, data science within the tech community remains focused on e-commerce and the sharing economy – which largely are at the point of diminishing returns from a customer’s perspective –instead of disrupting industries such as healthcare, education or security. This general lack of integration of data science innovations into products in other realms is anecdotally reinforced at the various data science focused conferences, which overwhelmingly present the incremental changes to driving times, deliveries, or more targeted shopping experiences. Areas awash with data at scale – such as security – rarely even garner a blip on the radar at data science-focused tech conferences.
The failure of data science to extend significantly into products in new industries may be a major contributing factor when looking at data science within the 2015 Gartner Hype Cycle for Emerging Technologies. The 2015 Hype Cycle divides the various approaches within data science, placing each of them just before or after the peak of inflated expectations, including machine learning and NLP. Interestingly, digital security remains in the innovation trigger phase, highlighting the great opportunities that exist in the security space.
Below is a quick synopsis of some observations from a range of data science and technology focused conferences I’ve attended on both coasts this year. In short, Hammerbacher’s admonitions are as relevant today as they were a few years ago. However, this does not need to be the case, with great opportunities for data science to disrupt the security industry.
Current State of Data Science
- Much of the Same: Targeted marketing continues to prevail, with emphasis on fine-tuning the already refined and complex algorithms for better shopping experiences and search results within sites.
- Diminishing Returns: Large teams are focused on incremental improvements to the user experience, creating an ever bigger void between what users understand is being done with their personal data and the reality. Much of this also focuses on social media mining for marketing and e-commerce purposes.
- Black Box Approach: Hailed by the Harvard Business Reviewas the sexiest job of the twenty-first century, there are signs that many believe current work by data scientists will soon be automated or simply is not the silver bullet as it has been portrayed in marketing and media materials. The prevalent mentality belittles domain expertise of the data and/or data science techniques in favor of a black box approach. This impacts the frequency and kind of data collected, what questions can be addressed with the data, or even the theoretical validity of the multitude of correlations that are bound to occur given a large-scale data environment.
- Chasing Fads: The majority of data science research and development focuses on edge-cases to solve niche problems, instead of the majority of the problems that would have the biggest disruption across an industry. While the technology may be novel and groundbreaking, it actually may provide little utility for a product. Theoretically interesting breakthroughs that fail to be relevant for a product remain stove piped in the Ivory (or Silicon) Tower.
- General misperception of data science: The less technical conferences with sections on data science or big data generally exhibit lengthy Q&A sessions, which exhibit the ongoing struggles of those outside of the field to comprehend how data science might be applied within their company or industry. In many cases, companies have hired data scientists but aren’t really sure what to do with them. The media portrayal does not help in this regard, arguing that BI tools can serve as nextgen data science.
Data Science’s Next Disruption: Security
The Gartner Hype Cycle for Emerging Technologies’ bleak outlook for data science highlights the necessity for data science to expand into products in industries beyond the e-commerce, sharing economy, and marketing realms. These markets have greatly benefited from machine learning and other data science techniques, but could very well be at the point of diminishing returns. In contrast, the security community – which is ultimately a key player in both the protection of individual privacy as well as economic and national security – greatly underachieves in integration of vetted and advanced data science techniques into commercial software. The vast majority of security products are based on rules and signatures, which are tenuous and fail to scale or generalize to current environments. While there is arguably a growing emphasis on quantitative approaches to security research, these remain one-off services, with very few actually making their way into products that could truly disrupt an industry that remains focused on Cold War, perimeter based mindsets.
There are great opportunities for data science to play a critical role in the next generation of security research and product instantiation. There is untapped potential for the application of anything from machine learning to natural language processing to dynamic, Bayesian approaches that can be automatically updated with prior and additional knowledge. Similarly, the socio-technical interplay is another under-explored area. For instance, the time series econometric models could help inform repeatable and scalable risk assessment frameworks. Finally, there is the unfortunate perception that security related work is orthogonal to individual privacy. In fact, data science algorithms should help inform the next wave of privacy features – ranging from encryption to fraud detection to preventing the extraction of personally identifiable information by malicious actors.
Join Us at the Data Mining in Cybersecurity Meetup
Data science within security is admittedly difficult, with low tolerance for errors and few open datasets for training and testing. These challenges, however, make the work that much more rewarding and impactful. Endgame’s data science and research and development teams are increasingly pursuing many of the established and bleeding edge techniques in data science across a wide range of data feeds. If you’d like to meet some of the team and hear more about our research, we’ll be hosting the Datamining in Cybersecurity Meetup in San Francisco on November 12th.