Lessons from a Bake Off: A Data Intelligence Conference Readout

Capital One recently hosted the excellent Data Intelligence conference in northern Virginia. As a data scientist working in infosec, it was great to meet so many new people and old friends who were all interested in applying machine learning to diverse fields. I presented an overview of our early research into malware classification titled “Which Model Came Hot and Fresh Out the Kitchen in our Malware Classifier Bake Off?”. We had previously documented this research in our technical blog, which detailed in depth our path toward choosing a machine learning model to use for our endpoint malware classifier, and providing tips for others as they evaluate machine learning models. The results of the bake off gave us a great foundation for building what eventually became MalwareScore^TM. In this talk, I added additional context about data science in security in general, the benefits and drawbacks of running a model bake off, and more information about our conclusions.

In our bake off, the Endgame data science team evaluated many machine learning models to see how well they could power a malware detection capability on customer endpoints. We not only focused on classification performance, but also model size and query time execution to fit within tight memory and CPU constraints. Our data science team members brought their expertise in many models to this bake off, including nearest neighbors, decision tree based models, and even a deep learning model.

In addition, this project allowed our data science team to share knowledge and insights. Having used a Support Vector Machine (SVM) with a Radial Basis Function kernel many years ago to find neutrinos from other galaxies during my graduate research, I thought I knew everything there was to know about them. But during this project, I learned that it’s best to train SVMs differently when your feature count is as high as it was here (>2000 features). This is just one example of how our data science team gained additional knowledge throughout the bake off process, and as a result expanded our own skillsets.

After my talk, most questions focused on the many things we could have done to continue to improve the performance of the models across the board. I had to laugh at some point where my answer was going to once again be “no, we didn’t try that” and explain that at some point the purpose of the bake off had been accomplished. Once we’ve learned from each other and seen some early performance results, it was important for the team to decide on a model and iterate on delivering an actual product. In our case, it was clear that gradient boosted decision trees offered the best combination of detection, size, and performance. A lot of the audience’s suggestions (identifying problem executables and improving on them, searching for better features, Bayesian hyperparameter optimization) are techniques we have since used to improve and optimize MalwareScore^TM performance, something we do on a nearly continual basis.

If you’re considering using machine learning to solve a problem at your company or organization, a bake off is a great way to determine which direction you should take or challenge existing assumptions about the best path. As you design a bake off, make sure to clearly define what you hope to learn from it. Machine learning can be applied to many different domains, and so an unexpected model could be appropriate for your problem area. At the same time, focusing too many resources on a bake off could distract your team from all the other task required before shipping a data product. By defining what questions you need answered from a bake off, you can reduce the chances of it becoming too large of a project.

In the process of recalling our earlier work in order to build this talk, I was reminded how this bake off really served as the ignition for our efforts towards building MalwareScore^TM. We may remix the effort in the coming months to include what our team has learned about training deep learning architectures and will share results if and when they’re available.

Lessons from a Bake Off: A Data Intelligence Conference Readout

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112