Last year, Endgame released an open source benchmark dataset called EMBER (Endgame Malware BEnchmark for Research). EMBER contains 1.1 million portable executable (PE file) sha256 hashes scanned in or before 2017, features extracted from those PE files, a benchmark model, and a code repository that makes it easy to work with this data. Since then, researchers have been able to use this dataset to quantify how quickly models degrade [1], investigate how labels evolve over time [2], and even to investigate how malware classifiers are vulnerable to attack [3]. We were very pleased to see this response from the community, but were also aware of a couple areas where we thought EMBER could improve.
Today, we’d like to announce a new release of EMBER that we built with those improvements in mind. This release is accompanied by a new set of 1 million PE files scanned in or before 2018 and an expanded feature set. Here are some of the differences between this release and the original EMBER:
More Difficult Dataset
We found that the test set in the original EMBER 2017 dataset release was very easy to classify. The benchmark model on this data was trained with default LightGBM parameters and still achieved a 0.999 area under a receiver operating characteristic (ROC) curve. So this time around, we selected PE files to include while keeping in mind that we wanted to make the classification task harder. The benchmark model trained on the EMBER 2018 files is now optimized by a simple parameter grid search, but it still only achieves a 0.996 area under a ROC curve. This is still excellent performance, but we’re hoping this leaves more room for innovative and improved classification techniques to be developed.
Expanded and Updated Feature Set
This latest release of EMBER contains a new feature set that we are calling feature set version 2. The main difference between this feature set and version 1 features is that we are now using LIEF version 0.9.0 instead of version 0.8.3. Feature calculations using the new version of LIEF are not guaranteed to be equivalent to those using the older version.
Also, one of our goals for the features that we provided last year was that researchers could recreate decisions made by the Adobe Malware Classifier without going back to the original PE file. It turns out that the original features were insufficient to reproduce the original work. In version 2 of our feature set, the new data directory features now allow that analysis.
Thanks to a clustering analysis carried out on the original EMBER release, we found that the distribution of some feature values was very uneven. The main culprit were samples that had a large number of ordinal imports that were getting hashed to the exact same place. In order to smooth the feature space, ordinal imports are now hashed along with named imports in feature set version 2.
After freezing the development of version 2 of our feature set, we calculated these new features for the original EMBER 2017 set of files. Those new features calculated on the old files are now publicly available for download in addition to the features from the new 2018 files.
PE File Selection
Clustering analysis of EMBER 2017 also revealed extreme outliers and PE files that had exactly the same feature vector. Although this sort of dirty data is expected in the real world by any deployed static malware classifier, we wanted our benchmark dataset to better capture performance across a normalized view of the set of PE files. For this reason, we used his outlier and duplicate detection to clean some of the worst offenders before finalizing our selection for the 2018 feature set.
Adding an additional year of labels and features onto the original data release opens the door to more longitudinal studies of model performance degradation and malware family evolution. While these possibilities are exciting, they are complicated by the fact that different logic was used in each year to select the set of files that were included in the dataset. The malware and benign samples are not sampled identically from a single distribution. Depending on the goal of the research, this difference may not matter. But researchers must be aware of this when forming and testing their hypotheses.
It’s been inspiring to see all the interest in EMBER over the last year and a half. We’re hoping that this new release can further empower researchers to find new static classification techniques, quantify the performance of existing models, or simply to practice their data analysis skills. Please reach out if you have any questions or suggestions!
[1] Understanding Model Degradation In Machine Learning Models
[2] Measure Twice, Quarantine Once: A Tale of Malware Labeling over Time
[3] TreeHuggr: Discovering where tree-based classifiers are vulnerable to adversarial attack