Last week in an unprecedented move, researchers at OpenAI stated that with the announcement of their powerful new language model, they would not be releasing the dataset, code, or model weights due to safety and security concerns. The researchers cited founding principles in the OpenAI charter that predicts that “safety and security concerns will reduce our traditional publishing in the future.” The particular fear is that the GPT-2 model could be misused to generate at scale fake news articles, impersonate others, or for phishing campaigns.
Coincidentally, this news comes almost exactly one year since the release of the Malicious Use of AI Report, co-authored in part by researchers from OpenAI and Endgame. In some security circles such as AI Village, the discussion associated with the GPT-2 model non-release has outweighed that of the initial report. As we noted in our blog post that coincided with the initial report, cybersecurity, with its lack of norms, lies at a critical inflection point in the potential misuse of machine learning in a broad array of fields that includes misinformation (for GPT-2).
Endgame researchers, including authors of this post, have a history of releasing models, datasets, code, and papers that could conceivably be used for malicious purposes. As co-authors with OpenAI on the Malicious Use of AI report, how do we justify these releases? What discussions and checks are in place when choosing to (or not to) release? Is a contraction in the sharing of infosec models and code to be expected?
Deciding to release
Each decision to release should come after a thoughtful debate about potential social and security impacts. In our case, none of our previously released research openly targeted flaws in software products, but rather highlights more generally potential blind spots or other causes of concern. We identified and discussed the costs attackers would need to consider to avoid releasing tools that exploited “low hanging fruit.” Importantly, for each red-team tactic, we also disclosed blue-team defenses.
The series of releases has refined a few guiding principles about responsible release that are worth highlighting:
- Invite debate that deliberately includes parties with no vested interest in the code release or publication. Although they need not be fatalists, they can help you walk with fresh eyes through worst-case scenarios.
- Adopt long-standing and broadly recognized responsible disclosure guidelines that include early notification of impacted parties, generous time before disclosure, and generally acting in good faith.
- Expect and embrace public debate and pushback, which is an important socially-rooted check.
- Be willing to give the benefit of the doubt to inventors and authors on their decision to release or withhold. There may be additional details or discussion not made public that contribute to the decision. Even while maintaining the public dialog, don’t fall for the fundamental attribution error in ascribing recklessness, greed or malintent.
Why did we release?
The reality is most AI/ML research can be leveraged for benign or malicious uses. Technologies like DeepFakes highlight that relatively innocuous research can pivot quickly into nightmare territory. For this reason it is important to acknowledge these concerns by committing to guiding principles such as those mentioned above. This is one net positive from OpenAI withholding the release of their GPT-2 model. Not without criticism, it set a precedent and brought the conversation into the public forum.
Let’s take two recent examples from Endgame history where models, tools, and frameworks could have been leveraged by threat actors.
The malware (evasion) gym provided a reinforcement learning approach to manipulate malicious binaries to bypass antimalware machine learning models. This certainly constitutes red-team activity, and abiding by the principles above, we also disclosed the best known defense for the attack. In this case, the machine learning model this project targeted was a “likeness” to commercial NGAV products, not a replica. Abiding by the spirit of responsible disclosure, it represented a very early warning proof-of-concept, not a gaping vulnerability. In reality, it would require a large effort to make this sort of attack successful for adversaries. The fact of the matter is that there are significantly easier, and less costly, ways to evade network or endpoint defenses.
EMBER is a benchmark dataset and anti-malware model that was released, in part, for adversarial research, and as such could be misused. Giving an adversary the “source code” to a modern machine learning antimalware architecture certainly can aid an attacker, but as in the other cases, education and transparency for researchers and defenders were ultimately paramount. We strongly believe that it is important to shine a bright light on the strengths and weaknesses of ML technologies in the information security domain. In fact, showing weaknesses in models like EMBER can help Blue teams mitigate risks when facing a team. This game theoretic approach is common across the infosec community and not one ML research should shy away from.
A less liberal future?
Part of the machine learning community is emphasizing reproducibility with open code and model releases. It is all too easy to create a model that’s a little over trained and cherry picked, and the open release mitigates these issues. This is a wonderful aspect that has drawn many to the community, and this is the source of one of the biggest criticisms on OpenAI’s decision to not release the full model. Based on descriptions of the data set size and training cost, it’s less likely that GPT-2 is overtrained. But there is a legitimate call for open release.
However, models that can produce images, text, and audio capable of fooling humans can be harmful. The common practice of reverse image searching a suspected bot’s profile image to find out if it’s a stock photo is completely defeated now with facial GANs. There’s also the disturbing success that the red-team model SNAP_R had in phishing on twitter. These aren’t vulnerabilities in software that could be patched out, attacks that are based on these generative models are attacks on human perception. We can educate people to not follow links, and to not trust things they read on a Facebook post but it takes time and there will always be “unpatched” people who are new to the internet.
Final thoughts
Healthy discussion about this continues, and we look forward to more formal forums where we can have discussions about releasing and mitigating dangerous models. Certainly, not every model need be readily available to advance the state of machine learning. If an author feels like their model could be dangerous, there should be a mechanism for them to submit it for review. If the review process validates the science and agrees that the model is dangerous, a report of the model needs to be published which includes a mitigation strategy. It is impossible to “put the genie back in the bottle” once a dangerous model is created. Nation state level actors have the resources to reproduce the full GPT-2, but we should probably not let every script kiddie on the internet have access to it.
Security sits in an unenviable position. As we have argued before, openness and transparency are adjectives to which our community should aspire. But, where’s the bright line that one shouldn’t cross? We think that line becomes crisper when one abides by the simple guiding principles as we’ve outlined here.