Conversational interfaces have improved customer interactions across a wide range of industries and use cases, providing interactive and intuitive experiences. That experience, however, is diminished if the core model of the conversational interface is poorly implemented. While many chatbots have a wealth of data on which to train, many have not been trained in the wild. This limits their ability to target customers’ pain points, and at worst, the bot can become infuriating (think Clippy). In other words, garbage in, garbage out, garbage UX. Unlike, for example, the ATIS dataset for the airline industry, there currently are not any open infosec training datasets. The absence of training datasets was one challenge we encountered in developing Artemis, our machine learning (ML)-powered security chatbot.
Artemis is designed to elevate and enhance the capabilities of analysts across all levels of expertise. Given natural language queries, Artemis can perform and automate complex analyses, data collection, investigation, and response to threats. But how does this bot’s ML engine work? Artemis extracts information from natural language queries and converts it into an action, typically an API call to the core Endgame platform. This requires an optimized training set that accurately captures a generalized set of expected and unexpected user input. I’ll discuss the fundamental steps our team took to train Artemis, including the process of collecting training data, as well as our tool for validating the language models, BotInspector. Going forward, this foundation will be essential to meet evolving customer needs and use cases through a regular cadence of Artemis updates.
Natural Language Understanding
Before discussing the data, it is useful to provide a brief refresher on two of of the key components of natural language understanding (NLU). As we discussed in a previous post, entities and the intent are core components of the user input (utterance) passed into our data pipeline. The entities in an utterance may include IP addresses, endpoints, usernames, and filenames, all of which the platform can perform some action with or on. The intent is, of course, what we would like to do. Do we want to search for processes or process lineage? Search DNS or netflow data? Sentence structures often vary depending on the intent, so entity extraction and intent classification go hand in hand.
Training a Chatbot with a Chatbot
To train the entity extractor and intent classifier components of the model, we need quite a bit of data—particularly labeled data consisting of real utterances matched with their intents and entities. Sure, there are natural language libraries freely available, but we aren’t aiming for Artemis to understand all sentences and strike up a casual conversation with us. Rather than arbitrary English sentences, we need to train the model on realistic natural language queries used in infosec. This requires amassing enough data to train a chatbot to comprehend intent based on utterances full of security jargon, and translate it into more intuitive conversational structure. Unfortunately, this data is not readily available —so we generated our own!
Initially, this data was generated using a complex script that randomized both utterance structures and fields in a manner resembling Madlibs. That is, we created a set of templates that was auto-populated with information security jargon, and their associated label (intent) to compile our dataset. While this data enabled us to train a reasonably accurate model, it was limited by the speech patterns created by the author of the templates. How did we fix this without having to manually modify the generator to support more utterance structures? With another chatbot!
Designed as a chat room on an Endgame chat server, the artemis_test chatbot allows users to directly supply sample queries to the Artemis engine, which responds with its interpretation based on the current model. This allows us to employ active learning to improve the model. The bot prompts the user to correct any issues with its interpretation via a straightforward conversational interface. This process outputs labeled data that can be used to train both components of the model, as you see in the HipChat conversations below.
Hipchat Output of User and Bot Interactions
Since several samples of each utterance structure are required for the machine learning algorithms for Artemis, we wrote another generator that treats data from artemis_test as templates, but only randomizes the extracted entities such as filenames or IP addresses, rather than the entire utterance structure. This allows us to generate numerous samples for each template provided by the original data generator and artemis_test. While it is true that we must rely on the original generator for a sizable portion of the training data, artemis_test allows us to add new utterance structures and patch misinterpretations incredibly quickly without requiring a lot of tedious work. We have seen massive improvements in entity extraction due to this new implementation.
Tuning Our Training Data
BotInspector, a tool we developed to evaluate our results, serves two purposes. It gives us detailed understanding of the makeup of our training data and also provides insights on the resulting models’ performance.
We analyze the composition of our training data so that we can be sure that different intents and entity types occur at frequencies that are statistically appropriate based on the complexity of utterances for each intent. For example, there are more ways to phrase the query, “I wanna search process for notasuspiciousfile.exe” than there are to ask Artemis to “cancel,” so we must make sure our training data reflects this difference. Furthermore, the intents, “search process” and “parent process tree”, are often present in queries with only subtle differences in utterance structure. Significant differences in the amount of training data for either intent could cause one intent to dominate in the classifier, resulting in misclassifications in unfamiliar utterances. BotInspector also generates frequencies of different entity types. This ensures that the model is trained equally on each entity type, as well as their presence in lists of varying lengths.
Model Validation
In addition to analyzing training data, BotInspector serves as a model validation tool. Given a test file containing a set of queries not present in the training data, BotInspector returns a results file containing accuracy percentages for entity extraction, intent classification, and a combination of the two. It also displays a list of misinterpreted samples, including incorrect entity extractions that are sorted by intent, and incorrect intent classifications that are categorized by whether the entity extractor also failed on the same samples. This can be useful because, as previously mentioned, intent classification and entity extraction are intertwined.
These results can be juxtaposed against those of previous model versions via the comparison function of BotInspector, which highlights areas of improvement and decline between model versions.
With these results, we can now complete the model building cycle. BotInspector tells us where the model goes wrong, indicating which utterance structures it fails to comprehend. We can then add or reinforce these structures by supplying them to artemis_test and using BotInspector to monitor training data quality. Finally, although the model building stage cycle is complete, we constantly train, validate with BotInspector, and repeat.
Next Steps
The implementation of artemis_test and BotInspector has vastly improved our NLU training pipeline, not only allowing us to ascertain deficiencies, but also providing a means of eliminating problems and measuring the NLU engine’s improvement. The system we use is optimal for our use case. Compared to personal assistants like Siri and Alexa, Artemis supports a smaller, domain-specific set of intents and entity types. This limited domain allows Artemis to support more complex natural language queries with varied phrasing.
As users increasingly utilize Artemis to interact with their data and systems via the Endgame platform, we will use targeted customer feedback as an additional training data source. This helps us close any unanticipated gaps in Artemis’ comprehension, increase the functionality within Artemis, and enable us to continue to hone and improve the user experience.
Tools need to empower analysts, not obstruct them. At Endgame, we are committed not only to providing the best prevention and protection in the industry, but also making our platform as easy and intuitive as possible to use. Artemis reflects this integration of ease of use with bleeding edge protections within the Endgame platform, facilitating and expediting the analytic workflow while surfacing insights for analysts across a broad range of expertise.