According to Cointelegraph, a team of researchers from artificial intelligence (AI) firm AutoGPT, Northeastern University, and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful outputs and prevents them from executing. The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” The research states that the agent is flexible enough to monitor existing LLMs and can stop harmful outputs such as code attacks before they happen.
Existing tools for monitoring LLM outputs for harmful interactions seemingly work well in laboratory settings, but when applied to testing models already in production on the open internet, they often fall short of capturing the dynamic intricacies of the real world. This is largely due to the existence of edge cases and the impossibility of researchers imagining every possible harm vector before it happens in the field of AI. Even when humans interacting with AI have the best intentions, unexpected harm can arise from seemingly innocuous prompts.
To train the monitoring agent, the researchers built a dataset of nearly 2,000 safe human/AI interactions across 29 different tasks ranging from simple text-retrieval tasks and coding corrections to developing entire webpages from scratch. They also created a competing testing dataset filled with manually-created adversarial outputs, including dozens designed to be unsafe. The datasets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a state-of-the-art system, capable of distinguishing between innocuous and potentially harmful outputs with an accuracy factor of nearly 90%.