Can We Red Team Our Way to AI Accountability?

Ranjit Singh is senior researcher, research equity at Data & Society; Borhane Blili-Hamelin is the taxonomy lead at the AI Vulnerability Database, a community partner organization for the Generative Red Team (GRT) challenge at DEF CON; and Jacob Metcalf is program director, AI on the Ground, at Data & Society. Shutterstock Last week’s much publicized Generative Red Team (GRT) Challenge at Caeasar’s Forum in Las Vegas, held during the annual DEF CON computer security convention, underscored the enthusiasm for red-teaming as a strategy for mitigating generative AI harms. Red-teaming LLMs often involves prompting AI models to produce the kind of content they’re not supposed to — that might, for example, be offensive, dangerous, deceptive, or just uncomfortably weird. The GRT Challenge offered a set of categories of harmful content that models such as OpenAI’s GPT4, Google’s Bard, or Anthropic’s Claude may output, from disclosing sensitive information such as credit card numbers to biased statements against particular communities. This included “jailbreaking” — intentional efforts to get the model to misbehave — as well as attempts, through benign prompts, to model the normal activities of everyday internet users who might stumble into harm. Volunteer players were evaluated on their prompts that made the models misbehave.  Co-organized by the DEF CON-affiliated AI Village and the AI accountability nonprofits Humane Intelligence and SeedAI, the challenge was “aligned with the goals of the Biden-Harris Blueprint for an AI Bill of Rights and the NIST AI Risk Management Framework,” and involved the participation of the National…Can We Red Team Our Way to AI Accountability?