Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Lizhi Lin; Honglin Mu; Zenan Zhai; Minghan Wang; Yuxia Wang; Renxi Wang; Junjie Gao; Yixuan Zhang; Wanxiang Che; Timothy Baldwin; Xudong Han; Haonan Li

doi:10.1613/jair.1.17654

PDF

Published: Feb 5, 2025

DOI: https://doi.org/10.1613/jair.1.17654

Keywords:

Adversarial Machine Learning, Natural Language, Neural Networks

Lizhi Lin

LibrAI, Tsinghua University

Honglin Mu

Harbin Institute of Technology

Zenan Zhai

LibrAI, Oracle

Minghan Wang

Monash University

Yuxia Wang

LibrAI, MBZUAI

Renxi Wang

LibrAI, MBZUAI

Junjie Gao

LibrAI, MBZUAI

Yixuan Zhang

LibrAI, MBZUAI

Wanxiang Che

Harbin Institute of Technology

Timothy Baldwin

LibrAI, The University of Melbourne, MBZUAI

Xudong Han

LibrAI, MBZUAI

Haonan Li

LibrAI, MBZUAI

Abstract

Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the “searcher” framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.

Warning: This paper contains examples that may be offensive, harmful, or biased.

Issue

Vol. 82 (2025)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details