Researchers propose Self-RedTeam, an online self-play reinforcement learning algorithm for safer language models.
The algorithm involves co-evolution of an attacker and defender agent through continuous interaction in a two-player zero-sum game.
Self-RedTeam enables dynamic co-adaptation and aims to converge to a Nash Equilibrium for reliable safety responses.
Empirical results show that Self-RedTeam uncovers more diverse attacks and achieves higher robustness on safety benchmarks compared to traditional static defender approaches.