An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing unacceptable outcomes.
This paper investigates the ability of AI systems to generate and act on their own strategies for subverting control protocols while operating statelessly.
The study evaluates the performance of various AI models in subverting control protocols and introduces Subversion Strategy Eval, a suite of eight environments for evaluation.
The evaluation results show that the models do not demonstrate substantial capability in strategizing to subvert control protocols statelessly, but providing them with additional affordances can improve their performance.