The Deliberative Alignment paper explains how OpenAI effectively tackled jailbreaks, safety protocols, and induced deliberation into their latest models.
The paper focuses on making the model learn safety specifications and achieve systematic thinking.
The training process involves generating training data from safety specifications and prompts, filtering them using a reward model, and using reinforcement learning to reward Chain of Thought-based learning.
The O1 model shows promise in improving true positives and decreasing false negatives, signaling a step towards AGI.