Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs sharing a common input prefix.
Prefix Grouper is an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy, reducing computational overhead in long shared-prefix scenarios.
Implemented by restructuring self-attention into two parts, Prefix Grouper encodes the shared prefix only once while maintaining full differentiability and compatibility with end-to-end training.
Empirical evidence shows that Prefix Grouper achieves training-equivalence to standard GRPO, reducing computational costs significantly and improving scalability for more complex tasks and larger models.