Reinforcement Learning with Verifiable Rewards (RLVR) is effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving.
The scarcity of human-labeled math problems and limited-verification answers in existing datasets limits the effectiveness of RL training.
A Self-aware Weakness-driven problem Synthesis framework (SwS) is introduced to identify model deficiencies and leverage them for problem augmentation.
SwS systematically identifies model weaknesses, extracts core concepts from failure cases, and synthesizes new problems to strengthen weak areas in subsequent training, resulting in average performance gains on mainstream reasoning benchmarks.