Improving open-source models on real-world SWE tasks (solving GITHUB issues) faces challenges in scalable curation of execution environments and optimal test-time compute scaling.
AgentGym is introduced as the largest procedurally-curated executable gym environment for training real-world SWE-agents, with over 8.7K tasks.
SYNGEN, a synthetic data curation recipe, is used to enable scalable curation of executable environments, leading to improved training performance.
Hybrid Test-time Scaling is employed, showcasing the complementary strengths and limitations of execution-based and execution-free verifiers.