Inductive program synthesis, or programming by example, is a task that requires synthesizing functions from input-output examples that can generalize to unseen inputs.
A new evaluation framework called CodeARC (Code Abstraction and Reasoning Challenge) has been proposed to benchmark the reasoning capabilities of large language model (LLM) agents in the context of inductive program synthesis.
CodeARC utilizes an interactive setting where agents interact with a hidden target function, query it with new inputs, synthesize candidate functions, and iteratively refine their solutions using a differential testing oracle.
Among the 18 evaluated models, o3-mini performs the best with a success rate of 52.7%, indicating the difficulty of the inductive program synthesis task.