Reward Models (RMs) are crucial for large model alignment but are underexplored in complex embodied tasks like Embodied Question Answering (EQA).
EQA-RM is a generative multimodal reward model tailored for EQA, trained using Contrastive Group Relative Policy Optimization (C-GRPO) to capture fine-grained behavioral distinctions.
EQA-RM offers structured reward feedback beyond simple scalars, enabling test-time scaling for dynamic evaluation granularity adjustment without retraining.
EQA-RewardBench is a new benchmark based on OpenEQA designed for assessing EQA reward models.
EQA-RM, fine-tuned on Qwen2-VL-2B-Instruct, achieves 61.9% accuracy on EQA-RM-Bench with high sample efficiency, outperforming various strong baselines and state-of-the-art models.
The code and dataset for EQA-RM can be accessed at https://github.com/UNITES-Lab/EQA-RM.