Researchers have introduced a new goal specification method called cross-view goal alignment to guide agent interactions in 3D environments.
The method allows users to specify target objects using segmentation masks from their camera views, enhancing spatial reasoning abilities of the agent.
ROCKET-2, a state-of-the-art agent trained in Minecraft, demonstrates improved efficiency and zero-shot generalization capabilities to other 3D environments like Doom, DMLab, and Unreal.
The development of ROCKET-2 includes auxiliary objectives like cross-view consistency loss and target visibility loss to align the agent's behavior with human intent when there are significant differences in camera views.