Researchers have introduced a Navigation World Model (NWM), a controllable video generation model for predicting future visual observations based on past observations and navigation actions.
NWM employs a Conditional Diffusion Transformer (CDiT) with 1 billion parameters, trained on a diverse collection of egocentric videos of human and robotic agents.
In familiar environments, NWM can plan navigation trajectories by simulating and evaluating them for achieving the desired goal, incorporating constraints dynamically during planning.
NWM can also imagine trajectories in unfamiliar environments using learned visual priors from a single input image, making it a versatile tool for next-generation navigation systems.