Chain-of-Action (CoA) is a new visuo-motor policy paradigm based on Trajectory Autoregressive Modeling.
CoA differs from traditional methods by generating an entire trajectory through backward reasoning with task-specific goals.
The process involves an action-level Chain-of-Thought (CoT) within a single autoregressive structure.
The first token in CoA represents a stable keyframe action encoding the task goals, with subsequent actions generated based on the initial keyframe and previously predicted actions.
This backward action reasoning enforces a global-to-local structure, where each local action is tightly constrained by the final goal.
To enhance action reasoning, CoA includes continuous action token representation, dynamic stopping for variable-length trajectory generation, reverse temporal ensemble, and multi-token prediction.
CoA demonstrates strong spatial generalization capabilities while maintaining a flexible and simple visuo-motor policy.
Empirical results show that CoA achieves state-of-the-art performance on 60 RLBench tasks and 8 real-world manipulation tasks.