MIT researchers developed a versatile technique that combines a huge amount of heterogeneous data from many of sources into one system that can teach any robot a wide range of tasks.
Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared “language” that a generative AI model can process.
This approach can be used to train a robot to perform a variety of tasks without the need to start training it from scratch each time.
This method could be faster and less expensive than traditional techniques.
Their architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these varied modalities and domains.
The researchers align data from vision and proprioception into the same type of input, called a token, which the transformer can process.
The larger the transformer becomes, the better it will perform.
When they tested HPT, it improved robot performance by more than 20 percent on simulation and real-world tasks, compared with training from scratch each time.
This approach enables robot learning methods to significantly scale up the size of datasets that they can train on.
The researchers are working towards creating a universal robot brain that can be downloaded and used for robots without any training.