Robot perception involves using sensors like cameras, LiDAR, and radar to gather environmental data, which algorithms process to interpret surroundings.
In previous experiments, Grounded SAM 2 (G-SAM2) was found to be effective in detecting and segmenting robots based on prompts given in human language.
The next step is to run G-SAM2 on a real robot, but the challenge is the need for real-time performance.
Inference time of G-SAM2 on a NVIDIA T4 GPU was around 1.5 seconds, which is considered slow for real-time applications.