<ul><li>TransforMerger is a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs.</li><li>It merges multimodal data into a single unified sentence and employs probabilistic embeddings to handle uncertainty.</li><li>The model integrates contextual scene understanding to resolve ambiguous references and is robust to noise, misalignment, and missing information.</li><li>TransforMerger outperforms deterministic baselines, demonstrating its effectiveness in enabling more robust and flexible human-robot communication.</li></ul>

TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Discover more