TransforMerger is a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs.
It merges multimodal data into a single unified sentence and employs probabilistic embeddings to handle uncertainty.
The model integrates contextual scene understanding to resolve ambiguous references and is robust to noise, misalignment, and missing information.
TransforMerger outperforms deterministic baselines, demonstrating its effectiveness in enabling more robust and flexible human-robot communication.