NVIDIA has introduced Hymba, a new family of small language models featuring a hybrid architecture that combines Mamba and Attention heads running in parallel.
Hymba models integrate transformer attention mechanisms with SSMs to enhance efficiency, allowing attention heads and SSM heads to process input data in parallel.
The Hymba-1.5B model combines Mamba and attention heads running in parallel with meta tokens to reduce the computational load of transformers without compromising memory recall.
Hymba outperforms other models in terms of efficiency and performance, making it suitable for deployment on smaller, less capable hardware.