Speculative decoding technique is used to improve the efficiency of large-scale autoregressive Transformer models by enabling multiple steps of token generation in a single forward pass.
Speculative decoding has been extended to state-space models (SSMs) to make them more efficient by leveraging hardware concurrency.
A scalable algorithm has been proposed for tree-based speculative decoding in SSMs and hybrid architectures of SSMs and Transformer layers, utilizing accumulated state transition matrices.
The proposed hardware-aware implementation outperforms vanilla speculative decoding methods with SSMs on three different benchmarks, paving the way for enhanced speed and efficiency in SSM and hybrid model inference.