Speculative decoding accelerates language model inference by using a lightweight draft model to propose tokens verified by a larger target model.
Applying speculative decoding to vision-language models faces challenges as small language models lack visual processing components and token predictions mismatch with larger VLMs.
MASSV introduces a method to transform small language models into effective multimodal drafters for VLMs by connecting them to a vision encoder and aligning token predictions using self-distilled visual instruction tuning.
Experiments show that MASSV improves accepted length by up to 30% and accelerates inference speed by up to 1.46x, providing a scalable method for enhancing both current and future VLMs.