OmniDraft is a unified framework designed to address challenges in online deployment settings related to cross-vocabulary mismatch and latency improvements in speculative decoding.
OmniDraft allows a single draft model to work with any target model and dynamically adapt to user data by utilizing an online n-gram cache and hybrid distillation fine-tuning.
This framework is ideal for on-device Large Language Model (LLM) applications focusing on model cost, efficiency, and user customization.
OmniDraft showcases its efficacy through online learning tasks in math reasoning, coding, and text generation, demonstrating compatibility with various target models and providing speed enhancements.