<ul><li>A new paradigm of diffusion-based Large Language Models (dLLMs) has emerged in text generation, offering advantages over Autoregressive Models (ARMs).</li><li>Traditional ARM acceleration techniques like Key-Value caching are not suitable for dLLMs due to their bidirectional attention mechanism causing high inference latency.</li><li>To address this, dLLM-Cache, a training-free adaptive caching framework, has been introduced, combining prompt caching with response updates for efficient computation reuse.</li><li>Experiments on dLLMs like LLaDA 8B and Dream 7B have shown that dLLM-Cache speeds up inference by up to 9.1 times without sacrificing output quality, bringing dLLM latency closer to ARMs.</li></ul>

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Discover more