A new paradigm of diffusion-based Large Language Models (dLLMs) has emerged in text generation, offering advantages over Autoregressive Models (ARMs).
Traditional ARM acceleration techniques like Key-Value caching are not suitable for dLLMs due to their bidirectional attention mechanism causing high inference latency.
To address this, dLLM-Cache, a training-free adaptive caching framework, has been introduced, combining prompt caching with response updates for efficient computation reuse.
Experiments on dLLMs like LLaDA 8B and Dream 7B have shown that dLLM-Cache speeds up inference by up to 9.1 times without sacrificing output quality, bringing dLLM latency closer to ARMs.