<ul><li>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers have gained traction for supporting long contexts in Large Language Model serving.</li><li>Marconi is a system that supports efficient prefix caching with Hybrid LLMs.</li><li>Marconi uses novel admission and eviction policies that assess potential cache entries based on recency, reuse likelihood, and compute savings.</li><li>Marconi achieves significantly higher token hit rates compared to state-of-the-art prefix caching systems.</li></ul>

Marconi: Prefix Caching for the Era of Hybrid LLMs

Discover more