Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers have gained traction for supporting long contexts in Large Language Model serving.
Marconi is a system that supports efficient prefix caching with Hybrid LLMs.
Marconi uses novel admission and eviction policies that assess potential cache entries based on recency, reuse likelihood, and compute savings.
Marconi achieves significantly higher token hit rates compared to state-of-the-art prefix caching systems.