<ul><li>Transformer-based large language models (LLMs) can be inefficient in real-world serving due to the expensive accelerators.</li><li>To address this, the paper introduces model-attention disaggregation, leveraging cheap, memory-optimized devices for attention operators.</li><li>This approach maximizes performance and cost efficiency by tailoring each component to its workload.</li><li> Experimental results show that Lamina, an LLM inference system using this approach, can provide higher estimated throughput than existing solutions with similar costs.</li></ul>

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

Discover more