Transformer-based large language models (LLMs) can be inefficient in real-world serving due to the expensive accelerators.
To address this, the paper introduces model-attention disaggregation, leveraging cheap, memory-optimized devices for attention operators.
This approach maximizes performance and cost efficiency by tailoring each component to its workload.
Experimental results show that Lamina, an LLM inference system using this approach, can provide higher estimated throughput than existing solutions with similar costs.