<ul><li>Large language models (LLMs) are being deployed on mobile devices, but limited DRAM capacity constrains the model size.</li><li>ActiveFlow is introduced as an LLM inference framework that enables adaptive DRAM usage for modern LLMs.</li><li>ActiveFlow utilizes novel techniques such as cross-layer active weights preloading and sparsity-aware self-distillation.</li><li>The framework achieves the performance-cost Pareto frontier compared to existing optimization methods.</li></ul>

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Discover more