<ul><li>Large language models (LLMs) have high computational and memory requirements, hindering deployment.</li><li>Existing compression methods for LLMs lack adaptability to runtime memory variations and diverse user requests.</li><li>RAP is an elastic pruning framework driven by reinforcement learning that dynamically adjusts compression strategies based on evolving memory and workload conditions.</li><li>Experiments show that RAP surpasses current methods by considering both model weights and key-value cache simultaneously during execution.</li></ul>

RAP: Runtime-Adaptive Pruning for LLM Inference

Discover more