Large language models (LLMs) have high computational and memory requirements, hindering deployment.
Existing compression methods for LLMs lack adaptability to runtime memory variations and diverse user requests.
RAP is an elastic pruning framework driven by reinforcement learning that dynamically adjusts compression strategies based on evolving memory and workload conditions.
Experiments show that RAP surpasses current methods by considering both model weights and key-value cache simultaneously during execution.