menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

RAP: Runti...
source image

Arxiv

1d

read

3

img
dot

Image Credit: Arxiv

RAP: Runtime-Adaptive Pruning for LLM Inference

  • Large language models (LLMs) have high computational and memory requirements, hindering deployment.
  • Existing compression methods for LLMs lack adaptability to runtime memory variations and diverse user requests.
  • RAP is an elastic pruning framework driven by reinforcement learning that dynamically adjusts compression strategies based on evolving memory and workload conditions.
  • Experiments show that RAP surpasses current methods by considering both model weights and key-value cache simultaneously during execution.

Read Full Article

like

Like

For uninterrupted reading, download the app