menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Google News

>

Data loadi...
source image

Cloudblog

23h

read

344

img
dot

Image Credit: Cloudblog

Data loading best practices for AI/ML inference on GKE

  • As AI models increase in sophistication, there’s increasingly large model data needed to serve them.
  • Loading the models and weights along with necessary frameworks to serve them for inference can add seconds or even minutes of scaling delay, impacting both costs and the end-user’s experience.
  • Inference servers such as Triton, Text Generation Inference (TGI), or vLLM are packaged as containers that are often over 10GB in size.
  • This can make them slow to download, and extend pod startup times in Kubernetes.
  • This blog explores techniques to accelerate data loading for both inference serving containers and downloading models + weights.
  • Accelerating container load times using secondary boot disks to cache container images with your inference engine and applicable libraries directly on the GKE node.
  • Accelerating model + weight load times from Google Cloud Storage with Cloud Storage Fuse or Hyperdisk ML.
  • GKE lets you pre-cache your container image into a secondary boot disk that is attached to your node at creation time.
  • Using Cloud Storage as the source of truth, there are two main products to retrieve your data at the GKE-pod level: Cloud Storage Fuse and Hyperdisk ML (HdML).
  • Loading large AI models, weights, and container images into GKE-based AI models can delay workload startup times.

Read Full Article

like

20 Likes

For uninterrupted reading, download the app