As AI models increase in sophistication, there’s increasingly large model data needed to serve them.
Loading the models and weights along with necessary frameworks to serve them for inference can add seconds or even minutes of scaling delay, impacting both costs and the end-user’s experience.
Inference servers such as Triton, Text Generation Inference (TGI), or vLLM are packaged as containers that are often over 10GB in size.
This can make them slow to download, and extend pod startup times in Kubernetes.
This blog explores techniques to accelerate data loading for both inference serving containers and downloading models + weights.
Accelerating container load times using secondary boot disks to cache container images with your inference engine and applicable libraries directly on the GKE node.
Accelerating model + weight load times from Google Cloud Storage with Cloud Storage Fuse or Hyperdisk ML.
GKE lets you pre-cache your container image into a secondary boot disk that is attached to your node at creation time.
Using Cloud Storage as the source of truth, there are two main products to retrieve your data at the GKE-pod level: Cloud Storage Fuse and Hyperdisk ML (HdML).
Loading large AI models, weights, and container images into GKE-based AI models can delay workload startup times.