Google Kubernetes Engine now supports up to 65,000-node clusters, providing an unmatched scale for training or inference with much-needed capacity for the world’s most resource-hungry AI workloads.
This advancement enables customers to reduce model training time or scale models to multi-trillion parameters or more.
GKE’s capacity offers more than 10X larger scale than the other two largest public cloud providers, according to Google.
GKE is transitioning from the open-source etcd, distributed key-value store, to a new, more robust, key-value store based on Spanner, Google’s distributed database that delivers virtually unlimited scale.
On top of supporting larger GKE clusters, this change will provide improved latency of cluster operations and a stateless cluster control plane.
GKE now scales significantly faster with a control plane that automatically adjusts to high-volume operations, while maintaining predictable operational latencies.
Guided by Google's long-standing and robust open-source culture, GKE provides substantial contributions to the open-source community, including when it comes to scaling Kubernetes.
Recent innovations in this space include Fully managed DCGM metrics for improved accelerator monitoring and Custom compute classes, which offer greater control over compute resource allocation and scaling.
GKE’s support for larger clusters will help companies like Anthropic accelerate AI innovation and work more efficiently across diverse workloads.
At Google Cloud, the company is dedicated to providing the best platform for running containerized workloads, consistently pushing the boundaries of innovation.