Google Cloud is continuously innovating its high-performance computing (HPC) capabilities with new technology developments and infrastructure, such as the upcoming generation of H-series VMs in early 2025. These new VMs will have improved scalability and native support to provision full, tightly-coupled HPC clusters on demand, offering superior performance, reliability, and security.
Parallelstore is a fully-managed, scalable, high-performance storage solution based on next-generation DAOS technology, designed for HPC and AI workloads. It can provide up to six times greater read throughput performance compared to competitive Lustre scratch offerings. It is also great for applications requiring fast access to large datasets, such as analyzing massive genomic datasets for personalized medicine.
A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs have significant improvements in performance over previous generations, offering non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). They can scale to tens of thousands of GPUs in a dense, performance-optimized cluster for large AI and HPC workloads.
Google Cloud has recently announced Trillium, our sixth-generation Tensor Processing Units (TPUs), available in preview. Compared with TPU v5e, Trillium has over 4x improved training performance, up to 3x increase in inference throughput, and 67% increase in energy efficiency.
Cluster Toolkit provides open-source tools for deploying and managing HPC environments on Google Cloud. Secondary boot disk provides faster workload startups through container image caching, while Custom compute classes offer greater control over compute resource allocation and scaling.
Visit Google Cloud at booth #1730 at Supercomputing 2024 in Atlanta, to learn more about HPC and AI infrastructure and quantum solutions. The booth will feature a Trillium TPU board, NVIDIA H200 GPU and ConnectX-7 NIC, hands-on labs, a full schedule of talks, a comfortable lounge, and plenty of great swag!
Atommap, a company specializing in atomic-scale materials design, is using Google Cloud HPC features like H3 VMs and Parallelstore to accelerate research and development efforts. They have achieved significant success, including a reduced time-to-results and better scalability with optimized infrastructure costs.
Google Cloud has launched the Google Cloud Advanced Computing Community, a new kind of community of practice for sharing and growing HPC, AI, and quantum computing expertise, innovation, and impact. The Community brings together thought leaders and experts from Google, its partners, and HPC, AI, and quantum computing organizations around the world.
Google is leading the way for containerized workloads with GKE, which supports up to 65,000 nodes, offering major improvements in automating and simplifying the building of HPC and AI platforms. Kueue.sh is a powerful and innovative tool for job queueing on Kubernetes, with topology-aware scheduling, priority and fairness in queueing, multi-cluster support, and more.
Google Cloud is committed to empowering researchers and engineers to tackle the world's most complex computational challenges through continuous innovation for HPC. There will be further enhancements to HPC VMs, Parallelstore, Cluster Toolkit, Slurm-gcp, and other HPC products and solutions.
Google will also participate in several parts of SC24's technical program, including BoFs, User Groups, and Workshops. Googlers will participate in various technical sessions, such as Converged HPC and Cloud Computing in the Era of Generative AI, HPC & Cloud Convergence: drivers, triggers, and constraints, and more.