<ul><li>Google, Bytedance, and Red Hat have partnered to introduce new capabilities for generative AI inference on Kubernetes.</li><li>The Gateway API Inference Extension now supports LLM-aware routing, making it more cost-effective to operationalize popular techniques for large language models (LLMs) inference.</li><li>A new inference performance project provides a benchmarking standard for detailed model performance insights on accelerators and HPA scaling metrics and thresholds.</li><li>Dynamic Resource Allocation simplifies and automates how Kubernetes allocates and schedules GPUs, TPUs, and other devices to pods and workloads, ensuring scheduling efficiency and portability across accelerators.</li></ul>

Google, Bytedance, and Red Hat make Kubernetes generative AI inference aware

Discover more