Google, Bytedance, and Red Hat have partnered to introduce new capabilities for generative AI inference on Kubernetes.
The Gateway API Inference Extension now supports LLM-aware routing, making it more cost-effective to operationalize popular techniques for large language models (LLMs) inference.
A new inference performance project provides a benchmarking standard for detailed model performance insights on accelerators and HPA scaling metrics and thresholds.
Dynamic Resource Allocation simplifies and automates how Kubernetes allocates and schedules GPUs, TPUs, and other devices to pods and workloads, ensuring scheduling efficiency and portability across accelerators.