<ul><li>OpenAI faced a major platform outage lasting four hours on December 11th, 2024.</li><li>The outage was caused by a faulty deployment of a telemetry service, which overwhelmed the API server with resource-intensive API calls.</li><li>The issue went unnoticed until it propagated to the largest clusters, leading to unresponsive DNS-based service discovery and preventing access to the Kubernetes control plane.</li><li>OpenAI aims to prevent future outages by improving phased rollouts, conducting fault injection testing, decoupling the Kubernetes data plane from the control plane, and implementing a break-glass mechanism for accessing the API server.</li></ul>

11th Dec 2024 — OpenAI's Kubernetes Outage Explained

Discover more