OpenAI faced a major platform outage lasting four hours on December 11th, 2024.
The outage was caused by a faulty deployment of a telemetry service, which overwhelmed the API server with resource-intensive API calls.
The issue went unnoticed until it propagated to the largest clusters, leading to unresponsive DNS-based service discovery and preventing access to the Kubernetes control plane.
OpenAI aims to prevent future outages by improving phased rollouts, conducting fault injection testing, decoupling the Kubernetes data plane from the control plane, and implementing a break-glass mechanism for accessing the API server.