menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Devops News

>

11th Dec 2...
source image

Dev

1w

read

169

img
dot

Image Credit: Dev

11th Dec 2024 — OpenAI's Kubernetes Outage Explained

  • OpenAI faced a major platform outage lasting four hours on December 11th, 2024.
  • The outage was caused by a faulty deployment of a telemetry service, which overwhelmed the API server with resource-intensive API calls.
  • The issue went unnoticed until it propagated to the largest clusters, leading to unresponsive DNS-based service discovery and preventing access to the Kubernetes control plane.
  • OpenAI aims to prevent future outages by improving phased rollouts, conducting fault injection testing, decoupling the Kubernetes data plane from the control plane, and implementing a break-glass mechanism for accessing the API server.

Read Full Article

like

10 Likes

For uninterrupted reading, download the app