In the blog post 'Launch Observability at Netflix Scale,' Varun Khaitan discusses the strategies and architecture implemented to achieve comprehensive title observability at scale.
The introduction of observability endpoints was a key step, with each microservice in the Personalization stack required to introduce a 'Title Health' endpoint.
The endpoints were designed to accurately reflect production behavior, standardize communication, and follow the Insight Triad principle of 'Healthy,' 'why not healthy,' and 'how to fix it.'
Standardization was achieved through a stable proto request/response format, enhancing adoption, system simplicity, and debuggability for engineers.
The importance of providing detailed information in endpoint responses to aid Launch Managers and partner engineers in understanding and addressing issues was emphasized.
A high-level architecture was outlined, detailing the establishment of observability endpoints, proactive monitoring, real-time data tracking, optimized data storage, and APIs for stakeholders.
Proactive monitoring was conducted through scheduled collector jobs, ensuring title health evaluations for various Netflix rows.
Real-time Title Impressions were monitored via Kafka Queue, aggregating impressions data to assess title performance in near-real-time.
Data storage and distribution were facilitated through Hollow Feeds, allowing for efficient dissemination of health data across service boxes.
An Observability Dashboard, powered by the Health Check Engine, provided stakeholders with current title status across supported rows.