FINRA has built an observability framework to provide operational metrics insights for big data processing workloads on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters with Prometheus and Grafana.
Monitoring EMR clusters in real-time is crucial in identifying root causes, minimizing manual actions, and increasing productivity. The challenges faced by organizations while observing cluster performance include scale, dynamic environments, data variety, resource utilization, latency, centralizing observability dashboards, alerting, incident management, and cost management.
Insights gained from Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana helped FINRA to build its enterprise central monitoring solution using Managed Prometheus and Managed Grafana.
Managed Prometheus allows for real-time high-volume data collection that scales the ingestion, storage, and querying of operational metrics to mimic Ganglia-like metrics. Additionally, a data ingestion layer for every cluster, configuration for metrics collection, and inefficiencies were included in the solution.
A mechanism was built to render metrics on Managed Grafana dashboards for task-level, node-level, and cluster-level metrics that can be promoted from lower environments to higher environments.
The scalable solution significantly reduced the time to resolution and enhanced overall operational stance. The solution empowered the operations and engineering teams with comprehensive insights into various Amazon EMR metrics like OS levels, Spark, JMX, HDFS, and Yarn, all consolidated in one place.
The solution extends to use cases such as Amazon Elastic Kubernetes Service (Amazon EKS) clusters, including EMR on EKS clusters, and other applications, establishing it as a one-stop system for monitoring metrics across Finra's infrastructure and applications.