The Performance Engineering team at GitHub conducted experiments to observe the impact of CPU utilization on system performance.
The team used a tool called “stress” to gradually increase CPU utilization and established a baseline by sending moderate production traffic to the Large Unicorn Collider (LUC) Kubernetes pod, which mirrored the same architecture and configuration as flagship workloads.
As expected, CPU time increased for all instance types as CPU utilization rose. Each instance type showed unique behavior and varying thresholds where performance began to decline.
The team observed that lower CPU utilization leads to better performance, while higher utilization leads to a decline in performance.
Intel’s Turbo Boost Technology also impacted CPU frequency, leading to a decrease as utilization increased, causing overall system performance to decline.
All nodes also had Hyper-Threading enabled, which allowed a single physical CPU core to operate as two virtual cores.
To achieve the optimal balance of CPU utilization, the team identified a threshold where CPU utilization is high enough to avoid underutilization but not so high as to significantly impair performance.
The team derived a mathematical model to identify the threshold by determining what percentage of CPU time degradation is acceptable and plotting the CPU utilization vs. CPU time (latency).
The team also discovered an issue where certain instances were not achieving their advertised maximum Turbo Boost CPU frequency due to a disabled CPU C-state, which prevented the CPU cores from halting even when they were not in use, and they successfully resolved it through re-enabling C-state.
These insights enable GitHub to inform resource provisioning strategies and maximize hardware investments.