Understanding the inner workings of Kubernetes resource management involves achieving an end-to-end contextual understanding of how resource management functions including everything from its user abstractions to technical implementation at the Linux kernel level.
Kubernetes pods get scheduled on nodes purely based on their requests. Node “fullness” is request-based, ignoring usage and limits.
For memory resources, there’s no cgroup setting corresponding to the memory request abstraction.
CPU time can be withheld or deferred without terminating the process, though doing so might hurt performance. But when it comes to memory, you either get it or you don’t. There is no try and there is no defer.
When you set a memory limit in Kubernetes, all the container runtime does is plug that number straight into the memory.max control for the container’s cgroup. If the in-use memory for the cgroup exceeds that limit, the OOMKiller will smite it.
Kubernetes does not set any cgroup controls based on memory requests.
Kubernetes won’t run any new pods on a node if the sum total of the running container memory requests would add up to more than the node’s allocatable memory.
The OOMKiller is a Linux kernel feature invoked when a node runs out of physical memory.
Kubernetes sets the oom_score_adj for every container process it starts and it uses clever math to ensure that containers using more memory than they requested will always be terminated before well-behaved containers.
Making a reasonable decision about which process to kill when something needs to be killed is great, but ideally, we want to avoid having to ever make that decision in the first place.