SGD minimizes a free energy function $F=U-TS$ during neural network training by balancing training loss with the entropy of the weights distribution.
The temperature in the free energy function is determined by the learning rate, explaining how different learning rates affect training convergence and stabilization at varying loss levels.
Empirical validation of the free energy framework shows that underparameterized models consistently follow free energy minimization with temperature increasing with learning rate, while overparameterized models converge to an optimal loss by effectively dropping the temperature to zero at low learning rates.
The difference in behavior between underparameterized and overparameterized models is attributed to the signal-to-noise ratio of stochastic gradients near optima, supported by experimental results.