A study explores the connection between gradient-based optimization of parametric models like neural networks and optimization of linear combinations of random features.
The main finding suggests that if a parametric model can be learned using mini-batch stochastic gradient descent without requiring assumptions about data distribution, then the target function can be approximated using a polynomial-sized combination of random features with high probability.
The size of the combination of random features depends on the number of gradient steps and numerical precision utilized in the bSGD process.
The study highlights the limitations of distribution-free learning in neural networks trained by gradient descent and emphasizes the importance of making assumptions about data distributions in practical scenarios.