Researchers have studied the approximation capabilities and convergence behaviors of one-layer transformers for in-context reasoning and next-token prediction tasks.
The research addressed gaps in theoretical understanding by proving the Bayes optimality of certain one-layer transformers with linear and ReLU attention.
Through finite-sample analysis, it was shown that the expected loss of these transformers converges at a linear rate to the Bayes risk during training with gradient descent.
The study also demonstrated that the trained models generalize well to unseen samples and exhibit expected learning behaviors based on empirical observations from previous works.