Theoretical analysis suggests that Transformers may outperform traditional feedforward and recurrent neural networks.
Transformers have the advantage of adaptability to dynamic sparsity, which leads to improved sample complexity.
A single-layer Transformer can learn a sequence-to-sequence data generating model with minimal sample complexity, depending on the number of attention heads.
In comparison, recurrent networks require significantly more samples to learn the same problem.