Tokenization is considered a necessary initial step for designing performant language models.
Transformers trained on certain data processes without tokenization fail to learn the right distribution and predict characters according to a unigram model.
With tokenization, transformers are able to break through this barrier and model the probabilities of sequences drawn from the source near-optimally.
The use of tokenization in language modeling is justified through the study of transformers on Markovian data.