<ul><li>Tokenization is considered a necessary initial step for designing performant language models.</li><li>Transformers trained on certain data processes without tokenization fail to learn the right distribution and predict characters according to a unigram model.</li><li>With tokenization, transformers are able to break through this barrier and model the probabilities of sequences drawn from the source near-optimally.</li><li>The use of tokenization in language modeling is justified through the study of transformers on Markovian data.</li></ul>

Toward a Theory of Tokenization in LLMs

Discover more