<ul><li>Tokenization can impact model performance by introducing a phenomenon called tokenization bias.</li><li>The Byte-Token Representation Lemma establishes a mapping between token and byte-level distributions.</li><li>A next-byte sampling algorithm eliminates tokenization bias, converting tokenized LMs into token-free ones.</li><li>The method shows improved performance in fill-in-the-middle tasks and model ensembles across different benchmarks.</li></ul>

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Discover more