Researchers from the University of Washington, NVIDIA, and the Allen Institute for AI have challenged the assumption that language models should tokenize words based on word boundaries.
Their paper on 'SuperBPE: Space Travel for Language Models' suggests that allowing tokens to cross word boundaries makes language models more efficient and capable.
The research found that this approach leads to up to 33% fewer tokens, requires 27% less computational power, and improves performance by 4% across diverse tasks.
By questioning the convention of word-based tokenization, the study demonstrates the advantages of token travel between words.