Researchers from Google, Max Planck Institute, and Peking University introduced a new approach called TokenFormer that addresses scaling issues faced by traditional transformer architecture.
TokenFormer introduces a token-parameter attention (Pattention) layer that enables incremental scaling without full retraining of the entire model from scratch.
This approach has demonstrated impressive results, successfully scaling from 124M to 1.4B parameters while maintaining performance comparable to Transformers trained from scratch.
TokenFormer’s most compelling features is its ability to preserve existing knowledge while scaling, offering a new approach to continuous learning.
In benchmark tests, TokenFormer achieved performance comparable to standard Transformers, requiring only one-tenth of the computational budget.
This efficiency extends to both language and vision tasks, with the model demonstrating competitive performance across various benchmarks, including zero-shot evaluations and image classification tasks.
Furthermore, TokenFormer maintains constant computational costs for token-token interactions while scaling parameters, thus making it suitable for processing longer sequences.
However, users from Hacker News have pointed out some issues, saying it is hard to trust the numbers shown in the research.
TokenFormer provides a new level of modularity and compatibility between publicly available weight sets, assuming they use similar channel dimensions.
While the approach looks promising on paper, we'll have to wait for developers to implement it in actual models.