<ul><li>Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference, storing temporary memories of past tokens in the current sequence.</li><li>Existing TTT methods faced challenges in handling long-context data efficiently on modern GPUs due to low FLOPs utilization and small online minibatch sizes, restricting their application beyond 1D ordered sequences.</li><li>A new approach called Large Chunk Test-Time Training (LaCT) utilizes extremely large chunk updates (from 2K to 1M tokens) across tasks, significantly improving hardware utilization, state capacity, and enabling easy integration of sophisticated optimizers.</li><li>The LaCT approach has been validated across various modalities and tasks, demonstrating scalability up to a 14B-parameter AR video diffusion model and enabling novel view synthesis with 1 million context length, aiming to advance long-context modeling and test-time training research.</li></ul>

Test-Time Training Done Right

Discover more