Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference, storing temporary memories of past tokens in the current sequence.
Existing TTT methods faced challenges in handling long-context data efficiently on modern GPUs due to low FLOPs utilization and small online minibatch sizes, restricting their application beyond 1D ordered sequences.
A new approach called Large Chunk Test-Time Training (LaCT) utilizes extremely large chunk updates (from 2K to 1M tokens) across tasks, significantly improving hardware utilization, state capacity, and enabling easy integration of sophisticated optimizers.
The LaCT approach has been validated across various modalities and tasks, demonstrating scalability up to a 14B-parameter AR video diffusion model and enabling novel view synthesis with 1 million context length, aiming to advance long-context modeling and test-time training research.