MIT has achieved a record 61.9% accuracy on the abstraction and reasoning corpus (ARC) benchmark using test-time training.
François Chollet, creator of Keras, built the ARC-AGI benchmark to measure progress on logical reasoning abilities.
The current leader, MindsAI, scored 55% by using a technique that fine-tunes the model at the time of testing.
Despite MIT scoring 62%, MindsAI remains the leader due to time limit requirements and private data usage guidelines.
MIT trained the parameters using low-rank adaptation (LoRa) and initial fine-tuning on a publicly available ARC-AGI dataset.
The test-time training technique strengthens the model’s understanding of the ARC problem dataset by ommitting examples and learning from the rest.
Based on the frequency of predictions, the model votes for a top prediction, evaluates the list of top predictions across transformations, retrieves the accurate output and transforms it back to the original input style.
Test-time methods could play a pivotal role in advancing the next generation of Large Language Models.
ARC-AGI is still the only benchmark designed to resist memorisation and measure progress to close the gap between current AI and AGI.
As the data corpus grows, the boundaries between specialised and general-purpose models tend to blur.