Large language models (LLMs) have grown significantly with some recent models containing trillions of parameters, leading to substantial computational challenges in terms of memory and compute resources.
Efforts have been made to address these challenges including the exploration of approaches like LoRA, which have proven effective for fine-tuning but more challenging for pre-training due to the need to learn vast datasets.
The study aims to determine if parameter- or memory-efficient methods can enhance pre-training efficiency while maintaining performance comparable to full-model training, and proposes practical techniques like weight refactorization and momentum reset to achieve this.
Benchmark evaluations of memory efficient pre-training approaches show that full-rank training with the right optimizer and hyperparameters delivers the best performance, and incorporating high-rank updates in low-rank approaches is crucial for improving performance.