During Large Language Models (LLMs) training, a significant amount of tensor data is checkpointed periodically for recovery purposes in case of failure.
The paper focuses on optimizing the checkpointing process by analyzing checkpoint data and maximizing the use of lossless compression techniques to reduce the data volume.
An effective compression solution named Language Model Compressor (LMC) has been developed, based on byte-grouping and Huffman encoding, offering better performance than existing alternatives like BZ2 with significantly reduced compression time.
LMC's 16-core parallel implementation achieves high compression and decompression throughput, leading to reduced CPU resources and enabling higher-frequency checkpoints during model training.