A new method called parallel scaling (ParScale) has been introduced for language models to increase model's parallel computation during both training and inference time.
ParScale uses diverse and learnable transformations on input, executes forward passes of the model in parallel, and dynamically aggregates the outputs, allowing for more efficient scaling without significantly increasing space or time costs.
Theoretical analysis and large-scale pre-training validates a new scaling law with ParScale, showing that a model with multiple parallel streams is comparable to scaling the parameters logarithmically while maintaining superior inference efficiency.
ParScale can achieve up to 22 times less memory increase and 6 times less latency increase compared to traditional parameter scaling methods, making it a more resource-efficient approach for enhancing model performance.