Mosaic is a novel system introduced for creating and deploying pruned large language models (LLMs) using composite projection pruning.
Projection pruning is a fine-grained method for reducing the size of LLMs by removing unnecessary model parameters.
Composite projection pruning is a synergistic combination of unstructured pruning and structured pruning to optimize accuracy and model size reduction.
Mosaic outperforms existing approaches by being 7.19 times faster in producing models, achieving up to 84.2% lower perplexity, and 31.4% higher accuracy, while also improving inference speed and GPU memory utilization.