Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.
Puzzle is a hardware-aware framework that accelerates LLM inference while preserving capabilities, optimized using neural architecture search (NAS) and blockwise local knowledge distillation (BLD).
The framework showcases a model, Nemotron-51B, achieving a 2.17x inference throughput speedup on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies.
Efficient deployment of powerful LLM models can be achieved with negligible loss in quality by focusing on inference performance rather than parameter count alone.