<ul><li>Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.</li><li>Puzzle is a hardware-aware framework that accelerates LLM inference while preserving capabilities, optimized using neural architecture search (NAS) and blockwise local knowledge distillation (BLD).</li><li>The framework showcases a model, Nemotron-51B, achieving a 2.17x inference throughput speedup on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies.</li><li>Efficient deployment of powerful LLM models can be achieved with negligible loss in quality by focusing on inference performance rather than parameter count alone.</li></ul>

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Discover more