Large language models (LLMs) facilitate various applications but face computational efficiency challenges due to the sequential structure of transformers, prompting the need for optimization strategies.
NVIDIA researchers introduced FFN Fusion, a method to parallelize low-dependency FFN layers by combining them into wider FFNs, reducing sequential computation.
FFN Fusion was applied to the Llama-405B model, resulting in Ultra-253B-Base, which improved speed and resource efficiency without compromising model performance.
The fused model achieved notable gains, including a 1.71x inference speed improvement and a 35x reduction in per-token computational cost.
Benchmark results showed competitive performance metrics for Ultra-253B-Base, with memory usage halved and high accuracy retained.
FFN Fusion demonstrates that redesigning model architectures can lead to significant efficiency enhancements, enabling broader applications across different model sizes.
The technique's systematic approach using cosine distance analysis helps identify suitable FFN sequences for fusion, with validation across varied model scales.
While FFN Fusion is effective for larger model scales and complements techniques like pruning and quantization, full transformer block parallelization requires further investigation due to interdependencies.
This research sets the stage for more parallel-friendly and hardware-efficient designs for large language models, promising advancements in computational efficiency and performance.
This study sheds light on how optimizing model architectures, like using FFN Fusion, can revolutionize the efficiency and scalability of large language models, addressing critical challenges in sequential computation.