In transformer models, the attention block is typically followed by a feed forward layer (FF) with a hidden layer and ReLU activation.
The feed forward layer holds most of the weights in the transformer due to its large hidden dimension, playing a major role in token reasoning.
Transformers repeat blocks multiple times, leading to challenges in total layer size, paving the way for efficiency improvements like sparsely-gated mixture of experts (MoE).
MoE architecture involves experts, a router or gate, and selecting top experts based on scores to process tokens effectively through weighted averaging.
MoE aims to increase model capacity without significantly raising computational costs by using a subset of parameters for each token.
The experts not among the top K for a token are disregarded, enhancing computational efficiency by eschewing unnecessary computation.
MoE allows for an overall model size increase without a proportional rise in computational expense.
Efficient MoE implementation may pose challenges related to load balancing among experts, addressing through methods like adding noise or special losses.
The MoE architecture offers specialization among experts, with each focusing on different model capabilities, optimizing token flow through the model.
Considerations like the Shrinking Batch Problem in MoE papers emphasize challenges associated with expert batch sizes and computational efficiency.