Interpretability researchers have focused on understanding MLP neurons of language models based on contexts and output weight vectors, neglecting the interaction between input and output.
A study examined the cosine similarity between input and output weights of neurons in 12 models, finding enrichment neurons prevalent in early-middle layers and depletion neurons in later layers.
Enrichment neurons enhance concept representations, aiding factual recall in the early stages, while later layers tend more towards depletion to reduce certain inputs.
This input-output perspective complements activation-dependent analyses and approaches that treat input and output separately in interpreting neural network behaviors.