A recent study analyzes the phenomenon of residual specialization in transformer networks, particularly in vision transformers.
The study links the specialization of residual contributions to the low-dimensional structure of visual head representations.
The authors examine the effect of head specialization on multimodal models and its impact on zero-shot classification performance.
The study introduces ResiDual, a technique for spectral alignment of the residual stream, which demonstrates fine-tuning level performance on different data distributions.