Vision Transformers (ViTs) are powerful in computer vision tasks due to their representation capabilities.
A layer-wise analysis of ViTs using neuron labeling reveals that concepts encoded in ViTs become more complex throughout the network.
Early layers primarily encode basic features like colors and textures, while later layers represent more specific classes, such as objects and animals.
Different pretraining strategies influence the quantity and category of encoded concepts, with finetuning reducing the number of concepts and shifting them to more relevant categories.