A team of scientists from the University of Osaka has discovered that Vision Transformers (ViT) can learn to focus visual attention similarly to humans without any examples or instructions.
Using a method called self-distillation with no labels (DINO), ViTs were able to develop human-like visual attention by organizing visual information without human guidance.
The ViTs trained with DINO focused on similar areas as humans like faces, people outlines, and background details, showing a natural development of attention heads mirroring the human brain's organization of visual information.
This research could lead to more human-aware AI systems, helping in developing robots that understand human gaze for better communication and educational tools to enhance child development, while also providing insights into human perception by observing machine learning techniques.