Convolutional Neural Networks (CNNs) have been the backbone of computer vision, excelling in image-related tasks.Vision Transformers (ViTs) challenge CNN dominance by using self-attention mechanisms instead of convolutions to process images.ViTs outperform CNNs when pre-trained on large datasets, but struggle with limited data.Efficient architectures are being researched to address the quadratic complexity in ViTs self-attention.