An Explanation of the Vision Transformer (ViT) Paper

A naukri.com initiative

New

Home

Deep Learning News

An Explana...

Medium

432

Image Credit: Medium

An Explanation of the Vision Transformer (ViT) Paper

The Vision Transformer (ViT) paper adapts the transformer architecture used in NLP to process images, treating them as a sequence of smaller, fixed-size patches that are processed through a pure transformer.
The ViT takes images as input, divides them into small, fixed-size patches, flatens and converts each patch into a numerical representation called a patch embeddings.
The ViT adds positional embeddings to patch embeddings, to help the model retain spatial structure of the image.
The ViT appends a special classification token ([CLS]) to the sequence to aggregate information from all patches during processing for image summarization.
The ViT showed excellent performance on larger datasets than CNNs for scalability, transfer learning, and performance in low-data scenarios. While CNNs, on the other hand, performed better on smaller datasets.
The authors proposed an optional hybrid architecture that starts with CNN to extract feature maps, which are then treated as input patches for the Vision Transformer.
ViT outperformed BiT and other state-of-the-art methods in Natural and Structured categories in the VTAB benchmark suite, demonstrating its ability to generalize well across varied datasets.
The ViT processes images differently from CNNs by learning spatial relationships from scratch and without CNN's inherent assumptions for localized patterns like textures, edges or shapes.
The authors also explored self-supervised learning applied to ViT, where parts of the input image were hidden, and the model was tasked with reconstructing the missing patches.
The ViT showed promising scaling efficiency but has not reached its full potential yet and could perform even better with larger datasets.

Read Full Article

26 Likes

Discover more

For uninterrupted reading, download the app