In this article, an implementation of the Deep Learning model proposed in the "Show and Tell: A Neural Image Caption Generator" paper using PyTorch has been discussed.
The image captioning task can be done by combining CNN and RNN models.
The paper proposed to use GoogLeNet and LSTM for the task.
In PyTorch, the InceptionEncoder and LSTMDecoder classes are used for this purpose.
The ShowAndTell class packages the encoder and decoder together, and can be used for training and inference.
The EMBED_DER and LSTM_HIDDEN_DIM variables are set to 512.
A pretrained GoogLeNet model is used for the encoder, and transferred learning method is used.
The generate() method simultaneously processes image features and generates an appropriate token sequence.
To do the post-processing, the sequence generated from the generate() method needs to be converted into a set of words.
The process of the model is summarized with each set of necessary code explained in order