The decoder architecture of Transformers is made up of six blocks, with each block consisting of masked self-attention, cross-attention, and feed-forward neural network operations.
During training, an input sentence goes through the encoder and generates contextual embeddings for the sentence. The output sentence goes through the input part of the decoder architecture where it is right-shifted, tokenized, and embedded using positional encoding.
The masked multi-head attention operation generates a corresponding contextual embedding vector for every input. The results of this operation are added to the original input vectors, and the combined vectors are normalized using layer normalization to create the contextual embeddings for each input token.
Cross-attention is performed on the contextual embeddings of the input sentence generated by the encoder and the contextual embeddings of the output sentence generated by the first decoder block. The results are added to the original normalized vectors, and the combined vectors are normalized again to create the contextual embeddings for each output token.
The feed-forward neural network block consists of two linear layers with ReLU and linear activation functions, respectively. The output of this block is added back to the input using residual connections, and the final vectors are normalized once more using layer normalization.
Finally, the output block consisting of a linear and softmax layer generates a probability distribution for each word in the Hindi vocabulary, and the word with the highest probability is chosen as the output for each input token.
This decoder architecture is specifically for training and works alongside the encoder to generate translations for machine translation tasks.
Overall, the decoder architecture of Transformers may seem overwhelming at first, but breaking it down into smaller parts helps to understand the process of how it works.
This discussion also highlights the importance of self attention, cross-attention, and feed-forward neural network operations in the generation of contextual embeddings that enable the decoder to create accurate output translations.
The decoder architecture for training involves a series of inputs, transformations, and computations that ultimately produce accurate translations for machine translation tasks.