Cross Attention, a multi-head attention block, receives inputs from both the encoder and decoder, unlike other multi-head attention blocks. Cross Attention allows finding the relationship between two sequences, which is the most important component of the decoder architecture of Transformers.
The Cross Attention mechanism helps to capture the relationship between each word of the input sequence and each word of the output sequence. It involves finding the relationship between two different sequences by multi-head attention.
Cross Attention is used in situations where two different types of sequences need to be compared. For instance, machine translation, question-answering systems, and multimodality applications such as image captioning, text-to-image, text-to-speech.
Query vectors come from the output sequence while Key and Value vectors come from the input sequence. The calculation for attention scores in Cross Attention is same as in Self Attention. Both mechanisms are similar except Cross Attention deals with two sequences instead of one.
The difference between the self attention and cross attention lies in the input aspect(the input for self-attention is the embeddings of a sequence while Cross Attention simultaneously requires two sequences in the input) and output aspect (the number of word vectors in the contextual embeddings obtained from Cross Attention is equal to the number of words in the output sequence).
The structure of Cross Attention is conceptually very similar to that of self-attention, except for its input and output differences, and with the addition of the other sequence in the processing.
Cross Attention is used in situations where two different types of sequences need to be compared. For instance, machine translation, question-answering systems, and multimodality applications such as image captioning, text-to-image, text-to-speech.
The Cross Attention mechanism is frequently used in situations that requiresimultaneously check for the relationships between two sequences.
Cross Attention mechanism of the Decoder architecture of transformers is essential and a high-level concept that needs to be understood for building NLP applications, and for understanding the working of transformer applications.
Understanding Cross Attention requires understanding of Self Attention, the decoding architecture of the transformer, and the use case scenarios to finally grasp the concept.