NLP’s Transformer is a new architecture that aims to solve tasks sequence-to-sequence while easily handling long-distance dependencies. Computing the input and output representations without using sequence-aligned RNNs or convolutions and it relies entirely on self-attention. Lets look in detail what are transformers.
The Basic Architecture
In general, the Transformer model is based on the encoder-decoder architecture. The encoder is the gray rectangle on the left and the decoder is on the right. The encoder and decoder consist of two and three sublayers, respectively. Multi-head self-awareness, fully connected feedforward network, and encoder decoder self-awareness in the case of decoders (called multi-head attention) with the following visualizations).
Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed-length vector called a context vector.
Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.
Let’s see how this setup of the encoder and the decoder stack works:
1.The word embeddings of the input sequence are passed to the first encoder
2.These are then transformed and propagated to the next encoder
3.The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in the figure below:
What exactly does this “Self-Attention” layer do in the Transformer?
Self-Attention in Transformers
Self-attention is a new spin on the attention technique. Instead of looking at prior hidden vectors when considering a word embedding, self-attention is a weighted combination of all other word embeddings (including those that appear later in the sentence):
How self-attention is implemented:
1. The word embedding is transformed into three separate matrices — queries, keys, and values — via multiplication of the word embedding against three matrices with learned weights. These vectors are trained and updated during the training process.
2. Consider this sentence- “action leads to results”. To calculate the self-awareness of the first word “action”, calculate the scores of all the words in the phrase related to “action”. This score determines the importance of other words when encoding a particular word in the input sequence.
- The score for the first word is calculated by taking the dot product of the Query vector (q1) with the keys vectors (k1, k2, k3) of all the words
- Then, these scores are divided by 8 which is the square root of the dimension of the key vector:
- Next, these scores are normalized using the softmax activation function
- These normalized scores are then multiplied by the value vectors (v1, v2, v3) and sum up the resultant vectors to arrive at the final vector (z1). This is the output of the self-attention layer. It is then passed on to the feed-forward network as input
- Same process is done for all the words
In the Transformer architecture, self-awareness is calculated independently of each other, not just once, but multiple times in parallel. Therefore, it is called multi-head attention. The outputs are concatenated and transformed linearly, as shown in the following figure.
Attention in Transformer Architecture and it’s working:
The transformer architecture uses attention model uses multi-headed attention at three steps(refer fig 1):
1. The first is the encoder and decoder attention layer. For this type of layer, the query is taken from the layer before the decoder and the keys and values are taken from the encoder output. This allows each position of the decoder to pay attention to every position in the input sequence.
2. The second type is the self-attention layer contained in the encoder. This layer receives key, value, and query input from the output of the layer before the encoder. Any position on the encoder can receive attention values from any position on the layer in front of the encoder.
3. The third type is the decoder self-attention. This is similar to encoder self-attention, where all queries, keys and values are retrieved from the previous layer. The self-aware decoder can be used at any position to serve any position up to that position. Future values are masked with (-Inf). This is called masked self-attention.
4. The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.
Comparison to RNNs
The Transformer architecture eliminates the time-dependent aspect of the RNN architecture by handling these aspects of learning in a completely separate architecture. Therefore, the transformer has as many linear layers as the words in the longest sentence, but these layers are relatively prime and time-independent, as in the case of RNNs. Therefore, it is incredibly parallel and easy to calculate.
Transformers are not better than traditional RNNs in all applications, RNNs still win in some contexts, but in those applications where they match or beat traditional RNNs they do so with lower computational cost.
Advantages of Transformers
1. They hold the potential to understand the relationship between sequential elements that are far from each other.
2. They are way more accurate.
3. They pay equal attention to all the elements in the sequence.
4. Transformers can process and train more data in lesser time.
5. They could work with virtually any kind of sequential data.
6. Transformers serve to be helpful in anomaly detection.
The Transformer model is a new kind of encoder-decoder model that uses self-awareness to understand speech sequences. This allows parallel processing and is much faster than any other model with the same performance. In doing so, they paved the way for modern language models (such as BERT and GPT) and, more recently, generative models.