In this tutorial , you will learn what Attention Mechanism is and you will be able to implement Attention model Architecture from scratch.
So let’s dive in .
Before moving ahead first we should know about Encoder-Decoder architecture, in this architecture we encode a sequence using RNN cells such as LSTM and GRUs, and then entire sequence is encoded into single vector, after this we will have to pass this encoded vector to Decoder , which is also a RNN layer and this layer decodes the encoded vector to some Sequence in output space.

But there is a significant problem in traditional encode-decoder architecture, if we increase the sequence length, then it is pretty hard to maintain the entire sequence in a single vector and thus the decoder willn’t get the entire sequence context and decodes poorly .
This effect is known as the Vanishing Gradient Problem.
Attention Mechanism Provides the Solution of this Vanishing Gradient issue in Encoded-Decoder Architecture,
In this mechanism we will not rely on just a single encoded vector, we will generate a vector for every decoded word, known as Context Vector.

How to Generate Context Vector :
We will add a weighted sum entire input sequence, these weights will be changing for each word in the output sequence, and these weights can be obtained by training a Feed Forward Network which takes Input sequence and Decoder’s hidden state as input.
In General, we can say in Attention Mechanism for each output word , we can focus on any word or group of words from the input sequence.
How to Implement Attention Architecture in Python :
Let’s first implement Encoder Part, encoder will take Sequence as input , which may come from previous LSTM or embedding layer,
The shape of Input will be (Batch_size,Seq_Length,Embd_dims),

As we can see in code, the encoder will just transform embedding_dims into some dims which will be given explicitly.
Now, lets introduce Decoding Layer (we will implement Attention Layer after this).

Now , let’s explain the decoder layer, in the call function x will be the previous predicted word (single word), features is the entire encoding layer’s output, and hidden is the previous hidden state of GRU Layer (Initially All Zero).
The Output of Decoder Layer will be in shape (batch_size,len(vocab)) for each run, then we will calculate the word which has max probability (if we want greedy search) or most likely probable word.
In Decode code we have used “BahdanauAttention” , so let’s discuss what it is and how to implement it .

You can Implement it or Copy it from here, it will just work, so you don’t need to panic about how to Implement this Bahdanau Attention Layer.
The Thing which you should remember from this Layer is the Output Shape which is (Batch_size,1,inputed_embd_length) which is the shape of the context vector.
Now we have to create Object of Encoder and Decoder,

Note : You can Change 256 or 512 according to your requirements.
Training of Attention Model :
In training, we will use Teacher Forcing Method means we have output data so instead of feeding previous predicted word to the network we will feed previous original word.
For training we will use tensorflow’s Gradient Tape Architecture,
But first let’s define the Loss Object and Loss Function,

We are going to use Sparse Categorical Cross-entropy loss, and we will also use masking due to padded sequences , we used index ‘0’ for token <pad>.
So our loss function will multiply padded output with zeros.
Now, we have to define our Optimizer and some training hyperparameters,

We will use Adam Optimizer and we have implemented train_data_gen Generator which generates random data samples from our dataset. You can do it by yourself.

Here, we trained the model and also plotted the graph of Loss on a per batch basis.
So that’s it, Now you know what Attention Architecture is and How to implement it and you can use it in your own tasks.
Here are some Applications where we can use Attention Architecture :
Image Captioning
Neural Machine Translation
Text to Summary Generator
Further Reading :
Neural Machine Translation by Jointly Learning to Align and Translate, 2015.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2015.
Hierarchical Attention Networks for Document Classification, 2016.
Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, 2016
Effective Approaches to Attention-based Neural Machine Translation, 2015.
Article by: Gajesh Ladhar
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other technical and Non Technical Internship Programs