1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:05,000
So we are going to continue our discussion with respect to encoder and decoder, uh, sequence to sequence

3
00:00:05,000 --> 00:00:06,000
architecture.

4
00:00:06,000 --> 00:00:10,000
Already we have seen how it exactly works.

5
00:00:10,000 --> 00:00:14,000
We have seen the architecture with its complete in-depth intuition.

6
00:00:14,000 --> 00:00:21,000
Uh, but let us go ahead and let us try to understand the problems with encoder and decoder architecture,

7
00:00:21,000 --> 00:00:23,000
sequence to sequence architecture.

8
00:00:23,000 --> 00:00:25,000
And I will just try to understand the problem.

9
00:00:25,000 --> 00:00:30,000
And in order to solve that particular problem, what we are specifically going to use and what architecture

10
00:00:30,000 --> 00:00:35,000
change will specifically happen in this encoder decoder sequence to sequence architecture that also

11
00:00:35,000 --> 00:00:38,000
we will be seeing in the upcoming video.

12
00:00:38,000 --> 00:00:43,000
So, uh, let me just go ahead and let me quickly go ahead and create a generic structure again.

13
00:00:43,000 --> 00:00:50,000
So this is my encoder right now with respect to the encoder, you know that let's say I'm using a LSTM.

14
00:00:50,000 --> 00:00:52,000
So over here I have my LSTM here.

15
00:00:52,000 --> 00:00:57,000
Also I have my LSTM and similarly here also I have my LSTM.

16
00:00:57,000 --> 00:01:03,000
Let's say with respect to time I have just drawn with t is equal to 1234.

17
00:01:03,000 --> 00:01:04,000
Right.

18
00:01:04,000 --> 00:01:08,000
And here you know that we specifically use embedding layer.

19
00:01:08,000 --> 00:01:11,000
So here I'm going to basically use an embedding layer.

20
00:01:11,000 --> 00:01:14,000
So embedding layer okay.

21
00:01:14,000 --> 00:01:18,000
Now with respect to embedding layer uh here we are going to give X11.

22
00:01:19,000 --> 00:01:24,000
Let's say this is just for the first sentence X12X13.

23
00:01:24,000 --> 00:01:28,000
And this will basically be my X14 which is end of sentence okay.

24
00:01:29,000 --> 00:01:34,000
Now you know that, uh, this entire information will be going over here in the terms of vectors.

25
00:01:34,000 --> 00:01:37,000
So here a vector will be created here.

26
00:01:37,000 --> 00:01:40,000
Also a vector will be created based on this embedding layer.

27
00:01:40,000 --> 00:01:42,000
Here also a vector will be created.

28
00:01:42,000 --> 00:01:44,000
And finally here also vector will be created.

29
00:01:45,000 --> 00:01:45,000
Okay.

30
00:01:45,000 --> 00:01:54,000
Now when we are discussing specifically with respect to this, uh, you know that here, whatever output

31
00:01:54,000 --> 00:01:55,000
we are going to get, okay.

32
00:01:55,000 --> 00:01:57,000
We do not consider this.

33
00:01:57,000 --> 00:02:02,000
So here the final output will be considered, which will basically be my w.

34
00:02:02,000 --> 00:02:03,000
Let's consider this as w.

35
00:02:03,000 --> 00:02:06,000
This is my context vectors right.

36
00:02:06,000 --> 00:02:11,000
And usually this context vector is represented by your uh two important lines.

37
00:02:11,000 --> 00:02:15,000
One is your long term memory cell and your, uh, hidden state.

38
00:02:15,000 --> 00:02:17,000
So I will just combine this specific line over here.

39
00:02:17,000 --> 00:02:18,000
Right.

40
00:02:18,000 --> 00:02:22,000
And this line only needs to be forwarded to the, uh, next decoder.

41
00:02:22,000 --> 00:02:23,000
Right.

42
00:02:23,000 --> 00:02:30,000
So with respect to any use cases which have this entire encoder and decoder architecture is usually

43
00:02:30,000 --> 00:02:32,000
solving a sequence to sequence problem statement.

44
00:02:32,000 --> 00:02:33,000
Right.

45
00:02:33,000 --> 00:02:36,000
So here, uh, let's consider this is my decoder.

46
00:02:36,000 --> 00:02:41,000
And here again I will be having my LSTM LSTM okay.

47
00:02:41,000 --> 00:02:45,000
And let's say in this case I'll just go ahead and draw three right.

48
00:02:45,000 --> 00:02:47,000
So what is basically going to happen.

49
00:02:47,000 --> 00:02:50,000
This entire line will get passed to this okay.

50
00:02:50,000 --> 00:02:54,000
And here I will again pass my S.O.S..

51
00:02:54,000 --> 00:02:56,000
Let's say uh start off the sentence.

52
00:02:56,000 --> 00:02:57,000
Okay?

53
00:02:57,000 --> 00:02:58,000
I will get my output.

54
00:02:59,000 --> 00:03:03,000
This state will still be passed till the last layer.

55
00:03:03,000 --> 00:03:09,000
But this particular output over here, let me just go ahead and create this.

56
00:03:09,000 --> 00:03:14,000
This will basically go ahead and pass to a softmax activation function.

57
00:03:14,000 --> 00:03:18,000
Softmax activation function.

58
00:03:18,000 --> 00:03:19,000
Right.

59
00:03:19,000 --> 00:03:23,000
And with respect to this softmax activation function, I'm going to get the output.

60
00:03:23,000 --> 00:03:30,000
Then this output will further get passed through the input to the next LSTM.

61
00:03:30,000 --> 00:03:32,000
Similarly this will get passed over here.

62
00:03:32,000 --> 00:03:37,000
Again with respect to the output I will again go back to the next one.

63
00:03:37,000 --> 00:03:41,000
And finally I get my last output, which may be my end of sentence.

64
00:03:41,000 --> 00:03:42,000
Right?

65
00:03:42,000 --> 00:03:44,000
And this will be my y hat y hat.

66
00:03:44,000 --> 00:03:44,000
Right.

67
00:03:44,000 --> 00:03:46,000
It can be y one hat y two hat.

68
00:03:46,000 --> 00:03:52,000
Right now, this is the entire architecture of an encoder decoder.

69
00:03:52,000 --> 00:03:52,000
Right.

70
00:03:53,000 --> 00:03:56,000
But now let's focus on this context vector okay.

71
00:03:56,000 --> 00:04:00,000
So when I say this is my context vector I'm also considering it over here itself because there will

72
00:04:00,000 --> 00:04:02,000
be two lines that will be passing.

73
00:04:02,000 --> 00:04:02,000
Right.

74
00:04:02,000 --> 00:04:04,000
One is my KT, one is my HT.

75
00:04:04,000 --> 00:04:08,000
Okay I will if I combine both of them this act exactly becomes the context vector, right?

76
00:04:08,000 --> 00:04:12,000
So that is the reason why I'm representing this, because I'm just going to consider the output of the

77
00:04:12,000 --> 00:04:16,000
last, last uh, LSTM with respect to time.

78
00:04:16,000 --> 00:04:18,000
Let's say this is the last time where I'm getting the sentence.

79
00:04:18,000 --> 00:04:20,000
So this will be the context vector.

80
00:04:20,000 --> 00:04:23,000
But understand what is the importance of this context vector.

81
00:04:24,000 --> 00:04:32,000
This context vector represents represents the entire sentence, right.

82
00:04:32,000 --> 00:04:36,000
This entire sentence that we are passing right now.

83
00:04:36,000 --> 00:04:43,000
If this is a smaller sentence, write this context vector will be more than sufficient to represent

84
00:04:43,000 --> 00:04:45,000
this particular sentence.

85
00:04:45,000 --> 00:04:47,000
But what did researcher do?

86
00:04:47,000 --> 00:04:47,000
Researcher.

87
00:04:47,000 --> 00:04:59,000
When they were testing right, they used sentences of varying length To test it, write sentences of

88
00:04:59,000 --> 00:05:00,000
varying length.

89
00:05:01,000 --> 00:05:02,000
Right.

90
00:05:02,000 --> 00:05:05,000
And they used the performance metrics.

91
00:05:05,000 --> 00:05:06,000
Again.

92
00:05:06,000 --> 00:05:13,000
If you go ahead and see the research paper with encoder and decoder, this research paper was something

93
00:05:13,000 --> 00:05:15,000
related to Bleu score okay.

94
00:05:15,000 --> 00:05:19,000
And here on the left hand side this will basically talk about sentence length.

95
00:05:20,000 --> 00:05:26,000
Now when they use this encoder decoder architecture, you could see that they drew a graph which looks

96
00:05:26,000 --> 00:05:27,000
like this.

97
00:05:28,000 --> 00:05:29,000
Right.

98
00:05:29,000 --> 00:05:35,000
That basically means as the sentence length kept on increasing, you know, after a certain point,

99
00:05:36,000 --> 00:05:42,000
you know, over here, let's say this is 30, 40, 50 and the sentence length went till high, the accuracy

100
00:05:42,000 --> 00:05:44,000
started decreasing.

101
00:05:44,000 --> 00:05:46,000
The blue score started decreasing.

102
00:05:46,000 --> 00:05:46,000
Right.

103
00:05:46,000 --> 00:05:52,000
And this specific score is the performance metrics to test how your model is performing with respect

104
00:05:52,000 --> 00:05:53,000
to sequence.

105
00:05:53,000 --> 00:05:54,000
To sequence data.

106
00:05:55,000 --> 00:05:57,000
Sequence to sequence data.

107
00:05:57,000 --> 00:05:57,000
Right.

108
00:05:59,000 --> 00:06:00,000
So this was the problem.

109
00:06:00,000 --> 00:06:02,000
And why this is specifically happening.

110
00:06:02,000 --> 00:06:06,000
Because they understand the context vector that we are probably considering.

111
00:06:06,000 --> 00:06:08,000
We are not taking this output right.

112
00:06:08,000 --> 00:06:09,000
We are not taking this output.

113
00:06:09,000 --> 00:06:13,000
Instead we are just taking the last output, whatever context vector it is basically creating.

114
00:06:13,000 --> 00:06:19,000
Now, this context vector, you know, will have more information with the nearest word that is just

115
00:06:19,000 --> 00:06:25,000
passed in this particular timestamp, in the nearest timestamp, if this is my t for t is equal to three,

116
00:06:25,000 --> 00:06:27,000
t is equal to two, t is equal to one.

117
00:06:27,000 --> 00:06:31,000
You will be knowing that this context vector will talk about more, will have more information about

118
00:06:31,000 --> 00:06:34,000
those words which will be to the nearest timestamp.

119
00:06:34,000 --> 00:06:35,000
Right at t is equal to four.

120
00:06:36,000 --> 00:06:38,000
But what about at t is equal to one?

121
00:06:38,000 --> 00:06:43,000
And if my sentence keeps on becoming bigger and bigger, let's say if this is t is equal to 100.

122
00:06:43,000 --> 00:06:50,000
If there is t is equal to one, the single context vector will be having more context information of

123
00:06:50,000 --> 00:06:53,000
the recent words that are there with respect to the timestamp.

124
00:06:54,000 --> 00:06:57,000
But what about those words that came at t is equal to one, t is equal to two.

125
00:06:57,000 --> 00:07:02,000
It will have less context because here the context vector will be having limited information about it.

126
00:07:02,000 --> 00:07:10,000
So the the the context vector that is used to represent the entire sentence will not be sufficient.

127
00:07:10,000 --> 00:07:10,000
Right.

128
00:07:10,000 --> 00:07:20,000
And that is the reason why for longer sentences the blue score was very, very less.

129
00:07:21,000 --> 00:07:28,000
And this was the problem with the encoder and decoder sequence to sequence architecture.

130
00:07:28,000 --> 00:07:32,000
So how do we specifically fix this problem?

131
00:07:32,000 --> 00:07:41,000
For fixing this problem we have another new amazing mechanism which we will be talking about which is

132
00:07:41,000 --> 00:07:45,000
called as attention mechanism.

133
00:07:46,000 --> 00:07:48,000
And this attention mechanism.

134
00:07:48,000 --> 00:07:52,000
We will talk regarding it sequence to sequence networks.

135
00:07:52,000 --> 00:07:55,000
We'll try to understand the architecture in the next video.

136
00:07:55,000 --> 00:08:00,000
What exactly change we really need to do in this encoder and decoder.

137
00:08:00,000 --> 00:08:02,000
Because this is my encoder, right?

138
00:08:02,000 --> 00:08:04,000
This is my decoder.

139
00:08:04,000 --> 00:08:05,000
The work is very simple.

140
00:08:06,000 --> 00:08:10,000
My encoder will encode the information and my decoder will decode back that particular information.

141
00:08:10,000 --> 00:08:16,000
That is why we use use cases like text generation, sentence generation, uh, language conversion.

142
00:08:16,000 --> 00:08:16,000
Right.

143
00:08:16,000 --> 00:08:18,000
Different kind of use cases.

144
00:08:19,000 --> 00:08:24,000
So we will try to understand what exactly is this attention mechanism.

145
00:08:24,000 --> 00:08:25,000
Just to give you a brief idea.

146
00:08:25,000 --> 00:08:25,000
Right.

147
00:08:25,000 --> 00:08:33,000
Whenever we have a longer sentence or longer paragraph or wrong with this context vector, right that

148
00:08:33,000 --> 00:08:38,000
we are specifically passing, we will pass additional context.

149
00:08:39,000 --> 00:08:41,000
How do we pass this additional context?

150
00:08:41,000 --> 00:08:42,000
We will see about it.

151
00:08:42,000 --> 00:08:44,000
What architecture diagram will be changing?

152
00:08:44,000 --> 00:08:45,000
We'll see about it.

153
00:08:45,000 --> 00:08:51,000
But just to give you an idea, here is the research paper that we will be discussing about neural machine

154
00:08:51,000 --> 00:08:54,000
translation by jointly learning to align and translate.

155
00:08:54,000 --> 00:08:59,000
And over here, if you go ahead and see this architecture, this is my encoder.

156
00:08:59,000 --> 00:09:05,000
This is my decoder We will be discussing about this particular architecture and how this architecture

157
00:09:05,000 --> 00:09:06,000
is different from the previous one.

158
00:09:06,000 --> 00:09:11,000
Here you can see in the encoder you just don't have only one LSTM layer.

159
00:09:11,000 --> 00:09:14,000
You also have you have bidirectional LSTM layer.

160
00:09:14,000 --> 00:09:19,000
So here also you have this here also you have this and then you probably combine all the information.

161
00:09:19,000 --> 00:09:22,000
You pass that additional context to your decoder.

162
00:09:22,000 --> 00:09:25,000
And that is what we are going to go ahead and have a look.

163
00:09:25,000 --> 00:09:26,000
Right.

164
00:09:26,000 --> 00:09:29,000
And then we'll be understanding about each and every questions as we go ahead.

165
00:09:29,000 --> 00:09:30,000
Right.

166
00:09:30,000 --> 00:09:37,000
Because whatever output you will be specifically predicting with respect to the decoder here, you will

167
00:09:37,000 --> 00:09:39,000
be having the context of the previous state.

168
00:09:39,000 --> 00:09:42,000
You will have this additional information that is probably coming up.

169
00:09:43,000 --> 00:09:48,000
Uh, and along with that, whatever you will be your previous output that will also be there.

170
00:09:48,000 --> 00:09:49,000
Right.

171
00:09:49,000 --> 00:09:51,000
So we will try to combine all those things and we'll try to see.

172
00:09:51,000 --> 00:09:55,000
But before that, uh, let's uh, go ahead and see.

173
00:09:55,000 --> 00:09:57,000
And I hope you have understood.

174
00:09:57,000 --> 00:10:02,000
What is the problem with respect to your, uh, encoder and decoder sequence architecture?

175
00:10:02,000 --> 00:10:03,000
And that is what I'm going to discuss.

176
00:10:03,000 --> 00:10:08,000
In the next video, we will discuss about attention mechanism, how we will provide some additional

177
00:10:08,000 --> 00:10:11,000
attention while training this entire models.

178
00:10:11,000 --> 00:10:13,000
So yes, this was it from my side.

179
00:10:13,000 --> 00:10:14,000
I will see you all in the next video.

180
00:10:14,000 --> 00:10:14,000
Thank you.

