1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to our decoders.

3
00:00:04,000 --> 00:00:09,000
Already in our previous video, we have discussed about this really important topic, which is called

4
00:00:09,000 --> 00:00:11,000
as masked Multi-head attention.

5
00:00:11,000 --> 00:00:17,000
And we understood how different types of masking is specifically done and what is the purpose of the

6
00:00:17,000 --> 00:00:18,000
masking.

7
00:00:18,000 --> 00:00:18,000
Right.

8
00:00:19,000 --> 00:00:23,000
Then after this, adding a normalization will specifically happen over here.

9
00:00:23,000 --> 00:00:26,000
And when I say normalization layer normalization will basically happen.

10
00:00:26,000 --> 00:00:30,000
And this is the same layer normalization like how we discussed in the encoder.

11
00:00:30,000 --> 00:00:31,000
Right.

12
00:00:31,000 --> 00:00:35,000
So from here you'll be able to see a residual connection is basically there which is also called as

13
00:00:35,000 --> 00:00:36,000
skip connection.

14
00:00:36,000 --> 00:00:41,000
So this skip connection information is also going right.

15
00:00:41,000 --> 00:00:45,000
So I will also go ahead and say skip connection okay.

16
00:00:45,000 --> 00:00:52,000
And with the help of this what will happen is that the uh it will get added to the output vectors.

17
00:00:52,000 --> 00:00:53,000
That is probably coming from here.

18
00:00:53,000 --> 00:00:54,000
Right.

19
00:00:54,000 --> 00:00:57,000
And how did we calculate the output vector step by step?

20
00:00:57,000 --> 00:01:00,000
Each and every thing is probably shown in the previous video.

21
00:01:00,000 --> 00:01:00,000
Right?

22
00:01:00,000 --> 00:01:06,000
Uh, we went till the softmax and we probably found out the softmax scores and these were the scores

23
00:01:06,000 --> 00:01:09,000
with respect to the vector that I had right now.

24
00:01:09,000 --> 00:01:16,000
What we are going to discuss about in this particular video is about encoder and decoder multi-head

25
00:01:16,000 --> 00:01:17,000
attention.

26
00:01:17,000 --> 00:01:17,000
Okay.

27
00:01:17,000 --> 00:01:21,000
So this multi-head attention is also called as encoder and decoder.

28
00:01:21,000 --> 00:01:24,000
Uh, why it is specifically called as encoder and decoder.

29
00:01:24,000 --> 00:01:26,000
We'll just discuss about it.

30
00:01:26,000 --> 00:01:32,000
And uh, you'll be able to see that two of the connection is specifically coming from the encoder output.

31
00:01:32,000 --> 00:01:33,000
Right.

32
00:01:34,000 --> 00:01:40,000
So here from the encoder output, when I say encoder output from here we are specifically taking the

33
00:01:40,000 --> 00:01:43,000
vectors of key and values.

34
00:01:43,000 --> 00:01:43,000
Okay.

35
00:01:43,000 --> 00:01:48,000
So we are going to probably take the vectors of key and value that is coming outside this particular

36
00:01:48,000 --> 00:01:49,000
encoder.

37
00:01:49,000 --> 00:01:53,000
And we are going to give it to the decoder Multi-head attention.

38
00:01:53,000 --> 00:01:58,000
So from here you will be able to see from the adding and normalization layer what information will be

39
00:01:58,000 --> 00:01:59,000
going from here.

40
00:01:59,000 --> 00:02:02,000
Whatever query vector is basically generated.

41
00:02:02,000 --> 00:02:03,000
Right.

42
00:02:03,000 --> 00:02:05,000
That will specifically be going.

43
00:02:05,000 --> 00:02:10,000
Now you may be thinking why query from here and why key and values from the encoder output okay.

44
00:02:10,000 --> 00:02:12,000
The reason is very very simple.

45
00:02:13,000 --> 00:02:20,000
So here you'll be able to see from the encoder right from the encoder output.

46
00:02:20,000 --> 00:02:21,000
Right.

47
00:02:21,000 --> 00:02:23,000
What we are specifically sending.

48
00:02:23,000 --> 00:02:30,000
We are sending a set of attention vectors.

49
00:02:34,000 --> 00:02:36,000
Attention vector k and v.

50
00:02:36,000 --> 00:02:37,000
Right.

51
00:02:37,000 --> 00:02:46,000
And other than this, you'll be able to see when the output from this right masked Multi-head.

52
00:02:46,000 --> 00:02:47,000
Attention.

53
00:02:47,000 --> 00:02:53,000
So the output from masked multi-head attention.

54
00:02:53,000 --> 00:02:55,000
Here we are sending.

55
00:02:57,000 --> 00:02:58,000
Attention.

56
00:02:58,000 --> 00:02:59,000
Vector.

57
00:03:04,000 --> 00:03:06,000
Q, which is also called as query vector.

58
00:03:08,000 --> 00:03:12,000
Now this query vector is the information that we are specifically doing over here.

59
00:03:12,000 --> 00:03:12,000
Right?

60
00:03:12,000 --> 00:03:16,000
By shifting the outputs to the right, then we are getting the output embedding.

61
00:03:16,000 --> 00:03:19,000
Then we are probably getting the positional encoding okay.

62
00:03:19,000 --> 00:03:21,000
And there is one very important thing.

63
00:03:21,000 --> 00:03:24,000
Also I really want to discuss based on your input and output.

64
00:03:24,000 --> 00:03:26,000
And just I'll discuss it in another five minutes.

65
00:03:26,000 --> 00:03:26,000
Okay.

66
00:03:26,000 --> 00:03:34,000
And why we are specifically doing it is very much simple because you can understand that these keys

67
00:03:34,000 --> 00:03:52,000
and values, these are to be used by each decoder in its.

68
00:03:54,000 --> 00:03:56,000
Encoder.

69
00:03:57,000 --> 00:03:58,000
Decoder.

70
00:04:00,000 --> 00:04:01,000
Attention layer.

71
00:04:04,000 --> 00:04:06,000
And why specifically we are doing this?

72
00:04:06,000 --> 00:04:28,000
Because this helps this entirely helps the decoder two focus on appropriate, appropriate places in

73
00:04:28,000 --> 00:04:30,000
the input sequence.

74
00:04:32,000 --> 00:04:37,000
So during the training time, I know here we will be giving our training output, right?

75
00:04:37,000 --> 00:04:39,000
Whatever will be my Y and we need to predict y hat.

76
00:04:39,000 --> 00:04:40,000
Right.

77
00:04:40,000 --> 00:04:45,000
So over here, with respect to what will be the query of Y with respect to this key and values, that

78
00:04:45,000 --> 00:04:52,000
is coming from the encoder output, that both will be given once a set of three vectors of k, v and

79
00:04:52,000 --> 00:04:56,000
query, which is coming from this masked multi-head attention, it will be given to the multi-head attention,

80
00:04:56,000 --> 00:04:59,000
which is also called as encoder decoder Multi-head attention.

81
00:04:59,000 --> 00:05:06,000
And based on this, whatever is the necessary operation that specifically takes in multi-head attention,

82
00:05:06,000 --> 00:05:06,000
right?

83
00:05:06,000 --> 00:05:08,000
So in the multi-head attention, what will happen?

84
00:05:08,000 --> 00:05:10,000
Self-attention will specifically happen inside that.

85
00:05:10,000 --> 00:05:12,000
Then you go ahead and multiply.

86
00:05:12,000 --> 00:05:18,000
K uh Q uh, the query with keys, then you do, uh, then you apply a softmax, then you multiply with

87
00:05:18,000 --> 00:05:19,000
the values.

88
00:05:19,000 --> 00:05:24,000
Then you probably calculate the scores that exactly we have discussed about each and everything one.

89
00:05:24,000 --> 00:05:28,000
When we are discussed in the previous session, right where we have discussed about all the numericals,

90
00:05:28,000 --> 00:05:29,000
what is going to happen, right?

91
00:05:29,000 --> 00:05:32,000
Same thing is over here that is specifically going to happen.

92
00:05:32,000 --> 00:05:37,000
The K and B are going to come from the encoder output, and query is basically going to go from here.

93
00:05:37,000 --> 00:05:37,000
Right?

94
00:05:37,000 --> 00:05:40,000
Then all the necessary operation will take place while the training.

95
00:05:40,000 --> 00:05:43,000
And then after this again adding a normalization will basically happen.

96
00:05:43,000 --> 00:05:47,000
And again here also you will be able to see whatever output I am getting over here, right.

97
00:05:47,000 --> 00:05:53,000
With respect to the queries that will be also forwarded to the next, uh, layer that is called as adding

98
00:05:53,000 --> 00:05:54,000
a normalization.

99
00:05:54,000 --> 00:05:58,000
This is just to make sure that more added information is gone to the next layer.

100
00:05:58,000 --> 00:05:59,000
Right.

101
00:05:59,000 --> 00:06:04,000
So by this you will be able to understand that what Multi-head attention will specifically do.

102
00:06:04,000 --> 00:06:10,000
And the major thing that is important for you all to understand that for this multi-head attention from

103
00:06:10,000 --> 00:06:13,000
where the encoder output is probably coming up, what is the encoder output?

104
00:06:13,000 --> 00:06:15,000
That is, the set of key value vectors.

105
00:06:15,000 --> 00:06:18,000
And the query vectors will probably come from here.

106
00:06:18,000 --> 00:06:18,000
Right.

107
00:06:18,000 --> 00:06:20,000
And why do we do this.

108
00:06:20,000 --> 00:06:23,000
See in at the end of the day when you have a data set right.

109
00:06:24,000 --> 00:06:30,000
So when you have a data set in this data set you have some input and output right in input.

110
00:06:30,000 --> 00:06:32,000
Let's say you have X1X2X3.

111
00:06:32,000 --> 00:06:35,000
In the output you have Y1Y2Y3.

112
00:06:35,000 --> 00:06:35,000
Right.

113
00:06:35,000 --> 00:06:39,000
So while training I have to pass X1X2X3 over here.

114
00:06:39,000 --> 00:06:40,000
Right.

115
00:06:40,000 --> 00:06:43,000
So when we have this inputs we need to pass the inputs from here.

116
00:06:43,000 --> 00:06:44,000
All the process will basically happen.

117
00:06:44,000 --> 00:06:49,000
And with respect to that, while we are training this entire Transformers with respect to our input

118
00:06:49,000 --> 00:06:54,000
and output, let's say the example that I'm actually going to take is about machine translation.

119
00:06:54,000 --> 00:06:58,000
So here also we need to pass the output which will be my y right y.

120
00:06:58,000 --> 00:07:00,000
Again output encoding will happen.

121
00:07:00,000 --> 00:07:01,000
Positional encoding will happen.

122
00:07:01,000 --> 00:07:04,000
Then after that, masked multi-head attention will happen, right?

123
00:07:04,000 --> 00:07:07,000
We have discussed about what masked multi-head attention is all about, right?

124
00:07:07,000 --> 00:07:12,000
But when we come over here, we need to make sure that based on this particular vectors, we need to

125
00:07:12,000 --> 00:07:14,000
map the key and value vectors for this.

126
00:07:14,000 --> 00:07:19,000
So the training mechanism will definitely know that how this query can be matched with this k and v

127
00:07:19,000 --> 00:07:27,000
so that we will be able to, you know, focus on the appropriate places in the input sequence to probably

128
00:07:27,000 --> 00:07:28,000
predict the next word.

129
00:07:28,000 --> 00:07:29,000
Right.

130
00:07:29,000 --> 00:07:32,000
And this is really, really much important to for you all to understand.

131
00:07:32,000 --> 00:07:33,000
Okay.

132
00:07:33,000 --> 00:07:37,000
Then after that you have this feed forward neural network and adding a normalization.

133
00:07:37,000 --> 00:07:39,000
This is the same thing that what we discussed in the encoder output.

134
00:07:39,000 --> 00:07:41,000
Now let me do one thing.

135
00:07:41,000 --> 00:07:44,000
Let me go ahead and show you one of the most amazing diagram.

136
00:07:44,000 --> 00:07:48,000
And by this you will get a clear idea how the entire training basically happens.

137
00:07:48,000 --> 00:07:51,000
So again this is the block from um Jalandhar.

138
00:07:51,000 --> 00:07:55,000
Uh, and beautifully Transformers has been explained over here.

139
00:07:55,000 --> 00:08:00,000
What I have actually done is that I have, uh, put up more stress on the research paper to explain

140
00:08:00,000 --> 00:08:02,000
you each and everything that is going behind the hood.

141
00:08:02,000 --> 00:08:03,000
Okay.

142
00:08:03,000 --> 00:08:04,000
So see over here.

143
00:08:04,000 --> 00:08:10,000
So decoder side, what it happens is that we know the components of decoder work as well okay.

144
00:08:10,000 --> 00:08:12,000
The encoder starts by processing the input sequence.

145
00:08:12,000 --> 00:08:15,000
So from here it all the input sequence is basically start.

146
00:08:15,000 --> 00:08:20,000
The output of the top loader is transformed into a set of attention vectors k and b.

147
00:08:20,000 --> 00:08:23,000
So from here I will be getting this k encoder and b encoder right.

148
00:08:23,000 --> 00:08:30,000
Then they are used by each decoder in its encoder and decoder attention layer.

149
00:08:30,000 --> 00:08:31,000
Okay.

150
00:08:31,000 --> 00:08:37,000
So that is where we are specifically using them inside our decoder where in the encoder and decoder

151
00:08:37,000 --> 00:08:42,000
attention layer, which is nothing but encoder and decoder multi-head attention, which helps the decoder

152
00:08:42,000 --> 00:08:45,000
focus on the appropriate places in the input sequence.

153
00:08:45,000 --> 00:08:51,000
Okay, now let's see this okay at timestamp one okay, we need to pass this entire information.

154
00:08:51,000 --> 00:08:52,000
See at timestamp one.

155
00:08:52,000 --> 00:08:54,000
This is the word j e.

156
00:08:54,000 --> 00:08:55,000
It goes to the encoder.

157
00:08:55,000 --> 00:09:00,000
I mean all these three words je suis and into and then this vector is basically created.

158
00:09:00,000 --> 00:09:04,000
Then from this we derive our key and value encoders.

159
00:09:04,000 --> 00:09:04,000
Right.

160
00:09:04,000 --> 00:09:06,000
And this is passed to our decoder.

161
00:09:06,000 --> 00:09:08,000
For every decoder we specifically need to pass.

162
00:09:08,000 --> 00:09:11,000
And initially we don't pass anything for our training data set.

163
00:09:11,000 --> 00:09:12,000
Understand one thing okay.

164
00:09:12,000 --> 00:09:14,000
So this is really important to understand.

165
00:09:15,000 --> 00:09:18,000
See guys initially we don't pass any training data set okay.

166
00:09:18,000 --> 00:09:19,000
From the second instance.

167
00:09:19,000 --> 00:09:23,000
Once we generate this then we will be passing our training data set okay.

168
00:09:23,000 --> 00:09:28,000
While the training phase I've also seen in some of the research paper where they say that, hey, the

169
00:09:28,000 --> 00:09:30,000
first word needs to be passed.

170
00:09:30,000 --> 00:09:32,000
And based on that, some prediction needs to be done.

171
00:09:32,000 --> 00:09:37,000
But uh, again, in some research paper that I was seeing and in some of the blogs that I was seeing

172
00:09:37,000 --> 00:09:41,000
here, a start token will specifically go, okay, so let me do one thing.

173
00:09:41,000 --> 00:09:45,000
Let me open my epi pen so that I will be able to explain you here also.

174
00:09:45,000 --> 00:09:45,000
Okay.

175
00:09:45,000 --> 00:09:49,000
And you'll be getting a clear idea like how things basically happen.

176
00:09:49,000 --> 00:09:52,000
But this diagram will actually showcase each and everything.

177
00:09:52,000 --> 00:09:54,000
So here I have got my encoder.

178
00:09:54,000 --> 00:09:54,000
Okay.

179
00:09:54,000 --> 00:09:55,000
Now the next step.

180
00:09:55,000 --> 00:09:57,000
This is my key and value encoder.

181
00:09:57,000 --> 00:09:58,000
So here what will happen.

182
00:09:58,000 --> 00:10:02,000
The first state start token will go okay.

183
00:10:02,000 --> 00:10:03,000
Now this start token.

184
00:10:03,000 --> 00:10:07,000
When it goes again this will be done with respect to embedding positional embedding everything.

185
00:10:07,000 --> 00:10:09,000
It will go to the next step.

186
00:10:09,000 --> 00:10:12,000
Then in our multi-head attention right.

187
00:10:12,000 --> 00:10:14,000
This key value will be taken.

188
00:10:14,000 --> 00:10:19,000
And whatever query vector is basically generated from here that will be given to the multi-head attention

189
00:10:19,000 --> 00:10:20,000
along with this.

190
00:10:20,000 --> 00:10:20,000
Right.

191
00:10:20,000 --> 00:10:23,000
And similarly the next step will go once I is computed.

192
00:10:23,000 --> 00:10:28,000
Now once I is computed, what will happen for timestamp two when we are doing the decoding?

193
00:10:28,000 --> 00:10:29,000
Timestamp of two.

194
00:10:29,000 --> 00:10:31,000
Something is going to happen in this way.

195
00:10:31,000 --> 00:10:32,000
See this.

196
00:10:33,000 --> 00:10:37,000
So for timestamp two see uh let's start this entirely from starting okay.

197
00:10:37,000 --> 00:10:40,000
So from timestamp two what will happen I will come previous output is I.

198
00:10:40,000 --> 00:10:42,000
Then it is going to go right.

199
00:10:42,000 --> 00:10:44,000
Then all the steps inside the decoder will basically happen.

200
00:10:44,000 --> 00:10:49,000
This will be your encoding and uh key and uh value vectors.

201
00:10:49,000 --> 00:10:54,000
But with respect to every word that you are specifically getting over here, here positional embedding

202
00:10:54,000 --> 00:10:55,000
will happen.

203
00:10:55,000 --> 00:10:59,000
And with respect to this particular word your query vector will be created.

204
00:10:59,000 --> 00:10:59,000
Okay.

205
00:10:59,000 --> 00:11:01,000
Don't worry about linear and softmax.

206
00:11:01,000 --> 00:11:06,000
I'll discuss more about this in the next video, but I hope you are able to understand this right.

207
00:11:06,000 --> 00:11:09,000
In the first instance I will just pass a start token.

208
00:11:09,000 --> 00:11:10,000
It will take this.

209
00:11:10,000 --> 00:11:12,000
It will generate a query vector for that.

210
00:11:12,000 --> 00:11:13,000
It will take this.

211
00:11:13,000 --> 00:11:16,000
It will pass it to our Multi-head attention.

212
00:11:16,000 --> 00:11:18,000
You know, what are the steps in Multi-head attention, right?

213
00:11:18,000 --> 00:11:20,000
But at timestamp is equal to two.

214
00:11:20,000 --> 00:11:25,000
Whatever output is basically generated over here, like I that will get passed over here like previous

215
00:11:25,000 --> 00:11:26,000
output will be this.

216
00:11:26,000 --> 00:11:27,000
Right.

217
00:11:27,000 --> 00:11:32,000
And then again for this vector is basically created again Q uh key and value vector will be created.

218
00:11:32,000 --> 00:11:35,000
Then we go to the next step where our next word will get predicted.

219
00:11:35,000 --> 00:11:39,000
Similarly we'll go and take this previous word give it to the input encoder decoder.

220
00:11:39,000 --> 00:11:43,000
Then again the third word will go and give it to the decoder.

221
00:11:43,000 --> 00:11:45,000
And finally we'll get the output over here.

222
00:11:45,000 --> 00:11:50,000
So this sentence that is probably given over here, it will give getting converted in this specific

223
00:11:50,000 --> 00:11:50,000
way.

224
00:11:51,000 --> 00:11:51,000
Right.

225
00:11:51,000 --> 00:11:54,000
So I hope uh, you got an idea of this.

226
00:11:54,000 --> 00:11:54,000
Right.

227
00:11:54,000 --> 00:11:58,000
So let me just go ahead and explain it once again, okay.

228
00:11:59,000 --> 00:12:02,000
So here let's say this is my encoder right.

229
00:12:02,000 --> 00:12:04,000
You know what is there in the encoder right.

230
00:12:05,000 --> 00:12:10,000
So in the encoder I pass all my words after doing the positional embedding.

231
00:12:10,000 --> 00:12:12,000
Let's say this is my output.

232
00:12:12,000 --> 00:12:17,000
From this we take all our key and value vectors.

233
00:12:17,000 --> 00:12:21,000
We give it to our decoder in the decoder.

234
00:12:21,000 --> 00:12:23,000
For the first phase in the training we give a start token.

235
00:12:24,000 --> 00:12:26,000
Okay, then what will happen?

236
00:12:26,000 --> 00:12:29,000
It will go to that masked multi-head attention.

237
00:12:29,000 --> 00:12:33,000
Then it will go to the multi-head attention, which is also called as encoder decoder for this.

238
00:12:33,000 --> 00:12:34,000
Okay.

239
00:12:34,000 --> 00:12:37,000
And we will be going from here and query will be going from here.

240
00:12:37,000 --> 00:12:37,000
Right.

241
00:12:37,000 --> 00:12:39,000
Then we go to the next step.

242
00:12:40,000 --> 00:12:44,000
So I've just drawn a rough diagram but I hope you're able to understand it over here.

243
00:12:44,000 --> 00:12:44,000
Right.

244
00:12:44,000 --> 00:12:47,000
So I will just close this again and let's do one thing.

245
00:12:47,000 --> 00:12:50,000
Let's quickly take this image okay.

246
00:12:50,000 --> 00:12:55,000
So I will copy this image and I will paste it over here okay.

247
00:12:56,000 --> 00:13:00,000
So please make sure that you have this entire steps recorded okay.

248
00:13:00,000 --> 00:13:05,000
So what we do in the first decoding timestamp we give all the words.

249
00:13:05,000 --> 00:13:07,000
We generate all these things right?

250
00:13:07,000 --> 00:13:08,000
We don't give anything from here.

251
00:13:08,000 --> 00:13:11,000
Instead, what we'll do, we'll do some kind of padding.

252
00:13:11,000 --> 00:13:13,000
We'll create a start token right initially.

253
00:13:13,000 --> 00:13:19,000
And then we'll go and generate this output I right then I will be sent with respect to the previous

254
00:13:19,000 --> 00:13:20,000
input.

255
00:13:20,000 --> 00:13:22,000
Then we'll go ahead and compute the m.

256
00:13:22,000 --> 00:13:26,000
So this will be m again M will be sent in my third timestamp right.

257
00:13:26,000 --> 00:13:29,000
Similarly all the further timestamp things will basically be happening.

258
00:13:29,000 --> 00:13:30,000
Okay.

259
00:13:30,000 --> 00:13:35,000
So over here you can see at Time Stamp two whatever output is basically generated at time stamp one

260
00:13:35,000 --> 00:13:37,000
I that is sent as a previous output.

261
00:13:37,000 --> 00:13:42,000
Okay, so I hope you got an idea with respect to the multi-head attention.

262
00:13:42,000 --> 00:13:44,000
And you know what happens inside Multi-head attention.

263
00:13:44,000 --> 00:13:49,000
I don't need to explain that because already we have discussed in this this is the entire training mechanism.

264
00:13:49,000 --> 00:13:53,000
Try to understand in this specific way training mechanism, okay.

265
00:13:53,000 --> 00:13:56,000
The training mechanism initially what is going to go over here?

266
00:13:56,000 --> 00:14:02,000
Then once we generate this particular output, this same output will be going as an input to the decoder

267
00:14:02,000 --> 00:14:03,000
for the next time stamp.

268
00:14:03,000 --> 00:14:04,000
Right.

269
00:14:04,000 --> 00:14:06,000
So like that we go ahead and continue it.

270
00:14:06,000 --> 00:14:09,000
So I hope you got an idea with respect to encoder and decoder multi-head attention.

271
00:14:09,000 --> 00:14:15,000
Now it's time that we understand the most important thing that how do we calculate our output probabilities.

272
00:14:15,000 --> 00:14:18,000
And that is what we are going to discuss in our next video.

273
00:14:18,000 --> 00:14:19,000
So I hope you like this particular video.

274
00:14:19,000 --> 00:14:20,000
I will see you all in the next video.

275
00:14:20,000 --> 00:14:21,000
Thank you.