1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
So we are going to continue the discussion with respect to Transformers.

3
00:00:04,000 --> 00:00:08,000
And in this video we are going to discuss about decoders.

4
00:00:08,000 --> 00:00:12,000
Uh, already the plan of action that is probably given in the previous video.

5
00:00:12,000 --> 00:00:17,000
We are going to discuss about this three main topics that is masked multi-head self-attention Multi-head

6
00:00:17,000 --> 00:00:20,000
attention, which is also called as encoder decoder attention.

7
00:00:20,000 --> 00:00:23,000
Why do we say it as encoder decoder attention will try to understand.

8
00:00:23,000 --> 00:00:26,000
Then we'll also be understanding about the feed forward neural network.

9
00:00:26,000 --> 00:00:32,000
But now if you see from the architecture here, you have your encoder, here you have your decoder.

10
00:00:32,000 --> 00:00:36,000
The differences between encoder and decoder is that instead of just having multi-head attention, here,

11
00:00:36,000 --> 00:00:38,000
you have masked multi-head attention.

12
00:00:38,000 --> 00:00:43,000
So in this video, we will try to understand the importance of this masked multi-head attention.

13
00:00:43,000 --> 00:00:44,000
Okay.

14
00:00:44,000 --> 00:00:50,000
And we'll also be seeing that how this entire transformer is also get trained during the training time.

15
00:00:50,000 --> 00:00:50,000
Okay.

16
00:00:50,000 --> 00:00:51,000
So let's go ahead.

17
00:00:51,000 --> 00:00:57,000
And as I said uh in the masked Multi-head attention, if you really want to understand, uh, we will

18
00:00:57,000 --> 00:00:59,000
be going completely from the basics.

19
00:00:59,000 --> 00:00:59,000
Okay.

20
00:00:59,000 --> 00:01:02,000
Uh, like how we understood with respect to the encoders.

21
00:01:02,000 --> 00:01:03,000
Okay.

22
00:01:03,000 --> 00:01:06,000
So here you can see that I have a data set.

23
00:01:06,000 --> 00:01:09,000
So this is my data set.

24
00:01:09,000 --> 00:01:15,000
Now inside this data set let's say I have uh task which I really need to convert my English text into

25
00:01:15,000 --> 00:01:16,000
Hindi text.

26
00:01:16,000 --> 00:01:17,000
Okay.

27
00:01:17,000 --> 00:01:21,000
So I want to do a language translation from English to Hindi.

28
00:01:21,000 --> 00:01:24,000
So let's say there are three words like this in English.

29
00:01:24,000 --> 00:01:27,000
And for that you have two words with respect to Hindi.

30
00:01:27,000 --> 00:01:30,000
Now let's just consider one input data okay?

31
00:01:30,000 --> 00:01:33,000
And with the help of one input data we'll try to understand.

32
00:01:33,000 --> 00:01:39,000
So this diagram that you see this side is my encoder okay.

33
00:01:39,000 --> 00:01:40,000
Just a single encoder.

34
00:01:40,000 --> 00:01:42,000
And this side is my decoder.

35
00:01:42,000 --> 00:01:45,000
Now, during the training time what will happen.

36
00:01:45,000 --> 00:01:53,000
This parameters that is X1X2X3 will be going from here right.

37
00:01:53,000 --> 00:01:55,000
Then we have our vectors.

38
00:01:55,000 --> 00:01:59,000
This vectors will get combined with positional vectors right.

39
00:01:59,000 --> 00:02:01,000
Sorry positional encodings.

40
00:02:02,000 --> 00:02:07,000
Once it is done then we get this entire input embedding you here you can see right.

41
00:02:07,000 --> 00:02:09,000
So let me just go and draw this.

42
00:02:09,000 --> 00:02:15,000
So once I give my inputs over here, you will be able to see that I will be able to get this specific

43
00:02:15,000 --> 00:02:15,000
vectors.

44
00:02:15,000 --> 00:02:18,000
And this is added along with positional encoding.

45
00:02:18,000 --> 00:02:23,000
So over here with respect to the inputs I will be sending X1X2X3.

46
00:02:23,000 --> 00:02:26,000
And this will be sent parallelly right in the encoder.

47
00:02:27,000 --> 00:02:33,000
Similarly, once this entire operation probably takes place over here where it probably calculates your

48
00:02:33,000 --> 00:02:39,000
q k v, it probably goes and find out all the attention heads like z one to z eight according to the

49
00:02:39,000 --> 00:02:39,000
research paper.

50
00:02:39,000 --> 00:02:41,000
Then it will do the normalization here.

51
00:02:41,000 --> 00:02:44,000
It will be passing this information over here, which is my residual.

52
00:02:44,000 --> 00:02:49,000
Then it will go to the feed forward neural network and whatever output I'm getting over here, this

53
00:02:49,000 --> 00:02:51,000
is ready to go to this multi-head attention.

54
00:02:51,000 --> 00:02:59,000
But at the same time you will be able to see that my outputs over here y1 and y2 will be going right.

55
00:02:59,000 --> 00:03:04,000
And when we say output shifted right, what does this basically mean?

56
00:03:04,000 --> 00:03:06,000
This is a very important word.

57
00:03:06,000 --> 00:03:09,000
Now let's say guys over here here I have three words.

58
00:03:09,000 --> 00:03:11,000
Here we have only two words right.

59
00:03:11,000 --> 00:03:16,000
It is always a good idea that whenever this kind of difference is there, we really need to make sure

60
00:03:16,000 --> 00:03:19,000
that we have a fixed set of characters, right?

61
00:03:19,000 --> 00:03:24,000
So for this initially one, while we are sending the output over here, we are going to make sure that

62
00:03:24,000 --> 00:03:28,000
we also make this text padded.

63
00:03:28,000 --> 00:03:28,000
Right.

64
00:03:28,000 --> 00:03:31,000
So let's say now my padding will become like Y1Y2.

65
00:03:31,000 --> 00:03:33,000
And over here I will be using a zero padding.

66
00:03:33,000 --> 00:03:34,000
Okay.

67
00:03:34,000 --> 00:03:37,000
So let's say that I have used a zero padding.

68
00:03:37,000 --> 00:03:38,000
Why did I use it?

69
00:03:38,000 --> 00:03:42,000
Because I wanted to make all the sentences of the same size.

70
00:03:42,000 --> 00:03:46,000
In this example, I've just considered three words like kind of thing.

71
00:03:46,000 --> 00:03:50,000
The length is actually three, but again, it depends on the data set.

72
00:03:50,000 --> 00:03:52,000
How long also you can actually make this particular sentence.

73
00:03:52,000 --> 00:03:57,000
So now when I say I have basically done the output shifted to right.

74
00:03:57,000 --> 00:04:00,000
That basically means I have also added zero zero padding over here.

75
00:04:00,000 --> 00:04:04,000
Now with respect to this zero padding, again, at the last you will be having zero.

76
00:04:04,000 --> 00:04:06,000
I will be getting this embedding again.

77
00:04:06,000 --> 00:04:08,000
This will go to my positional encoding.

78
00:04:08,000 --> 00:04:11,000
It will create a positional encoding, will do the summation.

79
00:04:11,000 --> 00:04:15,000
And now it will send the data to the masked multi-head attention.

80
00:04:15,000 --> 00:04:21,000
So if you see from this particular point right in decoder, also this input embedding and positional

81
00:04:21,000 --> 00:04:22,000
embedding will occur.

82
00:04:22,000 --> 00:04:26,000
But here you are going to specifically do a kind of padding.

83
00:04:26,000 --> 00:04:29,000
Let's say for an example I'm saying zero padding.

84
00:04:29,000 --> 00:04:34,000
Why this is done to make your sequences length equal.

85
00:04:36,000 --> 00:04:37,000
Perfect.

86
00:04:37,000 --> 00:04:44,000
Then here you will be able to see we are going to the masked Multi-head attention.

87
00:04:44,000 --> 00:04:47,000
Now what happens inside this masked multi-head attention?

88
00:04:47,000 --> 00:04:48,000
Let's talk about it.

89
00:04:48,000 --> 00:04:50,000
So I'm going to create this specific diagram.

90
00:04:52,000 --> 00:04:53,000
This will be.

91
00:04:55,000 --> 00:04:57,000
Masked.

92
00:04:59,000 --> 00:05:02,000
Multi head.

93
00:05:04,000 --> 00:05:06,000
Attention okay.

94
00:05:06,000 --> 00:05:07,000
Perfect.

95
00:05:07,000 --> 00:05:10,000
So here you have masked multi head attention okay.

96
00:05:10,000 --> 00:05:15,000
Now with respect to masked multi head attention what are the steps is that is basically going to happen.

97
00:05:16,000 --> 00:05:16,000
That is nothing.

98
00:05:16,000 --> 00:05:21,000
But it is the combination of all these three steps.

99
00:05:23,000 --> 00:05:23,000
Okay.

100
00:05:23,000 --> 00:05:27,000
Linear projection for q k, v this will get calculated.

101
00:05:27,000 --> 00:05:30,000
Then you have the scaled dot product attention.

102
00:05:30,000 --> 00:05:31,000
It will get calculated.

103
00:05:31,000 --> 00:05:33,000
We know this right in encoders also.

104
00:05:33,000 --> 00:05:39,000
Then we will be understanding about this very important topic that is mask application.

105
00:05:39,000 --> 00:05:42,000
And finally we'll go ahead and find out the mask Multi-head attention.

106
00:05:42,000 --> 00:05:45,000
We'll understand why this mask application is required okay.

107
00:05:45,000 --> 00:05:49,000
Step by step we will go ahead and we'll try to do it okay.

108
00:05:49,000 --> 00:06:01,000
So uh first of all let's get an idea about why do we specifically do this mask masking work okay.

109
00:06:01,000 --> 00:06:03,000
So which is basically written masked multi head attention okay.

110
00:06:03,000 --> 00:06:07,000
So before that let me go ahead and take a very simple example.

111
00:06:07,000 --> 00:06:11,000
I will show you with respect to all the steps that is basically happening in the decoder.

112
00:06:11,000 --> 00:06:13,000
This is also happening parallelly.

113
00:06:13,000 --> 00:06:16,000
Okay, first I will be getting the information over here.

114
00:06:16,000 --> 00:06:20,000
So till here I hope everybody knows, but we need to understand what is basically happening over here

115
00:06:20,000 --> 00:06:22,000
parallelly when we are training it.

116
00:06:22,000 --> 00:06:22,000
Okay.

117
00:06:22,000 --> 00:06:26,000
And here also all your information will be going all at once.

118
00:06:26,000 --> 00:06:32,000
But with respect to the output, I will be getting one by one output as I get it with respect to words

119
00:06:32,000 --> 00:06:32,000
by word.

120
00:06:32,000 --> 00:06:37,000
Okay, let's say if I want to get y one hat y two hat.

121
00:06:37,000 --> 00:06:39,000
So like this I will be getting the output one by one.

122
00:06:39,000 --> 00:06:46,000
Okay, so let's take one basic example and I'm just going to discuss about this okay.

123
00:06:46,000 --> 00:06:49,000
I'm not going to discuss about this because this is already done okay.

124
00:06:49,000 --> 00:06:55,000
So let's take a very good example now in the masked multi head multi-head attention.

125
00:06:55,000 --> 00:06:57,000
But before that, first of all, what will happen?

126
00:06:57,000 --> 00:06:59,000
Let's say that I have a sequence.

127
00:06:59,000 --> 00:07:02,000
Let's say, uh, my input sequence is something like this.

128
00:07:02,000 --> 00:07:04,000
My output sequence is over here in my data set.

129
00:07:04,000 --> 00:07:10,000
My input sequence is something like, uh, four, five, six.

130
00:07:10,000 --> 00:07:10,000
Okay.

131
00:07:10,000 --> 00:07:13,000
And let's say my four, five, six.

132
00:07:13,000 --> 00:07:15,000
And I'll go ahead and write seven.

133
00:07:15,000 --> 00:07:15,000
Okay.

134
00:07:15,000 --> 00:07:17,000
This is my just my input sequence.

135
00:07:17,000 --> 00:07:18,000
I'm numbering it.

136
00:07:18,000 --> 00:07:18,000
Okay.

137
00:07:18,000 --> 00:07:21,000
Then my output sequence is one, two, three.

138
00:07:21,000 --> 00:07:26,000
Okay, so this is my training data that I have in my input and output okay.

139
00:07:26,000 --> 00:07:27,000
Input and output.

140
00:07:27,000 --> 00:07:29,000
Now let's do one thing okay.

141
00:07:29,000 --> 00:07:31,000
Let's see what will happen step by step.

142
00:07:31,000 --> 00:07:35,000
So first step is nothing but it is called as input embedding.

143
00:07:36,000 --> 00:07:37,000
In my decoder.

144
00:07:37,000 --> 00:07:42,000
The first step is nothing but input embedding and positional encoding.

145
00:07:43,000 --> 00:07:49,000
Now, with respect to this positional encoding, what is going to happen?

146
00:07:50,000 --> 00:07:51,000
It is very simple right.

147
00:07:51,000 --> 00:07:59,000
So here let's say I know that my output is specified by this three characters.

148
00:07:59,000 --> 00:07:59,000
Right.

149
00:07:59,000 --> 00:08:00,000
Three characters.

150
00:08:00,000 --> 00:08:02,000
I'm going to probably make this into four.

151
00:08:02,000 --> 00:08:05,000
How I will be making padding over here.

152
00:08:05,000 --> 00:08:09,000
Let's say I'm considering the maximum length is four over here in all this particular data set, so

153
00:08:09,000 --> 00:08:11,000
I will just try to make it as four.

154
00:08:11,000 --> 00:08:12,000
I have done it zero padding over here.

155
00:08:12,000 --> 00:08:16,000
Now with respect to every every characters okay.

156
00:08:16,000 --> 00:08:25,000
Let's say that every number is represented by a four dimension vector four dimension vector.

157
00:08:25,000 --> 00:08:29,000
So here now how your output embedding will look like.

158
00:08:29,000 --> 00:08:33,000
See output embedding basically means this output embedding shifted to right right.

159
00:08:33,000 --> 00:08:39,000
This output admitting okay so here I will just go ahead and write my output embeddings will be.

160
00:08:42,000 --> 00:08:43,000
Very simple.

161
00:08:43,000 --> 00:08:47,000
Let's let's consider that uh one is represented by another vector.

162
00:08:47,000 --> 00:08:48,000
Let's say this is

163
00:08:48,000 --> 00:08:55,000
1.1.2.3.4

164
00:08:55,000 --> 00:08:59,000
Okay, then I have my next embeddings.

165
00:08:59,000 --> 00:09:05,000
.5.6.7.8.

166
00:09:05,000 --> 00:09:13,000
Then I have my next embedding 0.9, 1.0, 1.1, 1.2.

167
00:09:13,000 --> 00:09:14,000
Okay.

168
00:09:14,000 --> 00:09:17,000
Then let's say that final one is zero, right?

169
00:09:17,000 --> 00:09:20,000
So zero padding basically means it will be just zero only.

170
00:09:20,000 --> 00:09:22,000
So I'm just going to write it down over here.

171
00:09:22,000 --> 00:09:28,000
So if I just consider this my entire vector is represented like this okay.

172
00:09:28,000 --> 00:09:33,000
My entire vector or my entire sentence is represented like this.

173
00:09:33,000 --> 00:09:36,000
So this basically becomes my output embedding.

174
00:09:36,000 --> 00:09:36,000
Okay.

175
00:09:36,000 --> 00:09:37,000
Perfect.

176
00:09:38,000 --> 00:09:45,000
Now you know that uh, my output embedding should specifically be going along with positional encoding.

177
00:09:45,000 --> 00:09:45,000
Okay.

178
00:09:45,000 --> 00:09:46,000
Over here.

179
00:09:46,000 --> 00:09:49,000
So here you can probably see positional encoding is there.

180
00:09:49,000 --> 00:09:51,000
Let's consider that um positional encoding is zero.

181
00:09:51,000 --> 00:09:52,000
All the values are zero.

182
00:09:52,000 --> 00:09:55,000
So I will be getting the same output embedding okay.

183
00:09:55,000 --> 00:09:59,000
Now in the next step what we do we go to this masked multi-head attention.

184
00:09:59,000 --> 00:10:03,000
Now there are some steps that is involved in masked multi-head attention.

185
00:10:03,000 --> 00:10:08,000
So the steps line by line I will be saying the first step is nothing, but we go ahead and calculate

186
00:10:08,000 --> 00:10:11,000
the linear projection.

187
00:10:13,000 --> 00:10:14,000
Linear projection that is nothing.

188
00:10:14,000 --> 00:10:19,000
But for q k v this also we do do in encoder.

189
00:10:19,000 --> 00:10:21,000
And you know the importance of this right.

190
00:10:21,000 --> 00:10:25,000
Then uh, in short we are just calculating q k v okay.

191
00:10:25,000 --> 00:10:32,000
So once we do that then the next step that we have is nothing but scaled dot product.

192
00:10:33,000 --> 00:10:35,000
Dot product.

193
00:10:35,000 --> 00:10:39,000
And with the help of this we calculate the attention right.

194
00:10:39,000 --> 00:10:40,000
Attention scores.

195
00:10:40,000 --> 00:10:40,000
Right.

196
00:10:40,000 --> 00:10:43,000
So that is what we specifically get with respect to the scores.

197
00:10:43,000 --> 00:10:49,000
And for third step that you will be able to see is something called as mask application.

198
00:10:50,000 --> 00:10:52,000
Mask application okay.

199
00:10:52,000 --> 00:10:55,000
So this is what we are basically going to discuss.

200
00:10:55,000 --> 00:10:59,000
And here we are going to discuss about two important mask application.

201
00:10:59,000 --> 00:11:07,000
One is look ahead look ahead mask look ahead mask I will write it down.

202
00:11:07,000 --> 00:11:10,000
And this is like um, one more is basically we did the zero padding.

203
00:11:10,000 --> 00:11:11,000
Right.

204
00:11:11,000 --> 00:11:15,000
So this is nothing, but it is the padding mask.

205
00:11:16,000 --> 00:11:19,000
So we will discuss about both the specific mask.

206
00:11:19,000 --> 00:11:21,000
What is this mask techniques okay.

207
00:11:21,000 --> 00:11:25,000
And then probably whatever uh, steps is after this.

208
00:11:25,000 --> 00:11:26,000
You know that right?

209
00:11:26,000 --> 00:11:32,000
We finally need to get our contextual embedding so that the same thing we will be continuing and will

210
00:11:32,000 --> 00:11:33,000
try to understand it.

211
00:11:33,000 --> 00:11:33,000
Okay.

212
00:11:33,000 --> 00:11:35,000
So now let me go back over here.

213
00:11:35,000 --> 00:11:38,000
So this is my output embedding right now in the step two.

214
00:11:38,000 --> 00:11:39,000
So this is my step one.

215
00:11:39,000 --> 00:11:44,000
And remember the step one is we have added the positional encoding.

216
00:11:44,000 --> 00:11:48,000
My positional encoding is nothing, but everything is zeros okay.

217
00:11:48,000 --> 00:11:51,000
So four cross four matrix everything is zeros right.

218
00:11:51,000 --> 00:11:54,000
So here based on this same matrix I will be getting over here.

219
00:11:54,000 --> 00:11:56,000
Now coming to the step two.

220
00:11:58,000 --> 00:12:10,000
Now in the step two what we basically do we do the linear projection for q k and v.

221
00:12:10,000 --> 00:12:11,000
Okay.

222
00:12:11,000 --> 00:12:14,000
So let's go ahead and create what we are doing in this.

223
00:12:14,000 --> 00:12:18,000
We are creating query q.

224
00:12:22,000 --> 00:12:27,000
Key k and value v.

225
00:12:30,000 --> 00:12:32,000
This vectors we will try to create.

226
00:12:32,000 --> 00:12:36,000
How do you create it By multiplying with w of q, w of k and w of k.

227
00:12:36,000 --> 00:12:43,000
Let's go ahead and initialize w of q is equal to w of k is equal to w of v.

228
00:12:43,000 --> 00:12:46,000
We will initialize this to identity matrix.

229
00:12:46,000 --> 00:12:49,000
So if you are initializing this to identity matrix.

230
00:12:49,000 --> 00:12:52,000
So anything that you multiply with identity matrix.

231
00:12:52,000 --> 00:12:53,000
So I will just go ahead and calculate q.

232
00:12:53,000 --> 00:12:56,000
This is nothing but output embedding.

233
00:12:56,000 --> 00:12:59,000
Whatever output embedding I will just initialize it.

234
00:12:59,000 --> 00:13:03,000
See, at the end of the day we will be learning this specific parameters.

235
00:13:03,000 --> 00:13:08,000
This will be multiplied by w of q which will be nothing, but it will be output embedding only.

236
00:13:08,000 --> 00:13:08,000
Right.

237
00:13:08,000 --> 00:13:12,000
Because this is an identity matrix, right.

238
00:13:12,000 --> 00:13:19,000
Similarly, if I go ahead and do the k this multiplied by w of k, which will be nothing but the same

239
00:13:19,000 --> 00:13:20,000
output embedding.

240
00:13:20,000 --> 00:13:27,000
Similarly, I have my V output embedding multiplied by w of v, which will be equal to the same thing.

241
00:13:27,000 --> 00:13:27,000
Right?

242
00:13:27,000 --> 00:13:31,000
So this is how we calculate the q k v okay.

243
00:13:31,000 --> 00:13:35,000
At the end of the day the q k v is nothing, but it will be the same values.

244
00:13:35,000 --> 00:13:38,000
So let me just quickly copy this.

245
00:13:38,000 --> 00:13:43,000
So this is my final q kv for the first iteration let's say.

246
00:13:44,000 --> 00:13:47,000
So my q k v is nothing, but it is equal to the same thing.

247
00:13:48,000 --> 00:13:48,000
Right.

248
00:13:48,000 --> 00:13:50,000
So till here I hope everybody is clear.

249
00:13:50,000 --> 00:13:52,000
So this is how step by step it happens.

250
00:13:52,000 --> 00:13:58,000
And probably when I go to that multi head attention which includes self attention and all this is the

251
00:13:58,000 --> 00:14:00,000
first thing that is going to happen right.

252
00:14:00,000 --> 00:14:01,000
I will be able to calculate it.

253
00:14:01,000 --> 00:14:09,000
Now the third step uh over here we will go ahead and calculate our scaled.

254
00:14:10,000 --> 00:14:11,000
product.

255
00:14:13,000 --> 00:14:15,000
Scaled dot product.

256
00:14:18,000 --> 00:14:19,000
Scaled.

257
00:14:24,000 --> 00:14:25,000
Dot.

258
00:14:28,000 --> 00:14:29,000
Product.

259
00:14:31,000 --> 00:14:32,000
Attention.

260
00:14:35,000 --> 00:14:36,000
Calculation.

261
00:14:36,000 --> 00:14:36,000
Okay.

262
00:14:36,000 --> 00:14:39,000
So we will go ahead and compute this okay.

263
00:14:39,000 --> 00:14:43,000
Now in order to compute the attention scores it is very much simple.

264
00:14:43,000 --> 00:14:44,000
How do we compute the score.

265
00:14:44,000 --> 00:14:51,000
I will just write a shortcut formula q we multiplied by k transpose right.

266
00:14:51,000 --> 00:14:54,000
And then we divide by square root of d of k right.

267
00:14:54,000 --> 00:14:59,000
In this case d of k is nothing but four right dimension of k.

268
00:14:59,000 --> 00:15:04,000
That basically means it is nothing but q dot product with k of t divided by two.

269
00:15:04,000 --> 00:15:05,000
Right.

270
00:15:05,000 --> 00:15:07,000
We specifically do this okay.

271
00:15:07,000 --> 00:15:14,000
Now let's go ahead and do the further calculation for this scaled dot product attention calculation.

272
00:15:14,000 --> 00:15:18,000
And here we will be taking the same data of q k and v okay.

273
00:15:18,000 --> 00:15:23,000
So now we are going to do is that we are going to go ahead and calculate our scores.

274
00:15:24,000 --> 00:15:30,000
Now in order to calculate our scores uh, understand when I say q multiplied by k of t right.

275
00:15:30,000 --> 00:15:36,000
What exactly it is, it is the same matrix multiplied by the transpose of K.

276
00:15:36,000 --> 00:15:36,000
Right.

277
00:15:36,000 --> 00:15:40,000
So this will like see if this is q okay.

278
00:15:40,000 --> 00:15:43,000
And let's say this is k right.

279
00:15:43,000 --> 00:15:52,000
So first of all I'm just going to probably multiply this with this point one sorry point.

280
00:15:54,000 --> 00:15:57,000
So it should be something like this.

281
00:15:57,000 --> 00:16:03,000
.1.2.3.4.

282
00:16:03,000 --> 00:16:03,000
Right.

283
00:16:04,000 --> 00:16:06,000
So this will be my k transpose.

284
00:16:06,000 --> 00:16:08,000
So that is the reason what you'll do.

285
00:16:08,000 --> 00:16:13,000
Point one multiply by point one uh, plus point two multiply by point two plus point three.

286
00:16:13,000 --> 00:16:15,000
Multiply by point three plus point four.

287
00:16:15,000 --> 00:16:16,000
Multiply by point four.

288
00:16:16,000 --> 00:16:19,000
Then next one will be point five.

289
00:16:19,000 --> 00:16:22,000
Multiply uh point five uh multiplied by .1.6.

290
00:16:22,000 --> 00:16:24,000
Multiply by point two.

291
00:16:24,000 --> 00:16:26,000
Point uh point seven.

292
00:16:26,000 --> 00:16:27,000
Multiplied by point three.

293
00:16:27,000 --> 00:16:32,000
Then you have point four or sorry point eight multiplied by point four.

294
00:16:32,000 --> 00:16:35,000
Then again you will go ahead and calculate the third one.

295
00:16:35,000 --> 00:16:36,000
This will get multiplied by this.

296
00:16:36,000 --> 00:16:38,000
This will get multiplied by this.

297
00:16:38,000 --> 00:16:38,000
Okay.

298
00:16:38,000 --> 00:16:43,000
Then for the second time, what you really need to do, you need to take this point five deployed by

299
00:16:43,000 --> 00:16:50,000
this, uh, point five multiplied by .1.6 multiplied by .2.7. deploy by .3.8.

300
00:16:50,000 --> 00:16:52,000
Multiplied by point four.

301
00:16:52,000 --> 00:16:53,000
Then you what you will do.

302
00:16:53,000 --> 00:16:58,000
Then again, you will take this entire number and then multiply with the second one.

303
00:16:58,000 --> 00:16:58,000
right?

304
00:16:58,000 --> 00:17:03,000
So here you'll be able to see I will be creating my separate key pairs.

305
00:17:03,000 --> 00:17:03,000
Right.

306
00:17:03,000 --> 00:17:09,000
So this will be another key transpose where I'll say 0.5.6.7.8.

307
00:17:09,000 --> 00:17:10,000
Right.

308
00:17:10,000 --> 00:17:12,000
So this will be my another key transpose.

309
00:17:12,000 --> 00:17:14,000
Then another key transpose will be this.

310
00:17:14,000 --> 00:17:16,000
Then another key transpose will be this.

311
00:17:16,000 --> 00:17:23,000
So finally, when you do the specific calculation and just to give you an idea how the calculation is

312
00:17:23,000 --> 00:17:27,000
given, I have probably written down each and every steps over here, but I have done the calculation

313
00:17:27,000 --> 00:17:27,000
over here.

314
00:17:27,000 --> 00:17:30,000
My final calculation will be nothing but.

315
00:17:30,000 --> 00:17:32,000
But this is simple, right?

316
00:17:32,000 --> 00:17:36,000
See my q k v is all these numbers.

317
00:17:36,000 --> 00:17:39,000
When I say q transpose of k of t, I have to take.

318
00:17:39,000 --> 00:17:40,000
This is my one query, right?

319
00:17:40,000 --> 00:17:41,000
This is my another query.

320
00:17:41,000 --> 00:17:42,000
This is my another query.

321
00:17:42,000 --> 00:17:43,000
This is my another query.

322
00:17:43,000 --> 00:17:48,000
And similarly if I probably consider this as my keys, this is my another key.

323
00:17:48,000 --> 00:17:49,000
Another key, another key, another key.

324
00:17:49,000 --> 00:17:54,000
So with respect to every thing, I have to probably do the matrix multiplication or dot operation.

325
00:17:54,000 --> 00:17:55,000
Right.

326
00:17:55,000 --> 00:17:58,000
So finally you'll be able to see that I have already done the calculation.

327
00:17:58,000 --> 00:18:00,000
This will basically be my score.

328
00:18:00,000 --> 00:18:02,000
The score will be nothing but point three.

329
00:18:02,000 --> 00:18:08,000
Again it will be uh, four cross four, matrix point seven, 1.10.0.

330
00:18:08,000 --> 00:18:11,000
And then we will further do the calculation

331
00:18:11,000 --> 00:18:16,000
0.71.93.10.0.

332
00:18:16,000 --> 00:18:17,000
Right.

333
00:18:17,000 --> 00:18:26,000
And then I will be having 1.1, 3.15.18, 0.0 here.

334
00:18:26,000 --> 00:18:28,000
Also again you'll be getting zero point.

335
00:18:29,000 --> 00:18:35,000
I have to get this entire zero because you will be able to see that last one is just zero, right?

336
00:18:35,000 --> 00:18:40,000
Again, in order to show you how the calculation is specifically happening.

337
00:18:40,000 --> 00:18:43,000
So I will just make a matrix over here, let's say.

338
00:18:43,000 --> 00:18:46,000
So this is my first matrix.

339
00:18:46,000 --> 00:18:56,000
My first matrix will be 0.1 multiplied by 0.1 plus .0.2.

340
00:18:58,000 --> 00:19:00,000
It is a dot operation and you should understand this point two.

341
00:19:00,000 --> 00:19:04,000
Multiplied by point two plus point three.

342
00:19:04,000 --> 00:19:05,000
Multiplied by point three.

343
00:19:05,000 --> 00:19:07,000
Plus point four.

344
00:19:07,000 --> 00:19:08,000
Multiplied by point four.

345
00:19:08,000 --> 00:19:09,000
Okay.

346
00:19:09,000 --> 00:19:12,000
Then coming to the next next one.

347
00:19:12,000 --> 00:19:12,000
Right.

348
00:19:12,000 --> 00:19:16,000
Again what I will do I will take this point one again okay.

349
00:19:16,000 --> 00:19:17,000
Point one.

350
00:19:17,000 --> 00:19:20,000
And here I will I will take the same query right.

351
00:19:20,000 --> 00:19:21,000
Same query.

352
00:19:21,000 --> 00:19:23,000
And I will multiply with this.

353
00:19:23,000 --> 00:19:28,000
So here I will go ahead and write point one multiplied by point five.

354
00:19:28,000 --> 00:19:39,000
Then point two multiplied by point six plus point three multiplied by point seven plus point four multiplied

355
00:19:39,000 --> 00:19:40,000
by point eight.

356
00:19:40,000 --> 00:19:40,000
Right.

357
00:19:40,000 --> 00:19:41,000
So this is done.

358
00:19:42,000 --> 00:19:44,000
Then what I will go to my next one.

359
00:19:44,000 --> 00:19:45,000
Next one is nothing.

360
00:19:45,000 --> 00:19:52,000
But again if I do the transpose of keys because I have to for one token, I have to basically do this

361
00:19:52,000 --> 00:19:56,000
multiplication or do this dot operation for all the other tokens, just to understand that how much

362
00:19:56,000 --> 00:19:58,000
weight is it is basically adding, then this will.

363
00:19:58,000 --> 00:20:04,000
Third, one query of key will be 0.9, 1.0, 1.1, 1.2.

364
00:20:04,000 --> 00:20:08,000
So my third one will be something like 0.1 multiplied by 0.9.

365
00:20:08,000 --> 00:20:13,000
Then I have point two multiplied by 1.0.

366
00:20:13,000 --> 00:20:21,000
Then I have point three multiplied by 1.1 plus point four multiplied by 1.2.

367
00:20:21,000 --> 00:20:25,000
Similarly, the last one will be nothing, but this will become transpose, right?

368
00:20:25,000 --> 00:20:27,000
Then it will become point one multiplied by zero.

369
00:20:28,000 --> 00:20:30,000
Then you have point two multiplied by zero.

370
00:20:30,000 --> 00:20:36,000
Then you have point three multiplied by zero plus point 0.4 multiplied by zero.

371
00:20:37,000 --> 00:20:38,000
So this is done right.

372
00:20:39,000 --> 00:20:45,000
Overall, this is just for one query with respect to all the keys.

373
00:20:45,000 --> 00:20:45,000
Right.

374
00:20:45,000 --> 00:20:50,000
The last key will be nothing but 0000.

375
00:20:50,000 --> 00:20:53,000
Then what we do we go with the next query.

376
00:20:53,000 --> 00:20:55,000
This will try to multiply with the next thing.

377
00:20:55,000 --> 00:20:57,000
Then I will be getting my next matrix.

378
00:20:57,000 --> 00:21:03,000
Finally when I do this summation I'm going to get this values right point three.

379
00:21:03,000 --> 00:21:05,000
So this entire multiplication if you do it is point three.

380
00:21:05,000 --> 00:21:10,000
If I do entire multiplication and addition, this is point seven, 1.10.4.

381
00:21:10,000 --> 00:21:16,000
Then again the next one which will be in my next, uh, which will be my next one, which will be my

382
00:21:16,000 --> 00:21:16,000
next one.

383
00:21:16,000 --> 00:21:17,000
Like this.

384
00:21:17,000 --> 00:21:19,000
This will be my next one.

385
00:21:19,000 --> 00:21:23,000
And total overall you'll be able to see this is also like four cross four all the multiplication.

386
00:21:23,000 --> 00:21:25,000
And finally you'll be able to get the attention scores.

387
00:21:25,000 --> 00:21:28,000
Okay so this is my attention score.

388
00:21:28,000 --> 00:21:32,000
Uh, where we have specifically done scaled dot product attention calculation.

389
00:21:32,000 --> 00:21:34,000
This is the first step.

390
00:21:34,000 --> 00:21:40,000
See inside this right after linear projection of qkv we basically do this right now comes mask application.

391
00:21:40,000 --> 00:21:43,000
Now what exactly is mask application we are going to discuss about.

392
00:21:43,000 --> 00:21:46,000
So let's go ahead and discuss about it.

393
00:21:46,000 --> 00:21:50,000
So here I'm just going to go ahead and write my last application.

394
00:21:51,000 --> 00:21:55,000
Now first of all you need to understand what is the purpose of master application okay.

395
00:21:55,000 --> 00:21:58,000
And this is probably the most important topic.

396
00:21:58,000 --> 00:22:01,000
And I'll try to explain you this with uh, multiple examples.

397
00:22:01,000 --> 00:22:02,000
Okay.

398
00:22:03,000 --> 00:22:06,000
Uh, here, first of all I will just go ahead and write.

399
00:22:06,000 --> 00:22:11,000
It helps mask application helps managing.

400
00:22:11,000 --> 00:22:15,000
managing the structure.

401
00:22:16,000 --> 00:22:34,000
Okay, the structure of the sequence says being processed okay and ensures

402
00:22:36,000 --> 00:22:52,000
the model behaves correctly during training and inferencing.

403
00:22:53,000 --> 00:22:59,000
Okay, now what are the main reasons of masking?

404
00:22:59,000 --> 00:22:59,000
Okay.

405
00:22:59,000 --> 00:23:03,000
so exactly what exactly is masking will get to know about it.

406
00:23:03,000 --> 00:23:08,000
But you can just understand that it helps to manage the structure of the sequence being processed and

407
00:23:08,000 --> 00:23:09,000
ensures the model behaves correctly.

408
00:23:09,000 --> 00:23:15,000
Okay, now let's talk about the reasons why do we specifically do masking?

409
00:23:15,000 --> 00:23:20,000
The first reason is handling

410
00:23:21,000 --> 00:23:33,000
variable length sequences with padding mask.

411
00:23:35,000 --> 00:23:41,000
Okay, now see guys, as you all know during the output, right?

412
00:23:41,000 --> 00:23:45,000
I told you right when I'm actually calculating the output embedding, I'm also making sure that I do

413
00:23:45,000 --> 00:23:48,000
zero padding so that both the sequence length are almost equal.

414
00:23:48,000 --> 00:23:49,000
Right.

415
00:23:49,000 --> 00:23:55,000
So one of the reason why we specifically do mask application is that because we need to handle variable

416
00:23:55,000 --> 00:23:57,000
length sequence with pad masking.

417
00:23:57,000 --> 00:23:59,000
What is the purpose behind this.

418
00:24:00,000 --> 00:24:00,000
Okay.

419
00:24:00,000 --> 00:24:01,000
You'll understand this.

420
00:24:01,000 --> 00:24:02,000
What is the purpose.

421
00:24:02,000 --> 00:24:08,000
The first purpose is that to handle sequences.

422
00:24:10,000 --> 00:24:12,000
Sequences of different length in batch.

423
00:24:14,000 --> 00:24:18,000
Different length in batch.

424
00:24:18,000 --> 00:24:20,000
Please just wait for some time.

425
00:24:20,000 --> 00:24:25,000
I will explain each and every thing what I'm actually saying to ensure.

426
00:24:25,000 --> 00:24:31,000
The second point is to ensure that padding tokens.

427
00:24:35,000 --> 00:24:36,000
Which are.

428
00:24:37,000 --> 00:24:39,000
added?

429
00:24:41,000 --> 00:24:43,000
Which are added to make

430
00:24:45,000 --> 00:24:57,000
sequences of uniform length do not affect the model prediction.

431
00:25:00,000 --> 00:25:03,000
Do not affect the model prediction.

432
00:25:03,000 --> 00:25:05,000
So guys now let's talk about the second point.

433
00:25:05,000 --> 00:25:11,000
To ensure that padding tokens, which are added to make sequences of uniform length, do not affect

434
00:25:11,000 --> 00:25:12,000
the model prediction.

435
00:25:12,000 --> 00:25:16,000
Now let me just explain you this particular point with an example okay.

436
00:25:17,000 --> 00:25:21,000
Let's say in my input data I have a sequence.

437
00:25:21,000 --> 00:25:21,000
Okay.

438
00:25:21,000 --> 00:25:25,000
Let's the sequence is something like one two, three.

439
00:25:25,000 --> 00:25:29,000
And in my output data, let's say this is my A sequence one.

440
00:25:29,000 --> 00:25:31,000
In my output I have my sequence two.

441
00:25:32,000 --> 00:25:34,000
In my sequence two I have something like four five.

442
00:25:34,000 --> 00:25:37,000
And I have purposely done the padding over here.

443
00:25:37,000 --> 00:25:37,000
Okay.

444
00:25:37,000 --> 00:25:41,000
So here zero is nothing, but it is the padding token.

445
00:25:41,000 --> 00:25:42,000
Okay.

446
00:25:42,000 --> 00:25:44,000
We initially need to do the padding.

447
00:25:44,000 --> 00:25:44,000
Okay.

448
00:25:45,000 --> 00:25:45,000
Perfect.

449
00:25:46,000 --> 00:25:49,000
Now without masking see if I.

450
00:25:49,000 --> 00:25:52,000
If I just give this data like this.

451
00:25:52,000 --> 00:26:00,000
Okay, then in my data, since this zero padding is there, this will influence.

452
00:26:00,000 --> 00:26:01,000
Please note this point.

453
00:26:02,000 --> 00:26:09,000
This will influence the attention mechanism okay.

454
00:26:09,000 --> 00:26:11,000
The attention mechanism.

455
00:26:11,000 --> 00:26:15,000
Now you may be thinking, Chris, how right now you are just thinking, okay, one zero is actually

456
00:26:15,000 --> 00:26:16,000
added.

457
00:26:16,000 --> 00:26:19,000
But just imagine in a real world problem statement when we do padding, right?

458
00:26:20,000 --> 00:26:25,000
Let's say I want to probably I get the maximum length of the sentence is 100 okay.

459
00:26:25,000 --> 00:26:26,000
Or 100 or 200.

460
00:26:26,000 --> 00:26:26,000
Right.

461
00:26:26,000 --> 00:26:32,000
Let's say and let's say in one of the output I just have two words Y1Y2 okay.

462
00:26:32,000 --> 00:26:34,000
In one of the output.

463
00:26:34,000 --> 00:26:34,000
Right.

464
00:26:34,000 --> 00:26:36,000
I have just two words y one and y two.

465
00:26:36,000 --> 00:26:38,000
So for this what we will do, we will do the padding.

466
00:26:38,000 --> 00:26:43,000
And for this 98 zeros will get added when 98 zeros are getting added over here.

467
00:26:43,000 --> 00:26:49,000
And when we probably send this zero padding to our attention self-attention mechanism.

468
00:26:49,000 --> 00:26:55,000
Don't you think because of this zero it will influence the entire attention mechanism and this will

469
00:26:55,000 --> 00:27:10,000
in turn what will happen lead to lead to incorrect or biased predictions.

470
00:27:12,000 --> 00:27:14,000
Okay, this is the problem.

471
00:27:15,000 --> 00:27:17,000
So that is the reason.

472
00:27:17,000 --> 00:27:20,000
Over here we will do some kind of masking.

473
00:27:20,000 --> 00:27:28,000
And this masking will be making sure that this token zero padding tokens are ignored.

474
00:27:28,000 --> 00:27:34,000
So for this, a kind of masking that we are going to do is something called as padding mask.

475
00:27:35,000 --> 00:27:35,000
Okay.

476
00:27:35,000 --> 00:27:39,000
So here we are specifically going to say padding mask.

477
00:27:39,000 --> 00:27:45,000
This padding mask is basically going to make sure that the tokens are ignored.

478
00:27:46,000 --> 00:27:49,000
The tokens are ignored okay.

479
00:27:49,000 --> 00:27:53,000
Okay, so in short, if you really want to understand about the masking over here.

480
00:27:54,000 --> 00:28:00,000
Masking is done with two important types, two important in two different ways.

481
00:28:00,000 --> 00:28:10,000
One is something called as padding mask and the other one is something called as look ahead mask okay.

482
00:28:12,000 --> 00:28:15,000
And in transformer we take the combination of both of them.

483
00:28:15,000 --> 00:28:18,000
But we'll try to understand what is the importance of them okay.

484
00:28:18,000 --> 00:28:22,000
So let's take the sequences I have 123450.

485
00:28:22,000 --> 00:28:26,000
So this is my output in my training data set.

486
00:28:26,000 --> 00:28:27,000
This is my input in my training data.

487
00:28:27,000 --> 00:28:29,000
This will be sent to encoder.

488
00:28:29,000 --> 00:28:30,000
This will be sent to decoder.

489
00:28:30,000 --> 00:28:35,000
But we are still focusing on this right now for this since I have a zero right.

490
00:28:35,000 --> 00:28:37,000
There is a padding token called as zero right.

491
00:28:37,000 --> 00:28:40,000
So for this how do we apply a, uh, padding mask.

492
00:28:40,000 --> 00:28:43,000
So what we do over here is very simple.

493
00:28:43,000 --> 00:28:43,000
Okay.

494
00:28:44,000 --> 00:28:51,000
First of all, wherever I find a sequence of words, if it is, if it is not having any padding.

495
00:28:51,000 --> 00:28:51,000
Right.

496
00:28:51,000 --> 00:28:59,000
So what does padding mask does is that since we need to consider all the three sequences.

497
00:28:59,000 --> 00:29:02,000
So padding mask for the sequence one will be one, one, one.

498
00:29:03,000 --> 00:29:05,000
I'm just taking as an example okay.

499
00:29:05,000 --> 00:29:10,000
So over here 123 because you don't have any zero padding over here.

500
00:29:10,000 --> 00:29:12,000
So I will be having the padding mask of 111.

501
00:29:12,000 --> 00:29:13,000
What about sequence two.

502
00:29:13,000 --> 00:29:19,000
So sequence two padding mask will be 110 by 110.

503
00:29:19,000 --> 00:29:22,000
I will talk about it because this sequence are available here.

504
00:29:22,000 --> 00:29:23,000
I just have a zero padding.

505
00:29:23,000 --> 00:29:26,000
So I'm just going to mention that 110 okay.

506
00:29:26,000 --> 00:29:30,000
So this is basically my padding mask for my sequence two okay.

507
00:29:31,000 --> 00:29:33,000
But we will just not stop here okay.

508
00:29:33,000 --> 00:29:34,000
We will not stop here.

509
00:29:34,000 --> 00:29:35,000
They will.

510
00:29:35,000 --> 00:29:39,000
They'll also be another type of masking which is called as lookahead masking.

511
00:29:39,000 --> 00:29:41,000
So let's talk about this lookahead masking.

512
00:29:41,000 --> 00:29:44,000
Then again I'll go back to this pad padding mask okay.

513
00:29:44,000 --> 00:29:54,000
So the second type of masking which we specifically use is something called as look ahead masking.

514
00:29:54,000 --> 00:30:01,000
You'll understand it why we are just using 110 for zero zero padding token.

515
00:30:01,000 --> 00:30:04,000
Uh, why we are creating a padding mask of this zero.

516
00:30:04,000 --> 00:30:05,000
Okay.

517
00:30:05,000 --> 00:30:08,000
Because this will probably get added and multiplied.

518
00:30:08,000 --> 00:30:10,000
I'll show you okay how things will happen.

519
00:30:10,000 --> 00:30:20,000
So in look ahead masking this is basically called as and this is used to maintain autoregressive property.

520
00:30:23,000 --> 00:30:27,000
This is used to maintain autoregressive property.

521
00:30:27,000 --> 00:30:27,000
Okay.

522
00:30:28,000 --> 00:30:31,000
See one thing that you know right.

523
00:30:31,000 --> 00:30:34,000
In decoder whatever decoder I have okay.

524
00:30:34,000 --> 00:30:36,000
Let's say I'll draw it over here.

525
00:30:36,000 --> 00:30:38,000
This is my decoder Okay.

526
00:30:39,000 --> 00:30:41,000
Decoder word work is to.

527
00:30:41,000 --> 00:30:43,000
Let's say if this is my encoder.

528
00:30:43,000 --> 00:30:45,000
Decoder work is basically to.

529
00:30:45,000 --> 00:30:50,000
Whenever we give any kind of input to our encoder, this information will pass to the decoder.

530
00:30:50,000 --> 00:30:52,000
And finally I should be getting my input one by one.

531
00:30:52,000 --> 00:30:53,000
Right.

532
00:30:53,000 --> 00:30:59,000
111 like this one by one okay, so let's say if I say hey, how are you?

533
00:31:00,000 --> 00:31:00,000
Okay.

534
00:31:00,000 --> 00:31:07,000
So here you should uh, probably in Hindi will say, Cassey ho aap Cassey ho aap something like this.

535
00:31:07,000 --> 00:31:13,000
So one one word should be coming up from this particular decoder right now.

536
00:31:13,000 --> 00:31:16,000
You know that in order to find out the output.

537
00:31:16,000 --> 00:31:20,000
Okay, let's say my three words output is over here.

538
00:31:21,000 --> 00:31:27,000
If I want to get this word or if I want to get probably if I want to predict this word, I need to have

539
00:31:27,000 --> 00:31:28,000
the information of this specific word.

540
00:31:28,000 --> 00:31:29,000
Right.

541
00:31:29,000 --> 00:31:31,000
I need to have some idea about this specific word.

542
00:31:31,000 --> 00:31:32,000
Right.

543
00:31:32,000 --> 00:31:36,000
I don't require this further word information and that is how decoder will probably work, right?

544
00:31:36,000 --> 00:31:41,000
Similarly, if I want to probably get this specific word, I need to have the context of this particular

545
00:31:41,000 --> 00:31:42,000
word itself.

546
00:31:42,000 --> 00:31:42,000
Right.

547
00:31:42,000 --> 00:31:46,000
So a very important point that I'm actually going to write over here.

548
00:31:47,000 --> 00:31:50,000
While look ahead masking is specifically used.

549
00:31:50,000 --> 00:32:02,000
And why do we say maintaining autoregressive property to ensure that each each position.

550
00:32:04,000 --> 00:32:09,000
In the decoder.

551
00:32:10,000 --> 00:32:17,000
In the decoder output sequence.

552
00:32:19,000 --> 00:32:35,000
Can only attend to the previous position to the previous position, but not future position, but no

553
00:32:35,000 --> 00:32:36,000
future position.

554
00:32:37,000 --> 00:32:39,000
And that is what we want, right?

555
00:32:39,000 --> 00:32:41,000
We don't want the future position.

556
00:32:41,000 --> 00:32:43,000
We don't want the future information.

557
00:32:43,000 --> 00:32:45,000
Instead, from the previous information I should be able to predict.

558
00:32:45,000 --> 00:32:53,000
And that is why we basically use lookahead mask and why this is necessary for different different sequence

559
00:32:53,000 --> 00:32:58,000
to sequence task like language modeling.

560
00:33:00,000 --> 00:33:05,000
Or you can also say language translation right here.

561
00:33:05,000 --> 00:33:06,000
Sequence is very much important.

562
00:33:07,000 --> 00:33:07,000
Okay.

563
00:33:07,000 --> 00:33:12,000
And to probably do the prediction I don't require the information of the future token in my decoder.

564
00:33:12,000 --> 00:33:13,000
Right.

565
00:33:13,000 --> 00:33:16,000
And that is how we predict the current token okay.

566
00:33:16,000 --> 00:33:19,000
So I hope you have got an idea about lookahead mask.

567
00:33:19,000 --> 00:33:26,000
Now let's uh, see some example and we'll try to find out, uh, how masking is basically done.

568
00:33:26,000 --> 00:33:26,000
Uh, okay.

569
00:33:26,000 --> 00:33:29,000
And then we will again come back to this specific example.

570
00:33:29,000 --> 00:33:31,000
But let's see one simple example.

571
00:33:31,000 --> 00:33:31,000
Okay.

572
00:33:32,000 --> 00:33:34,000
Uh, with respect to uh, masking okay.

573
00:33:34,000 --> 00:33:38,000
So here you will be able to see I will take an example okay.

574
00:33:39,000 --> 00:33:41,000
So let's take an example.

575
00:33:41,000 --> 00:33:49,000
So example is that uh let's say I have a sequence like four comma five comma zero okay.

576
00:33:49,000 --> 00:33:51,000
Zero basically means zero padding.

577
00:33:51,000 --> 00:33:54,000
Now for this what kind of padding mask?

578
00:33:54,000 --> 00:33:56,000
I will create one comma, one comma zero.

579
00:33:57,000 --> 00:33:57,000
Done.

580
00:33:58,000 --> 00:33:58,000
Right.

581
00:33:59,000 --> 00:34:05,000
Now let's consider that every every token is a three dimension okay.

582
00:34:05,000 --> 00:34:07,000
Every token is basically three dimensional.

583
00:34:07,000 --> 00:34:10,000
Four is represented by three dimension, five is represented by three.

584
00:34:10,000 --> 00:34:12,000
Dimension zero is represented by three dimension.

585
00:34:12,000 --> 00:34:13,000
So that is the reason.

586
00:34:13,000 --> 00:34:18,000
What we can do is that we can also use this mask padding and we can extend it to three dimension.

587
00:34:18,000 --> 00:34:20,000
So let's go ahead and extend it okay.

588
00:34:20,000 --> 00:34:25,000
So guys now our task is basically to convert this 1D to 2D mask right.

589
00:34:25,000 --> 00:34:30,000
And for that we need to extend padding mask to 2D okay.

590
00:34:30,000 --> 00:34:34,000
Now how do I how do we go ahead and extend the padding mask to 2D.

591
00:34:34,000 --> 00:34:35,000
Understand.

592
00:34:35,000 --> 00:34:39,000
Over here whenever we have this one, that basically means it is a real token.

593
00:34:39,000 --> 00:34:41,000
Zero is nothing, but it is a zero padded token.

594
00:34:41,000 --> 00:34:42,000
Okay.

595
00:34:42,000 --> 00:34:47,000
Now let's take this example 1450I have represented as 110.

596
00:34:47,000 --> 00:34:52,000
Considering that this is a real token, this is a real token and this is just A00 token okay.

597
00:34:52,000 --> 00:35:02,000
Now in order to convert this one d to two D mask, uh, you will be able to see that each row.

598
00:35:02,000 --> 00:35:07,000
See each row in 2D mask will be a copy of the 1D mask.

599
00:35:07,000 --> 00:35:12,000
Okay, now understand one thing, guys, uh, this is really important to just understand.

600
00:35:12,000 --> 00:35:14,000
How do we apply 2D masking attention?

601
00:35:14,000 --> 00:35:17,000
Okay, so first of all, I have already created my 1D mask.

602
00:35:17,000 --> 00:35:20,000
So this is this is nothing, but it is a 1D mask.

603
00:35:21,000 --> 00:35:22,000
One dimension mask.

604
00:35:23,000 --> 00:35:28,000
Now in order to convert one dimension to two dimension each.

605
00:35:28,000 --> 00:35:35,000
For each token in the sequence, the mask should indicate which token it can attend to.

606
00:35:35,000 --> 00:35:38,000
Okay, now when I say which token it can attend to.

607
00:35:39,000 --> 00:35:42,000
Over here, when I am probably doing the encoding for this.

608
00:35:42,000 --> 00:35:50,000
If I'm converting this into vectors for should should probably take the context of this token, right?

609
00:35:50,000 --> 00:35:54,000
Similarly, if I can probably take the context of this token.

610
00:35:54,000 --> 00:35:59,000
They should not take the context of this zero, right?

611
00:35:59,000 --> 00:36:03,000
I don't want my vector to get changed because of this zero.

612
00:36:03,000 --> 00:36:07,000
Instead, I want my vector to get changed with respect to four and five.

613
00:36:07,000 --> 00:36:07,000
Right?

614
00:36:07,000 --> 00:36:09,000
So this two tokens I really want.

615
00:36:09,000 --> 00:36:10,000
Right?

616
00:36:10,000 --> 00:36:13,000
So what we are saying is that for each token.

617
00:36:15,000 --> 00:36:30,000
For each token in the sequence in the sequence, the mask should indicate.

618
00:36:32,000 --> 00:36:41,000
Which tokens it can attend to.

619
00:36:41,000 --> 00:36:42,000
Okay.

620
00:36:43,000 --> 00:36:43,000
Okay.

621
00:36:44,000 --> 00:36:50,000
So based on this, if my padding mask is one, the padding mask is this.

622
00:36:50,000 --> 00:36:52,000
We will just go ahead and repeat this.

623
00:36:52,000 --> 00:36:53,000
I know four is there.

624
00:36:53,000 --> 00:36:54,000
Right.

625
00:36:54,000 --> 00:36:55,000
So for four I will write 110.

626
00:36:55,000 --> 00:37:00,000
That basically means four can the attention can be changed with respect to five.

627
00:37:00,000 --> 00:37:00,000
Right.

628
00:37:00,000 --> 00:37:03,000
Then if I go and this is for the first token right.

629
00:37:03,000 --> 00:37:04,000
This is for the first token.

630
00:37:04,000 --> 00:37:06,000
Similarly for the second token I will go ahead and write 110.

631
00:37:06,000 --> 00:37:06,000
Why?

632
00:37:06,000 --> 00:37:12,000
Because I am I'm I'm saying that hey five also needs to probably change the attention based on this

633
00:37:12,000 --> 00:37:12,000
token.

634
00:37:12,000 --> 00:37:13,000
Four.

635
00:37:13,000 --> 00:37:13,000
Right.

636
00:37:13,000 --> 00:37:15,000
So both this context should be there.

637
00:37:15,000 --> 00:37:17,000
So that is the reason we have repeated 110110.

638
00:37:17,000 --> 00:37:19,000
This token is basically for four.

639
00:37:19,000 --> 00:37:21,000
This token is basically for five.

640
00:37:21,000 --> 00:37:24,000
And if I probably consider with respect to this this can impact this.

641
00:37:24,000 --> 00:37:26,000
This can basically impact this okay.

642
00:37:26,000 --> 00:37:28,000
This can be a, uh, context.

643
00:37:28,000 --> 00:37:32,000
I can also basically right over here, uh, token one.

644
00:37:34,000 --> 00:37:40,000
Token one attends to token one comma two.

645
00:37:41,000 --> 00:37:44,000
Similarly token two attends to two comma 1 or 1 comma two.

646
00:37:45,000 --> 00:37:46,000
Right.

647
00:37:46,000 --> 00:37:48,000
So this can probably get impacted.

648
00:37:48,000 --> 00:37:53,000
Again I'm telling you what is the exact meaning why I'm saying it can attend to.

649
00:37:53,000 --> 00:37:56,000
Okay, this is important for you all to understand, right?

650
00:37:56,000 --> 00:37:58,000
When we say it can attend to.

651
00:37:58,000 --> 00:37:59,000
Right?

652
00:37:59,000 --> 00:38:06,000
Uh, in short, we we are basically saying that, hey, this token for can attend to token five in the

653
00:38:06,000 --> 00:38:14,000
context of attention mechanism, attention mechanism, attention mechanism, we can change the context,

654
00:38:14,000 --> 00:38:15,000
okay.

655
00:38:15,000 --> 00:38:17,000
And we can change the representation.

656
00:38:17,000 --> 00:38:22,000
But the final one, if I probably see with respect to the token zero, I don't want this to impact anything.

657
00:38:22,000 --> 00:38:29,000
So my when I convert this one D mask to to 2D mask, this is how it is basically going to look at.

658
00:38:29,000 --> 00:38:29,000
Right.

659
00:38:29,000 --> 00:38:31,000
So this is my token one token two.

660
00:38:31,000 --> 00:38:33,000
This both can probably change its own attention.

661
00:38:33,000 --> 00:38:34,000
So I'm writing 110.

662
00:38:34,000 --> 00:38:38,000
Similarly I have 110 over here then 000 over here okay.

663
00:38:38,000 --> 00:38:42,000
So this is how things work with, uh, respect to this.

664
00:38:42,000 --> 00:38:46,000
And I hope you got an idea with respect to this particular example.

665
00:38:46,000 --> 00:38:49,000
Okay, but we'll not just stop over here, okay?

666
00:38:49,000 --> 00:38:53,000
There are many more things that we really need to probably go ahead and discuss.

667
00:38:53,000 --> 00:38:53,000
Okay.

668
00:38:53,000 --> 00:38:54,000
And that is nothing.

669
00:38:54,000 --> 00:38:58,000
But we need to also go ahead and apply one more padding okay.

670
00:38:58,000 --> 00:39:01,000
And and how do we combine both of those padding.

671
00:39:01,000 --> 00:39:03,000
We will try to understand that okay.

672
00:39:03,000 --> 00:39:04,000
But I hope you got an idea.

673
00:39:04,000 --> 00:39:05,000
idea over here.

674
00:39:05,000 --> 00:39:08,000
So this is my extended padding mask.

675
00:39:08,000 --> 00:39:15,000
Okay, then the next step, what we do is that we will go ahead and create our look ahead mask.

676
00:39:17,000 --> 00:39:18,000
Now how do we calculate this.

677
00:39:18,000 --> 00:39:20,000
Look ahead mask.

678
00:39:20,000 --> 00:39:26,000
Look ahead mask is also based on which token can attend.

679
00:39:26,000 --> 00:39:30,000
See this is completely dependent on the decoder output.

680
00:39:31,000 --> 00:39:33,000
And here the sequence is very much important.

681
00:39:34,000 --> 00:39:40,000
Now if I really want to create the look ahead mask, I know right when I probably get the first token,

682
00:39:40,000 --> 00:39:41,000
I will keep it as one.

683
00:39:41,000 --> 00:39:45,000
This will be zero zero because the further context I don't require.

684
00:39:45,000 --> 00:39:49,000
Similarly, I will go ahead and write 110 the second token.

685
00:39:49,000 --> 00:39:53,000
If I'm probably getting it, it should have the context of first, not the future.

686
00:39:53,000 --> 00:39:58,000
And if I try to probably predict the third one, this should probably have the context of all the previous,

687
00:39:58,000 --> 00:39:59,000
not the further one, right?

688
00:39:59,000 --> 00:40:01,000
So since I'm working with three cross three.

689
00:40:01,000 --> 00:40:04,000
So this will be my another three cross three matrix okay.

690
00:40:04,000 --> 00:40:11,000
And uh, if I really want to specify this in a better way, it should be better that I create a separate

691
00:40:11,000 --> 00:40:13,000
arrays for this or matrix for this.

692
00:40:13,000 --> 00:40:15,000
So this will be 100.

693
00:40:15,000 --> 00:40:19,000
Then I have the next one which is nothing but 110.

694
00:40:19,000 --> 00:40:22,000
And I have nothing but 111.

695
00:40:22,000 --> 00:40:25,000
Okay, since we are using three cross three, this is how it is basically going to look.

696
00:40:25,000 --> 00:40:28,000
Now in this kind of masking you'll be able to see the top.

697
00:40:28,000 --> 00:40:30,000
Everything will be zero right.

698
00:40:30,000 --> 00:40:36,000
And in this masking, wherever the token is zero, I'll be making sure that I try to convert that into

699
00:40:36,000 --> 00:40:39,000
an extended mask where this zero will be created in the similar way.

700
00:40:39,000 --> 00:40:40,000
Okay.

701
00:40:40,000 --> 00:40:45,000
Now we the next step will be that we combine.

702
00:40:46,000 --> 00:40:54,000
We combine padding and look ahead, look ahead.

703
00:40:54,000 --> 00:40:54,000
Mask.

704
00:40:55,000 --> 00:40:59,000
Why do we combine with the help of padding mask?

705
00:40:59,000 --> 00:41:02,000
You'll be able to see that we are removing all the zeros.

706
00:41:02,000 --> 00:41:02,000
Right.

707
00:41:02,000 --> 00:41:03,000
And how do we do this?

708
00:41:03,000 --> 00:41:05,000
It is very much simple.

709
00:41:05,000 --> 00:41:08,000
It is basically with the help of element wise.

710
00:41:09,000 --> 00:41:10,000
Element wise.

711
00:41:10,000 --> 00:41:11,000
Multiplication of two mask.

712
00:41:13,000 --> 00:41:17,000
Multiplication of two mask okay.

713
00:41:17,000 --> 00:41:21,000
Now in this scenario if I go ahead and combine my mask.

714
00:41:21,000 --> 00:41:23,000
So combine mask is nothing.

715
00:41:23,000 --> 00:41:25,000
But we will go ahead and do the calculation.

716
00:41:26,000 --> 00:41:29,000
Uh, it's nothing, but it's a simple dot operation.

717
00:41:29,000 --> 00:41:33,000
Okay, so here, uh, I will just go ahead and write.

718
00:41:33,000 --> 00:41:38,000
If you just do the multiplication operation, final output that I'm actually going to get is 100.

719
00:41:39,000 --> 00:41:39,000
Okay.

720
00:41:39,000 --> 00:41:42,000
Then I have one comma one comma zero.

721
00:41:42,000 --> 00:41:45,000
And then I have zero comma zero comma zero.

722
00:41:45,000 --> 00:41:45,000
Okay.

723
00:41:45,000 --> 00:41:47,000
So this is my final matrix.

724
00:41:47,000 --> 00:41:48,000
And that is what we have done.

725
00:41:48,000 --> 00:41:53,000
We have done the multiplication of two uh important metrics.

726
00:41:53,000 --> 00:41:53,000
Uh mask matrix.

727
00:41:53,000 --> 00:41:54,000
in short.

728
00:41:54,000 --> 00:41:55,000
Okay.

729
00:41:55,000 --> 00:41:57,000
And this is how you have actually got it right.

730
00:41:57,000 --> 00:42:03,000
Then the next step in this will be that wherever the value is, zero.

731
00:42:03,000 --> 00:42:06,000
Please listen to this very, very carefully.

732
00:42:06,000 --> 00:42:11,000
Wherever in the combined mass the value is basically zero.

733
00:42:11,000 --> 00:42:14,000
There we specify okay.

734
00:42:15,000 --> 00:42:18,000
There we specifically specify infinity okay.

735
00:42:18,000 --> 00:42:20,000
The reason I will tell you okay.

736
00:42:20,000 --> 00:42:23,000
Why do we specify infinity.

737
00:42:23,000 --> 00:42:25,000
But let us do one thing.

738
00:42:25,000 --> 00:42:30,000
Let us do all the steps for this particular data that we have created right for this data.

739
00:42:30,000 --> 00:42:35,000
So here I'm just going to copy this particular data because we are working on this.

740
00:42:35,000 --> 00:42:37,000
And here we discussed about master application.

741
00:42:37,000 --> 00:42:39,000
We saw example.

742
00:42:39,000 --> 00:42:41,000
And finally we got the combined mask.

743
00:42:41,000 --> 00:42:45,000
Now for this data we will go ahead and compute our mask okay.

744
00:42:45,000 --> 00:42:47,000
So let's go ahead and compute our mask.

745
00:42:47,000 --> 00:42:51,000
So here you could see uh this was my score right.

746
00:42:51,000 --> 00:42:54,000
And uh with respect to this particular score.

747
00:42:54,000 --> 00:42:57,000
Now let's go ahead and find out our look ahead mask.

748
00:42:57,000 --> 00:43:01,000
So here I'm going to go ahead and compute my look ahead mask.

749
00:43:01,000 --> 00:43:05,000
Now in order to get the look ahead mask.

750
00:43:05,000 --> 00:43:06,000
It is very simple here how much?

751
00:43:06,000 --> 00:43:12,000
I have four cross four right now with respect to four cross four again this will be 1000.

752
00:43:13,000 --> 00:43:16,000
Then the next one will be 1100.

753
00:43:16,000 --> 00:43:19,000
Then I have this 1110.

754
00:43:20,000 --> 00:43:24,000
And then I have finally 1111 since it is a four dimension.

755
00:43:24,000 --> 00:43:25,000
Right.

756
00:43:25,000 --> 00:43:28,000
And here this is my complete look ahead mask.

757
00:43:28,000 --> 00:43:28,000
Right.

758
00:43:28,000 --> 00:43:32,000
And similarly if I want to get the padding mask okay.

759
00:43:32,000 --> 00:43:39,000
So in order to get the padding mask and with respect to the padding mask, I know I need to have this

760
00:43:39,000 --> 00:43:41,000
in the extended to 2D format.

761
00:43:42,000 --> 00:43:42,000
Okay.

762
00:43:43,000 --> 00:43:44,000
Single format.

763
00:43:44,000 --> 00:43:46,000
It is very simple but 2D format.

764
00:43:46,000 --> 00:43:51,000
What we really need to do over here is that I will go ahead and write one comma, one comma, one comma

765
00:43:51,000 --> 00:43:53,000
zero, because this is not important.

766
00:43:53,000 --> 00:43:54,000
Then again I have.

767
00:43:54,000 --> 00:43:56,000
This is dependent on this.

768
00:43:56,000 --> 00:43:57,000
This is dependent on this.

769
00:43:57,000 --> 00:43:59,000
This is in turn dependent on this dependent on this.

770
00:43:59,000 --> 00:44:01,000
So I'll again go ahead and write this one.

771
00:44:01,000 --> 00:44:04,000
Then I will be having my third one, one comma, one comma zero.

772
00:44:04,000 --> 00:44:09,000
And then the fourth one will be nothing but zero comma zero comma zero comma zero.

773
00:44:09,000 --> 00:44:09,000
Okay.

774
00:44:10,000 --> 00:44:15,000
So this is my padding mask right now when I want to probably create my combined mask.

775
00:44:15,000 --> 00:44:16,000
So it is nothing.

776
00:44:16,000 --> 00:44:28,000
But if I want to probably find out my combined mask, it is the product of look ahead, look ahead mask

777
00:44:29,000 --> 00:44:35,000
multiplied by, uh, multiplication by padding mask.

778
00:44:35,000 --> 00:44:37,000
Now how are we going to do this particular product?

779
00:44:37,000 --> 00:44:38,000
It is very simple.

780
00:44:39,000 --> 00:44:43,000
Um, it's more like element wise multiplication.

781
00:44:43,000 --> 00:44:44,000
This will get multiplied to this.

782
00:44:44,000 --> 00:44:46,000
This will get multiplied to this.

783
00:44:46,000 --> 00:44:46,000
Like that.

784
00:44:46,000 --> 00:44:46,000
Right.

785
00:44:46,000 --> 00:44:52,000
So finally, if you go ahead and do this calculation, it is nothing but one multiplied by one, zero

786
00:44:52,000 --> 00:44:58,000
multiplied by one zero multiplied by one zero multiplied by zero.

787
00:44:58,000 --> 00:44:58,000
Okay.

788
00:44:58,000 --> 00:45:01,000
Then the next one will be nothing but one.

789
00:45:01,000 --> 00:45:08,000
Multiply by one, one multiplied by one, zero multiplied by one, and zero multiplied by zero.

790
00:45:08,000 --> 00:45:08,000
Okay.

791
00:45:09,000 --> 00:45:11,000
Then again one.

792
00:45:11,000 --> 00:45:12,000
Multiply by one.

793
00:45:12,000 --> 00:45:13,000
One multiplied by one.

794
00:45:13,000 --> 00:45:15,000
One multiplied by one.

795
00:45:15,000 --> 00:45:17,000
Zeroth multiplied by zero.

796
00:45:17,000 --> 00:45:20,000
So when we say multiplication, this is nothing but dot wise multiplication okay.

797
00:45:20,000 --> 00:45:22,000
Uh, element wise multiplication.

798
00:45:22,000 --> 00:45:22,000
Sorry.

799
00:45:22,000 --> 00:45:27,000
One multiplied by one, one multiplied by one, one multiplied by one.

800
00:45:27,000 --> 00:45:29,000
And here I have one multiplied by zero.

801
00:45:29,000 --> 00:45:30,000
Okay.

802
00:45:30,000 --> 00:45:32,000
So finally this is my output.

803
00:45:32,000 --> 00:45:34,000
And the output is very simple.

804
00:45:34,000 --> 00:45:38,000
Over here I will be getting one comma zero comma zero comma zero.

805
00:45:38,000 --> 00:45:42,000
Then I have one comma, one comma zero comma zero.

806
00:45:42,000 --> 00:45:47,000
Then I have one comma one comma one zero.

807
00:45:47,000 --> 00:45:51,000
Then I have one comma, one comma, one comma.

808
00:45:51,000 --> 00:45:52,000
Sorry one comma zero.

809
00:45:55,000 --> 00:45:59,000
So this is my final combined padding.

810
00:45:59,000 --> 00:46:03,000
This zero is there basically to ignore all the tokens.

811
00:46:03,000 --> 00:46:04,000
Padded tokens.

812
00:46:04,000 --> 00:46:07,000
And wherever there is one those are very important okay.

813
00:46:07,000 --> 00:46:11,000
Now the next step which is really really important okay.

814
00:46:11,000 --> 00:46:14,000
Which is about masked scores.

815
00:46:15,000 --> 00:46:17,000
We go ahead and compute our mask scores.

816
00:46:18,000 --> 00:46:19,000
And it is very simple.

817
00:46:19,000 --> 00:46:26,000
The mask score is nothing, but wherever there are zeros that is going to just will just go ahead and

818
00:46:26,000 --> 00:46:27,000
add.

819
00:46:27,000 --> 00:46:31,000
You can just basically do something over here okay.

820
00:46:31,000 --> 00:46:33,000
So let's, let's let's see this okay.

821
00:46:33,000 --> 00:46:36,000
Now here I have my combined mask okay.

822
00:46:36,000 --> 00:46:41,000
Now for getting this combined mask for applying this combined mask.

823
00:46:41,000 --> 00:46:46,000
What we'll do we'll apply this combined mask to this entire matrix okay.

824
00:46:47,000 --> 00:46:53,000
And the next step that you'll be seeing after this will be that we will try to convert this into a mask

825
00:46:53,000 --> 00:46:54,000
over here.

826
00:46:54,000 --> 00:47:01,000
And in this instead of zero here we are going to convert this into minus infinity y minus infinity.

827
00:47:01,000 --> 00:47:03,000
I'll just talk about it in a while okay.

828
00:47:03,000 --> 00:47:08,000
So here my next step will be converting everything into minus infinity.

829
00:47:08,000 --> 00:47:10,000
The reason is very simple.

830
00:47:10,000 --> 00:47:11,000
Very very simple.

831
00:47:11,000 --> 00:47:12,000
Just think over it.

832
00:47:12,000 --> 00:47:16,000
Till then I will go ahead and write this minus infinity.

833
00:47:16,000 --> 00:47:22,000
Wherever there is zero that will get converted into minus infinity okay.

834
00:47:22,000 --> 00:47:26,000
So this in turn I'm actually getting in the form of minus infinity.

835
00:47:26,000 --> 00:47:29,000
And again we do the dot wise operation with this right.

836
00:47:29,000 --> 00:47:31,000
Wherever there is minus infinity that will become minus infinity.

837
00:47:31,000 --> 00:47:37,000
Okay, so final mass score, you'll be able to see that I will go ahead and just do the dot operation.

838
00:47:37,000 --> 00:47:43,000
So I'm going to get point three minus infinity minus infinity minus infinity okay.

839
00:47:43,000 --> 00:47:50,000
And then you have this .71.9 minus infinity minus infinity.

840
00:47:50,000 --> 00:47:55,000
Then similarly you have 1.13.15.1.

841
00:47:55,000 --> 00:47:57,000
And again you have minus infinity.

842
00:47:57,000 --> 00:48:02,000
Then you have 0.0, 0.00.0.

843
00:48:02,000 --> 00:48:03,000
Then again you have minus infinity.

844
00:48:04,000 --> 00:48:07,000
So this is your final mask score.

845
00:48:07,000 --> 00:48:08,000
Now.

846
00:48:10,000 --> 00:48:18,000
The most important thing that you really need to understand is why you need to have this minus infinity.

847
00:48:19,000 --> 00:48:23,000
So here you'll be able to see that, uh, one minor mistake that I did was that I told, hey, we are

848
00:48:23,000 --> 00:48:25,000
replacing zero with minus infinity.

849
00:48:25,000 --> 00:48:29,000
No, minus infinity is getting added to this specific zero value.

850
00:48:29,000 --> 00:48:29,000
Okay.

851
00:48:29,000 --> 00:48:37,000
And why this is actually done is that I will just go ahead and write the reason over here, okay.

852
00:48:37,000 --> 00:48:44,000
When we apply or when we try to, uh, you know, add this specific value.

853
00:48:44,000 --> 00:48:44,000
Okay.

854
00:48:44,000 --> 00:48:56,000
It is basically to zero out or it is to remove the influence, zero out the influence when the softmax

855
00:48:56,000 --> 00:48:57,000
activation function is applied.

856
00:48:57,000 --> 00:49:00,000
Because the next step is applying softmax activation function.

857
00:49:01,000 --> 00:49:02,000
Okay.

858
00:49:03,000 --> 00:49:08,000
Now when we apply softmax to minus infinity you will be seeing the answer what you will be able to get.

859
00:49:08,000 --> 00:49:08,000
Okay.

860
00:49:09,000 --> 00:49:10,000
Uh, this is crucial.

861
00:49:10,000 --> 00:49:14,000
This step is crucial to ensure that certain position.

862
00:49:14,000 --> 00:49:14,000
Right.

863
00:49:14,000 --> 00:49:15,000
Example.

864
00:49:15,000 --> 00:49:20,000
If I probably consider padded tokens or future tokens in the case of lookahead, masking do not affect

865
00:49:20,000 --> 00:49:22,000
the attention mechanism, okay.

866
00:49:22,000 --> 00:49:25,000
And obviously the impact of zeros will also get removed, right?

867
00:49:25,000 --> 00:49:30,000
When we apply a mask, we need to ensure that a certain tokens do not influence the attention mechanism.

868
00:49:30,000 --> 00:49:37,000
We achieve this by setting the attention score for these tokens to a very high large negative numbers.

869
00:49:37,000 --> 00:49:40,000
When the softmax function is applied to the scores, it converts this.

870
00:49:40,000 --> 00:49:43,000
This entire value will be getting converted to zero.

871
00:49:43,000 --> 00:49:49,000
Okay, ensuring that there is no much influence on the attention weights.

872
00:49:49,000 --> 00:49:52,000
So attention weights will not get impacted.

873
00:49:52,000 --> 00:49:53,000
Right?

874
00:49:53,000 --> 00:49:58,000
Because after this, after we calculate, uh, the mass scores, what we are specifically going to do,

875
00:49:58,000 --> 00:50:03,000
okay, we have to probably convert, uh, or apply a softmax activation function.

876
00:50:03,000 --> 00:50:05,000
I hope everybody knows that.

877
00:50:05,000 --> 00:50:05,000
Right.

878
00:50:05,000 --> 00:50:08,000
So that is what we are specifically going to do over here.

879
00:50:09,000 --> 00:50:12,000
Uh, over here, we are just going to go ahead and apply our softmax.

880
00:50:12,000 --> 00:50:13,000
That is our next step.

881
00:50:13,000 --> 00:50:21,000
So let's go ahead and do the softmax because after this we'll go ahead and add our values.

882
00:50:21,000 --> 00:50:21,000
Right.

883
00:50:21,000 --> 00:50:25,000
So softmax scores that we are going to get is nothing but softmax.

884
00:50:26,000 --> 00:50:33,000
And here I'm going to use my masked scores right now with respect to the mask scores.

885
00:50:33,000 --> 00:50:35,000
When I apply the softmax.

886
00:50:35,000 --> 00:50:38,000
Uh, here you will be able to see that I will get one point.

887
00:50:38,000 --> 00:50:43,000
Oh, I've done the calculation minus infinity is basically going to convert into zero.

888
00:50:43,000 --> 00:50:44,000
Okay.

889
00:50:44,000 --> 00:50:46,000
You can go ahead and write this or you can also check it out.

890
00:50:46,000 --> 00:50:51,000
This will be .3.70.00.0.

891
00:50:52,000 --> 00:50:59,000
Okay then I have .3.6.60.0.

892
00:50:59,000 --> 00:51:04,000
And then I have 1.0, 0.0, 0.00.0.

893
00:51:05,000 --> 00:51:09,000
So this is my entire score that I'm actually getting Okay.

894
00:51:09,000 --> 00:51:15,000
And then, as you know, in the self-attention, the final step will be that we will do the weighted

895
00:51:15,000 --> 00:51:18,000
sum of values, right.

896
00:51:18,000 --> 00:51:27,000
So every number over here inside the softmax is going to get multiplied by, uh, this this this softmax

897
00:51:27,000 --> 00:51:32,000
score okay is going to get multiplied by v okay.

898
00:51:32,000 --> 00:51:41,000
So here my attention output will be nothing, but it will be softmax scores.

899
00:51:43,000 --> 00:51:49,000
Multiplied by the vector v okay so guys this is the step by step mechanism that specifically happens

900
00:51:49,000 --> 00:51:51,000
in the decoder which is similar to encoder.

901
00:51:51,000 --> 00:51:56,000
But one additional step that you were able to find out was about masking.

902
00:51:56,000 --> 00:51:57,000
Right.

903
00:51:57,000 --> 00:51:59,000
And what is the importance of masking?

904
00:51:59,000 --> 00:52:03,000
I will be giving you in this specific PDF in the word of writing, so that you'll be able to understand

905
00:52:03,000 --> 00:52:07,000
masking in the transformer architecture is essential for several reasons.

906
00:52:07,000 --> 00:52:12,000
It helps manage the structure of sequence being processed and ensures the model behaves correctly during

907
00:52:12,000 --> 00:52:13,000
training and inferences.

908
00:52:13,000 --> 00:52:15,000
Here are the key reasons for using masking.

909
00:52:15,000 --> 00:52:19,000
One is basically to handle variable length sequence and padding masking because there will be a lot

910
00:52:19,000 --> 00:52:24,000
many number of zeros to handle sequence of different length in a batch to ensure the padding tokens,

911
00:52:24,000 --> 00:52:27,000
which are added to make sequences of uniform length do not affect the model prediction.

912
00:52:27,000 --> 00:52:32,000
And second is basically to maintaining autoregressive property with lookahead mask, right?

913
00:52:32,000 --> 00:52:37,000
The purpose is very simple to ensure that each position in a decoder output sequence can only attend

914
00:52:37,000 --> 00:52:40,000
to the previous position or itself, but not future position.

915
00:52:40,000 --> 00:52:46,000
This is crucial for sequence generation task like language modeling and translation, where the model

916
00:52:46,000 --> 00:52:49,000
should not have the access to the future tokens while predicting the current token.

917
00:52:49,000 --> 00:52:54,000
Okay, so these are some of the important steps with respect to the masking.

918
00:52:54,000 --> 00:52:59,000
And now the next step that we are basically going to discuss about is add a normalization.

919
00:52:59,000 --> 00:53:03,000
And then what is this exact line and why this particular line is also coming up.

920
00:53:03,000 --> 00:53:03,000
Okay.

921
00:53:03,000 --> 00:53:07,000
So that is what we are going to discuss about in this specific in the next video.

922
00:53:07,000 --> 00:53:09,000
So yes, this was it for my side.

923
00:53:09,000 --> 00:53:10,000
I'll see you all in the next video.

924
00:53:10,000 --> 00:53:10,000
Thank you.

