1
00:00:00,000 --> 00:00:01,000
Hello guys!

2
00:00:01,000 --> 00:00:05,000
So finally we are into the last stage of understanding about Transformers.

3
00:00:05,000 --> 00:00:13,000
I know this entire series now probably is more than five hours of content, but you can just understand

4
00:00:13,000 --> 00:00:16,000
why this entire architecture is so amazing, right?

5
00:00:16,000 --> 00:00:21,000
It looks complex, but if you break down this entire architecture, you will be able to understand each

6
00:00:21,000 --> 00:00:22,000
and every thing.

7
00:00:22,000 --> 00:00:26,000
And that is what, while explaining all these topics I have actually done right.

8
00:00:26,000 --> 00:00:32,000
So from starting from positional encoding to multi-head attention to feed forward neural network here

9
00:00:32,000 --> 00:00:34,000
also we discussed about masked multi-head attention.

10
00:00:34,000 --> 00:00:36,000
We went to Multi-head attention.

11
00:00:36,000 --> 00:00:38,000
We also covered about feed forward neural network.

12
00:00:38,000 --> 00:00:44,000
Now it's time to understand about the final layer of the decoder.

13
00:00:44,000 --> 00:00:44,000
Right?

14
00:00:44,000 --> 00:00:48,000
That is nothing but linear and softmax layer.

15
00:00:48,000 --> 00:00:52,000
And through this we basically go ahead and calculate the output probabilities.

16
00:00:52,000 --> 00:01:00,000
Now I will try to just make you understand again, this entire diagram is from that amazing blog right

17
00:01:00,000 --> 00:01:02,000
about Transformers.

18
00:01:02,000 --> 00:01:06,000
And what I will do is that I will also go ahead and share that link with you all.

19
00:01:06,000 --> 00:01:08,000
Okay, you can go ahead and refer it.

20
00:01:09,000 --> 00:01:14,000
So till here, uh, what we have actually seen, we have seen all the kind of operation related to q,

21
00:01:14,000 --> 00:01:20,000
q, k, v here, q and k is basically going V is basically coming from here.

22
00:01:20,000 --> 00:01:25,000
And again we also have lot of multi skip connections like from here to here.

23
00:01:25,000 --> 00:01:29,000
We are passing uh some information regarding positional encoding then other information.

24
00:01:29,000 --> 00:01:31,000
I'm passing it over here.

25
00:01:31,000 --> 00:01:32,000
Now let's finally do one thing.

26
00:01:32,000 --> 00:01:37,000
See guys after this decoder is completed, right at the end of the day, you are just going to get vectors

27
00:01:37,000 --> 00:01:43,000
right now, how do we convert this vector into a word?

28
00:01:43,000 --> 00:01:46,000
That is what we are going to see in this session, right?

29
00:01:47,000 --> 00:01:55,000
So our main aim of this particular session is that how we will be able to convert our vectors into our

30
00:01:55,000 --> 00:01:56,000
output word.

31
00:01:57,000 --> 00:02:00,000
And that is what I am going to discuss step by step.

32
00:02:01,000 --> 00:02:07,000
Now, first of all, here you will be able to see there is something called a linear and softmax.

33
00:02:07,000 --> 00:02:10,000
First of all, we'll try to understand what exactly is this linear.

34
00:02:10,000 --> 00:02:15,000
So to begin with this linear or I will just go ahead and write the linear layer.

35
00:02:19,000 --> 00:02:34,000
Is a simply is a simple fully connected neural network.

36
00:02:37,000 --> 00:02:51,000
That projects the vector produced by the stack of decoders.

37
00:02:53,000 --> 00:02:54,000
Okay.

38
00:02:54,000 --> 00:02:56,000
Stack of decoders.

39
00:02:58,000 --> 00:03:05,000
And it basically generates a very large vectors, which we also called as as log vectors.

40
00:03:05,000 --> 00:03:06,000
Okay.

41
00:03:06,000 --> 00:03:07,000
Logits vectors.

42
00:03:07,000 --> 00:03:10,000
Now I'll talk about what is the importance of this logit vectors.

43
00:03:10,000 --> 00:03:13,000
And this is what a linear layer basically does.

44
00:03:14,000 --> 00:03:14,000
Okay.

45
00:03:14,000 --> 00:03:16,000
Linear layer basically does.

46
00:03:17,000 --> 00:03:22,000
So let me go ahead and explain a linear layer is a simply, fully simple, fully connected neural network

47
00:03:22,000 --> 00:03:26,000
that projects the vector produced by the stack of decoders.

48
00:03:26,000 --> 00:03:27,000
Right.

49
00:03:27,000 --> 00:03:32,000
And it basically generates this into a large vector which is called as logit vectors.

50
00:03:32,000 --> 00:03:35,000
Now what exactly is this logic vectors.

51
00:03:35,000 --> 00:03:35,000
Okay.

52
00:03:35,000 --> 00:03:37,000
So we'll try to understand this.

53
00:03:37,000 --> 00:03:47,000
Let's say if our model if our we have a specific model, if let's say, uh, our model knows more than

54
00:03:47,000 --> 00:03:49,000
10,000 words, right.

55
00:03:49,000 --> 00:03:51,000
Let's say that there are 10,000 unique words.

56
00:03:51,000 --> 00:03:53,000
And this is my entire vocabulary.

57
00:03:54,000 --> 00:03:55,000
Right.

58
00:03:55,000 --> 00:03:59,000
So in this case my logit vectors will be 10,000.

59
00:04:01,000 --> 00:04:07,000
Logit vectors will be 10,000 cells wide.

60
00:04:07,000 --> 00:04:13,000
So here, instead of just having this many vocab size, we'll be having this 10,000 vocab size okay.

61
00:04:13,000 --> 00:04:15,000
Or logit size.

62
00:04:15,000 --> 00:04:21,000
Now understand each and every block that you see in this logit vectors.

63
00:04:21,000 --> 00:04:24,000
It corresponds to a score of a unique word.

64
00:04:24,000 --> 00:04:32,000
So here this specifically corresponds to a score of a unique word.

65
00:04:33,000 --> 00:04:34,000
What kind of score?

66
00:04:34,000 --> 00:04:35,000
We will talk about it, right?

67
00:04:35,000 --> 00:04:38,000
It's just like mapping a vector to words.

68
00:04:38,000 --> 00:04:38,000
Right?

69
00:04:38,000 --> 00:04:45,000
So for this, once I give once, I probably get all the vectors from the stack of decoder.

70
00:04:45,000 --> 00:04:48,000
I'm going to pass through this through the linear layer.

71
00:04:48,000 --> 00:04:51,000
It is a it is a fully connected neural network layer.

72
00:04:52,000 --> 00:04:57,000
And once we pass through this, this is going to give us an output of this much nodes.

73
00:04:57,000 --> 00:04:57,000
Right.

74
00:04:57,000 --> 00:05:00,000
This many number of outputs of a vocab size.

75
00:05:00,000 --> 00:05:03,000
So number of words vocab right.

76
00:05:04,000 --> 00:05:08,000
Uh so again uh over here you'll be able to see it will take this particular vector and it will convert

77
00:05:08,000 --> 00:05:10,000
this into logit vectors okay.

78
00:05:11,000 --> 00:05:15,000
Uh, then if you go to the next step it is a softmax layer.

79
00:05:15,000 --> 00:05:16,000
Now you know why softmax layer is used.

80
00:05:16,000 --> 00:05:20,000
It is specifically used for multi-class classification.

81
00:05:21,000 --> 00:05:24,000
Multi-Class classification.

82
00:05:24,000 --> 00:05:29,000
Now in the case of multi-class classification, what will happen as soon as I pass this particular vectors?

83
00:05:29,000 --> 00:05:32,000
Every vector is basically going to give us our log probabilities.

84
00:05:32,000 --> 00:05:35,000
Right now this probabilities means what?

85
00:05:35,000 --> 00:05:39,000
Let's say if this vector has the highest probability right?

86
00:05:39,000 --> 00:05:43,000
Or if this word has the highest probability, then my output will be this specific word.

87
00:05:43,000 --> 00:05:48,000
If this word has the highest probability, then the output will be this specific word okay.

88
00:05:48,000 --> 00:06:05,000
So here in the second step what we are basically doing is that the softmax layer turns those scores

89
00:06:06,000 --> 00:06:08,000
into probabilities.

90
00:06:10,000 --> 00:06:18,000
Now, since it is turning this into probabilities, if I probably do the summation, it will all add

91
00:06:18,000 --> 00:06:19,000
up to one, right?

92
00:06:20,000 --> 00:06:24,000
That I hope everybody knows about because that is the property of softmax layer.

93
00:06:24,000 --> 00:06:40,000
Then the cell with the highest probability is chosen as the output is chosen.

94
00:06:40,000 --> 00:06:40,000
Right.

95
00:06:41,000 --> 00:06:42,000
And the word.

96
00:06:46,000 --> 00:06:47,000
Associated.

97
00:06:51,000 --> 00:06:55,000
With it is produced as the output.

98
00:06:58,000 --> 00:07:00,000
As the output.

99
00:07:00,000 --> 00:07:01,000
Right.

100
00:07:01,000 --> 00:07:03,000
For that specific timestamp.

101
00:07:03,000 --> 00:07:03,000
Okay.

102
00:07:03,000 --> 00:07:07,000
Or for for that specific timestamp.

103
00:07:08,000 --> 00:07:11,000
Now I hope you are able to understand it okay.

104
00:07:11,000 --> 00:07:15,000
So whatever output I get from the decoder that is passed to the linear layer, which is a fully connected

105
00:07:15,000 --> 00:07:20,000
layer, and this output will be based on a vocab size that many number of cells.

106
00:07:20,000 --> 00:07:24,000
And once this is passed to the softmax, I will be getting a probability, and whichever has the highest

107
00:07:24,000 --> 00:07:27,000
probability, that will be my output word.

108
00:07:27,000 --> 00:07:34,000
Okay, uh, when we say vocab size, vocabsize basically means if there are 10,000 words for every index,

109
00:07:34,000 --> 00:07:35,000
there will be one word.

110
00:07:35,000 --> 00:07:37,000
Okay, so more to talk about.

111
00:07:37,000 --> 00:07:39,000
Let's go through the entire training process now.

112
00:07:39,000 --> 00:07:43,000
So here I will talk about the recap of training.

113
00:07:43,000 --> 00:07:46,000
And here also I will take multiple examples.

114
00:07:46,000 --> 00:07:50,000
Let's say in my entire vocab size I have this many number of vocabs.

115
00:07:50,000 --> 00:07:52,000
Let's say my this many number of words are there.

116
00:07:52,000 --> 00:07:57,000
So in my vocabulary I have a am I thank student iOS.

117
00:07:57,000 --> 00:08:00,000
Okay, so that is the reason I've given this as indexing.

118
00:08:00,000 --> 00:08:00,000
Okay.

119
00:08:01,000 --> 00:08:07,000
Now the first for the first instance, what we do, uh, if I don't just use a simple embedding technique.

120
00:08:07,000 --> 00:08:09,000
Okay, I will just go ahead and use this simple embedding technique.

121
00:08:09,000 --> 00:08:14,000
So for every word, uh, what we will do is that we will try to use one hot encoding to represent it.

122
00:08:14,000 --> 00:08:17,000
So let's say this is my one hot encoding okay.

123
00:08:17,000 --> 00:08:20,000
This is my one hot encoding for the word m okay.

124
00:08:20,000 --> 00:08:23,000
So wherever m is present since the vocabulary size is six.

125
00:08:23,000 --> 00:08:29,000
So I will write 0.0 where m is present, it will become 1.0, 0.0, 0.0 and 0.0 okay.

126
00:08:29,000 --> 00:08:36,000
Now So during the loss function, what will happen if I have started training the word okay if I have

127
00:08:36,000 --> 00:08:38,000
started training the word.

128
00:08:38,000 --> 00:08:42,000
And similarly like if I want to have thanks, then this will become one and remaining all will become

129
00:08:42,000 --> 00:08:43,000
zero.

130
00:08:43,000 --> 00:08:47,000
Let's say that I have a problem statement where I need to convert this mercy, right?

131
00:08:47,000 --> 00:08:48,000
Mercy.

132
00:08:48,000 --> 00:08:49,000
I think it is in French.

133
00:08:49,000 --> 00:08:51,000
We need to convert this into thanks.

134
00:08:51,000 --> 00:08:52,000
Right.

135
00:08:52,000 --> 00:08:54,000
And vocabulary size is over here for me.

136
00:08:54,000 --> 00:08:55,000
Okay.

137
00:08:55,000 --> 00:08:58,000
So what will happen if I have an untrained model, right.

138
00:08:58,000 --> 00:09:01,000
My initial output may be something like this.

139
00:09:01,000 --> 00:09:04,000
See, I need to get this word right.

140
00:09:04,000 --> 00:09:04,000
Thanks.

141
00:09:04,000 --> 00:09:06,000
I need to get this value as 1.0.

142
00:09:06,000 --> 00:09:10,000
But if we see with respect to the untrained model initially, you'll be able to see these are different,

143
00:09:10,000 --> 00:09:11,000
different values.

144
00:09:11,000 --> 00:09:13,000
I may get different different values for each and everything.

145
00:09:13,000 --> 00:09:14,000
Okay.

146
00:09:14,000 --> 00:09:16,000
So this is how my output should come.

147
00:09:16,000 --> 00:09:18,000
But this is how I'm getting the output right.

148
00:09:18,000 --> 00:09:23,000
So after this what we do we basically go ahead and calculate the loss function okay.

149
00:09:23,000 --> 00:09:25,000
Loss function.

150
00:09:25,000 --> 00:09:26,000
Now this loss function is really important.

151
00:09:26,000 --> 00:09:30,000
That is where your back propagation will specifically happen okay.

152
00:09:30,000 --> 00:09:38,000
And just to make you understand uh, let's say I want to completely frame this particular sentence okay.

153
00:09:38,000 --> 00:09:40,000
So just uh, I'll take an example.

154
00:09:40,000 --> 00:09:40,000
Okay.

155
00:09:40,000 --> 00:09:45,000
I have a task where I give this words and it should probably form a sentence for me.

156
00:09:45,000 --> 00:09:51,000
Okay, so here I have a m, I thank student if I really want to know the output, it should be like

157
00:09:51,000 --> 00:09:52,000
I am.

158
00:09:52,000 --> 00:09:55,000
I am a I am a student and thanks.

159
00:09:55,000 --> 00:09:57,000
We can basically write right?

160
00:09:57,000 --> 00:09:59,000
I can basically say thanks, okay?

161
00:09:59,000 --> 00:10:04,000
Or let's say I don't want to use thanks or even because thanks will not be a very good, uh, thing

162
00:10:04,000 --> 00:10:05,000
in that particular sentence.

163
00:10:05,000 --> 00:10:08,000
Let's say if I give this particular word, I need to frame a better sentence.

164
00:10:08,000 --> 00:10:15,000
So here I can frame one kind of sentence like this I am a student, and then we should basically get

165
00:10:15,000 --> 00:10:16,000
our us right.

166
00:10:17,000 --> 00:10:22,000
So in this particular scenario what will happen is that for positions also right for position one,

167
00:10:22,000 --> 00:10:24,000
what will be my target model output.

168
00:10:24,000 --> 00:10:26,000
And also the for for the real output.

169
00:10:26,000 --> 00:10:30,000
My target output in my output feature will be something like this.

170
00:10:30,000 --> 00:10:31,000
Right?

171
00:10:31,000 --> 00:10:32,000
So my input is something like this.

172
00:10:32,000 --> 00:10:35,000
Let's see I will just show you over here okay.

173
00:10:35,000 --> 00:10:38,000
Let's say my input over here in my data set.

174
00:10:38,000 --> 00:10:42,000
And this is my output in my input I have words like am I.

175
00:10:43,000 --> 00:10:47,000
Thanks okay not thanks student.

176
00:10:47,000 --> 00:10:49,000
Let's say I will go ahead and write.

177
00:10:49,000 --> 00:10:50,000
Student okay.

178
00:10:50,000 --> 00:10:52,000
Now my output is I will go ahead and write.

179
00:10:52,000 --> 00:11:02,000
I am a student Okay, let's say this kind of output I really need to get now here based on this particular

180
00:11:02,000 --> 00:11:04,000
actual output I should be getting.

181
00:11:04,000 --> 00:11:08,000
Like if I consider my vocabulary size as six, I should be getting something like this right?

182
00:11:08,000 --> 00:11:10,000
My position one should be I.

183
00:11:10,000 --> 00:11:16,000
Position two should be I uh m position three should be uh, position four should be student.

184
00:11:16,000 --> 00:11:17,000
And then finally end of statement.

185
00:11:18,000 --> 00:11:23,000
But what happens is that while we are training, you know, with this particular data set, after some

186
00:11:23,000 --> 00:11:29,000
epochs during initial training, you will be seeing that, hey, uh, my output will not be that good.

187
00:11:29,000 --> 00:11:33,000
But after training for some time, let me just go and show you.

188
00:11:33,000 --> 00:11:36,000
After training for some time, you will be able to see over here.

189
00:11:36,000 --> 00:11:38,000
I will be able to get something like this.

190
00:11:38,000 --> 00:11:39,000
I will become 0.93.

191
00:11:39,000 --> 00:11:42,000
Then this will become .8.9.94.98.

192
00:11:42,000 --> 00:11:43,000
Right.

193
00:11:43,000 --> 00:11:45,000
And this is all because of loss function in epochs.

194
00:11:45,000 --> 00:11:47,000
And this all keeps on getting updated.

195
00:11:47,000 --> 00:11:48,000
Okay.

196
00:11:48,000 --> 00:11:51,000
And that is what I really wanted to show you as an example.

197
00:11:51,000 --> 00:11:56,000
But at the end of the day, you'll be able to see that we will initially be doing encoding for this.

198
00:11:56,000 --> 00:11:57,000
Here we have just used one hot encoding.

199
00:11:57,000 --> 00:11:59,000
Different encoding can also be used, right.

200
00:11:59,000 --> 00:12:02,000
Untrained model may give you some kind of output.

201
00:12:02,000 --> 00:12:04,000
But again we need to go ahead and compute the loss.

202
00:12:04,000 --> 00:12:06,000
And then we will do the back propagation.

203
00:12:07,000 --> 00:12:10,000
And then we will go ahead and compute all the weights.

204
00:12:10,000 --> 00:12:12,000
Right with respect to all your training data.

205
00:12:12,000 --> 00:12:13,000
And all right.

206
00:12:13,000 --> 00:12:16,000
So back propagation the main aim is to reduce the loss.

207
00:12:16,000 --> 00:12:17,000
Right.

208
00:12:17,000 --> 00:12:22,000
So after training from some specific outputs initially my training model output will be like this.

209
00:12:22,000 --> 00:12:22,000
Right.

210
00:12:22,000 --> 00:12:29,000
If I convert each and every word into one hot encoding, one hot encoding, one hot encoding, one hot

211
00:12:29,000 --> 00:12:30,000
encoding, right.

212
00:12:30,000 --> 00:12:31,000
One hot encoding.

213
00:12:31,000 --> 00:12:33,000
And this is with respect to position also in position one.

214
00:12:34,000 --> 00:12:35,000
I is basically having 1.2.

215
00:12:35,000 --> 00:12:37,000
So this is what is the one hot encoding for this.

216
00:12:37,000 --> 00:12:40,000
Then you have this then you have this then you have this right.

217
00:12:40,000 --> 00:12:43,000
So once you do this this is what is my output that I need to get.

218
00:12:43,000 --> 00:12:47,000
But my model when it initially generates it can generate something else.

219
00:12:47,000 --> 00:12:47,000
Right?

220
00:12:47,000 --> 00:12:53,000
Then a loss will be calculated between both of them and it will be reduced till we get this kind of

221
00:12:53,000 --> 00:12:54,000
values.

222
00:12:54,000 --> 00:12:59,000
And, uh, once I get this particular model out and this is after the, uh, you can probably say with

223
00:12:59,000 --> 00:13:02,000
respect to your fully connected layer right over here.

224
00:13:02,000 --> 00:13:05,000
So the linear layer, that is what I'm actually talking about.

225
00:13:05,000 --> 00:13:05,000
Right.

226
00:13:05,000 --> 00:13:09,000
And then once you probably compute the loss you will again be doing back propagation.

227
00:13:09,000 --> 00:13:15,000
So this was uh overall the final linear and softmax layer.

228
00:13:15,000 --> 00:13:18,000
But I hope I tried my level best to whatever was possible.

229
00:13:18,000 --> 00:13:24,000
But just imagine guys 34 pages have actually written just to explain this research paper, broken down

230
00:13:24,000 --> 00:13:25,000
each and every thing.

231
00:13:25,000 --> 00:13:27,000
Uh, tried my level best.

232
00:13:27,000 --> 00:13:32,000
Okay, but yes, there may be again, some things that you really need to revise from the research paper,

233
00:13:32,000 --> 00:13:33,000
but I tried my level best.

234
00:13:33,000 --> 00:13:35,000
Whatever was possible from my side.

235
00:13:35,000 --> 00:13:35,000
Okay.

236
00:13:35,000 --> 00:13:38,000
But I hope, uh, you're quite happy with this explanation.

237
00:13:38,000 --> 00:13:40,000
I hope you are able to understand each and everything.

238
00:13:40,000 --> 00:13:41,000
So, yes, this was it from my side.

239
00:13:41,000 --> 00:13:43,000
I will see you all in the next video.

240
00:13:43,000 --> 00:13:44,000
Thank you.