1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to our Transformers.

3
00:00:04,000 --> 00:00:09,000
Now in this video we are going to discuss about the entire encoder architecture.

4
00:00:09,000 --> 00:00:11,000
Already we have discussed about self-attention.

5
00:00:11,000 --> 00:00:16,000
We have discussed about multi-head attention, we have discussed about positional encoding and so many

6
00:00:16,000 --> 00:00:17,000
different topics.

7
00:00:17,000 --> 00:00:18,000
We have also discussed.

8
00:00:18,000 --> 00:00:20,000
We also discussed about layer normalization and each and everything.

9
00:00:21,000 --> 00:00:26,000
But now based on the research paper, we will just try to discuss how the entire architecture is because

10
00:00:26,000 --> 00:00:28,000
this is just one encoder.

11
00:00:28,000 --> 00:00:28,000
Right.

12
00:00:28,000 --> 00:00:29,000
And this is just one decoder.

13
00:00:29,000 --> 00:00:31,000
So right now first of all we'll focus on encoder.

14
00:00:32,000 --> 00:00:34,000
But initially only I told you right.

15
00:00:34,000 --> 00:00:39,000
Based on the research paper you will be seeing that there will be six encoders like this.

16
00:00:39,000 --> 00:00:39,000
Right.

17
00:00:39,000 --> 00:00:40,000
So there will be not one encoder.

18
00:00:40,000 --> 00:00:45,000
But instead of this there will be six like one encoder two encoder 3456.

19
00:00:45,000 --> 00:00:48,000
Similarly there are six decoders okay.

20
00:00:48,000 --> 00:00:53,000
And with respect to any task like machine translation, you give the you give your input over here,

21
00:00:53,000 --> 00:00:55,000
you get your output over here.

22
00:00:55,000 --> 00:00:55,000
Right.

23
00:00:55,000 --> 00:00:59,000
So this is the part that we are specifically going to discuss about.

24
00:00:59,000 --> 00:00:59,000
Okay.

25
00:00:59,000 --> 00:01:07,000
When I say this is just one encoder, that basically means this part can be mentioned in this way,

26
00:01:07,000 --> 00:01:08,000
right.

27
00:01:08,000 --> 00:01:10,000
If I probably expand this, it is nothing.

28
00:01:10,000 --> 00:01:13,000
But it is basically this entire encoder part, right?

29
00:01:13,000 --> 00:01:19,000
Similarly, you have other encoders also because after passing, after coming after the information

30
00:01:19,000 --> 00:01:23,000
is basically coming out of this, it will go to the next encoder, then it will go to the next encoder,

31
00:01:23,000 --> 00:01:25,000
then it will go to the next encoder.

32
00:01:25,000 --> 00:01:30,000
And all right so based on the research paper let me just go ahead and write some very important information.

33
00:01:30,000 --> 00:01:34,000
So first of all the input that you specifically take right.

34
00:01:34,000 --> 00:01:36,000
You convert that into vectors.

35
00:01:36,000 --> 00:01:39,000
Then you add a positional encoding on top of that.

36
00:01:39,000 --> 00:01:40,000
Then you pass through a self-attention layer.

37
00:01:40,000 --> 00:01:41,000
Right.

38
00:01:41,000 --> 00:01:46,000
So uh, if I probably show you the flow, it looks something like this, right.

39
00:01:46,000 --> 00:01:51,000
So I'll show you step by step, like how things basically happen.

40
00:01:51,000 --> 00:01:57,000
And then we will just try to, uh, understand, uh, like how many embedding vectors is specifically

41
00:01:57,000 --> 00:01:58,000
used and all.

42
00:01:58,000 --> 00:01:58,000
Okay.

43
00:01:58,000 --> 00:02:01,000
So let's say this is your input sequence.

44
00:02:01,000 --> 00:02:10,000
So here is my input sequence Now, from this input sequence, what we specifically do.

45
00:02:10,000 --> 00:02:12,000
We go to the next step.

46
00:02:12,000 --> 00:02:16,000
In the next step we are specifically going to convert this into vectors.

47
00:02:16,000 --> 00:02:18,000
That is through text embedding techniques.

48
00:02:18,000 --> 00:02:20,000
It can be any embedding techniques.

49
00:02:20,000 --> 00:02:25,000
Plus we add it with positional encoding okay.

50
00:02:26,000 --> 00:02:29,000
So once we add this with positional encoding.

51
00:02:29,000 --> 00:02:33,000
So this is what your input encoding plus positional encoding is basically going on right.

52
00:02:34,000 --> 00:02:38,000
The input sequence when it is converted into text embedding.

53
00:02:38,000 --> 00:02:43,000
So the number of dimensions that is specifically taken for the text embedding it is nothing but 512

54
00:02:43,000 --> 00:02:43,000
okay.

55
00:02:43,000 --> 00:02:51,000
That basically means every word is converted into a 512 dimension based on the research paper.

56
00:02:51,000 --> 00:02:52,000
Okay.

57
00:02:52,000 --> 00:02:57,000
So in the research paper it is mentioned when they try to entirely create this.

58
00:02:57,000 --> 00:03:01,000
Initially they use this 512 uh vectors for every input sequence.

59
00:03:01,000 --> 00:03:02,000
Right.

60
00:03:02,000 --> 00:03:04,000
So positional encoding will also be same.

61
00:03:04,000 --> 00:03:04,000
Right.

62
00:03:04,000 --> 00:03:10,000
Then after we pass through this what we get we get our multi-head attention.

63
00:03:10,000 --> 00:03:11,000
Right.

64
00:03:11,000 --> 00:03:13,000
We get our multi-head attention.

65
00:03:13,000 --> 00:03:17,000
And this is where your self-attention works, right?

66
00:03:17,000 --> 00:03:19,000
Between this multi-head attention.

67
00:03:19,000 --> 00:03:23,000
The multi-head Multi-head attention basically means how many different head attentions you have over

68
00:03:23,000 --> 00:03:23,000
here.

69
00:03:23,000 --> 00:03:31,000
So if I probably talk about here, you have eight head attentions like z two, z three, z four, z

70
00:03:31,000 --> 00:03:35,000
five, z six, z seven, z eight.

71
00:03:35,000 --> 00:03:38,000
So this is also based on the research paper.

72
00:03:38,000 --> 00:03:42,000
So in research paper how many attentions had they specifically used.

73
00:03:42,000 --> 00:03:43,000
Eight.

74
00:03:43,000 --> 00:03:51,000
Okay then, before we go ahead and pass this information from Multi-head attention here, what we are

75
00:03:51,000 --> 00:03:53,000
specifically doing here.

76
00:03:53,000 --> 00:03:57,000
We are first of all, adding and then normalizing.

77
00:03:57,000 --> 00:04:01,000
So that is the reason we say add and not write.

78
00:04:01,000 --> 00:04:05,000
The normalization technique that is specifically used is layer normalization.

79
00:04:05,000 --> 00:04:11,000
So here what we are doing before doing this we need to add this information that is probably coming

80
00:04:11,000 --> 00:04:12,000
from this layer.

81
00:04:12,000 --> 00:04:15,000
And this is what is called as residuals.

82
00:04:18,000 --> 00:04:18,000
Okay.

83
00:04:18,000 --> 00:04:20,000
This is basically called as residuals.

84
00:04:20,000 --> 00:04:21,000
What is the importance of this.

85
00:04:21,000 --> 00:04:28,000
Just imagine before I do the normalization the output of the multi head attention is also going.

86
00:04:28,000 --> 00:04:31,000
And we are adding the text embedding and positional encoding information.

87
00:04:31,000 --> 00:04:34,000
So that basically means we are giving some additional signals.

88
00:04:34,000 --> 00:04:35,000
Right.

89
00:04:35,000 --> 00:04:38,000
So we'll try to understand I'll write down point by point what is the importance of this.

90
00:04:38,000 --> 00:04:40,000
So residuals is basically over here.

91
00:04:40,000 --> 00:04:47,000
Then after doing the normalization we basically send it to the feed forward neural network.

92
00:04:48,000 --> 00:04:49,000
Right.

93
00:04:49,000 --> 00:04:54,000
So these are all the steps that we are specifically using in the feed forward neural network.

94
00:04:54,000 --> 00:04:55,000
It will be something like this.

95
00:04:55,000 --> 00:04:56,000
This will basically be my input.

96
00:04:56,000 --> 00:05:00,000
The input will be based on all these dimensions that we are getting.

97
00:05:00,000 --> 00:05:02,000
And then we have one hidden layer.

98
00:05:02,000 --> 00:05:05,000
In this hidden layer we have somewhere around 5 to 12 hidden nodes.

99
00:05:05,000 --> 00:05:06,000
Okay.

100
00:05:06,000 --> 00:05:08,000
And then I have my output layer okay.

101
00:05:08,000 --> 00:05:13,000
So this will basically give you the output based on the information that I have based on the number

102
00:05:13,000 --> 00:05:13,000
of vectors.

103
00:05:13,000 --> 00:05:14,000
That is probably coming up.

104
00:05:14,000 --> 00:05:15,000
Okay.

105
00:05:15,000 --> 00:05:18,000
So here that is what it is basically happening.

106
00:05:18,000 --> 00:05:20,000
And this is the input that is coming from this.

107
00:05:20,000 --> 00:05:21,000
Right.

108
00:05:21,000 --> 00:05:23,000
All the input that we are specifically passing and it will get trained.

109
00:05:24,000 --> 00:05:27,000
Once I do all these things, then it is sent to the next encoder.

110
00:05:27,000 --> 00:05:31,000
And the next encoder will have all the information with respect to all these vectors that I have.

111
00:05:31,000 --> 00:05:31,000
Right.

112
00:05:31,000 --> 00:05:36,000
But understand when I say five to l, this is with respect to every word.

113
00:05:38,000 --> 00:05:42,000
Every word is basically converted into five to l dimension of vectors.

114
00:05:42,000 --> 00:05:43,000
Okay.

115
00:05:43,000 --> 00:05:48,000
Now this is what the basic information, but they are still more information if I go inside self head

116
00:05:48,000 --> 00:05:51,000
uh, self uh attention right self uh self attention.

117
00:05:51,000 --> 00:05:54,000
So what are the parameters that is specifically used.

118
00:05:54,000 --> 00:05:54,000
Right.

119
00:05:54,000 --> 00:05:57,000
That 3 to 4 important parameters over there.

120
00:05:57,000 --> 00:06:03,000
Also one is your Q query key value.

121
00:06:03,000 --> 00:06:06,000
So what dimension this query key value is right.

122
00:06:06,000 --> 00:06:10,000
Based on the research paper this is 64 6464.

123
00:06:10,000 --> 00:06:16,000
So you'll be able to see inside self-attention when I do the product of query into K right.

124
00:06:16,000 --> 00:06:22,000
Later on when I normalize and when I have to divide right, I divide by root of 64, which is nothing

125
00:06:22,000 --> 00:06:23,000
but eight.

126
00:06:24,000 --> 00:06:27,000
Okay, this is what we are specifically doing, right?

127
00:06:27,000 --> 00:06:29,000
We are dividing by 64.

128
00:06:29,000 --> 00:06:32,000
And the number of attention of head is basically 68.

129
00:06:32,000 --> 00:06:32,000
Right.

130
00:06:32,000 --> 00:06:34,000
So all these information I have actually mentioned.

131
00:06:35,000 --> 00:06:39,000
So these are some of the basic parameters that are specifically used in research paper.

132
00:06:39,000 --> 00:06:41,000
Now the question is coming right.

133
00:06:41,000 --> 00:06:41,000
Crush.

134
00:06:41,000 --> 00:06:42,000
Why 52.

135
00:06:42,000 --> 00:06:44,000
Well why so many number of encoders.

136
00:06:44,000 --> 00:06:50,000
Because the understand guys the kind of task that you do is kind of sequence to sequence task sequence

137
00:06:50,000 --> 00:06:53,000
to sequence task are very complex task.

138
00:06:55,000 --> 00:06:57,000
They are very very complex task.

139
00:06:58,000 --> 00:06:58,000
Right.

140
00:06:58,000 --> 00:07:02,000
Let's if I probably say language translation, it is a very complex.

141
00:07:02,000 --> 00:07:03,000
Right.

142
00:07:03,000 --> 00:07:04,000
You have one language.

143
00:07:04,000 --> 00:07:06,000
Then we're trying to convert that into other language.

144
00:07:06,000 --> 00:07:07,000
They'll be dialects.

145
00:07:07,000 --> 00:07:08,000
There'll be so many different things.

146
00:07:09,000 --> 00:07:13,000
Now, if I really want to convert this kind of task just with one encoder, you'll not be able to get

147
00:07:13,000 --> 00:07:14,000
a very good accuracy.

148
00:07:14,000 --> 00:07:17,000
So we definitely need to use many encoders over here.

149
00:07:17,000 --> 00:07:21,000
According to a research paper they have used six encoders and they were able to get good results.

150
00:07:21,000 --> 00:07:22,000
Okay.

151
00:07:22,000 --> 00:07:25,000
So this is one of the very important thing that you really need to understand.

152
00:07:25,000 --> 00:07:25,000
Okay.

153
00:07:25,000 --> 00:07:31,000
The second thing, uh, that usually comes in your mind, right.

154
00:07:31,000 --> 00:07:35,000
And obviously everybody's mind first of all why residuals okay.

155
00:07:35,000 --> 00:07:39,000
Why you need to probably, uh, add residuals.

156
00:07:39,000 --> 00:07:43,000
Why you need to be probably provide some information over here.

157
00:07:43,000 --> 00:07:47,000
Uh, and probably give it to the additional and normalization layer.

158
00:07:47,000 --> 00:07:47,000
Right.

159
00:07:47,000 --> 00:07:51,000
This is the next question that many people will specifically ask.

160
00:07:51,000 --> 00:07:52,000
Right.

161
00:07:52,000 --> 00:07:56,000
And the third question that probably comes is why to use feed forward neural network.

162
00:07:56,000 --> 00:07:56,000
Right.

163
00:07:56,000 --> 00:07:59,000
So that is also there in your mind because feed forward neural network.

164
00:07:59,000 --> 00:08:02,000
Uh, here also you have used one kind of layer.

165
00:08:02,000 --> 00:08:07,000
Even though we are doing the back propagation over here, we have updating the weights of w q w k and.

166
00:08:07,000 --> 00:08:08,000
All right.

167
00:08:08,000 --> 00:08:11,000
So we'll try to answer step by step all these things.

168
00:08:11,000 --> 00:08:15,000
And uh again uh after thorough research, what are the points?

169
00:08:15,000 --> 00:08:19,000
What are the, uh, things that have basically been there from the researchers?

170
00:08:19,000 --> 00:08:20,000
Uh, will try to note it down.

171
00:08:20,000 --> 00:08:21,000
Okay.

172
00:08:21,000 --> 00:08:25,000
And as I have already shown you, self-attention, this is what is the operation that is specifically

173
00:08:25,000 --> 00:08:25,000
happening.

174
00:08:25,000 --> 00:08:29,000
And this is just a two example like encoder one and encoder two.

175
00:08:29,000 --> 00:08:30,000
You pass your self-attention.

176
00:08:30,000 --> 00:08:33,000
You pass this particular information from your input embedding vectors.

177
00:08:33,000 --> 00:08:35,000
So this is basically a phi two L vectors.

178
00:08:35,000 --> 00:08:36,000
Right.

179
00:08:36,000 --> 00:08:38,000
So this is nothing but 512.

180
00:08:38,000 --> 00:08:39,000
This is also 512 okay.

181
00:08:39,000 --> 00:08:41,000
Then positional encoding will also be 512.

182
00:08:41,000 --> 00:08:43,000
Then you have to add this.

183
00:08:43,000 --> 00:08:46,000
Then what you are doing you are passing this information to the next layer.

184
00:08:46,000 --> 00:08:47,000
That is additional normalization.

185
00:08:47,000 --> 00:08:49,000
Then this gets passed to the self-attention.

186
00:08:49,000 --> 00:08:51,000
Then after that a feed forward neural network.

187
00:08:51,000 --> 00:08:58,000
Then for every uh combination of self-attention feed forward, you are doing normalization and why normalization

188
00:08:58,000 --> 00:08:59,000
specifically used?

189
00:08:59,000 --> 00:09:00,000
I've already spoken about it.

190
00:09:00,000 --> 00:09:01,000
Right.

191
00:09:01,000 --> 00:09:03,000
Your input distribution may be different here.

192
00:09:03,000 --> 00:09:07,000
And after doing all these operations, your distribution may become different, right?

193
00:09:07,000 --> 00:09:12,000
So what we do we basically do a layer normalization which we have discussed in our previous video.

194
00:09:12,000 --> 00:09:12,000
Right.

195
00:09:12,000 --> 00:09:17,000
So let's talk about the residual right.

196
00:09:18,000 --> 00:09:19,000
What is residual over here.

197
00:09:19,000 --> 00:09:22,000
Any information that I'm passing it to the next layer here.

198
00:09:22,000 --> 00:09:25,000
Also I'm passing here also I'm passing here also I'm passing.

199
00:09:25,000 --> 00:09:27,000
And if you see in decoder also here also you are passing.

200
00:09:27,000 --> 00:09:29,000
We'll discuss about decoder later on.

201
00:09:29,000 --> 00:09:31,000
But let's focus over here in the encoder okay.

202
00:09:31,000 --> 00:09:33,000
So residual connection.

203
00:09:34,000 --> 00:09:38,000
What it is it is also called as skip connections.

204
00:09:41,000 --> 00:09:41,000
Okay.

205
00:09:42,000 --> 00:09:46,000
That are used in neural networks okay.

206
00:09:46,000 --> 00:09:52,000
Now let's understand why residual connection is basically required okay.

207
00:09:53,000 --> 00:09:59,000
First of all, just imagine okay guys I'm passing this information to the next layer.

208
00:09:59,000 --> 00:10:00,000
I'm skipping the self-attention.

209
00:10:00,000 --> 00:10:02,000
I'm also passing what was the input.

210
00:10:02,000 --> 00:10:07,000
So if I'm passing additional information if I'm normalizing it, don't you think this layer will be

211
00:10:07,000 --> 00:10:09,000
having more information about the input?

212
00:10:09,000 --> 00:10:09,000
Right.

213
00:10:09,000 --> 00:10:15,000
Even though we are able to probably do all the we are able to convert that into a textual contextual

214
00:10:15,000 --> 00:10:17,000
vector by using self-attention, right?

215
00:10:17,000 --> 00:10:22,000
But based on the research paper, there are multiple reasons why it is basically used.

216
00:10:22,000 --> 00:10:27,000
So first one is addressing the vanishing gradient problem.

217
00:10:29,000 --> 00:10:32,000
Now why do we say vanishing gradient problem.

218
00:10:32,000 --> 00:10:36,000
Because see guys here we have somewhere around six layers right.

219
00:10:36,000 --> 00:10:42,000
Six layers of encoders and six layers of decoder During this particular case, you will be able to see

220
00:10:42,000 --> 00:10:46,000
that as the number of layers increases, there are chances the gradient of loss function with respect

221
00:10:46,000 --> 00:10:52,000
to the weights can be very small, and this phenomenon is basically called as vanishing gradient problem.

222
00:10:52,000 --> 00:10:53,000
Okay.

223
00:10:53,000 --> 00:10:57,000
Now how does the residual help you in this okay.

224
00:10:57,000 --> 00:10:59,000
So I will just go ahead and write.

225
00:10:59,000 --> 00:11:01,000
How does residuals help you in this.

226
00:11:01,000 --> 00:11:04,000
Or skip connection help you in this.

227
00:11:04,000 --> 00:11:05,000
It is very simple okay.

228
00:11:05,000 --> 00:11:09,000
Here residual connection.

229
00:11:12,000 --> 00:11:13,000
Create a.

230
00:11:16,000 --> 00:11:18,000
Create a.

231
00:11:20,000 --> 00:11:22,000
Short paths.

232
00:11:25,000 --> 00:11:26,000
Very important point.

233
00:11:26,000 --> 00:11:30,000
Gradients to flow.

234
00:11:36,000 --> 00:11:40,000
To flow directly through the network.

235
00:11:43,000 --> 00:11:44,000
Okay.

236
00:11:45,000 --> 00:11:51,000
Because of this, the gradient remains

237
00:11:53,000 --> 00:11:57,000
sufficiently large.

238
00:11:57,000 --> 00:12:00,000
And this even in deep neural networks.

239
00:12:00,000 --> 00:12:01,000
Now see what is happening over here.

240
00:12:01,000 --> 00:12:04,000
I'm passing this information from here to here.

241
00:12:04,000 --> 00:12:06,000
I'm skipping this entire operation of self-attention.

242
00:12:06,000 --> 00:12:06,000
Right.

243
00:12:06,000 --> 00:12:08,000
And I'm passing some additional information.

244
00:12:09,000 --> 00:12:13,000
So when this entire information is basically passed, you'll be able to see that we are creating a short

245
00:12:13,000 --> 00:12:17,000
path because see, in back propagation what will happen over here, all the weights will get updated.

246
00:12:18,000 --> 00:12:20,000
And here you are passing some additional information.

247
00:12:20,000 --> 00:12:20,000
Right.

248
00:12:20,000 --> 00:12:25,000
So because of this it creates a short path for gradient to flow directly through the network.

249
00:12:25,000 --> 00:12:27,000
Gradient remains sufficiently large okay.

250
00:12:28,000 --> 00:12:30,000
Now because of this what will happen?

251
00:12:30,000 --> 00:12:38,000
The second very important point that I really want to mention it improves gradient flow.

252
00:12:40,000 --> 00:12:45,000
When I say gradient flow, that basically means my convergence will be faster.

253
00:12:49,000 --> 00:12:52,000
My convergence will be faster.

254
00:12:52,000 --> 00:12:53,000
Okay.

255
00:12:53,000 --> 00:12:59,000
Now if I don't do this, I may face some vanishing or vanishing gradient problem.

256
00:12:59,000 --> 00:12:59,000
Okay.

257
00:12:59,000 --> 00:13:02,000
It may also be uh, exploding gradient problem.

258
00:13:02,000 --> 00:13:05,000
Because if let's say if my weights is initialized in a higher way, right.

259
00:13:05,000 --> 00:13:07,000
In a very large values, okay.

260
00:13:07,000 --> 00:13:14,000
So when convergence is faster, then obviously your training will be also smoother, right?

261
00:13:14,000 --> 00:13:22,000
Now coming to the third point, this actually enables training of deeper networks.

262
00:13:26,000 --> 00:13:27,000
Deeper networks.

263
00:13:28,000 --> 00:13:31,000
Now here in the encoder you can see that I'm just not using one encoder.

264
00:13:31,000 --> 00:13:33,000
I'm using six encoders.

265
00:13:33,000 --> 00:13:36,000
So this becomes a very large network right.

266
00:13:36,000 --> 00:13:38,000
So very deep neural network right.

267
00:13:38,000 --> 00:13:44,000
So here because of this, at every part of this encoder we are doing this particular step that is called

268
00:13:44,000 --> 00:13:44,000
as residuals.

269
00:13:44,000 --> 00:13:45,000
Right.

270
00:13:45,000 --> 00:13:49,000
So this will basically enable training of deep neural networks okay.

271
00:13:49,000 --> 00:13:54,000
So you can basically say that residual connections allow for the effective training of much deep networks

272
00:13:54,000 --> 00:14:00,000
by addressing the gradient flow issues and making it easier for the network to learn identity mappings.

273
00:14:00,000 --> 00:14:05,000
This enables the model to be very expressive and capable of learning more complex functions.

274
00:14:05,000 --> 00:14:06,000
Right.

275
00:14:06,000 --> 00:14:13,000
So this was about, uh, you know, the residual connection that we have specifically discussed.

276
00:14:13,000 --> 00:14:13,000
Okay.

277
00:14:13,000 --> 00:14:18,000
Now let's talk about over here why we are using this feed forward neural network.

278
00:14:18,000 --> 00:14:19,000
Okay.

279
00:14:19,000 --> 00:14:23,000
This also you really need to understand why feed forward neural network.

280
00:14:23,000 --> 00:14:25,000
What is the importance of feed forward neural network.

281
00:14:25,000 --> 00:14:30,000
Because, uh, I've just given one example with respect to feed forward neural network.

282
00:14:30,000 --> 00:14:36,000
Obviously, if you know about N, then n is nothing, but it is a feed forward neural network.

283
00:14:36,000 --> 00:14:40,000
So here why feed forward neural network?

284
00:14:40,000 --> 00:14:44,000
Again, based on the research paper, I will try to give you the definition with respect to some advantages.

285
00:14:44,000 --> 00:14:45,000
Okay.

286
00:14:45,000 --> 00:14:52,000
First of all, see if you have created anytime an Ann.

287
00:14:52,000 --> 00:14:52,000
Right.

288
00:14:52,000 --> 00:14:54,000
Let's say this is an Ann.

289
00:14:55,000 --> 00:15:00,000
And in the Ann let's say in the output layer you don't have any activation function.

290
00:15:00,000 --> 00:15:05,000
So this basically solves a linear problem right.

291
00:15:07,000 --> 00:15:08,000
Sorry.

292
00:15:08,000 --> 00:15:10,000
This solves a linear function problem.

293
00:15:12,000 --> 00:15:13,000
Problem.

294
00:15:13,000 --> 00:15:20,000
But the main aim of feedforward neural network is basically to capture the nonlinear function.

295
00:15:20,000 --> 00:15:22,000
It is to solve the nonlinear function.

296
00:15:23,000 --> 00:15:23,000
Right?

297
00:15:23,000 --> 00:15:28,000
Now in the nonlinear function how do we solve this and all.

298
00:15:28,000 --> 00:15:32,000
Obviously we we have seen right in every layer we apply an activation function.

299
00:15:32,000 --> 00:15:32,000
Right.

300
00:15:32,000 --> 00:15:37,000
And based on that it is able to capture the non-linear functions, the most complex properties within

301
00:15:37,000 --> 00:15:38,000
a specific data set.

302
00:15:38,000 --> 00:15:39,000
Right the behavior.

303
00:15:39,000 --> 00:15:40,000
And all right.

304
00:15:40,000 --> 00:15:47,000
So similarly feed forward neural network will also do that same thing obviously before the self-attention

305
00:15:47,000 --> 00:15:48,000
is able to do good task.

306
00:15:48,000 --> 00:15:54,000
But to capture more information for this complex task we specifically use this feed forward neural network.

307
00:15:54,000 --> 00:16:01,000
So what it does over here, I will just add one point that it is going to add non-linearity.

308
00:16:03,000 --> 00:16:03,000
Okay.

309
00:16:04,000 --> 00:16:11,000
Second information and probably um, this is even not mentioned in the research paper.

310
00:16:11,000 --> 00:16:17,000
But I was reading an article and there I specifically understood about this.

311
00:16:17,000 --> 00:16:17,000
Okay.

312
00:16:18,000 --> 00:16:25,000
It is nothing but processing each position independently.

313
00:16:30,000 --> 00:16:32,000
Okay, now let's talk about this.

314
00:16:32,000 --> 00:16:33,000
Why it is important.

315
00:16:33,000 --> 00:16:39,000
Now you know that self-attention mechanism captures the relationship between tokens, right?

316
00:16:39,000 --> 00:16:41,000
What does self-attention basically do?

317
00:16:42,000 --> 00:16:45,000
It captures the relationship.

318
00:16:50,000 --> 00:16:52,000
It captures the relationship between the token.

319
00:16:52,000 --> 00:16:59,000
It processes the relationship in such a way that each token can attend to every other token, right?

320
00:16:59,000 --> 00:17:02,000
That is how it basically captures the relation, right?

321
00:17:02,000 --> 00:17:07,000
And that is the reason you don't require further transformation or feature extraction on the representation

322
00:17:07,000 --> 00:17:09,000
obtained from the self-attention.

323
00:17:09,000 --> 00:17:09,000
Okay.

324
00:17:10,000 --> 00:17:16,000
Now how does feedforward neural network help this feedforward neural network?

325
00:17:16,000 --> 00:17:23,000
Based on the tokens, whatever context you are specifically able to get, it processes each token representation

326
00:17:23,000 --> 00:17:26,000
independently right before what it needs to do.

327
00:17:27,000 --> 00:17:29,000
We are sending all the information with the positional encoding.

328
00:17:29,000 --> 00:17:33,000
That is how your self-attention will understand which words are probably coming first.

329
00:17:33,000 --> 00:17:35,000
Which words are probably coming second, right?

330
00:17:35,000 --> 00:17:45,000
But if I talk about the feed forward neural network, this process is each token representation.

331
00:17:48,000 --> 00:17:49,000
Independently.

332
00:17:50,000 --> 00:17:53,000
Now you may be thinking what is so important about me?

333
00:17:53,000 --> 00:18:00,000
This means that for each token representation resulting from the self-attention mechanism, the feed

334
00:18:00,000 --> 00:18:04,000
forward neural network applies the same two layer neural network.

335
00:18:04,000 --> 00:18:11,000
As I said, there will be a two neural network, and this helps since we are taking each and every token

336
00:18:11,000 --> 00:18:14,000
and we are trying to retrieve more specific information out of it.

337
00:18:14,000 --> 00:18:23,000
This helps in transforming this representation.

338
00:18:26,000 --> 00:18:27,000
Further.

339
00:18:30,000 --> 00:18:32,000
And allows the model.

340
00:18:36,000 --> 00:18:40,000
Allows the model to learn.

341
00:18:43,000 --> 00:18:44,000
To learn.

342
00:18:45,000 --> 00:18:49,000
Richer representation.

343
00:18:51,000 --> 00:18:55,000
With the self-attention, we are definitely able to understand some of the information based on the

344
00:18:55,000 --> 00:18:56,000
contextual information.

345
00:18:56,000 --> 00:19:01,000
But by using feed forward neural network, again, we are taking each and every vector for that specific

346
00:19:01,000 --> 00:19:04,000
information or for that contextual embedding.

347
00:19:04,000 --> 00:19:10,000
And then we are allowing the model to learn more representation from that particular thing.

348
00:19:10,000 --> 00:19:10,000
Right.

349
00:19:10,000 --> 00:19:19,000
It's just like more trying to explore more of this specific vectors that are created by the contextual

350
00:19:19,000 --> 00:19:20,000
embeddings.

351
00:19:20,000 --> 00:19:20,000
Right.

352
00:19:20,000 --> 00:19:23,000
So this is why we specifically use feedforward neural network.

353
00:19:23,000 --> 00:19:28,000
And if you know about Ann, which is the kind of feedforward neural network, you can just understand

354
00:19:28,000 --> 00:19:33,000
with the help of forward and backward propagation, what all information will be able to explore and

355
00:19:33,000 --> 00:19:33,000
get.

356
00:19:33,000 --> 00:19:33,000
Okay.

357
00:19:35,000 --> 00:19:42,000
Um, again, uh, the third important point about feed forward neural network is that already our network

358
00:19:42,000 --> 00:19:43,000
is really deep, right?

359
00:19:43,000 --> 00:19:49,000
So feed forward neural network helps your model, uh, helps your neural network to become much more

360
00:19:49,000 --> 00:19:50,000
deeper.

361
00:19:50,000 --> 00:19:51,000
Right.

362
00:19:51,000 --> 00:19:56,000
So it basically adds depth to the model.

363
00:19:56,000 --> 00:20:01,000
And one important property about depth is that if you're handling all those vanishing gradient problem

364
00:20:01,000 --> 00:20:13,000
at all, as there is more depth, more learnings will be able to get from the data, more learnings

365
00:20:14,000 --> 00:20:15,000
you'll be able to get from the data.

366
00:20:15,000 --> 00:20:19,000
So that basically means the neural network will be able to capture more and more information from that

367
00:20:19,000 --> 00:20:20,000
particular data.

368
00:20:20,000 --> 00:20:20,000
Okay.

369
00:20:20,000 --> 00:20:21,000
Now it's not like that.

370
00:20:21,000 --> 00:20:25,000
You can probably use just ten layers of feed forward neural network.

371
00:20:25,000 --> 00:20:30,000
But this is doing one thing because the two layers is specifically able to do much more things over

372
00:20:30,000 --> 00:20:31,000
there, right?

373
00:20:31,000 --> 00:20:36,000
And again, uh, when you say one more very important property with respect to feed forward neural network,

374
00:20:36,000 --> 00:20:38,000
it also increases the model parameters, okay.

375
00:20:38,000 --> 00:20:43,000
And because of which it will be able to help us to generalize well with the training data and with the

376
00:20:43,000 --> 00:20:45,000
test data, which is unseen data.

377
00:20:45,000 --> 00:20:46,000
Okay.

378
00:20:46,000 --> 00:20:51,000
So this is more about understanding why feed forward neural network, why this and all.

379
00:20:51,000 --> 00:20:56,000
But if I just go with the flow, I have each and every word that is passing to self-attention.

380
00:20:56,000 --> 00:21:00,000
Add normalization, feedforward neural network, then again passing through the next encoder.

381
00:21:00,000 --> 00:21:03,000
Now again you may also have a question Krish why so many number of encoders.

382
00:21:03,000 --> 00:21:04,000
Right.

383
00:21:04,000 --> 00:21:07,000
Again as I said synchronous sequence to sequence task are more complex.

384
00:21:07,000 --> 00:21:11,000
So as we are able to extract more and more information, more better, it is right.

385
00:21:11,000 --> 00:21:16,000
Tomorrow, if you are planning to create another kind of LM models or any other models, you can also

386
00:21:16,000 --> 00:21:17,000
change this.

387
00:21:17,000 --> 00:21:19,000
You can use for encoders if you want.

388
00:21:19,000 --> 00:21:20,000
And all right.

389
00:21:20,000 --> 00:21:24,000
So after all this information is probably passed then from here it goes to the decoder.

390
00:21:24,000 --> 00:21:28,000
Now in my next video I'll try to make you understand what is exactly about the decoders.

391
00:21:28,000 --> 00:21:31,000
But in this video I told you about the parameters.

392
00:21:31,000 --> 00:21:34,000
So let me revise what all parameters are specifically used.

393
00:21:34,000 --> 00:21:36,000
One is for the embedding vectors.

394
00:21:37,000 --> 00:21:40,000
Embedding vectors is nothing but 512 dimension.

395
00:21:40,000 --> 00:21:46,000
Your q k, v is nothing but 64 your multi-head attentions.

396
00:21:46,000 --> 00:21:48,000
How many multi-head attentions you have?

397
00:21:48,000 --> 00:21:50,000
It is nothing but eight, right?

398
00:21:50,000 --> 00:21:55,000
And then along with this positional encoding will also be the same thing over here.

399
00:21:55,000 --> 00:21:57,000
And all the remaining information I have discussed.

400
00:21:57,000 --> 00:21:58,000
Right.

401
00:21:58,000 --> 00:21:59,000
So yes, this was it for my side.

402
00:21:59,000 --> 00:22:01,000
I hope you liked this particular video.

403
00:22:01,000 --> 00:22:03,000
Uh, this was about the encoder architecture.

404
00:22:03,000 --> 00:22:04,000
I'll see you in the next video.

405
00:22:04,000 --> 00:22:04,000
Thank you.

406
00:22:04,000 --> 00:22:04,000
Take care.