1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to Transformers.

3
00:00:03,000 --> 00:00:07,000
Already till now we have discussed about many things, right?

4
00:00:07,000 --> 00:00:11,000
We have covered something called as self-attention.

5
00:00:11,000 --> 00:00:19,000
Then after covering self-attention layer and how it exactly works, we have also understood about Multi-head

6
00:00:19,000 --> 00:00:20,000
attention.

7
00:00:20,000 --> 00:00:27,000
Then after covering Multi-head attention, we saw the problem with respect to the sequence of word,

8
00:00:27,000 --> 00:00:33,000
and for this we also understood about what is the importance of positional encoding.

9
00:00:33,000 --> 00:00:33,000
Right.

10
00:00:33,000 --> 00:00:40,000
And we have actually seen a lot of examples I have personally solved done all the calculations how with

11
00:00:40,000 --> 00:00:45,000
the help of self-attention layer, how we can convert that fixed vectors that we specifically had from

12
00:00:45,000 --> 00:00:47,000
the inputs to contextual vectors.

13
00:00:47,000 --> 00:00:48,000
Right.

14
00:00:48,000 --> 00:00:50,000
So all those things we have specifically covered.

15
00:00:50,000 --> 00:00:59,000
So in this video we are going to go ahead and understand about something called as layer normalization.

16
00:00:59,000 --> 00:01:02,000
Now near normalization is one of the component.

17
00:01:02,000 --> 00:01:08,000
Now if you see the architecture of the transformers and this architecture is from the research paper

18
00:01:09,000 --> 00:01:12,000
here, you will be able to see that initially I have my inputs.

19
00:01:12,000 --> 00:01:16,000
We convert this into some input embeddings.

20
00:01:16,000 --> 00:01:19,000
Then we add it with the positional encoding.

21
00:01:19,000 --> 00:01:22,000
And after this we go ahead and create our multi-head attention.

22
00:01:22,000 --> 00:01:27,000
So here we specifically pass through our self-attention layer right.

23
00:01:27,000 --> 00:01:32,000
Now when we are creating this multi-head attention now you can see the next step.

24
00:01:32,000 --> 00:01:34,000
What exactly after this it is right.

25
00:01:34,000 --> 00:01:37,000
It is basically called as add and normalize.

26
00:01:38,000 --> 00:01:38,000
Okay.

27
00:01:38,000 --> 00:01:41,000
Add and normalize.

28
00:01:42,000 --> 00:01:46,000
And this is what we are going to talk about right.

29
00:01:46,000 --> 00:01:51,000
When we say add and normalize, the normalization technique that is specifically applied is called as

30
00:01:51,000 --> 00:01:53,000
layer normalization.

31
00:01:53,000 --> 00:01:56,000
So we are going to understand the mathematical intuition.

32
00:01:56,000 --> 00:02:02,000
Along with this there is also one more topic that we really need to understand which is called as residuals.

33
00:02:03,000 --> 00:02:10,000
Now here you can see that before going to the multi-head attention, whatever vectors that I get after

34
00:02:10,000 --> 00:02:17,000
adding the positional encoding to the input embedding embedding that is also passed to this add and

35
00:02:17,000 --> 00:02:18,000
normalize.

36
00:02:18,000 --> 00:02:21,000
So this are called as residuals okay.

37
00:02:22,000 --> 00:02:23,000
Residuals.

38
00:02:23,000 --> 00:02:28,000
And this provides some additional signals okay.

39
00:02:28,000 --> 00:02:37,000
Additional signals to the ad and to the layer normalization.

40
00:02:38,000 --> 00:02:38,000
Right.

41
00:02:38,000 --> 00:02:40,000
Layer normalization.

42
00:02:40,000 --> 00:02:48,000
It's just like this line is basically providing the information of this input embedding plus positional

43
00:02:48,000 --> 00:02:51,000
encoding vectors to the additional and normalization layer.

44
00:02:51,000 --> 00:02:52,000
Right.

45
00:02:52,000 --> 00:02:56,000
So in short, uh, that is the reason we basically say here we are saying add and normalize.

46
00:02:56,000 --> 00:03:00,000
So we are going to probably add this previous encodings.

47
00:03:00,000 --> 00:03:03,000
That is probably coming from the input embedding and positional encoding.

48
00:03:03,000 --> 00:03:06,000
And then we are basically going to normalize right.

49
00:03:06,000 --> 00:03:09,000
So that is basically the residuals that is getting added.

50
00:03:09,000 --> 00:03:10,000
So this is the information.

51
00:03:10,000 --> 00:03:14,000
Whatever pre hand we are able to get that we are passing it to the next stage okay.

52
00:03:14,000 --> 00:03:16,000
And that is what residuals is all about.

53
00:03:16,000 --> 00:03:19,000
But again uh don't worry we will try to understand it.

54
00:03:19,000 --> 00:03:23,000
What exactly is normalized and how this addition and normalization takes place.

55
00:03:23,000 --> 00:03:23,000
Okay.

56
00:03:24,000 --> 00:03:31,000
Now before we go ahead, uh, let's understand where this addition and normalization will get added.

57
00:03:31,000 --> 00:03:31,000
Okay.

58
00:03:31,000 --> 00:03:35,000
So here uh, again I've taken this particular diagram from the blog.

59
00:03:35,000 --> 00:03:36,000
Right.

60
00:03:36,000 --> 00:03:39,000
So here you initially have your input embeddings okay.

61
00:03:39,000 --> 00:03:40,000
Layer by layer.

62
00:03:40,000 --> 00:03:40,000
It will be there.

63
00:03:40,000 --> 00:03:42,000
This is for my word one word two.

64
00:03:42,000 --> 00:03:43,000
Right.

65
00:03:43,000 --> 00:03:48,000
Then for each of these words, based on the number of attentions, I will be initializing different

66
00:03:48,000 --> 00:03:49,000
different weights.

67
00:03:49,000 --> 00:03:50,000
Right?

68
00:03:50,000 --> 00:03:55,000
Then we probably go ahead and find our query key and value vectors for each and every attention.

69
00:03:55,000 --> 00:04:00,000
And finally we get all the attention itself right z zero, z one and z seven.

70
00:04:00,000 --> 00:04:02,000
Now see one thing over here.

71
00:04:02,000 --> 00:04:04,000
Where does this see.

72
00:04:04,000 --> 00:04:06,000
Because this is the multi-head attention, right?

73
00:04:06,000 --> 00:04:11,000
And this multi head attention before we pass it to the feed forward neural network.

74
00:04:11,000 --> 00:04:12,000
Right.

75
00:04:12,000 --> 00:04:17,000
So here we will go ahead and apply our next step.

76
00:04:17,000 --> 00:04:18,000
That is nothing but.

77
00:04:21,000 --> 00:04:23,000
Add and normalize.

78
00:04:23,000 --> 00:04:30,000
So let me write it in another color add and normalize okay.

79
00:04:31,000 --> 00:04:34,000
And this is where your layer normalization basically happens.

80
00:04:34,000 --> 00:04:38,000
Layer normalization basically happens.

81
00:04:38,000 --> 00:04:40,000
Now what exactly is this layer normalization.

82
00:04:40,000 --> 00:04:41,000
We'll discuss more about it.

83
00:04:41,000 --> 00:04:47,000
But with respect to the kind of working that you'll be able to see that over here add and normalize

84
00:04:47,000 --> 00:04:48,000
is specifically happening okay.

85
00:04:49,000 --> 00:04:56,000
And here you can see once we also provide this before we go ahead we also need to add our positional

86
00:04:56,000 --> 00:04:56,000
encoding.

87
00:04:57,000 --> 00:04:57,000
Okay.

88
00:04:57,000 --> 00:05:03,000
And this positional encoding will be like with the same shape right.

89
00:05:03,000 --> 00:05:06,000
So let's say if this is my positional encoding.

90
00:05:07,000 --> 00:05:09,000
So here 1234.

91
00:05:09,000 --> 00:05:12,000
So this will be my positional encoding that will get added.

92
00:05:12,000 --> 00:05:13,000
And it will be sent over here.

93
00:05:14,000 --> 00:05:15,000
Along with this.

94
00:05:15,000 --> 00:05:21,000
What happens is that this positional encoding information is also sent over here, and it is added before

95
00:05:21,000 --> 00:05:23,000
the normalization takes place.

96
00:05:23,000 --> 00:05:23,000
Okay.

97
00:05:23,000 --> 00:05:25,000
And the reason over here why we do this.

98
00:05:25,000 --> 00:05:34,000
Because we are providing the signals this this entire information information to this layer okay.

99
00:05:35,000 --> 00:05:36,000
Some more additional signals.

100
00:05:36,000 --> 00:05:39,000
So that is what residuals is all about.

101
00:05:39,000 --> 00:05:42,000
Now let's go ahead and understand about the layer normalization.

102
00:05:42,000 --> 00:05:48,000
So first of all we will just get to know an idea about what exactly is normalization.

103
00:05:48,000 --> 00:05:54,000
And specifically with respect to deep learning we use normalization in two different ways.

104
00:05:54,000 --> 00:06:02,000
One is batch normalization, one is batch normalization.

105
00:06:03,000 --> 00:06:07,000
And the other one is something called as layer normalization.

106
00:06:09,000 --> 00:06:12,000
Batch normalization and layer normalization.

107
00:06:13,000 --> 00:06:16,000
Now let's discuss about this batch and layer normalization.

108
00:06:16,000 --> 00:06:18,000
So for this I will take an example.

109
00:06:18,000 --> 00:06:23,000
Let's say I have a neural network with two inputs.

110
00:06:23,000 --> 00:06:23,000
Okay.

111
00:06:23,000 --> 00:06:28,000
And let's say here in my hidden layer I have two hidden neurons.

112
00:06:28,000 --> 00:06:30,000
And then finally I have one right.

113
00:06:30,000 --> 00:06:36,000
So whenever this input is basically connected to the hidden layer right in this way.

114
00:06:37,000 --> 00:06:40,000
And then finally I'm able to get my output over here.

115
00:06:40,000 --> 00:06:41,000
So this is my output.

116
00:06:42,000 --> 00:06:45,000
Now what exactly is normalization.

117
00:06:45,000 --> 00:06:48,000
See guys let's say over here I am having two features.

118
00:06:48,000 --> 00:06:54,000
Let's say my feature is like house size right.

119
00:06:54,000 --> 00:06:56,000
This is my first feature.

120
00:06:56,000 --> 00:06:59,000
Let's say I will go ahead and right area of the house.

121
00:06:59,000 --> 00:07:00,000
Okay.

122
00:07:00,000 --> 00:07:01,000
House size and area.

123
00:07:01,000 --> 00:07:05,000
I'll say number of rooms in my second feature number of rooms.

124
00:07:06,000 --> 00:07:10,000
And I need to probably predict the price of the house.

125
00:07:10,000 --> 00:07:13,000
Okay, so this is my F1 feature.

126
00:07:13,000 --> 00:07:16,000
This is my F2 feature and this is what F1 feature goes over here.

127
00:07:16,000 --> 00:07:17,000
F2 goes over here.

128
00:07:17,000 --> 00:07:25,000
Now before we usually send our F1 and F2 input to this particular.

129
00:07:25,000 --> 00:07:27,000
And that is artificial neural network.

130
00:07:27,000 --> 00:07:29,000
Or I can also say it as feed forward neural network.

131
00:07:29,000 --> 00:07:34,000
Right here I will be having some information and one by one I will probably send this information for

132
00:07:34,000 --> 00:07:36,000
training the neural network.

133
00:07:36,000 --> 00:07:40,000
Now what usually happens is that the first step and you know that right?

134
00:07:40,000 --> 00:07:42,000
I will be having multiple values over here.

135
00:07:42,000 --> 00:07:45,000
Let's say I will just go ahead and write what is the house size?

136
00:07:45,000 --> 00:07:49,000
Let's say it is 1200 square feet okay.

137
00:07:49,000 --> 00:07:50,000
Number of rooms are two.

138
00:07:50,000 --> 00:07:53,000
And let's say the price is 45 lakhs okay.

139
00:07:53,000 --> 00:07:58,000
And uh, I will go ahead and write, hey, my house size is another house.

140
00:07:58,000 --> 00:08:00,000
Data that we have is 1500 square feet.

141
00:08:00,000 --> 00:08:03,000
The number of rooms are three and this is 70 lakhs.

142
00:08:03,000 --> 00:08:04,000
Okay.

143
00:08:04,000 --> 00:08:07,000
And similarly here you have 2000ft².

144
00:08:07,000 --> 00:08:10,000
You have 3.5, uh, BC rooms.

145
00:08:10,000 --> 00:08:12,000
And here let's say the price is 80 lakhs.

146
00:08:12,000 --> 00:08:14,000
So all this information you specifically have.

147
00:08:15,000 --> 00:08:21,000
Now, the first thing that we basically do before we provide all these inputs, F1 and F2 for the training,

148
00:08:21,000 --> 00:08:21,000
right.

149
00:08:21,000 --> 00:08:22,000
What do we do?

150
00:08:22,000 --> 00:08:24,000
We do something called as normalization.

151
00:08:24,000 --> 00:08:29,000
One of the most common normalization we do.

152
00:08:29,000 --> 00:08:29,000
Right.

153
00:08:29,000 --> 00:08:32,000
That is nothing but standard scaling.

154
00:08:33,000 --> 00:08:35,000
Now what does standard scaling basically mean?

155
00:08:36,000 --> 00:08:38,000
With the help of standard scaling, right.

156
00:08:38,000 --> 00:08:44,000
What we do is that we take each of this each of this features.

157
00:08:44,000 --> 00:08:45,000
Right.

158
00:08:45,000 --> 00:08:53,000
Each of these features, even the output feature if we want and we apply a general standard scaling

159
00:08:53,000 --> 00:08:57,000
formula and that we basically say it as z score.

160
00:08:57,000 --> 00:08:58,000
Now nine z score.

161
00:08:58,000 --> 00:09:01,000
What happens is that we the formula is very simple.

162
00:09:01,000 --> 00:09:06,000
We take every input, we subtract it with mean and we divide by standard deviation.

163
00:09:07,000 --> 00:09:12,000
Now once we do this, if we are considering any feature, let's say if I'm considering house size and

164
00:09:12,000 --> 00:09:19,000
if I apply this formula, all these values, all these values, you will be able to see that we will

165
00:09:19,000 --> 00:09:25,000
be getting in such a way where my mean will be zero and my standard deviation will be equal to one.

166
00:09:26,000 --> 00:09:28,000
Okay, so what is basically happening?

167
00:09:28,000 --> 00:09:36,000
Let's say if f one is one of my distribution, after applying the standard scaling this will get converted

168
00:09:36,000 --> 00:09:37,000
into f one dash.

169
00:09:37,000 --> 00:09:42,000
And inside this f one dash here you will be able to see my mean will be zero and my standard deviation

170
00:09:42,000 --> 00:09:43,000
will be one.

171
00:09:43,000 --> 00:09:44,000
Okay.

172
00:09:45,000 --> 00:09:48,000
Now one thing that you really need to understand, right?

173
00:09:48,000 --> 00:09:54,000
Whenever we are applying this formula, this basically gets applied to this entire column okay.

174
00:09:54,000 --> 00:09:56,000
To entire column.

175
00:09:56,000 --> 00:10:00,000
So first of all, what we need to do in order to apply this formula we need to calculate the mean.

176
00:10:00,000 --> 00:10:03,000
For this we need to calculate the standard deviation for this.

177
00:10:03,000 --> 00:10:07,000
And then for every number we will go ahead and apply standard scaling.

178
00:10:07,000 --> 00:10:12,000
This is what happens when we try to provide our input over here, right?

179
00:10:13,000 --> 00:10:14,000
When we provide our input over here.

180
00:10:15,000 --> 00:10:17,000
Now this scaling is fine okay.

181
00:10:17,000 --> 00:10:19,000
And usually we do this.

182
00:10:19,000 --> 00:10:24,000
And what usually happens because of this I will tell you there are some very good advantages of using

183
00:10:24,000 --> 00:10:24,000
this.

184
00:10:24,000 --> 00:10:30,000
And similarly in deep learning if I talk about deep learning right.

185
00:10:30,000 --> 00:10:34,000
In deep learning, uh, let's say my input is in the form of images.

186
00:10:34,000 --> 00:10:36,000
So there also we do some kind of scaling.

187
00:10:36,000 --> 00:10:41,000
And that scaling that we specifically do for the data is something called as min max scaler.

188
00:10:42,000 --> 00:10:46,000
Now in the case of min max scaler you know that all your images.

189
00:10:47,000 --> 00:10:47,000
Right.

190
00:10:47,000 --> 00:10:53,000
Let's say I have some color images of this of some four cross four matrix.

191
00:10:53,000 --> 00:10:56,000
And you know, all the values will be between 0 to 255.

192
00:10:56,000 --> 00:10:59,000
So what we do after applying this min max scaler.

193
00:11:02,000 --> 00:11:11,000
We convert all these dimensions, all these dimensions between 0 to 1 right, 0 to 1, 0 to 1, all

194
00:11:11,000 --> 00:11:12,000
these pixels that we have.

195
00:11:12,000 --> 00:11:13,000
Right.

196
00:11:13,000 --> 00:11:14,000
If it is four cross four pixels.

197
00:11:14,000 --> 00:11:17,000
So we try to convert this into.

198
00:11:20,000 --> 00:11:24,000
We try to convert this into 0 to 1.

199
00:11:25,000 --> 00:11:30,000
So this usually we do for the input data whatever input data we specifically have.

200
00:11:30,000 --> 00:11:34,000
And what are the usual advantages for doing this.

201
00:11:34,000 --> 00:11:37,000
Scaling techniques which we also see it as normalization.

202
00:11:37,000 --> 00:11:40,000
First let me just go ahead and write some of the advantages.

203
00:11:42,000 --> 00:11:47,000
The first advantage that I would like to write is about improved.

204
00:11:51,000 --> 00:11:52,000
Training.

205
00:11:54,000 --> 00:11:55,000
Stability.

206
00:11:56,000 --> 00:12:04,000
Okay, now what does this improved training stability basically means, right?

207
00:12:04,000 --> 00:12:12,000
See, before before we are passing it to our and write all the information.

208
00:12:12,000 --> 00:12:16,000
We know that my data may be following a different kind of distribution.

209
00:12:16,000 --> 00:12:19,000
It can be this distribution, it can be this distribution.

210
00:12:19,000 --> 00:12:25,000
It can be probably, uh, this kind of distribution, different, different distribution.

211
00:12:25,000 --> 00:12:25,000
It may follow.

212
00:12:25,000 --> 00:12:26,000
Right.

213
00:12:26,000 --> 00:12:34,000
And when we actually convert this distribution into a distribution where all my information is centered

214
00:12:34,000 --> 00:12:35,000
near to zero.

215
00:12:35,000 --> 00:12:35,000
Right.

216
00:12:35,000 --> 00:12:40,000
Because here, what is happening here we are able to convert into data where our mean will be zero and

217
00:12:40,000 --> 00:12:42,000
it will be standard deviation equal to one.

218
00:12:42,000 --> 00:12:43,000
Right.

219
00:12:44,000 --> 00:12:49,000
So and all the values that you will be seeing let's say initially when I probably check the distribution,

220
00:12:49,000 --> 00:12:54,000
let's say that I'm checking the distribution between my f one and f two feature.

221
00:12:55,000 --> 00:12:57,000
Now you know my distribution is coming over here.

222
00:12:57,000 --> 00:12:59,000
Let's say it is coming over here.

223
00:13:00,000 --> 00:13:03,000
Now, what we are doing after applying scaling, it is very simple.

224
00:13:03,000 --> 00:13:08,000
We are taking this and this is my f one and f two.

225
00:13:08,000 --> 00:13:16,000
Now here my magnitude was I'm trying to bring all the magnitude over here right.

226
00:13:16,000 --> 00:13:18,000
So here what is the main property over here.

227
00:13:18,000 --> 00:13:21,000
Your mean will be equal to zero and a standard deviation will be equal to one.

228
00:13:21,000 --> 00:13:24,000
This entire distribution will have this kind of distribution itself.

229
00:13:24,000 --> 00:13:31,000
Now, because of this, what will happen is that when we are doing the back propagation, okay, when

230
00:13:31,000 --> 00:13:38,000
we are doing the back propagation, we will definitely not face any kind of problems related to vanishing

231
00:13:39,000 --> 00:13:40,000
and exploding.

232
00:13:40,000 --> 00:13:41,000
Gradient problem.

233
00:13:45,000 --> 00:13:46,000
Gradient problem.

234
00:13:46,000 --> 00:13:51,000
This is one of the properties that we usually get after doing the scaling.

235
00:13:51,000 --> 00:13:54,000
And right now all the scaling has been done in the center right.

236
00:13:55,000 --> 00:14:00,000
If we had this huge values, there may be scenario that whenever we do, we are doing some kind of multiplication

237
00:14:00,000 --> 00:14:01,000
with respect to the weights.

238
00:14:01,000 --> 00:14:03,000
And all these values will be quite huge.

239
00:14:03,000 --> 00:14:03,000
Okay.

240
00:14:03,000 --> 00:14:09,000
And because of this, what will happen is that the second important thing that may happen is that we

241
00:14:09,000 --> 00:14:12,000
will be able to get faster convergence.

242
00:14:14,000 --> 00:14:19,000
We will be able to get faster convergence because obviously these values are very slow.

243
00:14:19,000 --> 00:14:20,000
Uh, less right.

244
00:14:20,000 --> 00:14:28,000
So if you remember the gradient descent with respect to our weights, uh, we try to find out our loss

245
00:14:28,000 --> 00:14:32,000
right now since our values are already centered completely.

246
00:14:32,000 --> 00:14:32,000
Right.

247
00:14:32,000 --> 00:14:33,000
This is zero centered.

248
00:14:33,000 --> 00:14:37,000
All the data over here is like zero centered, right?

249
00:14:37,000 --> 00:14:44,000
You'll be able to see that initially, if I compare with respect to our weights and loss, our loss

250
00:14:44,000 --> 00:14:49,000
when we are doing this, when we are probably doing the scaling initially, it will be very, very small.

251
00:14:49,000 --> 00:14:53,000
If you don't do this, there may be scenario that there may be a huge loss over here.

252
00:14:53,000 --> 00:14:57,000
And then our main aim is basically to come to the global minima, right?

253
00:14:57,000 --> 00:14:59,000
It can be from here.

254
00:14:59,000 --> 00:15:01,000
It can be from here, it can be from here.

255
00:15:01,000 --> 00:15:03,000
It can be from different different position.

256
00:15:03,000 --> 00:15:04,000
Right.

257
00:15:04,000 --> 00:15:08,000
So because of this we are able to do the faster convergence.

258
00:15:08,000 --> 00:15:12,000
And the best thing will be that when we apply back propagation.

259
00:15:14,000 --> 00:15:18,000
Back propagation there will be a stable update.

260
00:15:19,000 --> 00:15:22,000
Stable updates okay.

261
00:15:22,000 --> 00:15:24,000
Perfect then.

262
00:15:26,000 --> 00:15:29,000
one more important information that I really want to talk about.

263
00:15:29,000 --> 00:15:29,000
Right.

264
00:15:29,000 --> 00:15:33,000
And, uh, that that point I will discuss later on.

265
00:15:33,000 --> 00:15:36,000
But first of all, this is one of the way how we do the scaling.

266
00:15:36,000 --> 00:15:41,000
But again, if I take this particular neural network, okay, let's say this is my neural network.

267
00:15:41,000 --> 00:15:45,000
This is my input layer, this is my hidden layer and this is my output layer.

268
00:15:45,000 --> 00:15:47,000
Now, you know that this is connected to this.

269
00:15:47,000 --> 00:15:49,000
This is connected to this.

270
00:15:49,000 --> 00:15:51,000
This is connected to this.

271
00:15:51,000 --> 00:15:58,000
And I have this right now here you will be able to see that if I am doing all the normalization with

272
00:15:58,000 --> 00:15:59,000
respect to F1 and F2.

273
00:15:59,000 --> 00:16:05,000
Now, you know over here we have w one, we have w two, we have w three, we have w four right now

274
00:16:05,000 --> 00:16:06,000
in the hidden layer.

275
00:16:06,000 --> 00:16:11,000
If I just consider as an example let's let's take this as an example.

276
00:16:11,000 --> 00:16:21,000
Let's say I will just go ahead and take some house size right area, uh, rooms.

277
00:16:21,000 --> 00:16:22,000
How many number of rooms are there?

278
00:16:22,000 --> 00:16:25,000
And this will basically be my price.

279
00:16:25,000 --> 00:16:25,000
Okay.

280
00:16:25,000 --> 00:16:28,000
So let's consider this three features I have.

281
00:16:28,000 --> 00:16:29,000
Okay.

282
00:16:29,000 --> 00:16:33,000
The first feature with respect to the house size is let's say 1200 square feet.

283
00:16:33,000 --> 00:16:40,000
Now after I probably do the scaling, let's say I have some values like one, I'll just put some smaller

284
00:16:40,000 --> 00:16:41,000
value okay.

285
00:16:41,000 --> 00:16:43,000
Let's say 0.45 okay.

286
00:16:43,000 --> 00:16:49,000
Rooms I will just convert this into 0.55 and price I will keep it like that.

287
00:16:49,000 --> 00:16:50,000
Only 45 lakhs okay.

288
00:16:50,000 --> 00:16:52,000
Then again this can be 0.60.

289
00:16:52,000 --> 00:16:56,000
This can be um 0.20.

290
00:16:56,000 --> 00:16:57,000
This can be some other value.

291
00:16:57,000 --> 00:17:00,000
Similarly, I'm just putting some values over here okay.

292
00:17:00,000 --> 00:17:01,000
After doing the scaling.

293
00:17:01,000 --> 00:17:05,000
And this is my f one, this is my f two I'm sending this value okay.

294
00:17:05,000 --> 00:17:09,000
Now you know that when I'm sending F1 and F2 right.

295
00:17:09,000 --> 00:17:13,000
So first for a first feature I have given over here as 0.45.

296
00:17:13,000 --> 00:17:15,000
And this is my 0.55, right.

297
00:17:15,000 --> 00:17:17,000
Now what is basically happening over here.

298
00:17:17,000 --> 00:17:19,000
The first what I will do for this particular node.

299
00:17:19,000 --> 00:17:25,000
You know, we will go ahead and see that, uh, the first Z one that we will go ahead and compute.

300
00:17:25,000 --> 00:17:27,000
And here you know, that a bias is added B one.

301
00:17:27,000 --> 00:17:29,000
And here bias two will be added.

302
00:17:29,000 --> 00:17:32,000
And top of that activation function will appear applied.

303
00:17:32,000 --> 00:17:35,000
So inside this z one we will try to apply an activation function.

304
00:17:35,000 --> 00:17:38,000
Let's say I'm going to apply a sigmoid activation function.

305
00:17:38,000 --> 00:17:39,000
So what is my first feature.

306
00:17:39,000 --> 00:17:39,000
One value.

307
00:17:39,000 --> 00:17:44,000
It is 0.45 multiplied by okay w one okay.

308
00:17:44,000 --> 00:17:46,000
So this basically becomes my w one.

309
00:17:46,000 --> 00:17:52,000
Then 0.55 multiplied by what is this W three, right.

310
00:17:52,000 --> 00:17:55,000
Once we do this, then we are adding a bias, right?

311
00:17:55,000 --> 00:17:58,000
We add a bias that is b one, right?

312
00:17:59,000 --> 00:18:04,000
Now once we do this entire calculation we pass it to the sigmoid activation function.

313
00:18:04,000 --> 00:18:05,000
We get some value.

314
00:18:05,000 --> 00:18:06,000
We will get some value.

315
00:18:06,000 --> 00:18:08,000
Over here you can go ahead and calculate the value if you want.

316
00:18:08,000 --> 00:18:09,000
Okay.

317
00:18:09,000 --> 00:18:11,000
Well you can initialize any weights and all okay.

318
00:18:12,000 --> 00:18:15,000
Now similarly I will go ahead and do it for z two.

319
00:18:15,000 --> 00:18:15,000
Right.

320
00:18:15,000 --> 00:18:16,000
Z two is the output of this.

321
00:18:16,000 --> 00:18:20,000
So z one is here and z two is over here right.

322
00:18:20,000 --> 00:18:24,000
So in the case of z two again I will go ahead and apply sigmoid activation function with respect to

323
00:18:24,000 --> 00:18:29,000
that whatever value I'll get here I'll add a bias and this will be my another value okay.

324
00:18:30,000 --> 00:18:32,000
So here I'm getting my value one.

325
00:18:34,000 --> 00:18:36,000
Here I'm getting my value two.

326
00:18:38,000 --> 00:18:38,000
Okay.

327
00:18:38,000 --> 00:18:43,000
if I probably consider this in the form of matrix, uh, it is very simple.

328
00:18:43,000 --> 00:18:45,000
How many weights are basically getting created?

329
00:18:45,000 --> 00:18:45,000
Two.

330
00:18:45,000 --> 00:18:45,000
Cross two.

331
00:18:45,000 --> 00:18:49,000
So here the output that I'm actually going to get is two cross two.

332
00:18:49,000 --> 00:18:50,000
Right.

333
00:18:50,000 --> 00:18:52,000
So here you can see two two right.

334
00:18:52,000 --> 00:18:55,000
So two cross two will be basically my output okay.

335
00:18:56,000 --> 00:18:57,000
Two inputs I have.

336
00:18:57,000 --> 00:18:58,000
So this is one cross two.

337
00:18:58,000 --> 00:19:05,000
Then um uh with respect to this you'll be able to see that I'm passing it through all the weights over

338
00:19:05,000 --> 00:19:07,000
here so it will become a two cross, two matrix.

339
00:19:07,000 --> 00:19:08,000
Okay, so this is fine.

340
00:19:08,000 --> 00:19:10,000
Let's discuss more about this okay.

341
00:19:11,000 --> 00:19:13,000
Now let's observe this.

342
00:19:13,000 --> 00:19:14,000
So this is my value one.

343
00:19:14,000 --> 00:19:17,000
This is my value one.

344
00:19:18,000 --> 00:19:19,000
This is my value two.

345
00:19:19,000 --> 00:19:25,000
Now when I have this value one and value two initially when I gave this F1 and f2, obviously we have

346
00:19:25,000 --> 00:19:27,000
done the normalization.

347
00:19:27,000 --> 00:19:32,000
but after doing all this kind of multiplication, don't you think this value one and value two value

348
00:19:33,000 --> 00:19:38,000
will change when compared to the distribution of f1 and f2?

349
00:19:38,000 --> 00:19:39,000
Right.

350
00:19:39,000 --> 00:19:41,000
Whatever multiplications that we are doing over here.

351
00:19:42,000 --> 00:19:45,000
Does this distribution may change.

352
00:19:47,000 --> 00:19:50,000
Does this distribution may change.

353
00:19:50,000 --> 00:19:52,000
So here let me just go ahead and write it over here.

354
00:19:52,000 --> 00:19:54,000
Z one is nothing but some value.

355
00:19:54,000 --> 00:19:56,000
Z two is some value.

356
00:19:56,000 --> 00:19:56,000
Okay.

357
00:19:56,000 --> 00:19:59,000
Similarly for all the other inputs I'll be getting some value.

358
00:19:59,000 --> 00:20:06,000
So whatever inputs, whatever calculation that we are getting over here, don't you think this will

359
00:20:06,000 --> 00:20:07,000
be.

360
00:20:07,000 --> 00:20:13,000
This may have a different distribution when compared to this feature one and feature two right.

361
00:20:13,000 --> 00:20:15,000
It may have a different distribution.

362
00:20:16,000 --> 00:20:18,000
And that is what is the problem.

363
00:20:18,000 --> 00:20:19,000
Right.

364
00:20:19,000 --> 00:20:22,000
Because the distribution is getting changed.

365
00:20:22,000 --> 00:20:23,000
Right.

366
00:20:23,000 --> 00:20:25,000
It may it may have any kind of distribution over here.

367
00:20:25,000 --> 00:20:28,000
This and this distribution will definitely get changed.

368
00:20:28,000 --> 00:20:28,000
Right.

369
00:20:28,000 --> 00:20:33,000
But and for most of the problem statement we feel that okay this distribution not be should not be getting

370
00:20:33,000 --> 00:20:36,000
changed when compared to your input features distribution.

371
00:20:36,000 --> 00:20:36,000
Right.

372
00:20:36,000 --> 00:20:40,000
And it is always good to have a normal distribution at each and every place.

373
00:20:40,000 --> 00:20:44,000
But here what is happening because of so many different types of operations that you have done, the

374
00:20:44,000 --> 00:20:46,000
entire distribution is getting changed over here.

375
00:20:47,000 --> 00:20:53,000
Now, because of this, what we have to do is that in order to make sure that the distribution does

376
00:20:53,000 --> 00:21:00,000
not change very much in a, in a, in a way that it impacts the entire training of the artificial neural

377
00:21:00,000 --> 00:21:01,000
network or any neural network.

378
00:21:01,000 --> 00:21:07,000
What we do is that we further go ahead and perform normalization on top of Z1 and Z2.

379
00:21:07,000 --> 00:21:12,000
So here we go ahead and provide, uh, perform normalization.

380
00:21:12,000 --> 00:21:19,000
Normalization in every in every output z one and z two.

381
00:21:19,000 --> 00:21:25,000
And this normalization is basically called as batch normalization.

382
00:21:26,000 --> 00:21:30,000
So in the case of batch normalization we take this entire values of z one.

383
00:21:30,000 --> 00:21:32,000
We normalize it.

384
00:21:32,000 --> 00:21:37,000
How to normalize it I will talk more about it I will just show you the formula over here okay.

385
00:21:38,000 --> 00:21:39,000
And this Z two.

386
00:21:40,000 --> 00:21:46,000
Um so for this Z one if I really want to normalize it, what I have to do, I have to go ahead and compute

387
00:21:46,000 --> 00:21:49,000
the mu, the standard deviation, let's say mu one and standard deviation.

388
00:21:49,000 --> 00:21:50,000
For this.

389
00:21:50,000 --> 00:21:53,000
And for every number we will go ahead and apply the formula okay.

390
00:21:53,000 --> 00:21:58,000
And similarly for z two if I really want to normalize it again I will go ahead and compute the mu two

391
00:21:58,000 --> 00:22:02,000
and sigma two uh standard deviation of two for that particular z two.

392
00:22:02,000 --> 00:22:05,000
And for every number I will go ahead and apply my z score.

393
00:22:05,000 --> 00:22:05,000
Right.

394
00:22:05,000 --> 00:22:09,000
And that is how the entire normalization probably happens, right.

395
00:22:10,000 --> 00:22:15,000
By by doing the normalization again, our distribution will start matching this particular distribution.

396
00:22:15,000 --> 00:22:19,000
Because again, what we are doing when we do some kind of normalization over here, with the help of

397
00:22:19,000 --> 00:22:25,000
some z score value, you'll be able to see that my mean will again be coming to zero, and my standard

398
00:22:25,000 --> 00:22:27,000
deviation will be equal to one or near to one.

399
00:22:28,000 --> 00:22:31,000
And when we are doing that, that distribution will start matching with your features one and feature

400
00:22:31,000 --> 00:22:32,000
two.

401
00:22:32,000 --> 00:22:32,000
Okay.

402
00:22:32,000 --> 00:22:35,000
So this is batch normalization.

403
00:22:35,000 --> 00:22:40,000
Now what is the difference between batch normalization and layer normalization I will talk about it

404
00:22:40,000 --> 00:22:40,000
okay.

405
00:22:40,000 --> 00:22:48,000
So here let's say I will also go ahead and talk about batch normalization versus layer normalization.

406
00:22:48,000 --> 00:22:56,000
Because in if I probably consider my neural network okay.

407
00:22:57,000 --> 00:22:59,000
Because if I consider transformer.

408
00:22:59,000 --> 00:23:04,000
So in transformer specifically we apply layer normalization okay.

409
00:23:04,000 --> 00:23:09,000
So let's say over here I have my feature one my feature two okay.

410
00:23:09,000 --> 00:23:11,000
And this was my Z one output.

411
00:23:11,000 --> 00:23:12,000
This was my Z two output.

412
00:23:12,000 --> 00:23:13,000
Right.

413
00:23:13,000 --> 00:23:14,000
I had some values over here.

414
00:23:14,000 --> 00:23:15,000
Right.

415
00:23:16,000 --> 00:23:18,000
So I had some values over here.

416
00:23:18,000 --> 00:23:24,000
So batch normalization says hey go ahead and normalize each and every value like this.

417
00:23:24,000 --> 00:23:25,000
Right.

418
00:23:26,000 --> 00:23:27,000
This was batch normalization.

419
00:23:27,000 --> 00:23:34,000
Layer normalization says hey, instead of taking the complete Z one, what you do is that you take layer

420
00:23:34,000 --> 00:23:39,000
by layer, you take layer by layer, and then you normalize them.

421
00:23:40,000 --> 00:23:42,000
So you normalize in this way.

422
00:23:42,000 --> 00:23:49,000
For this you go ahead and compute your mu one sigma one and apply your z score with respect to x of

423
00:23:49,000 --> 00:23:55,000
I minus mu one divided by sigma one x of one will be this X1X2 will be this.

424
00:23:55,000 --> 00:23:59,000
Similarly for this you go ahead and calculate your mu two sigma two.

425
00:23:59,000 --> 00:24:01,000
Then you go ahead and calculate your z score.

426
00:24:01,000 --> 00:24:05,000
That is nothing but x of I mu two by sigma two.

427
00:24:05,000 --> 00:24:10,000
And similarly for this you go ahead and compute mu mu mu three and sigma three.

428
00:24:10,000 --> 00:24:17,000
And for this you go ahead and apply your z score x of I mu three divided by sigma uh, divided by sigma

429
00:24:17,000 --> 00:24:17,000
three.

430
00:24:17,000 --> 00:24:18,000
Okay.

431
00:24:18,000 --> 00:24:27,000
And this way you specifically normalize n layer normalization.

432
00:24:28,000 --> 00:24:34,000
This is the basic difference between batch and layer normalization.

433
00:24:36,000 --> 00:24:41,000
So guys I hope you got a basic idea with respect to batch normalization versus layer normalization.

434
00:24:41,000 --> 00:24:48,000
Now let me talk about two important learnable parameters that is called as gamma and beta okay.

435
00:24:48,000 --> 00:24:56,000
These learnable parameters are also used while we are training our models okay.

436
00:24:56,000 --> 00:24:58,000
Now what are these learnable parameters.

437
00:24:58,000 --> 00:25:00,000
White is specifically used.

438
00:25:00,000 --> 00:25:01,000
We will discuss about it.

439
00:25:01,000 --> 00:25:06,000
But before that what is the use of layer normalization that will try to see let's say over here I have

440
00:25:06,000 --> 00:25:08,000
z one, I have z two okay.

441
00:25:08,000 --> 00:25:13,000
And this z one and z two is what it is basically the output from here.

442
00:25:13,000 --> 00:25:13,000
Right.

443
00:25:13,000 --> 00:25:16,000
So here we are basically talking about right.

444
00:25:16,000 --> 00:25:19,000
And like that we can have any number of hidden layer for every hidden layer.

445
00:25:19,000 --> 00:25:23,000
Also it is a good idea that we actually perform batch normalization okay.

446
00:25:23,000 --> 00:25:27,000
Now let's say with respect to z one and Z2I have different different values.

447
00:25:27,000 --> 00:25:31,000
And with respect to this let's say I'm also having some zero values okay.

448
00:25:33,000 --> 00:25:37,000
Let's say over here also I'm having zero values okay.

449
00:25:37,000 --> 00:25:40,000
Now for this values that I have.

450
00:25:40,000 --> 00:25:42,000
What is the use of doing layer normalization.

451
00:25:42,000 --> 00:25:49,000
See in batch normalization what we do we take this entire z one output z two output.

452
00:25:49,000 --> 00:25:50,000
When I say z one output.

453
00:25:50,000 --> 00:25:54,000
Uh see I'm basically considering the output of both of them okay.

454
00:25:54,000 --> 00:25:54,000
Okay.

455
00:25:54,000 --> 00:25:55,000
Both of them?

456
00:25:55,000 --> 00:25:58,000
In short, uh, sorry, in the form of this particular matrix.

457
00:25:58,000 --> 00:26:00,000
So Z one has one side and Z two has the one side.

458
00:26:00,000 --> 00:26:01,000
Right.

459
00:26:01,000 --> 00:26:03,000
And it is in the form of matrix itself okay.

460
00:26:03,000 --> 00:26:05,000
Now understand one thing over here okay.

461
00:26:05,000 --> 00:26:06,000
Very simple.

462
00:26:06,000 --> 00:26:14,000
If I have many number of zeros over here if this are zeros basically the output is not coming up anything.

463
00:26:14,000 --> 00:26:15,000
Right.

464
00:26:15,000 --> 00:26:19,000
When we have this kind of normals uh, with when we have this kind of numbers.

465
00:26:19,000 --> 00:26:23,000
So here we are definitely performing batch normalization.

466
00:26:23,000 --> 00:26:23,000
Right.

467
00:26:23,000 --> 00:26:27,000
We are taking this entire batch and we are normalizing it okay.

468
00:26:28,000 --> 00:26:31,000
Now once we are doing this you know that most of the values are zero.

469
00:26:31,000 --> 00:26:38,000
But for all these zeros values also we are still normalizing it normalizing it.

470
00:26:38,000 --> 00:26:38,000
Right.

471
00:26:38,000 --> 00:26:44,000
And we are normalizing how by calculating your mean value sorry your mean value for this.

472
00:26:44,000 --> 00:26:45,000
Okay.

473
00:26:45,000 --> 00:26:48,000
And then along with this you are calculating your standard deviation.

474
00:26:48,000 --> 00:26:50,000
And then you are applying z score for each and every number.

475
00:26:50,000 --> 00:26:54,000
Now when you apply the z score for each and every number, this values are also going to change right.

476
00:26:54,000 --> 00:26:57,000
Some other values will definitely come over here.

477
00:26:57,000 --> 00:26:59,000
Similarly over here you have this zeros.

478
00:26:59,000 --> 00:27:01,000
Now what is basically happening.

479
00:27:01,000 --> 00:27:04,000
This zeros is impacting the entire distribution.

480
00:27:04,000 --> 00:27:09,000
So this entire distribution will get impacted because of this normalization.

481
00:27:09,000 --> 00:27:09,000
Right.

482
00:27:09,000 --> 00:27:15,000
So usually in an and all we definitely apply batch normalization.

483
00:27:15,000 --> 00:27:20,000
But in transformers that we have seen we specifically go with layer normalization because in transformer

484
00:27:20,000 --> 00:27:26,000
we have this MLP kind of use cases where every word is specifically important considering the context

485
00:27:26,000 --> 00:27:28,000
of all the other words.

486
00:27:28,000 --> 00:27:28,000
Right.

487
00:27:28,000 --> 00:27:36,000
So what we do in layer normalization we go ahead and take each and every combination.

488
00:27:36,000 --> 00:27:38,000
We basically take each and every layer's combination.

489
00:27:39,000 --> 00:27:41,000
And then we go ahead and normalize it.

490
00:27:41,000 --> 00:27:43,000
Now let's see if I have zero comma zero.

491
00:27:43,000 --> 00:27:47,000
Even though we go ahead and normalize this I know I'm going to get the mean is equal to zero for this.

492
00:27:47,000 --> 00:27:49,000
If I go ahead and calculate standard deviation is also zero.

493
00:27:49,000 --> 00:27:52,000
So no much impact no values is going to change.

494
00:27:52,000 --> 00:27:53,000
Similarly over here.

495
00:27:53,000 --> 00:27:54,000
Also no values is going to change.

496
00:27:54,000 --> 00:27:57,000
Similarly if I have zero over here no values is going to change, right?

497
00:27:57,000 --> 00:28:02,000
This is the most important thing and the basic difference between layer and batch normalization.

498
00:28:02,000 --> 00:28:05,000
Now one more question that comes in mind Chris.

499
00:28:05,000 --> 00:28:09,000
Is it compulsory that we always have to do this normalization?

500
00:28:10,000 --> 00:28:17,000
What if the final output of z one and z two that we are getting and this distribution may be important,

501
00:28:17,000 --> 00:28:22,000
may be important for predicting the output and that point of time.

502
00:28:22,000 --> 00:28:25,000
What if I do not want to do the normalization?

503
00:28:25,000 --> 00:28:31,000
Okay, because I'm considering hey for my problem statement whatever distribution I'm getting in z one

504
00:28:31,000 --> 00:28:34,000
and z two, that is perfect for my execution.

505
00:28:34,000 --> 00:28:36,000
Let's say for my solving my problem statement.

506
00:28:37,000 --> 00:28:39,000
So if I do not want to normalize this.

507
00:28:39,000 --> 00:28:45,000
So for that we will be using two learnable parameters that is gamma and beta.

508
00:28:45,000 --> 00:28:51,000
And this learnable parameter will be specifically learnt when we are training the entire neural network.

509
00:28:51,000 --> 00:28:52,000
Right.

510
00:28:52,000 --> 00:28:54,000
So now let me just go ahead and write again.

511
00:28:54,000 --> 00:28:58,000
So Z one if you are calculating it will be some function.

512
00:28:58,000 --> 00:29:01,000
Let's say I'm going to go ahead and apply my sigmoid over here.

513
00:29:01,000 --> 00:29:02,000
Right.

514
00:29:02,000 --> 00:29:10,000
And this will be w transpose x whatever transpose we are specifically doing with B1 right, or whatever

515
00:29:10,000 --> 00:29:16,000
B1 weights we are specifically getting after this step, what happens is that we assign learnable parameters

516
00:29:16,000 --> 00:29:17,000
okay.

517
00:29:17,000 --> 00:29:21,000
And this learnable parameters, I'll just write it as y, y will be my output.

518
00:29:21,000 --> 00:29:24,000
And here I will be using gamma okay gamma.

519
00:29:24,000 --> 00:29:30,000
And then I will go ahead and use this x uh Z1Z1 okay.

520
00:29:30,000 --> 00:29:32,000
Minus mu one.

521
00:29:32,000 --> 00:29:35,000
Z one basically means all the numbers over here.

522
00:29:35,000 --> 00:29:38,000
We are going to subtract with minus mu one, which we are going to calculate it.

523
00:29:38,000 --> 00:29:43,000
And then along with this gamma parameter we are also going to use one beta parameter.

524
00:29:43,000 --> 00:29:45,000
And these are the learnable parameters.

525
00:29:45,000 --> 00:29:47,000
These are the learnable parameters.

526
00:29:48,000 --> 00:29:51,000
Now you can see that right learnable parameters.

527
00:29:51,000 --> 00:29:59,000
Let's say if our distribution is not important over here and we do not need to normalize it right then

528
00:29:59,000 --> 00:30:05,000
based on this gamma and beta value, we can make sure that we can tell, hey, do not normalize in this

529
00:30:05,000 --> 00:30:10,000
scenario, or do not or normalize in this scenario based on the data that I have.

530
00:30:10,000 --> 00:30:10,000
Right.

531
00:30:10,000 --> 00:30:13,000
So that is the reason we use this gamma and beta value.

532
00:30:14,000 --> 00:30:19,000
I will be showing you this entire example by solving a simple problem also as we go ahead.

533
00:30:19,000 --> 00:30:26,000
But I hope you got an idea with respect to your layer normalization and batch normalization and how

534
00:30:26,000 --> 00:30:29,000
this entire layer normalization is probably taking place.

535
00:30:29,000 --> 00:30:34,000
And we specifically use this two important parameter that is gamma and beta okay.

536
00:30:35,000 --> 00:30:38,000
Now I hope you have understood till here.

537
00:30:38,000 --> 00:30:42,000
Now we can go back over here and we can see why add and normalize is basically coming over here.

538
00:30:42,000 --> 00:30:49,000
Because after this particular output that I get, okay, it is really important to understand, okay,

539
00:30:50,000 --> 00:30:56,000
after this particular output that I am actually getting, once we combine this entire entire output

540
00:30:56,000 --> 00:31:04,000
with respect to the sequence length that we are getting right, this output is going to get added with

541
00:31:04,000 --> 00:31:10,000
this plus this, which I will mention it as x dash.

542
00:31:11,000 --> 00:31:15,000
So x dash is going to get added with this entire values right.

543
00:31:15,000 --> 00:31:18,000
And then we are going to apply our layer normalization.

544
00:31:18,000 --> 00:31:18,000
Right.

545
00:31:18,000 --> 00:31:25,000
So here we are going to specifically apply our layer normalization okay.

546
00:31:26,000 --> 00:31:27,000
Perfect.

547
00:31:27,000 --> 00:31:30,000
So that is what is all about layer normalization.

548
00:31:30,000 --> 00:31:36,000
Now what I will do I will go ahead and solve a basic problem, okay.

549
00:31:36,000 --> 00:31:38,000
Just by uh, putting some numbers.

550
00:31:38,000 --> 00:31:46,000
And when we use this gamma and beta this are also called as scale and shift parameters.

551
00:31:46,000 --> 00:31:56,000
So this gamma and this beta is also called as scale and shift parameters.

552
00:31:57,000 --> 00:31:57,000
Okay.

553
00:31:58,000 --> 00:32:02,000
Now let us go ahead and let us take a simple problem.

554
00:32:02,000 --> 00:32:02,000
Okay?

555
00:32:02,000 --> 00:32:08,000
So I will take a simple problem where I will take an example of Cat as a token.

556
00:32:08,000 --> 00:32:17,000
Let's say I will denote this cat as token embeddings like 2.0, 4.0, 6.0, 8.0.

557
00:32:17,000 --> 00:32:23,000
Let's consider that this is my four dimension and the next parameters that I'm actually going to define

558
00:32:25,000 --> 00:32:26,000
is my gamma and beta.

559
00:32:26,000 --> 00:32:28,000
Because these are the learnable parameters.

560
00:32:28,000 --> 00:32:30,000
Initially this will be initialized.

561
00:32:30,000 --> 00:32:32,000
Later on we'll train them.

562
00:32:32,000 --> 00:32:34,000
We learn them 1.01.0.

563
00:32:34,000 --> 00:32:38,000
So let's go ahead and initialize this to 1.0 and beta.

564
00:32:38,000 --> 00:32:43,000
I will go ahead and initialize to 0.00.00.00..

565
00:32:43,000 --> 00:32:45,000
The reason is very simple.

566
00:32:45,000 --> 00:32:48,000
I have seen the research paper and there also they have initialized this.

567
00:32:48,000 --> 00:32:48,000
Okay.

568
00:32:48,000 --> 00:32:53,000
Now we will go ahead and perform step by step calculation, like how specifically it happens with respect

569
00:32:53,000 --> 00:32:57,000
to the transformer and how this entire process basically happens.

570
00:32:57,000 --> 00:32:57,000
Okay.

571
00:32:58,000 --> 00:33:02,000
Um, so we will just go ahead and apply this normalization.

572
00:33:02,000 --> 00:33:04,000
We will try to calculate.

573
00:33:04,000 --> 00:33:06,000
We will try to find out the normalized vector.

574
00:33:06,000 --> 00:33:07,000
Then we'll do the scale and shift.

575
00:33:07,000 --> 00:33:10,000
And then we will finally get our y value okay.

576
00:33:10,000 --> 00:33:14,000
So this entire computation that will do, uh, remember one thing over here.

577
00:33:14,000 --> 00:33:17,000
I'm not taken w transpose x.

578
00:33:17,000 --> 00:33:22,000
Instead I'm directly taking this particular vectors and how one vector gets normalized.

579
00:33:22,000 --> 00:33:28,000
That is what I am actually planning to show you how this vectors will get normalized.

580
00:33:29,000 --> 00:33:30,000
Okay, normalized.

581
00:33:30,000 --> 00:33:33,000
That is what I'm actually going to show you over here.

582
00:33:33,000 --> 00:33:36,000
So let's see this in the next video.

