1
00:00:00,000 --> 00:00:00,000
In this video.

2
00:00:00,000 --> 00:00:03,000
In this video we are going to talk about quantization.

3
00:00:03,000 --> 00:00:04,000
Exactly.

4
00:00:04,000 --> 00:00:04,000
Quantization.

5
00:00:04,000 --> 00:00:09,000
We're going to discuss you know, uh, we are going to discuss about full precision half precision.

6
00:00:09,000 --> 00:00:14,000
And this is something related to data types like how the data is stored in the memory.

7
00:00:14,000 --> 00:00:20,000
When I specifically say data in LM models, I will talk about weights and parameters.

8
00:00:20,000 --> 00:00:21,000
Right.

9
00:00:21,000 --> 00:00:26,000
Because at the end of the day, LM is also deep learning neural networks in the form of Transformers

10
00:00:26,000 --> 00:00:27,000
or Bert.

11
00:00:27,000 --> 00:00:27,000
Right.

12
00:00:28,000 --> 00:00:31,000
Then we are going to discuss about what what exactly is calibration?

13
00:00:32,000 --> 00:00:36,000
Uh, this is also called as uh, like calibration in model quantization.

14
00:00:36,000 --> 00:00:36,000
Right.

15
00:00:36,000 --> 00:00:39,000
We are going to also make sure that we are going to see some problems.

16
00:00:40,000 --> 00:00:40,000
Right.

17
00:00:40,000 --> 00:00:42,000
How we can actually do calibration.

18
00:00:42,000 --> 00:00:45,000
Then there is different different modes of quantization.

19
00:00:45,000 --> 00:00:46,000
Right.

20
00:00:46,000 --> 00:00:49,000
Uh, first of all I will explain you the definition.

21
00:00:49,000 --> 00:00:52,000
Then only you will be able to understand in modes of quantization we are going to discuss about two

22
00:00:52,000 --> 00:00:52,000
types.

23
00:00:52,000 --> 00:00:56,000
One is post-training quantization and quantization aware training.

24
00:00:56,000 --> 00:00:57,000
Right.

25
00:00:57,000 --> 00:01:01,000
So these all are very important in terms of fine tuning techniques.

26
00:01:01,000 --> 00:01:05,000
Now let's go ahead and talk about quantization.

27
00:01:05,000 --> 00:01:08,000
And we will try to see the definition right.

28
00:01:08,000 --> 00:01:10,000
Quantization okay.

29
00:01:10,000 --> 00:01:19,000
Now if you want to really understand the meaning of quantization it is better to write a simple definition

30
00:01:19,000 --> 00:01:19,000
for it.

31
00:01:19,000 --> 00:01:20,000
Okay.

32
00:01:20,000 --> 00:01:25,000
So quantization basically means conversion from.

33
00:01:27,000 --> 00:01:31,000
Higher memory format.

34
00:01:33,000 --> 00:01:39,000
To a lower memory format.

35
00:01:40,000 --> 00:01:41,000
Right.

36
00:01:43,000 --> 00:01:47,000
Now, I have written a very generic definition.

37
00:01:47,000 --> 00:01:49,000
What exactly quantisation mean?

38
00:01:49,000 --> 00:01:53,000
It is nothing but conversion from a higher memory format to a lower memory format.

39
00:01:54,000 --> 00:01:59,000
Now, when I say higher memory format, let's let's consider, uh, any data.

40
00:01:59,000 --> 00:01:59,000
Right.

41
00:01:59,000 --> 00:02:02,000
And if I probably consider any neural network okay.

42
00:02:02,000 --> 00:02:04,000
So let's say if I have neural network.

43
00:02:04,000 --> 00:02:05,000
Right.

44
00:02:05,000 --> 00:02:09,000
And when we train this neural network, right, all these neural network are interconnected.

45
00:02:09,000 --> 00:02:13,000
At the end of the day, what are the parameters that is probably involved over here?

46
00:02:13,000 --> 00:02:15,000
It is nothing but weights, right?

47
00:02:15,000 --> 00:02:17,000
We specifically have weights.

48
00:02:17,000 --> 00:02:18,000
Now.

49
00:02:18,000 --> 00:02:20,000
Weights are usually in the form of matrix.

50
00:02:21,000 --> 00:02:21,000
Right.

51
00:02:21,000 --> 00:02:24,000
Let's say that I have a three cross three weight okay.

52
00:02:24,000 --> 00:02:28,000
I'm just taking as an example in one of the layer I have three cross three weights.

53
00:02:28,000 --> 00:02:41,000
And over here every value right is probably stored in the memory in the form of 32 bits, right, 32

54
00:02:41,000 --> 00:02:42,000
bits.

55
00:02:42,000 --> 00:02:48,000
We also say this bits as we also we also denote it as something like something like FP 32.

56
00:02:48,000 --> 00:02:50,000
Now what exactly is FP 32?

57
00:02:50,000 --> 00:02:55,000
So FP 32 basically means I can also consider it as c fp.

58
00:02:55,000 --> 00:02:57,000
Full form is not floating point.

59
00:02:57,000 --> 00:02:58,000
Okay, but we are.

60
00:02:58,000 --> 00:03:01,000
I'm just writing floating point 32 bits right?

61
00:03:02,000 --> 00:03:03,000
When I say FP.

62
00:03:03,000 --> 00:03:07,000
FP basically means it is nothing but full precision.

63
00:03:08,000 --> 00:03:08,000
Right?

64
00:03:08,000 --> 00:03:10,000
Full precision or single precision.

65
00:03:11,000 --> 00:03:14,000
Okay, so this is the definition that is probably given, right.

66
00:03:14,000 --> 00:03:17,000
But in short this is like a floating point number okay.

67
00:03:17,000 --> 00:03:21,000
So over here, let's say my number is somewhere around 7.23.

68
00:03:21,000 --> 00:03:27,000
Now this number is stored based on 32 bits in the memory right.

69
00:03:28,000 --> 00:03:35,000
Now understand when you have a very big neural network or you have LM models, right.

70
00:03:35,000 --> 00:03:37,000
As you see different different animal models.

71
00:03:37,000 --> 00:03:38,000
Right.

72
00:03:38,000 --> 00:03:40,000
Parameters keeps on increasing.

73
00:03:40,000 --> 00:03:40,000
Right.

74
00:03:40,000 --> 00:03:42,000
Some may have 70 billion parameters.

75
00:03:42,000 --> 00:03:49,000
If I probably consider llama two with 70 billion parameters, that basically means it has 70 billion

76
00:03:49,000 --> 00:03:51,000
parameters in terms of weights and bias.

77
00:03:52,000 --> 00:03:52,000
Okay.

78
00:03:52,000 --> 00:03:59,000
Now it is not possible for me to let's say I want to use this particular model, and I want to probably

79
00:03:59,000 --> 00:04:03,000
do some fine tuning with respect to the normal GPU that I have.

80
00:04:03,000 --> 00:04:03,000
Right.

81
00:04:03,000 --> 00:04:06,000
And let's say I have a very limited Ram in my system.

82
00:04:06,000 --> 00:04:09,000
Let's say the Ram I have is somewhere around 32 GB, right?

83
00:04:09,000 --> 00:04:16,000
I cannot directly download this specific model and put it in my memory in my Ram itself, right.

84
00:04:16,000 --> 00:04:22,000
Let's say or load it in my Vram that is available in the GPU because GPU also has some limited Ram,

85
00:04:22,000 --> 00:04:22,000
right.

86
00:04:23,000 --> 00:04:26,000
And it is not possible you cannot directly download it.

87
00:04:26,000 --> 00:04:31,000
Obviously it will require you space the other way is that yes, I can probably take a cloud space somewhere.

88
00:04:31,000 --> 00:04:34,000
Let's say in AWS, I can create my instance.

89
00:04:34,000 --> 00:04:39,000
I can say, hey, give me this, this much Ram, 64 GB Ram, and I probably want this much GPU.

90
00:04:39,000 --> 00:04:43,000
And then I will try to load the model over there that you can do it.

91
00:04:43,000 --> 00:04:49,000
But over there what is basically happening, lot of cost is involved, right?

92
00:04:49,000 --> 00:04:53,000
Based on the resources, the cost is involved.

93
00:04:53,000 --> 00:04:54,000
Right.

94
00:04:54,000 --> 00:04:54,000
right.

95
00:04:55,000 --> 00:05:01,000
So in this scenario and why this 70 billion parameter is happening, why this model has a 70 billion

96
00:05:01,000 --> 00:05:01,000
parameter.

97
00:05:01,000 --> 00:05:07,000
Because every weights or every bias that is available in this, you know, this may be getting stored

98
00:05:07,000 --> 00:05:08,000
in 32 bits.

99
00:05:09,000 --> 00:05:19,000
So what we can specifically do is that we can convert this 32 bits into, as I said over here C conversion

100
00:05:19,000 --> 00:05:21,000
from a higher memory format to a lower memory format.

101
00:05:21,000 --> 00:05:28,000
Let's say I can probably convert this 32 bit into int eight and then download the model, or then use

102
00:05:28,000 --> 00:05:33,000
the model right after doing this, what will happen within my system?

103
00:05:33,000 --> 00:05:36,000
I will be able to influence it, right?

104
00:05:36,000 --> 00:05:42,000
Obviously for fine tuning, if I want to fine tune it, the new data set I will obviously require GPU,

105
00:05:42,000 --> 00:05:48,000
but if I consider with respect to inferencing, it becomes quite easy because now all my values that

106
00:05:48,000 --> 00:05:52,000
are stored in the form of 32 bits, it will be stored in the form of eight bits.

107
00:05:53,000 --> 00:05:59,000
So what we are specifically doing over here, we are converting from a high memory format to a low memory

108
00:05:59,000 --> 00:05:59,000
format.

109
00:05:59,000 --> 00:06:03,000
And this is what is called as quantization quantization.

110
00:06:04,000 --> 00:06:06,000
A very important thing.

111
00:06:06,000 --> 00:06:12,000
Now why quantization is important because you will be able to influence it quickly.

112
00:06:12,000 --> 00:06:15,000
See, inferencing basically means what if I have an LM model?

113
00:06:15,000 --> 00:06:19,000
If I give any input to that, I should be able to get any output right.

114
00:06:19,000 --> 00:06:21,000
I should be able to get response right.

115
00:06:21,000 --> 00:06:25,000
Now, when I give any input, all the calculation with respect to different different weights will happen,

116
00:06:26,000 --> 00:06:26,000
right?

117
00:06:26,000 --> 00:06:32,000
And obviously if I have a bigger GPU, this inferencing will quickly happen, right?

118
00:06:32,000 --> 00:06:37,000
But if I have a GPU with less less cores, let's say, then what will happen?

119
00:06:37,000 --> 00:06:45,000
This calculation will take time, but if I convert my 32 bit to eight bits right, every weights are

120
00:06:45,000 --> 00:06:46,000
basically converted into eight bits.

121
00:06:46,000 --> 00:06:48,000
Now just imagine the calculation.

122
00:06:48,000 --> 00:06:50,000
Will there be a difference?

123
00:06:50,000 --> 00:06:52,000
Yes it will happen little bit much more quicker.

124
00:06:53,000 --> 00:06:57,000
So quantization is very much important for inferencing.

125
00:06:57,000 --> 00:07:02,000
And some of the examples that I can probably talk about is that and obviously you may have heard about

126
00:07:02,000 --> 00:07:03,000
this.

127
00:07:03,000 --> 00:07:04,000
It's not like only an LM model.

128
00:07:04,000 --> 00:07:08,000
We specifically do write in different computer vision models.

129
00:07:08,000 --> 00:07:13,000
In NLP models also very thing that there is a lot of weights that is involved and all these weights,

130
00:07:13,000 --> 00:07:18,000
if I want to quantize it right, I can actually do it right now.

131
00:07:18,000 --> 00:07:23,000
This inferencing, let's say I want to use a specific deep learning model in my mobile phone.

132
00:07:24,000 --> 00:07:31,000
So in my mobile phone, if I want to use it right in in a specific app, then what I will do, I will

133
00:07:31,000 --> 00:07:32,000
try to quantize, right?

134
00:07:32,000 --> 00:07:37,000
Whatever deep learning model I have created from 32 bits to eight bit, and then I will try to deploy

135
00:07:37,000 --> 00:07:40,000
it in my mobile phone or any edge device.

136
00:07:41,000 --> 00:07:43,000
Any edge device, Right.

137
00:07:43,000 --> 00:07:46,000
It is not possible that you can probably deploy this big model over there.

138
00:07:47,000 --> 00:07:47,000
Right.

139
00:07:47,000 --> 00:07:48,000
It is not possible.

140
00:07:48,000 --> 00:07:50,000
So and so many parameters.

141
00:07:50,000 --> 00:07:53,000
So what we do we basically perform quantization.

142
00:07:53,000 --> 00:07:57,000
So I hope you are able to understand over here quantization is nothing.

143
00:07:57,000 --> 00:08:04,000
But we are trying to lower down the memory right with respect to any weights that we have like from

144
00:08:04,000 --> 00:08:07,000
32 to probably 60.

145
00:08:07,000 --> 00:08:10,000
uh, int eight or let's say FP 16.

146
00:08:10,000 --> 00:08:11,000
We can also say FP 16, right.

147
00:08:11,000 --> 00:08:19,000
Let's say if I have an FP 32 bit that is specifically required to store any information in my memory,

148
00:08:19,000 --> 00:08:23,000
I can also convert this into FP 16 bit.

149
00:08:23,000 --> 00:08:26,000
This is also quantization only, right?

150
00:08:28,000 --> 00:08:31,000
Usually all these values are stored in floating point, right.

151
00:08:31,000 --> 00:08:33,000
We specifically say this FP 32 bit.

152
00:08:33,000 --> 00:08:36,000
We say single precision or full precision.

153
00:08:36,000 --> 00:08:38,000
We uh for FP 16 bit.

154
00:08:38,000 --> 00:08:41,000
If I'm trying to convert like this, it is basically called as half precision.

155
00:08:42,000 --> 00:08:42,000
Right.

156
00:08:42,000 --> 00:08:45,000
So you should be able to understand all this technical terms.

157
00:08:45,000 --> 00:08:46,000
Right.

158
00:08:46,000 --> 00:08:48,000
And in short this is are nothing.

159
00:08:48,000 --> 00:08:50,000
But these are floating point numbers right now.

160
00:08:50,000 --> 00:08:55,000
Similarly in TensorFlow also you'll be able to see when we probably work with TensorFlow you'll find

161
00:08:55,000 --> 00:08:58,000
TF 32 bit, right.

162
00:08:58,000 --> 00:09:01,000
The data types the numbers are stored in this particular format.

163
00:09:01,000 --> 00:09:02,000
Right.

164
00:09:02,000 --> 00:09:04,000
And it is important okay.

165
00:09:04,000 --> 00:09:07,000
This is all our terminologies that are super important.

166
00:09:07,000 --> 00:09:09,000
But I hope you got an idea why.

167
00:09:09,000 --> 00:09:11,000
What is the main aim?

168
00:09:11,000 --> 00:09:16,000
What is the main motivation out of quantization is that if I have a bigger model, I should be able

169
00:09:16,000 --> 00:09:22,000
to quantize it and make it as a smaller model so that I can use it for my faster inferencing purpose,

170
00:09:22,000 --> 00:09:29,000
both in mobile phones and the edge devices, let's say in even watches smartwatches.

171
00:09:29,000 --> 00:09:30,000
I want to use it over here.

172
00:09:30,000 --> 00:09:38,000
I can actually do that right now if I talk with respect to LM model, also with the help of quantization.

173
00:09:38,000 --> 00:09:44,000
See, once we compress this particular model right later on, we can also perform fine tuning.

174
00:09:44,000 --> 00:09:45,000
Right.

175
00:09:45,000 --> 00:09:46,000
Fine tuning.

176
00:09:46,000 --> 00:09:50,000
But here there is one disadvantage when we quantize.

177
00:09:50,000 --> 00:09:51,000
Right.

178
00:09:51,000 --> 00:09:57,000
When we perform this quantization, since we are converting from 32 bits to int8, let's say as an example,

179
00:09:57,000 --> 00:10:00,000
there is some loss of information also.

180
00:10:00,000 --> 00:10:02,000
And because of this there will be some loss of accuracy.

181
00:10:03,000 --> 00:10:05,000
Now how to overcome this?

182
00:10:05,000 --> 00:10:06,000
We will talk about it.

183
00:10:06,000 --> 00:10:10,000
There are different different techniques how we can specifically overcome it.

184
00:10:10,000 --> 00:10:12,000
But I hope you got an example.

185
00:10:12,000 --> 00:10:14,000
What exactly is quantization?

186
00:10:14,000 --> 00:10:15,000
What is full precision.

187
00:10:15,000 --> 00:10:16,000
Half precision, half precision?

188
00:10:16,000 --> 00:10:18,000
Example is something like this.

189
00:10:18,000 --> 00:10:22,000
Now let's talk about what exactly is calibration.

190
00:10:23,000 --> 00:10:29,000
Now calibration basically means how we will be able to convert this 32 bit into int eight.

191
00:10:29,000 --> 00:10:31,000
Like what is the formula.

192
00:10:31,000 --> 00:10:34,000
What is the mathematical intuition that is specifically required.

193
00:10:34,000 --> 00:10:36,000
Let's go ahead and discuss that.

194
00:10:37,000 --> 00:10:42,000
So guys now let's go ahead and try to understand how to perform quantization.

195
00:10:42,000 --> 00:10:50,000
And this is super important in terms of mathematical concept that I'm probably going to talk about because

196
00:10:50,000 --> 00:10:54,000
with with the help of TensorFlow, just by writing four lines of code, you know, I will be able to

197
00:10:54,000 --> 00:10:57,000
perform quantization, but it is important.

198
00:10:57,000 --> 00:10:59,000
You should know that how you can actually do it manually.

199
00:11:00,000 --> 00:11:05,000
Whenever I talk in terms of what are the types of quantization that we have.

200
00:11:05,000 --> 00:11:08,000
So we have two different types of quantization.

201
00:11:08,000 --> 00:11:13,000
One is symmetric quantization and one is called as asymmetric quantization.

202
00:11:13,000 --> 00:11:17,000
Now just by showing you an example, you will be able to understand what is the exact difference between

203
00:11:17,000 --> 00:11:18,000
them okay.

204
00:11:18,000 --> 00:11:25,000
Let's say I have a task, and this first task that I am probably going to talk is with respect to symmetric

205
00:11:25,000 --> 00:11:26,000
and understand.

206
00:11:26,000 --> 00:11:30,000
I hope in deep learning you have heard of something called as batch normalization.

207
00:11:31,000 --> 00:11:34,000
So if you have heard about this batch normalization.

208
00:11:34,000 --> 00:11:39,000
So batch normalization is a technique of symmetric quantization.

209
00:11:39,000 --> 00:11:39,000
Right.

210
00:11:39,000 --> 00:11:45,000
So every time you'll be able to see that whenever we do forward propagation backward propagation between

211
00:11:45,000 --> 00:11:51,000
all the layers, we apply batch normalization so that all our all our weights are zero centered, that

212
00:11:51,000 --> 00:11:53,000
is near the zero.

213
00:11:53,000 --> 00:11:56,000
And the entire distribution of the weights will be centered near zero.

214
00:11:56,000 --> 00:11:56,000
Okay.

215
00:11:56,000 --> 00:12:01,000
So batch normalization is one technique uh of symmetric quantization.

216
00:12:01,000 --> 00:12:03,000
So let's go ahead and see one example.

217
00:12:03,000 --> 00:12:07,000
So this will be my first example over here I will go ahead and write it down.

218
00:12:07,000 --> 00:12:11,000
And now you'll be able to understand it how symmetric quantization is basically performed.

219
00:12:11,000 --> 00:12:16,000
Now what is symmetric quantization you have understood from higher memory format to lower memory format

220
00:12:16,000 --> 00:12:17,000
will try to convert.

221
00:12:17,000 --> 00:12:18,000
Okay.

222
00:12:18,000 --> 00:12:20,000
So here we are trying to understand the mathematical intuition.

223
00:12:20,000 --> 00:12:31,000
So let's go ahead and talk about one technique which is called as symmetric unsigned unsigned int unsigned

224
00:12:31,000 --> 00:12:33,000
int eight Okay quantization.

225
00:12:33,000 --> 00:12:35,000
So we will see this technique first.

226
00:12:36,000 --> 00:12:39,000
Now here what is our main aim.

227
00:12:39,000 --> 00:12:44,000
Let's say I have a floating point number okay.

228
00:12:44,000 --> 00:12:46,000
So let's go ahead and write it down.

229
00:12:46,000 --> 00:12:51,000
Let's say I have a floating point number between 0 to 1000.

230
00:12:51,000 --> 00:12:54,000
Now just imagine that these are my weights right.

231
00:12:54,000 --> 00:12:59,000
Whatever matrix mates I have my values ranges between 0 to 1000.

232
00:12:59,000 --> 00:13:01,000
Let's consider in this particular way.

233
00:13:01,000 --> 00:13:01,000
Right.

234
00:13:02,000 --> 00:13:07,000
And this is let's say these are the weights for my larger model okay.

235
00:13:07,000 --> 00:13:10,000
Larger model okay.

236
00:13:11,000 --> 00:13:15,000
Now one very important thing that you really need to understand, right.

237
00:13:15,000 --> 00:13:19,000
When I talk about any larger model let's consider any LM model okay.

238
00:13:19,000 --> 00:13:22,000
So in a linear model you have lot of parameters.

239
00:13:22,000 --> 00:13:25,000
Let's consider this is one kind of LM model.

240
00:13:25,000 --> 00:13:31,000
Now when I say all this weights are there this may be getting stored in 32 bits okay.

241
00:13:31,000 --> 00:13:36,000
Usually what will happen guys the weights will not be in this range also okay.

242
00:13:36,000 --> 00:13:38,000
It will be in the minimalistic range.

243
00:13:38,000 --> 00:13:41,000
So I will just consider this as some numbers okay.

244
00:13:41,000 --> 00:13:45,000
So that you don't get confused with respect to these are some numbers.

245
00:13:45,000 --> 00:13:48,000
I will also not consider LM model over here.

246
00:13:48,000 --> 00:13:53,000
And let's say this this this numbers are stored in the form of 32 bits okay.

247
00:13:54,000 --> 00:14:01,000
Now my main aim is to convert this into unsigned int eight.

248
00:14:01,000 --> 00:14:05,000
That basically means eight bit right.

249
00:14:05,000 --> 00:14:09,000
So eight bit basically means what two raise to eight is how much two raise to eight.

250
00:14:09,000 --> 00:14:14,000
But when we say you unsigned that basically means my value will be ranging between 0 to 255.

251
00:14:14,000 --> 00:14:19,000
So I want to quantize from this values to this value.

252
00:14:19,000 --> 00:14:19,000
Okay.

253
00:14:19,000 --> 00:14:20,000
What is my aim?

254
00:14:20,000 --> 00:14:26,000
I want to quantize my range of values between 0 to 1000 to 0 to 255.

255
00:14:26,000 --> 00:14:30,000
Okay, this is what is my aim with respect to this okay.

256
00:14:30,000 --> 00:14:33,000
So this is what is my target now.

257
00:14:33,000 --> 00:14:35,000
Let's see okay.

258
00:14:35,000 --> 00:14:40,000
So if I probably draw just a real points and the same thing we will do with the weights over there,

259
00:14:40,000 --> 00:14:46,000
whatever quantization process we are specifically doing, let's say I have values between zero and this

260
00:14:46,000 --> 00:14:47,000
is basically 1000.

261
00:14:47,000 --> 00:14:47,000
Okay.

262
00:14:48,000 --> 00:14:57,000
Then what I will do over here I want to convert this into again 0 to 255 okay.

263
00:14:57,000 --> 00:14:59,000
now let me talk about one very important thing.

264
00:14:59,000 --> 00:14:59,000
Guys.

265
00:14:59,000 --> 00:15:04,000
Whenever we have any let's let's consider this one.

266
00:15:04,000 --> 00:15:13,000
If we have this single precision, single precision floating point 32.

267
00:15:14,000 --> 00:15:15,000
Right.

268
00:15:15,000 --> 00:15:21,000
If we have this number, how this number is stored, do you know that the one bit will specifically

269
00:15:21,000 --> 00:15:25,000
be used for sign or unsigned values?

270
00:15:25,000 --> 00:15:25,000
Right.

271
00:15:25,000 --> 00:15:26,000
Let's say positive or negative.

272
00:15:26,000 --> 00:15:28,000
If positive is there, then this will be plus one.

273
00:15:28,000 --> 00:15:29,000
If negative is there, it will be minus one.

274
00:15:29,000 --> 00:15:33,000
So all the values are basically saved between 0 to 1 right.

275
00:15:33,000 --> 00:15:35,000
Uh it can be 0 or 1 okay.

276
00:15:35,000 --> 00:15:47,000
Then the next eight bits are stored for exponent okay 12345678.

277
00:15:47,000 --> 00:15:48,000
Okay.

278
00:15:48,000 --> 00:15:57,000
So this is for sine this all numbers that you see it is basically stored for exponent.

279
00:15:59,000 --> 00:16:04,000
This is how it is stored inside the memory and remaining 23 numbers.

280
00:16:05,000 --> 00:16:10,000
All this 23 bits will basically be saved for mantissa.

281
00:16:11,000 --> 00:16:13,000
This is specifically for the fraction.

282
00:16:13,000 --> 00:16:20,000
So if I have a number which looks like 7.32 now 7.32 is the number, it is a positive number.

283
00:16:20,000 --> 00:16:24,000
So for my sign bit there will be a positive value let's say one over here.

284
00:16:25,000 --> 00:16:30,000
Then the seven will be probably put up in this eight bit.

285
00:16:30,000 --> 00:16:34,000
And remaining 0.32 will be put up in this mantissa.

286
00:16:35,000 --> 00:16:35,000
Right.

287
00:16:36,000 --> 00:16:40,000
So this is how the numbers are basically stored in the memory, right.

288
00:16:40,000 --> 00:16:49,000
If I consider an example with respect to Fp16 right half precision floating point 16 bit, then you'll

289
00:16:49,000 --> 00:16:51,000
be able to see there will be one bit for the sign number.

290
00:16:52,000 --> 00:16:55,000
There will be five bits for the exponent.

291
00:16:55,000 --> 00:16:59,000
We basically say exponent 12345.

292
00:16:59,000 --> 00:17:00,000
Let's consider this five.

293
00:17:02,000 --> 00:17:04,000
Let's let's draw till here okay.

294
00:17:04,000 --> 00:17:09,000
So this let's let's let's say that this is 512345 okay.

295
00:17:09,000 --> 00:17:11,000
So this will be five.

296
00:17:11,000 --> 00:17:16,000
And remaining remaining ten bits will be saved with respect to mantissa.

297
00:17:17,000 --> 00:17:18,000
Mantissa.

298
00:17:18,000 --> 00:17:20,000
So we basically say this as a fraction right.

299
00:17:20,000 --> 00:17:22,000
Anything that comes after the decimal.

300
00:17:22,000 --> 00:17:23,000
Right.

301
00:17:23,000 --> 00:17:26,000
And this is how you will be able to see it will take this will take less memory.

302
00:17:26,000 --> 00:17:28,000
This will take high memory.

303
00:17:28,000 --> 00:17:30,000
Right now what is our main aim over here?

304
00:17:30,000 --> 00:17:32,000
I already have a 32 bit number.

305
00:17:32,000 --> 00:17:37,000
I need to probably convert this into a range of unsigned int8.

306
00:17:37,000 --> 00:17:40,000
Unsigned int end basically means I will not take any negative numbers.

307
00:17:40,000 --> 00:17:42,000
It will be between 0 to 255.

308
00:17:42,000 --> 00:17:43,000
Right?

309
00:17:43,000 --> 00:17:45,000
This is what I really want to do okay.

310
00:17:45,000 --> 00:17:45,000
Okay.

311
00:17:45,000 --> 00:17:47,000
Now for this.

312
00:17:47,000 --> 00:17:48,000
What will be the equation?

313
00:17:48,000 --> 00:17:52,000
And I hope you have heard about something called as min max scaler.

314
00:17:53,000 --> 00:17:57,000
I've repeated so many times this in my machine learning sections also.

315
00:17:57,000 --> 00:18:04,000
So any number that is over here, how will I be able to convert from this floating point to this unit

316
00:18:04,000 --> 00:18:06,000
eight unit Int8 right.

317
00:18:06,000 --> 00:18:07,000
Unsigned indent right?

318
00:18:07,000 --> 00:18:09,000
How I will be able to do it.

319
00:18:09,000 --> 00:18:12,000
Now for this, let's go ahead and calculate it.

320
00:18:12,000 --> 00:18:14,000
And what equation is specifically required.

321
00:18:14,000 --> 00:18:14,000
Okay.

322
00:18:15,000 --> 00:18:21,000
So that basically means over here what we are going to do 0.0 will be converted into a quantized value

323
00:18:21,000 --> 00:18:24,000
of zero okay 0.0 it will be zero.

324
00:18:24,000 --> 00:18:29,000
And similarly thousand should be converted to a quantized value of 255.

325
00:18:29,000 --> 00:18:31,000
At the end of the day the bits are decreasing.

326
00:18:31,000 --> 00:18:33,000
So quantization is basically happening.

327
00:18:33,000 --> 00:18:36,000
But we have to probably come up with a scale factor.

328
00:18:36,000 --> 00:18:37,000
Now what exactly is a scale factor.

329
00:18:37,000 --> 00:18:39,000
So let me define a scale over here.

330
00:18:40,000 --> 00:18:46,000
So here the scale formula will be x max divided by x mean.

331
00:18:46,000 --> 00:18:50,000
And then they will be divided by q max minus q mean.

332
00:18:51,000 --> 00:18:53,000
This q is nothing but quantization.

333
00:18:53,000 --> 00:18:57,000
Now what is the x max over here 1000 right.

334
00:18:57,000 --> 00:18:59,000
So this I will consider as my x.

335
00:18:59,000 --> 00:19:02,000
This I will consider as my Q right.

336
00:19:02,000 --> 00:19:06,000
I'm showing you how quantization happens in a symmetric distribution.

337
00:19:06,000 --> 00:19:11,000
Symmetric basically means all the data is evenly distributed okay.

338
00:19:11,000 --> 00:19:15,000
And we really need to convert this based on this itself.

339
00:19:15,000 --> 00:19:16,000
These are evenly distributed.

340
00:19:16,000 --> 00:19:21,000
Now what is x max x max is nothing but 1000 -0.

341
00:19:22,000 --> 00:19:28,000
Then q max q max basically means 255 -0.

342
00:19:28,000 --> 00:19:28,000
Right.

343
00:19:28,000 --> 00:19:32,000
So if I probably go ahead with this specific division right.

344
00:19:33,000 --> 00:19:36,000
Then what will be the value that I will be having.

345
00:19:36,000 --> 00:19:36,000
Right.

346
00:19:36,000 --> 00:19:40,000
1000 divided by 255 it is nothing but 3.92.

347
00:19:40,000 --> 00:19:42,000
So this is nothing.

348
00:19:42,000 --> 00:19:44,000
But this is called as a scale factor.

349
00:19:44,000 --> 00:19:45,000
Right.

350
00:19:45,000 --> 00:19:46,000
Scale factor.

351
00:19:46,000 --> 00:19:55,000
So any number that I have over here, if I want to convert from this FP 32 bit to you into it, I just

352
00:19:55,000 --> 00:20:00,000
need to use this scale along with one formula which is called as round.

353
00:20:00,000 --> 00:20:00,000
Okay.

354
00:20:00,000 --> 00:20:03,000
So I will apply round with respect to any number.

355
00:20:03,000 --> 00:20:14,000
Let's say if I consider 500 divided by 3.92 or let's let's just consider 250 divided by 3.92.

356
00:20:14,000 --> 00:20:18,000
So if I want to see what will happen to 250, right.

357
00:20:18,000 --> 00:20:21,000
What will be the number in this you int eight.

358
00:20:21,000 --> 00:20:23,000
So I can probably go ahead and divide it.

359
00:20:23,000 --> 00:20:28,000
It will be nothing but 250 divided by 3.92.

360
00:20:28,000 --> 00:20:32,000
So if I go and calculate it, it is nothing but 63.77.

361
00:20:32,000 --> 00:20:38,000
So if I do the rounding that basically means this will be 64.

362
00:20:38,000 --> 00:20:44,000
So in short, any number that is over here, let's say it is 250 over here this will get converted to

363
00:20:44,000 --> 00:20:47,000
a quantized value to 64.

364
00:20:47,000 --> 00:20:47,000
Right.

365
00:20:47,000 --> 00:20:50,000
So the same thing the code will also be doing.

366
00:20:50,000 --> 00:20:54,000
And this is for symmetric unsigned int8.

367
00:20:54,000 --> 00:20:55,000
Okay.

368
00:20:55,000 --> 00:21:01,000
Quantization if I want a quantization some for some other factor, let's let's talk about this okay.

369
00:21:01,000 --> 00:21:05,000
So let's say uh I have another kind of distribution.

370
00:21:05,000 --> 00:21:08,000
And this is time that it is asymmetric.

371
00:21:08,000 --> 00:21:11,000
And I want this as you want it.

372
00:21:12,000 --> 00:21:17,000
So if I want to perform this quantization so what will happen now in this particular case let's say

373
00:21:17,000 --> 00:21:23,000
if I have a values between -20.0 to 1000.

374
00:21:24,000 --> 00:21:24,000
Okay.

375
00:21:24,000 --> 00:21:25,000
So these are my floating point.

376
00:21:25,000 --> 00:21:31,000
Now I want to perform quantization and convert this into 0 to 255.

377
00:21:32,000 --> 00:21:37,000
Right now in case of asymmetric what will happen is that in my real number section right.

378
00:21:37,000 --> 00:21:40,000
This numbers are not symmetrically distributed.

379
00:21:40,000 --> 00:21:41,000
It may be right skewed.

380
00:21:42,000 --> 00:21:43,000
It may be left skewed.

381
00:21:43,000 --> 00:21:44,000
Okay.

382
00:21:44,000 --> 00:21:49,000
So in this scenario you will be able to see that my values are ranging between -20 to 1000.

383
00:21:49,000 --> 00:21:51,000
I want to convert this into this.

384
00:21:51,000 --> 00:21:56,000
Now in this scenario if I apply the same formula x max minus x mean.

385
00:21:56,000 --> 00:21:59,000
So how it will be 1000 -20.

386
00:21:59,000 --> 00:22:05,000
So minus of -20 is nothing plus plus 20 divided by 255.

387
00:22:05,000 --> 00:22:10,000
So if I probably do the calculation then you will be able to see that I will be getting somewhere around

388
00:22:10,000 --> 00:22:11,000
4.0.

389
00:22:12,000 --> 00:22:12,000
Okay.

390
00:22:13,000 --> 00:22:18,000
Now very important thing that basically means this 20.0.

391
00:22:19,000 --> 00:22:22,000
If I quantize it right.

392
00:22:22,000 --> 00:22:27,000
If I quantize it, it will get converted to something like 4.00.

393
00:22:27,000 --> 00:22:28,000
Sorry.

394
00:22:28,000 --> 00:22:31,000
This this is the my scale factor okay.

395
00:22:31,000 --> 00:22:31,000
Scale factor.

396
00:22:31,000 --> 00:22:36,000
Now if I take any number and try to convert it let's consider -20.

397
00:22:36,000 --> 00:22:39,000
If I try to convert it by dividing by 4.0.

398
00:22:39,000 --> 00:22:40,000
Right.

399
00:22:40,000 --> 00:22:41,000
And if I do the round.

400
00:22:42,000 --> 00:22:43,000
So what it will become.

401
00:22:43,000 --> 00:22:44,000
Minus five.

402
00:22:44,000 --> 00:22:45,000
Right.

403
00:22:45,000 --> 00:22:46,000
Minus five of round.

404
00:22:47,000 --> 00:22:49,000
This much I will probably get minus five.

405
00:22:49,000 --> 00:22:50,000
Right.

406
00:22:50,000 --> 00:22:55,000
Now you can understand that this -20.0 is getting converted to -5.0.

407
00:22:55,000 --> 00:22:59,000
But you can see over here my distribution starts from 0 to 255.

408
00:23:00,000 --> 00:23:05,000
So how can I forcefully make this -20.0 to 0?

409
00:23:05,000 --> 00:23:09,000
All you have to do is that go ahead and add the same number in a positive way.

410
00:23:09,000 --> 00:23:14,000
So in this case the number that you see this five right.

411
00:23:14,000 --> 00:23:18,000
This is basically called as zero point right.

412
00:23:19,000 --> 00:23:25,000
So there are two important parameters that we specifically talk with respect to quantization.

413
00:23:25,000 --> 00:23:29,000
One is 0.4 for the above one.

414
00:23:29,000 --> 00:23:38,000
Since we have a symmetrical distribution here, the zero point was zero only and the scale was 3.92.

415
00:23:38,000 --> 00:23:44,000
In this particular case, since it is a symmetrical distribution, here we have a zero point as nothing

416
00:23:44,000 --> 00:23:45,000
but five.

417
00:23:45,000 --> 00:23:48,000
But scale is 4.0.

418
00:23:48,000 --> 00:23:55,000
So this two parameters we usually require to perform quantization.

419
00:23:55,000 --> 00:23:55,000
Okay.

420
00:23:55,000 --> 00:23:58,000
And these are some of the examples that I have shown you.

421
00:23:58,000 --> 00:24:02,000
To just give you an idea like how quantization basically happens.

422
00:24:02,000 --> 00:24:07,000
And super important in terms of understanding the simple equations.

423
00:24:07,000 --> 00:24:12,000
You'll be able to understand how things are basically working right at the end of the day.

424
00:24:13,000 --> 00:24:20,000
Understand quantization is a simple process of converting that high, uh, full single precision of

425
00:24:20,000 --> 00:24:24,000
full precision floating point 32 bits into small bits.

426
00:24:24,000 --> 00:24:28,000
You know, it can be, uh, unsigned integer eight.

427
00:24:28,000 --> 00:24:29,000
It can be signed integer eight.

428
00:24:29,000 --> 00:24:32,000
If we say signed integer eight, then what will happen?

429
00:24:32,000 --> 00:24:36,000
It is that it will be ranging between -128 to 127.

430
00:24:36,000 --> 00:24:39,000
And based on that, you can specifically apply the formula.

431
00:24:39,000 --> 00:24:42,000
Right now let's go ahead.

432
00:24:42,000 --> 00:24:45,000
See we had already discussed about this two topics.

433
00:24:45,000 --> 00:24:47,000
One is this.

434
00:24:47,000 --> 00:24:51,000
And second one we wanted to discuss about calibration.

435
00:24:51,000 --> 00:24:59,000
Now this squeezing that you could see right from here to here to here, here to here we are squeezing

436
00:24:59,000 --> 00:25:00,000
it right.

437
00:25:00,000 --> 00:25:03,000
This squeezing process is basically called as calibration.

438
00:25:03,000 --> 00:25:07,000
Whatever process we are basically applying in this quantization process, it is nothing.

439
00:25:07,000 --> 00:25:12,000
But it is called as calibration because we are squeezing those values from a higher format to a lower

440
00:25:12,000 --> 00:25:13,000
format.

441
00:25:13,000 --> 00:25:14,000
Okay.

442
00:25:14,000 --> 00:25:16,000
So that is nothing but calibration.

443
00:25:16,000 --> 00:25:18,000
So we have completed both this thing.

444
00:25:18,000 --> 00:25:19,000
Okay.

445
00:25:19,000 --> 00:25:22,000
Now let's see what are the different modes of quantization.

446
00:25:22,000 --> 00:25:27,000
One is called as post-training quantization and quantization aware training.

447
00:25:27,000 --> 00:25:28,000
Right.

448
00:25:28,000 --> 00:25:29,000
I will talk about this.

449
00:25:29,000 --> 00:25:31,000
Why it is super important both this technique okay.

450
00:25:31,000 --> 00:25:33,000
So you will get an idea about it over here.

451
00:25:33,000 --> 00:25:40,000
So first of all we will go ahead and say post training quantization.

452
00:25:43,000 --> 00:25:46,000
So what exactly is post-training quantization.

453
00:25:46,000 --> 00:25:50,000
Here we already have a pre-trained model.

454
00:25:52,000 --> 00:25:55,000
So we already have a pre-trained model.

455
00:25:55,000 --> 00:26:00,000
Now if I want to use this pre-trained model obviously the weights are very high.

456
00:26:00,000 --> 00:26:02,000
Here we apply.

457
00:26:05,000 --> 00:26:06,000
Calibration.

458
00:26:08,000 --> 00:26:08,000
Right.

459
00:26:08,000 --> 00:26:13,000
When I say calibration that basically means squeezing the value from high format to a lower format,

460
00:26:13,000 --> 00:26:14,000
right?

461
00:26:14,000 --> 00:26:20,000
And then after performing this particular calibration, we take what kind of data we take that weights

462
00:26:20,000 --> 00:26:25,000
data, whatever weights data is basically there in this particular pre-trained model.

463
00:26:25,000 --> 00:26:30,000
And then we convert this into a quantized model.

464
00:26:33,000 --> 00:26:34,000
Okay.

465
00:26:34,000 --> 00:26:39,000
So once we apply this process then only we will be able to get the quantized model.

466
00:26:39,000 --> 00:26:44,000
And then we can use this entire model for any use cases okay.

467
00:26:44,000 --> 00:26:46,000
For any use cases right.

468
00:26:46,000 --> 00:26:49,000
This is a simple mechanism with respect to post-training quantization.

469
00:26:49,000 --> 00:26:54,000
See understand Post-training basically means I already have a pre-trained model where my weights are

470
00:26:54,000 --> 00:26:55,000
fixed.

471
00:26:55,000 --> 00:26:58,000
I don't need to change those weights.

472
00:26:58,000 --> 00:27:00,000
I will just take or download those weights.

473
00:27:00,000 --> 00:27:06,000
I will take this weights data, apply the calibration and then convert this into a quantized model.

474
00:27:06,000 --> 00:27:07,000
Right.

475
00:27:07,000 --> 00:27:13,000
The second technique that we have written over here is quantization aware training okay.

476
00:27:14,000 --> 00:27:15,000
Quantization aware training.

477
00:27:15,000 --> 00:27:17,000
So let's talk about this.

478
00:27:19,000 --> 00:27:24,000
Quantization aware training.

479
00:27:24,000 --> 00:27:27,000
This is also called as cued.

480
00:27:28,000 --> 00:27:31,000
So this is also called as cue 80.

481
00:27:31,000 --> 00:27:32,000
Okay.

482
00:27:32,000 --> 00:27:34,000
So we can write it as Q 80.

483
00:27:35,000 --> 00:27:35,000
Okay.

484
00:27:35,000 --> 00:27:37,000
Quantization aware technique.

485
00:27:37,000 --> 00:27:40,000
This is basically called as p t q p t q okay.

486
00:27:41,000 --> 00:27:44,000
So over here what is the exact difference.

487
00:27:44,000 --> 00:27:46,000
Will try to see between this two okay.

488
00:27:47,000 --> 00:27:50,000
Now in quantization aware training what happens.

489
00:27:50,000 --> 00:27:51,000
See over here.

490
00:27:51,000 --> 00:27:58,000
What is the problem if I probably perform calibration and if I create a quantized model there is a loss

491
00:27:58,000 --> 00:27:58,000
of data.

492
00:27:59,000 --> 00:28:06,000
And because of this what will happen is that the accuracy will also decrease okay for any use cases.

493
00:28:07,000 --> 00:28:14,000
But in the case of quantization aware training okay, you will be able to see that we will be taking

494
00:28:14,000 --> 00:28:15,000
our trained model.

495
00:28:15,000 --> 00:28:17,000
Whatever trained model is there.

496
00:28:19,000 --> 00:28:20,000
Trained model is there.

497
00:28:21,000 --> 00:28:21,000
Okay.

498
00:28:21,000 --> 00:28:23,000
Let me just go ahead and write it down.

499
00:28:25,000 --> 00:28:26,000
Train model is there.

500
00:28:26,000 --> 00:28:29,000
Then we perform quantization.

501
00:28:30,000 --> 00:28:34,000
Again quantization is what same the calibration process will apply over there.

502
00:28:34,000 --> 00:28:37,000
We will probably do all these things okay.

503
00:28:37,000 --> 00:28:46,000
And then once we do this, the next step is that we will go ahead and perform fine tuning.

504
00:28:49,000 --> 00:28:50,000
See, we know that.

505
00:28:52,000 --> 00:28:58,000
We know that with the help of PDQ, you'll be seeing over here on the top some loss of data and accuracy

506
00:28:58,000 --> 00:28:59,000
is there.

507
00:28:59,000 --> 00:29:04,000
But here, with respect to fine tuning, we will take new training data.

508
00:29:08,000 --> 00:29:10,000
New training data.

509
00:29:10,000 --> 00:29:16,000
Now, once we specifically take new training data, we will be fine tuning this model.

510
00:29:16,000 --> 00:29:18,000
And then we create a quantized model.

511
00:29:19,000 --> 00:29:21,000
Quantized model.

512
00:29:21,000 --> 00:29:27,000
So with respect to any fine tuning technique that you will be seeing we don't use post-training quantization.

513
00:29:27,000 --> 00:29:31,000
We specifically use quantization aware training technique.

514
00:29:31,000 --> 00:29:37,000
So that basically means we are just not losing accuracy or data over here because we are in turn adding

515
00:29:37,000 --> 00:29:39,000
more data for the training purpose.

516
00:29:39,000 --> 00:29:44,000
And through this we will be fine tuning our data and then we create our quantized models.

517
00:29:44,000 --> 00:29:45,000
Right.

518
00:29:45,000 --> 00:29:47,000
So this is the basic difference.

519
00:29:47,000 --> 00:29:52,000
So all the fine tuning technique that I will probably show you in the future will be of this type that

520
00:29:52,000 --> 00:29:56,000
is quantization aware training so that we do not lose much data or accuracy.

521
00:29:56,000 --> 00:30:01,000
So I hope you got an idea with respect to all these three techniques guys going ahead, there are two

522
00:30:01,000 --> 00:30:04,000
important techniques that we really need to understand.

523
00:30:04,000 --> 00:30:07,000
One is chlora and one is Laura.

524
00:30:08,000 --> 00:30:13,000
So this techniques specifically will be also understanding with respect to fine tuning.

525
00:30:13,000 --> 00:30:15,000
So I hope you got an idea.

526
00:30:15,000 --> 00:30:17,000
Just get to know about all these things.

527
00:30:17,000 --> 00:30:17,000
Guys.

528
00:30:17,000 --> 00:30:22,000
This is important because someone if someone asks in the interview what exactly it is, then you'll

529
00:30:22,000 --> 00:30:24,000
be able to understand it very much easily.

530
00:30:24,000 --> 00:30:28,000
And again, explaining all these things will be important if you are really interested, because in

531
00:30:28,000 --> 00:30:32,000
January we have what I feel is the most important thing is with respect to the fine tuning things.

532
00:30:32,000 --> 00:30:34,000
So I hope you like this particular video.

533
00:30:35,000 --> 00:30:36,000
Uh, this was it for my side.

534
00:30:36,000 --> 00:30:37,000
I'll see you on the next video.

535
00:30:37,000 --> 00:30:37,000
Have a great day.

536
00:30:37,000 --> 00:30:38,000
Thank you all.

537
00:30:38,000 --> 00:30:38,000
Take care.

538
00:30:38,000 --> 00:30:38,000
Bye bye.