1
00:00:00,000 --> 00:00:05,000
We are into the part two of fine tuning series, and in this particular video we are going to discuss

2
00:00:05,000 --> 00:00:08,000
about Laura and Laura in depth.

3
00:00:08,000 --> 00:00:10,000
Intuition.

4
00:00:10,000 --> 00:00:14,000
Already in the playlist of fine tuning I've discussed about quantization and I hope you have seen the

5
00:00:14,000 --> 00:00:15,000
video early.

6
00:00:15,000 --> 00:00:18,000
Many people were requesting for Laura and Laura also.

7
00:00:18,000 --> 00:00:25,000
So let's, uh, discuss, uh, and uh, we will try to discuss the complete in depth maths intuition.

8
00:00:25,000 --> 00:00:30,000
Uh, the best thing will be that guys, I'll try to explain this maths in depth.

9
00:00:30,000 --> 00:00:31,000
Intuition.

10
00:00:31,000 --> 00:00:36,000
Uh, I've already seen the research paper, and there are a lot of complicated things that is probably

11
00:00:36,000 --> 00:00:38,000
there in the research paper.

12
00:00:38,000 --> 00:00:44,000
But I will try to teach you in such a way that at least you should be able to understand, you know,

13
00:00:44,000 --> 00:00:46,000
what exactly is Laura?

14
00:00:46,000 --> 00:00:47,000
What exactly is Laura?

15
00:00:47,000 --> 00:00:50,000
And then we'll also see one example.

16
00:00:50,000 --> 00:00:55,000
You know how with the help of code, you will be able to do it And trust me, these are some of the

17
00:00:55,000 --> 00:00:57,000
very important things in fine tuning.

18
00:00:57,000 --> 00:01:04,000
Because tomorrow you go in any interviews, you are going to get asked respect to this kind of questions

19
00:01:04,000 --> 00:01:11,000
that may be coming because at the end of the day, any generative AI projects specifically with LLM

20
00:01:11,000 --> 00:01:17,000
right, LLM models, if you are working in the company, they will be giving you fine tuning tasks to

21
00:01:17,000 --> 00:01:18,000
talk about the research paper.

22
00:01:18,000 --> 00:01:19,000
So what does Lora means?

23
00:01:19,000 --> 00:01:23,000
Lora basically means low rack adaptation of large language models.

24
00:01:24,000 --> 00:01:26,000
Uh, this is this amazing research paper.

25
00:01:26,000 --> 00:01:32,000
And, uh, probably you'll be seeing a lot of this kind of equations as you go ahead.

26
00:01:32,000 --> 00:01:34,000
They'll be different, different performance metrics.

27
00:01:34,000 --> 00:01:37,000
But as usual, I will do what I'm good at.

28
00:01:37,000 --> 00:01:43,000
I will try to break down all these things and probably explain you with respect to examples, with respect

29
00:01:43,000 --> 00:01:45,000
to code and many more things.

30
00:01:46,000 --> 00:01:47,000
So quickly, let's go ahead.

31
00:01:49,000 --> 00:01:53,000
So why Laura and Laura is basically used?

32
00:01:53,000 --> 00:01:58,000
Laura basically means low land, low rank adaptation, lower order rank adaptation.

33
00:01:58,000 --> 00:02:03,000
It is specifically used in fine tuning of LM models.

34
00:02:03,000 --> 00:02:03,000
Okay.

35
00:02:03,000 --> 00:02:05,000
So let's go ahead and discuss this.

36
00:02:05,000 --> 00:02:07,000
And I've created this amazing diagram over here.

37
00:02:08,000 --> 00:02:16,000
Initially whenever whenever you have a pre-trained model that basically means let's say that there is

38
00:02:16,000 --> 00:02:20,000
a model something like GPT four or GPT four turbo, right?

39
00:02:21,000 --> 00:02:24,000
And this model has been created by OpenAI.

40
00:02:24,000 --> 00:02:33,000
So we basically say this model as the base model okay, GPT four turbo and this model is trained with

41
00:02:33,000 --> 00:02:34,000
huge amount of data, right.

42
00:02:34,000 --> 00:02:38,000
So the data sources will be internet.

43
00:02:38,000 --> 00:02:41,000
It can be books, it can be multiple sources.

44
00:02:41,000 --> 00:02:46,000
Right at the end of the day, all these models, how you can measure them.

45
00:02:46,000 --> 00:02:51,000
They may be saying that, hey, uh, it supports 1.5 million tokens.

46
00:02:51,000 --> 00:02:54,000
You know, it has been trained with this many number of words, right?

47
00:02:54,000 --> 00:02:56,000
So many tokens of words.

48
00:02:56,000 --> 00:03:01,000
Now what will happen is that all these models, you know, probably to predict the next word, it will

49
00:03:01,000 --> 00:03:06,000
have the context of all those tokens and then it will be able to give you the response.

50
00:03:06,000 --> 00:03:09,000
So this all models are basically base model.

51
00:03:09,000 --> 00:03:12,000
And we also say this as a pre-trained model okay.

52
00:03:12,000 --> 00:03:13,000
Some of the examples.

53
00:03:13,000 --> 00:03:20,000
Again I'll be telling you GPT four GPT four turbo right GPT three GPT 3.5 something.

54
00:03:20,000 --> 00:03:23,000
So all these models are specifically pre-trained models.

55
00:03:23,000 --> 00:03:28,000
Now further, we can take this model and there are various ways of fine tuning it.

56
00:03:28,000 --> 00:03:31,000
Please make sure that you watch this video till the end.

57
00:03:31,000 --> 00:03:37,000
If you watch this video, you will understand everything that is actually required with respect to fine

58
00:03:37,000 --> 00:03:40,000
tuning and there are multiple ways of fine tuning also.

59
00:03:40,000 --> 00:03:42,000
Now let's say that I take this model.

60
00:03:42,000 --> 00:03:44,000
I do some amount of fine tuning okay.

61
00:03:44,000 --> 00:03:48,000
And this fine tuning is done on all the weights of this specific model.

62
00:03:48,000 --> 00:03:49,000
Right.

63
00:03:49,000 --> 00:03:54,000
So some of the application that we may generate is like ChatGPT, you know we may generate like cloud,

64
00:03:54,000 --> 00:03:56,000
uh, cloudy ChatGPT.

65
00:03:56,000 --> 00:03:57,000
Right.

66
00:03:57,000 --> 00:04:00,000
Cloud, GPT itself like the chatbot that we specifically use.

67
00:04:00,000 --> 00:04:01,000
Some of the examples okay.

68
00:04:01,000 --> 00:04:06,000
So this one way of fine tuning is there where we train all our parameters.

69
00:04:06,000 --> 00:04:12,000
So here we specifically say full, full, full parameter training okay.

70
00:04:12,000 --> 00:04:14,000
So here you can see full parameter fine tuning.

71
00:04:15,000 --> 00:04:19,000
So here I'm going to write full parameter fine tuning.

72
00:04:19,000 --> 00:04:25,000
Now this is one way of fine tuning where we train our entire parameter based on our data that we have

73
00:04:25,000 --> 00:04:26,000
okay.

74
00:04:26,000 --> 00:04:32,000
And after training it we can develop applications like ChatGPT or any custom GPT that you specifically

75
00:04:32,000 --> 00:04:33,000
create.

76
00:04:33,000 --> 00:04:42,000
As we go ahead, you can also take these models and perform domain specific fine tuning.

77
00:04:42,000 --> 00:04:48,000
Okay, so one type of fine tuning is this the other tune, other fine tuning technique that you can

78
00:04:48,000 --> 00:04:50,000
specifically use is called as domain specific fine tuning.

79
00:04:50,000 --> 00:04:58,000
Some of the example let's say that I'm fine tuning a chatbot model, you know, which will be for finance.

80
00:04:58,000 --> 00:05:00,000
It can be for sales.

81
00:05:00,000 --> 00:05:03,000
It can be for different different domains itself.

82
00:05:03,000 --> 00:05:03,000
Right.

83
00:05:03,000 --> 00:05:05,000
So here the main important word is domain.

84
00:05:05,000 --> 00:05:08,000
So this fine tuning also we can perform okay.

85
00:05:08,000 --> 00:05:10,000
Again why I'm saying all these things.

86
00:05:10,000 --> 00:05:13,000
Because there are various ways of fine tuning things.

87
00:05:13,000 --> 00:05:13,000
Right.

88
00:05:13,000 --> 00:05:20,000
One more fine tuning we can basically divide by is something called as specific task fine tuning.

89
00:05:20,000 --> 00:05:22,000
Now in case of a specific task fine tuning.

90
00:05:22,000 --> 00:05:24,000
These are my different different task.

91
00:05:24,000 --> 00:05:27,000
Let's say this is task a, b, C, d.

92
00:05:27,000 --> 00:05:31,000
This task can be something related to Q&A chat bot.

93
00:05:31,000 --> 00:05:36,000
This task can be something related to document Q&A chat bot different different applications.

94
00:05:36,000 --> 00:05:41,000
So that is the reason why we are specifically saying over here as specific task okay.

95
00:05:41,000 --> 00:05:43,000
Specific task fine tuning.

96
00:05:43,000 --> 00:05:44,000
Now perfect.

97
00:05:44,000 --> 00:05:45,000
Now this is good.

98
00:05:45,000 --> 00:05:48,000
You have seen all the different ways of fine tuning.

99
00:05:48,000 --> 00:05:52,000
Okay, now let's talk about full parameter fine tuning.

100
00:05:52,000 --> 00:05:53,000
Again I'll repeat it.

101
00:05:53,000 --> 00:05:55,000
This is my base model right now.

102
00:05:55,000 --> 00:06:02,000
If you use it as an example, as I said GPT four turbo, GPT 3.5, you know Gemini.

103
00:06:02,000 --> 00:06:05,000
Gemini 1.4 different different models can be there.

104
00:06:05,000 --> 00:06:06,000
We can take this base model.

105
00:06:06,000 --> 00:06:09,000
We can fine tune and create applications like ChatGPT.

106
00:06:09,000 --> 00:06:14,000
We can create other other applications like stable Diffusions, you know, not specifically to LLM,

107
00:06:14,000 --> 00:06:16,000
but Liam, we can actually do it.

108
00:06:16,000 --> 00:06:21,000
Then we can also further fine tune this based on domain specific fine tuning, right?

109
00:06:21,000 --> 00:06:25,000
Based on different different domains like finance, sales, retail.

110
00:06:25,000 --> 00:06:32,000
We can also take this model and do more specific task fine tuning like task A, task B, task D.

111
00:06:32,000 --> 00:06:35,000
Let's say I want to convert this into text to SQL.

112
00:06:35,000 --> 00:06:41,000
I want to have this as document Q&A so I can further fine tune it based on specific task.

113
00:06:41,000 --> 00:06:44,000
Now let's talk about this full parameter fine tuning okay.

114
00:06:44,000 --> 00:06:47,000
And what are the challenges with full parameter fine tuning.

115
00:06:47,000 --> 00:06:50,000
And that is where see I'm building up the story.

116
00:06:50,000 --> 00:06:55,000
Later on I'll be explaining you where Laura will be used now in full parameter fine tuning.

117
00:06:55,000 --> 00:07:00,000
The major challenge is that we really need to update all the model weights.

118
00:07:00,000 --> 00:07:02,000
Let's say that I have.

119
00:07:02,000 --> 00:07:07,000
I have a model which has somewhere around 175 billion parameters.

120
00:07:07,000 --> 00:07:10,000
That basically means 175 billion weights.

121
00:07:10,000 --> 00:07:16,000
Now, in this particular case, whenever I fine tune this model, I need to update all the weights.

122
00:07:16,000 --> 00:07:17,000
Great.

123
00:07:17,000 --> 00:07:23,000
Now when I'm saying updating all the model weights and when we talk about this many number of parameters,

124
00:07:23,000 --> 00:07:28,000
white can be a challenge because there will be hardware resource constraint.

125
00:07:28,000 --> 00:07:29,000
Right.

126
00:07:29,000 --> 00:07:36,000
So with respect to different different task, if I really want to use this particular model that much

127
00:07:36,000 --> 00:07:41,000
Ram, I really require for inferencing purpose, that much GPU I really require, right.

128
00:07:41,000 --> 00:07:47,000
So for downstream task it becomes very difficult.

129
00:07:47,000 --> 00:07:49,000
Now what is downstream task.

130
00:07:49,000 --> 00:07:50,000
Downstream task.

131
00:07:50,000 --> 00:07:52,000
Some of the example is like model monitoring.

132
00:07:53,000 --> 00:07:55,000
Right model monitoring.

133
00:07:56,000 --> 00:08:00,000
The other task can be like, uh, model inferencing.

134
00:08:01,000 --> 00:08:02,000
Right.

135
00:08:02,000 --> 00:08:04,000
Model inferencing.

136
00:08:04,000 --> 00:08:10,000
Similarly, the GPU constraint that we may have the Ram constraint that we may have.

137
00:08:10,000 --> 00:08:17,000
So we may face multiple challenges when we have this full parameter tuning file for full parameter fine

138
00:08:17,000 --> 00:08:18,000
tuning.

139
00:08:18,000 --> 00:08:25,000
Now, in order to overcome this challenge, we will specifically use Laura and Laura.

140
00:08:25,000 --> 00:08:26,000
Okay.

141
00:08:26,000 --> 00:08:27,000
What exactly is Laura?

142
00:08:27,000 --> 00:08:32,000
As I said, low order rank adaptation and Laura is something.

143
00:08:32,000 --> 00:08:34,000
It is also called as Laura 2.0.

144
00:08:34,000 --> 00:08:39,000
So we'll discuss about both of them with respect to mathematical intuition, and you'll get a complete

145
00:08:39,000 --> 00:08:41,000
idea what I'm actually trying to say.

146
00:08:41,000 --> 00:08:43,000
Now what does Laura do?

147
00:08:43,000 --> 00:08:45,000
Okay, now let's read the first point.

148
00:08:45,000 --> 00:08:51,000
And this is very much important because in the research paper you will find this equation okay.

149
00:08:51,000 --> 00:08:52,000
This equation.

150
00:08:52,000 --> 00:08:54,000
Now what exactly Laura will do.

151
00:08:54,000 --> 00:09:00,000
Laura says that instead of updating all the weights in full parameter fine tuning, right.

152
00:09:00,000 --> 00:09:04,000
Instead of updating all the weights, it will not update them.

153
00:09:04,000 --> 00:09:08,000
Instead it will track the changes.

154
00:09:08,000 --> 00:09:11,000
It will track the changes.

155
00:09:11,000 --> 00:09:12,000
Now what changes?

156
00:09:12,000 --> 00:09:13,000
This is basically tracking.

157
00:09:13,000 --> 00:09:18,000
It will track the changes of the new weights based on fine tuning.

158
00:09:18,000 --> 00:09:18,000
Okay.

159
00:09:18,000 --> 00:09:20,000
This is very much important to understand.

160
00:09:20,000 --> 00:09:27,000
So uh, based on the new weights, how we are going to combine these weights with the pre-trained weights.

161
00:09:27,000 --> 00:09:31,000
Okay, so here you can see these are my pre-trained weights from the base model.

162
00:09:31,000 --> 00:09:34,000
Like let's say that, uh, this model is llama two.

163
00:09:34,000 --> 00:09:42,000
Now if you are performing fine tuning using Lora, then Lora will track the new weights over here,

164
00:09:42,000 --> 00:09:44,000
which will be of the same size.

165
00:09:44,000 --> 00:09:50,000
Okay, so let's say if this, uh, weight is three cross three, then the new weights, when it is probably

166
00:09:50,000 --> 00:09:56,000
doing the forward and the backward propagation, those new weights will be tracked in a separate matrix.

167
00:09:56,000 --> 00:10:01,000
And then these two weights will be combined wherein we will get the fine tuned weights.

168
00:10:02,000 --> 00:10:06,000
Now this way what will happen is that this tracking will happen in a separate way.

169
00:10:06,000 --> 00:10:12,000
But still you may be thinking Krish, here also we are updating all the weights itself, right?

170
00:10:12,000 --> 00:10:16,000
So here also the resource constraint will definitely happen.

171
00:10:16,000 --> 00:10:18,000
Yeah fine I'm talking about three cross three.

172
00:10:18,000 --> 00:10:22,000
But what about uh weights and parameters where they are 175 billion.

173
00:10:22,000 --> 00:10:22,000
Right.

174
00:10:22,000 --> 00:10:25,000
175 billion parameters or 7 billion parameters.

175
00:10:25,000 --> 00:10:28,000
That time I'll be having a huge matrix.

176
00:10:28,000 --> 00:10:29,000
Right.

177
00:10:29,000 --> 00:10:37,000
So at that scenario you need to now understand how Laura works because this weights how it is getting

178
00:10:37,000 --> 00:10:38,000
tracked.

179
00:10:38,000 --> 00:10:40,000
It will just not get tracked in this three cross three matrix.

180
00:10:40,000 --> 00:10:48,000
Instead all these weights that is getting tracked there, a simple mathematical equation will happen.

181
00:10:48,000 --> 00:10:54,000
Or I'll not say mathematical equation will happen, but there'll be a technique that will happen which

182
00:10:54,000 --> 00:10:56,000
is called as matrix decomposition.

183
00:10:56,000 --> 00:11:03,000
That basically means the same three cross three matrix is saved it in a two smaller matrix.

184
00:11:04,000 --> 00:11:09,000
Now in this two smaller matrix you can see this is nothing but one cross three and this is nothing but

185
00:11:09,000 --> 00:11:10,000
three cross one.

186
00:11:10,000 --> 00:11:11,000
Right.

187
00:11:11,000 --> 00:11:12,000
Sorry.

188
00:11:12,000 --> 00:11:14,000
This is three cross one and this is one cross three.

189
00:11:14,000 --> 00:11:15,000
Right.

190
00:11:15,000 --> 00:11:18,000
So this is three cross one and this is one cross three.

191
00:11:18,000 --> 00:11:23,000
When we multiply both these weights then I will be getting this weight only right.

192
00:11:23,000 --> 00:11:31,000
So over here if I consider I have some around nine weights 456789.

193
00:11:31,000 --> 00:11:31,000
Right.

194
00:11:32,000 --> 00:11:39,000
You'll be able to see that I will be able to get all these nine weights from how many number of parameters?

195
00:11:39,000 --> 00:11:41,000
Just six parameters.

196
00:11:42,000 --> 00:11:43,000
Right.

197
00:11:43,000 --> 00:11:48,000
Because when we multiply this then you'll be able to see that I'll get all these nine parameters or

198
00:11:48,000 --> 00:11:49,000
nine weights that I have.

199
00:11:50,000 --> 00:11:50,000
Right.

200
00:11:50,000 --> 00:11:56,000
So in short, what Laura is doing is that it is performing this matrix decomposition where a big matrix

201
00:11:56,000 --> 00:12:04,000
and this matrix can be of any size is decomposed into two smaller matrix based on a parameter, which

202
00:12:04,000 --> 00:12:05,000
is called as rank.

203
00:12:06,000 --> 00:12:08,000
How to calculate a rank of a matrix.

204
00:12:08,000 --> 00:12:11,000
You can definitely check out any YouTube channel.

205
00:12:11,000 --> 00:12:16,000
It is a simple algebraic equations based on transpose of a matrix.

206
00:12:16,000 --> 00:12:18,000
How we calculate the rank.

207
00:12:18,000 --> 00:12:22,000
But let's say that this matrix that I have, which is a three cross one over here, the rank of this

208
00:12:22,000 --> 00:12:24,000
particular matrix is one.

209
00:12:24,000 --> 00:12:25,000
Right.

210
00:12:25,000 --> 00:12:32,000
And if I use this two matrix, you can obviously see that the number of parameters that I'm storing

211
00:12:32,000 --> 00:12:34,000
over here is less when compared to this, right?

212
00:12:34,000 --> 00:12:40,000
Yes, there will be a loss in precision, but it is making sure that when we combine both these metrics

213
00:12:40,000 --> 00:12:44,000
will be able to get the entire updated weights.

214
00:12:44,000 --> 00:12:50,000
Now just imagine start thinking guys, let's say that if I have 7 billion parameters now I'm trying

215
00:12:50,000 --> 00:12:53,000
to perform fine tuning on those parameters.

216
00:12:53,000 --> 00:13:00,000
So whenever I track those weights, this huge matrix will be decomposed into two smaller metrics.

217
00:13:00,000 --> 00:13:07,000
And when we are decomposing this this matrix, this updated tracked weights matrix into two smaller

218
00:13:07,000 --> 00:13:11,000
metrics obviously will be requiring less parameter to store all these values.

219
00:13:12,000 --> 00:13:12,000
Right.

220
00:13:12,000 --> 00:13:16,000
And this way your fine tuning becomes very much efficient.

221
00:13:16,000 --> 00:13:20,000
And this really solves the resource constraint.

222
00:13:20,000 --> 00:13:21,000
Right.

223
00:13:21,000 --> 00:13:23,000
This is the most important thing.

224
00:13:23,000 --> 00:13:23,000
Right.

225
00:13:23,000 --> 00:13:30,000
And so in any research paper that you go ahead you'll be seeing this equation w zero.

226
00:13:30,000 --> 00:13:32,000
This is my pre-trained weights.

227
00:13:32,000 --> 00:13:37,000
Plus the tracked changed weights is nothing but my pre-trained weights.

228
00:13:37,000 --> 00:13:39,000
Plus B multiplied by a.

229
00:13:39,000 --> 00:13:40,000
What is b.

230
00:13:40,000 --> 00:13:42,000
B is this a.

231
00:13:42,000 --> 00:13:43,000
Is this.

232
00:13:43,000 --> 00:13:49,000
So when we multiply this you'll be able to see that we are able to get the all the track changed weights.

233
00:13:49,000 --> 00:13:50,000
Right.

234
00:13:50,000 --> 00:13:55,000
And obviously this requires less parameter if you are you're decomposing our bigger matrix into two

235
00:13:55,000 --> 00:13:57,000
smaller matrix less parameters is required.

236
00:13:58,000 --> 00:14:01,000
Now what will happen if we keep on increasing the ranks?

237
00:14:01,000 --> 00:14:08,000
If we keep on increasing the ranks, this parameter will also keep on increasing, but it will always

238
00:14:08,000 --> 00:14:10,000
be less than this, right?

239
00:14:10,000 --> 00:14:17,000
If I have 7 billion parameters, if I try to decompose that into two two matrices, two smaller matrices

240
00:14:17,000 --> 00:14:22,000
with increasing rank, then also the parameters that will be required will be less.

241
00:14:22,000 --> 00:14:28,000
How I am saying this because in the research paper also they have tried with multiple trainable parameters.

242
00:14:28,000 --> 00:14:31,000
Now let's see over here there are multiple techniques of fine tuning.

243
00:14:31,000 --> 00:14:37,000
Some of the techniques that are there is something called as prefix embed prefix layer adapter.

244
00:14:37,000 --> 00:14:41,000
Adapter is one very famous thing that is probably used before Laura.

245
00:14:41,000 --> 00:14:41,000
right?

246
00:14:41,000 --> 00:14:46,000
You can see as the rank is increasing the parameters also increases, right?

247
00:14:46,000 --> 00:14:50,000
Initially the trainable parameters are 175 billion.

248
00:14:50,000 --> 00:14:52,000
But when I use techniques like adapter right.

249
00:14:52,000 --> 00:14:58,000
So initially it will have 7.50 7.1 million parameters with rank is equal to one.

250
00:14:58,000 --> 00:15:03,000
But as I keep on increasing the rank, as I keep on increasing the rank, you'll be able to see that

251
00:15:03,000 --> 00:15:07,000
this parameter is also getting increased, but you can see from 175 billion parameter.

252
00:15:07,000 --> 00:15:11,000
If I compare 7.1 million weights, the percentage is very less.

253
00:15:11,000 --> 00:15:17,000
Now similarly, in Laura, because of that, uh, matrix decomposition, you'll be able to see that

254
00:15:17,000 --> 00:15:20,000
as I keep on increasing my ranks.

255
00:15:20,000 --> 00:15:26,000
So these are my ranks with respect to q k v because in transfer you have this three parameter q k v.

256
00:15:26,000 --> 00:15:30,000
Then only all the matrix multiplication will happen with respect to this three parameters.

257
00:15:30,000 --> 00:15:35,000
And then we as we keep on increasing the rank here you can see for here you can see eight.

258
00:15:35,000 --> 00:15:36,000
Here you can see 64.

259
00:15:36,000 --> 00:15:40,000
Then you'll be able to see initially we got 4.7 million parameters.

260
00:15:41,000 --> 00:15:44,000
Compare from 175 billion to 4.7 million.

261
00:15:44,000 --> 00:15:49,000
How this was possible because of the because of the matrix decomposition.

262
00:15:51,000 --> 00:15:54,000
Because of the matrix decomposition.

263
00:15:54,000 --> 00:15:55,000
Right.

264
00:15:55,000 --> 00:16:00,000
And as we keep on increasing the rank, you'll be seeing that the parameters are increasing.

265
00:16:00,000 --> 00:16:01,000
Right?

266
00:16:01,000 --> 00:16:02,000
The parameters are obviously increasing.

267
00:16:02,000 --> 00:16:10,000
But if I compare it with 175 billion parameter, this is very less 9.4 million if you just see the percentage.

268
00:16:10,000 --> 00:16:10,000
Right.

269
00:16:11,000 --> 00:16:17,000
So here also when rank is equal to eight 37.7 million, then rank is equal to 60 430 1.9 million.

270
00:16:17,000 --> 00:16:18,000
Right.

271
00:16:18,000 --> 00:16:19,000
Parameters are there.

272
00:16:19,000 --> 00:16:25,000
So still it is making sure that the parameter is not that much like like not not like 175 billion or

273
00:16:25,000 --> 00:16:26,000
not near to this.

274
00:16:26,000 --> 00:16:29,000
If I talk with respect to percentage, it is very, very less.

275
00:16:29,000 --> 00:16:31,000
I've also made another table.

276
00:16:31,000 --> 00:16:32,000
Right.

277
00:16:32,000 --> 00:16:35,000
Just to show you if I have different different models.

278
00:16:35,000 --> 00:16:40,000
Number of trainable parameter.

279
00:16:43,000 --> 00:16:45,000
Number of trainable parameter.

280
00:16:45,000 --> 00:16:48,000
Here you can see I have one lm model with 7 billion.

281
00:16:48,000 --> 00:16:53,000
If I use rank is equal to one, then I will be having 167 k parameters to fine tune.

282
00:16:53,000 --> 00:16:59,000
Fine tune weights based on fine tune weights then and this one 767 k parameter basically means what

283
00:16:59,000 --> 00:17:04,000
this decomposed matrix that I have right two matrix that many number of parameters.

284
00:17:04,000 --> 00:17:10,000
So in the first case it is just nothing but seven 167 k parameters that is the available in this decomposed

285
00:17:10,000 --> 00:17:11,000
matrix.

286
00:17:11,000 --> 00:17:11,000
Okay.

287
00:17:11,000 --> 00:17:17,000
When we combine them then we will be able to get how many 7 billion parameters.

288
00:17:17,000 --> 00:17:23,000
If we combine both this matrix then we will be getting 7 billion parameters okay then similarly you

289
00:17:23,000 --> 00:17:26,000
can see in 13 billion then you have 2 to 8 k parameters.

290
00:17:26,000 --> 00:17:29,000
In 70 billion you have 5 to 9 k parameters.

291
00:17:29,000 --> 00:17:32,000
In 180 billion you have eight for nine k parameters.

292
00:17:32,000 --> 00:17:35,000
So as you keep on increasing the weight, this parameter will keep on increasing.

293
00:17:35,000 --> 00:17:38,000
But it is not increasing with that huge amount.

294
00:17:38,000 --> 00:17:38,000
Right?

295
00:17:38,000 --> 00:17:41,000
Even you can see when we keep the rank as five to L right.

296
00:17:41,000 --> 00:17:43,000
So here you can see 86 million parameters.

297
00:17:43,000 --> 00:17:45,000
Is there when compared to billions.

298
00:17:45,000 --> 00:17:45,000
Right.

299
00:17:46,000 --> 00:17:51,000
Uh, Microsoft, uh, you know, it came up with this Laura technique, uh, in one of the research

300
00:17:51,000 --> 00:17:54,000
paper and it has used rank is equal to eight, okay.

301
00:17:54,000 --> 00:17:56,000
To probably do the fine tuning.

302
00:17:56,000 --> 00:17:57,000
And it has performed absolutely well.

303
00:17:57,000 --> 00:18:01,000
So most of the time we select this particular value.

304
00:18:01,000 --> 00:18:05,000
But at the end of the day how to select this.

305
00:18:05,000 --> 00:18:06,000
Right.

306
00:18:06,000 --> 00:18:11,000
It won't matter you know because the parameters are increasing by very less number over here as we go

307
00:18:11,000 --> 00:18:11,000
ahead.

308
00:18:11,000 --> 00:18:16,000
So usually you can select rank 1 to 8 while you're performing fine tuning.

309
00:18:16,000 --> 00:18:20,000
Now there may be also scenario that when should we use very high rank?

310
00:18:20,000 --> 00:18:22,000
When to use high rank?

311
00:18:23,000 --> 00:18:25,000
When to use high rank?

312
00:18:26,000 --> 00:18:33,000
This answer because in the interview they may ask you if the model wants to learn complex things.

313
00:18:36,000 --> 00:18:37,000
Complex things.

314
00:18:37,000 --> 00:18:40,000
Then you can specifically use high rank, right?

315
00:18:40,000 --> 00:18:47,000
Let's say some of the model is not trained to probably, uh, interact or probably, uh, perform some

316
00:18:47,000 --> 00:18:48,000
of the behavior at that point of time.

317
00:18:48,000 --> 00:18:53,000
Those complex things can be handled when you are probably increasing the number of ranks.

318
00:18:53,000 --> 00:18:53,000
Okay.

319
00:18:53,000 --> 00:18:58,000
So this can be a very simple question that may be asked in the interview, but I hope you got a complete

320
00:18:58,000 --> 00:18:59,000
idea.

321
00:18:59,000 --> 00:19:03,000
At the end of the day, this is the equation that you'll be able to see in most of the research paper.

322
00:19:03,000 --> 00:19:06,000
Uh, what Laura is doing is that nothing very simple.

323
00:19:06,000 --> 00:19:11,000
All the track weights is decomposed into two smaller matrixes with different different ranks.

324
00:19:11,000 --> 00:19:12,000
It can be different, different ranks.

325
00:19:12,000 --> 00:19:17,000
When you're fine tuning, the first thing is that you really need to set that rank.

326
00:19:17,000 --> 00:19:17,000
Okay?

327
00:19:17,000 --> 00:19:21,000
If you set that rank, like in this particular case, if I probably see if I go ahead and calculate

328
00:19:21,000 --> 00:19:25,000
with all the mathematical stuff, I will be able to get, the rank is equal to one.

329
00:19:25,000 --> 00:19:25,000
Okay.

330
00:19:26,000 --> 00:19:27,000
Uh, for this also rank is equal to one.

331
00:19:27,000 --> 00:19:28,000
Right.

332
00:19:28,000 --> 00:19:29,000
So similarly if you have rank two.

333
00:19:29,000 --> 00:19:31,000
So one of the matrix can be something like this.

334
00:19:31,000 --> 00:19:33,000
So decomposed metrics.

335
00:19:33,000 --> 00:19:36,000
So this is based on rank two okay.

336
00:19:37,000 --> 00:19:39,000
So if I probably combine this right.

337
00:19:39,000 --> 00:19:40,000
So how many.

338
00:19:40,000 --> 00:19:43,000
123456789 ten 1112.

339
00:19:43,000 --> 00:19:43,000
Right.

340
00:19:43,000 --> 00:19:47,000
If I multiply this I'll be getting a matrix uh, of much more parameters.

341
00:19:47,000 --> 00:19:47,000
Right.

342
00:19:47,000 --> 00:19:51,000
But at the end of the day for in this particular case it is less number of parameters.

343
00:19:51,000 --> 00:19:52,000
Right.

344
00:19:52,000 --> 00:19:54,000
So this is what Laura is all about.

345
00:19:54,000 --> 00:19:59,000
And because of this technique the fine tuning is done less the the weights, the parameters becomes

346
00:19:59,000 --> 00:19:59,000
less.

347
00:19:59,000 --> 00:20:02,000
So this is how the main resource constraint is done.

348
00:20:02,000 --> 00:20:06,000
And uh, with respect to all the downstream tasks, it becomes very much easy.

349
00:20:06,000 --> 00:20:09,000
Now one more thing that I really want to talk about is Clara.

350
00:20:10,000 --> 00:20:10,000
Okay.

351
00:20:11,000 --> 00:20:12,000
Clara.

352
00:20:13,000 --> 00:20:15,000
Clara basically means quantized.

353
00:20:16,000 --> 00:20:17,000
Quantized.

354
00:20:17,000 --> 00:20:17,000
Laura.

355
00:20:18,000 --> 00:20:23,000
Okay, now you have already learned from the first video what is quantized.

356
00:20:23,000 --> 00:20:23,000
Quantized basically means.

357
00:20:23,000 --> 00:20:31,000
Now in our case, what will happen is that all these parameters that is probably stored in float float

358
00:20:31,000 --> 00:20:32,000
16 bit.

359
00:20:32,000 --> 00:20:33,000
Okay.

360
00:20:33,000 --> 00:20:35,000
We will try to convert this into four bit.

361
00:20:35,000 --> 00:20:36,000
That's it.

362
00:20:36,000 --> 00:20:37,000
Okay.

363
00:20:37,000 --> 00:20:42,000
Once we do this, you will be able to see that we reduce the precision.

364
00:20:42,000 --> 00:20:44,000
And then we try to reduce this values.

365
00:20:44,000 --> 00:20:47,000
Also by this you won't require much more memory.

366
00:20:47,000 --> 00:20:50,000
So that is the reason we say quantized Laura technique okay.

367
00:20:50,000 --> 00:20:57,000
And the best thing about is this is that, uh, Clara also has one amazing algorithm which will be able

368
00:20:57,000 --> 00:21:02,000
to take care of both this part, let's say if there is a float 16 bit, I quantize it to four bit.

369
00:21:02,000 --> 00:21:05,000
I can also convert this back into 16 bit.

370
00:21:05,000 --> 00:21:10,000
Okay, so with respect to this explanation guys, I have already spoken about many things over here.

371
00:21:10,000 --> 00:21:11,000
Laura and Clara.