1
00:00:00,000 --> 00:00:01,000
Hello guys!

2
00:00:01,000 --> 00:00:07,000
Before we start implementing some of the amazing generative AI applications with the help of large language

3
00:00:07,000 --> 00:00:13,000
models or multimodal, it is very much necessary that you really need to understand how these LLM models

4
00:00:13,000 --> 00:00:15,000
are specifically trained.

5
00:00:15,000 --> 00:00:20,000
What is the step by step mechanism to properly train this particular model completely from scratch?

6
00:00:20,000 --> 00:00:24,000
Yes, with respect to practical implementation, it will not be possible because you definitely require

7
00:00:24,000 --> 00:00:27,000
a huge amount of resource data and many more things.

8
00:00:27,000 --> 00:00:32,000
But in this video, what I'm actually going to do is that I'll be taking some of the LM models, like,

9
00:00:32,000 --> 00:00:34,000
uh, OpenAI ChatGPT models.

10
00:00:34,000 --> 00:00:37,000
I'll be talking about meta, uh, meta llama three models and.

11
00:00:37,000 --> 00:00:38,000
All right.

12
00:00:38,000 --> 00:00:44,000
And when I read this research papers of all the specific models, they usually follow a specific pattern

13
00:00:44,000 --> 00:00:46,000
for training all these LM models from scratch.

14
00:00:46,000 --> 00:00:51,000
And in this video I will be talking about this entire steps.

15
00:00:51,000 --> 00:00:54,000
Uh, step by step will try to understand it completely.

16
00:00:54,000 --> 00:00:59,000
Right again, uh, the main aim, the main goal of this specific video is just to make you understand

17
00:00:59,000 --> 00:01:05,000
how this, uh, you know, LM models are basically trained, and this is just a theoretical intuition.

18
00:01:05,000 --> 00:01:10,000
Practical intuition will not be definitely possible where you will be training your model from scratch.

19
00:01:10,000 --> 00:01:12,000
It definitely requires a lot of resources.

20
00:01:12,000 --> 00:01:13,000
Right.

21
00:01:13,000 --> 00:01:15,000
So, yes, uh, let's go ahead and enjoy this particular session.

22
00:01:15,000 --> 00:01:19,000
And please make sure that you watch this session till the end because it is important.

23
00:01:19,000 --> 00:01:20,000
Thank you.

24
00:01:20,000 --> 00:01:25,000
Here I have actually created all this diagrams to just make you understand.

25
00:01:25,000 --> 00:01:32,000
But, uh, just to understand how ChatGPT was trained, you know, I have I have explored lot of research

26
00:01:32,000 --> 00:01:38,000
papers from a past month, you know, a lot of different resources, materials that are available in

27
00:01:38,000 --> 00:01:43,000
the internet about ChatGPT blogs or even I have explored the ChatGPT websites.

28
00:01:43,000 --> 00:01:49,000
The research paper that was basically generated for this, everything I explored and I found out multiple

29
00:01:49,000 --> 00:01:51,000
things to probably explain about this.

30
00:01:51,000 --> 00:01:57,000
I'll try to break this down, but before I go ahead, there is an amazing article written by Pradeep

31
00:01:57,000 --> 00:01:57,000
menon.

32
00:01:57,000 --> 00:01:59,000
You should definitely check out this.

33
00:01:59,000 --> 00:02:05,000
I will be providing you the link in the description of this particular video and it looks completely

34
00:02:05,000 --> 00:02:05,000
amazing.

35
00:02:05,000 --> 00:02:07,000
He has explained in an amazing way.

36
00:02:07,000 --> 00:02:10,000
And this last two diagrams that you will be seeing over here, right?

37
00:02:10,000 --> 00:02:14,000
If you probably explore this, this last two diagrams have also copied and pasted.

38
00:02:14,000 --> 00:02:21,000
So definitely a lot of credit goes to uh, to this specific article, uh, which is written by, um,

39
00:02:21,000 --> 00:02:22,000
you know, Pradeep.

40
00:02:22,000 --> 00:02:28,000
So I will just try to explain you again, just but not, uh, if you probably read this, you may get

41
00:02:28,000 --> 00:02:32,000
some amount of understanding, but, uh, currently I will.

42
00:02:32,000 --> 00:02:35,000
What I will do is that I'll give a lot of examples over here, right.

43
00:02:35,000 --> 00:02:39,000
So that you will be able to relate it like how chat GPT is trained.

44
00:02:39,000 --> 00:02:44,000
Now to start with chat, GPT entire model is basically trained in three stages.

45
00:02:44,000 --> 00:02:45,000
Okay.

46
00:02:45,000 --> 00:02:49,000
The first stage we basically say it as generative pre-training.

47
00:02:49,000 --> 00:02:50,000
Okay.

48
00:02:50,000 --> 00:02:56,000
And before you probably really want to understand about ChatGPT, you should also check out my video

49
00:02:56,000 --> 00:02:59,000
in YouTube regarding Transformers.

50
00:02:59,000 --> 00:03:00,000
Okay, because this is the base.

51
00:03:00,000 --> 00:03:04,000
So if you search for Transformers, right?

52
00:03:04,000 --> 00:03:08,000
So here you will be seeing that I have created a live session over here.

53
00:03:08,000 --> 00:03:13,000
This specific video should definitely refer to, because this is the video that I probably uploaded

54
00:03:13,000 --> 00:03:14,000
two years back.

55
00:03:14,000 --> 00:03:21,000
And if I talk about models like ChatGPT or Bart, you know they are specifically using transformer architecture,

56
00:03:21,000 --> 00:03:23,000
which has an encoder and decoder.

57
00:03:23,000 --> 00:03:28,000
Okay, so definitely check out this particular video if you really want to know about Transformers.

58
00:03:28,000 --> 00:03:28,000
Okay.

59
00:03:28,000 --> 00:03:34,000
So in the stage one you probably train a generative pre-trained model which through which you get a

60
00:03:34,000 --> 00:03:35,000
base GPT model.

61
00:03:35,000 --> 00:03:39,000
Then in the stage two you do supervised fine tuning.

62
00:03:39,000 --> 00:03:44,000
I will be talking about each and every thing of this, like how this generative pre-training is basically

63
00:03:44,000 --> 00:03:47,000
happening, supervised fine tuning is happening.

64
00:03:47,000 --> 00:03:54,000
And then finally in the stage three, you basically do the reinforcement learning uh, by using the

65
00:03:54,000 --> 00:03:55,000
human inputs.

66
00:03:55,000 --> 00:03:55,000
Right?

67
00:03:55,000 --> 00:04:00,000
So that kind of uh, not human input, but I'll say say human feedback.

68
00:04:00,000 --> 00:04:00,000
Right?

69
00:04:00,000 --> 00:04:04,000
So that is the reason it is written as reinforcement learning, human feedback.

70
00:04:04,000 --> 00:04:07,000
And finally you get a ChatGPT model.

71
00:04:07,000 --> 00:04:10,000
Now this first step that you probably see the stage one, right.

72
00:04:10,000 --> 00:04:12,000
Generative pre-training.

73
00:04:12,000 --> 00:04:16,000
What kind of data is basically required from the entire internet data.

74
00:04:16,000 --> 00:04:17,000
Right.

75
00:04:17,000 --> 00:04:25,000
It can be website articles, books, public forums, uh, like uh, a website, which is probably having

76
00:04:25,000 --> 00:04:26,000
tutorials, a lot of things.

77
00:04:26,000 --> 00:04:26,000
Right.

78
00:04:26,000 --> 00:04:30,000
So those are all internet data is basically taken to train the ChatGPT.

79
00:04:30,000 --> 00:04:34,000
So let's go ahead and deep dive and talk about this okay.

80
00:04:34,000 --> 00:04:38,000
But let me just give you a basic example like what a learner model looks like.

81
00:04:38,000 --> 00:04:44,000
Let's say that I am a person over here and I am really much interested to know about dogs.

82
00:04:44,000 --> 00:04:51,000
Okay, so what I did is that I have explored five, six different types of this big, huge 500 pages

83
00:04:51,000 --> 00:04:52,000
of books regarding dogs.

84
00:04:52,000 --> 00:04:55,000
And I have learned I have probably read it about it.

85
00:04:55,000 --> 00:04:56,000
So I know many things about the dog.

86
00:04:56,000 --> 00:05:03,000
Now you can ask me any questions, so I will try to answer any new thing about that specific dog that

87
00:05:03,000 --> 00:05:04,000
is only present in the book.

88
00:05:04,000 --> 00:05:06,000
Okay, that is only present in the book.

89
00:05:06,000 --> 00:05:13,000
So this gives a brief idea about a model where you are able to answer something after probably getting

90
00:05:13,000 --> 00:05:16,000
trained or reading it from multiple books.

91
00:05:16,000 --> 00:05:18,000
Okay, so this is just an example.

92
00:05:18,000 --> 00:05:21,000
But when we talk about the stage one generative pre-training model.

93
00:05:21,000 --> 00:05:23,000
So what exactly is basically happening over here.

94
00:05:23,000 --> 00:05:28,000
So first of all let me make this as a full screen here.

95
00:05:28,000 --> 00:05:31,000
You have a internet huge input data okay.

96
00:05:31,000 --> 00:05:34,000
So here you basically have an internet huge input text data.

97
00:05:34,000 --> 00:05:37,000
You pass this data to the transformers okay.

98
00:05:37,000 --> 00:05:42,000
Transformers, when you pass this huge amount of data to the transformers, in transformers you have

99
00:05:42,000 --> 00:05:43,000
encoder decoder.

100
00:05:43,000 --> 00:05:46,000
I've explained about transformers a lot in my live session.

101
00:05:46,000 --> 00:05:50,000
Please do make sure that you watch that particular video after you after this particular data.

102
00:05:50,000 --> 00:05:53,000
This huge data is basically trained with the transformers.

103
00:05:53,000 --> 00:05:56,000
It basically creates a base GPT model.

104
00:05:56,000 --> 00:06:03,000
Now, as you know what all tasks a transformer will be able to do it transformer can uh, like they

105
00:06:03,000 --> 00:06:08,000
are able to do this kind of task like language translation, text summarization, text completion,

106
00:06:08,000 --> 00:06:09,000
sentiment analysis.

107
00:06:09,000 --> 00:06:13,000
So this is the kind of task that Transformers can easily do.

108
00:06:13,000 --> 00:06:18,000
And just by training this huge amount of data, we are able to get this kind of task and we are able

109
00:06:18,000 --> 00:06:19,000
to solve this problems.

110
00:06:19,000 --> 00:06:19,000
Right?

111
00:06:19,000 --> 00:06:24,000
Even, uh, in Transformers there is a concept called as attention is all you need, right.

112
00:06:24,000 --> 00:06:27,000
And based on that, all this task can be easily implemented.

113
00:06:27,000 --> 00:06:31,000
And I've also shown a practical implementation with respect to that particular live session also.

114
00:06:31,000 --> 00:06:32,000
Right.

115
00:06:32,000 --> 00:06:35,000
Uh, just in, uh, Google Colab notebook.

116
00:06:35,000 --> 00:06:35,000
Right.

117
00:06:35,000 --> 00:06:38,000
If you have good amount of data, you will also be able to implement this.

118
00:06:38,000 --> 00:06:38,000
Okay.

119
00:06:38,000 --> 00:06:47,000
So once this task is implemented, but our main aim is basically to use this ChatGPT model for conversation.

120
00:06:47,000 --> 00:06:47,000
Right.

121
00:06:47,000 --> 00:06:51,000
What we want we want basically a conversation chat bot.

122
00:06:51,000 --> 00:06:52,000
Right.

123
00:06:52,000 --> 00:06:57,000
So we want this functionality where we are giving a request and a chat bot is giving the response.

124
00:06:57,000 --> 00:06:59,000
That is the functionality we want.

125
00:06:59,000 --> 00:07:01,000
We don't want this independent functionality.

126
00:07:01,000 --> 00:07:05,000
This functionality can be combined in the response part.

127
00:07:05,000 --> 00:07:05,000
Right.

128
00:07:05,000 --> 00:07:12,000
So in short, when we probably create a generative pre-training model, what we are specifically doing

129
00:07:12,000 --> 00:07:17,000
over here is that we are able to get this sub task kind of task over here, right?

130
00:07:17,000 --> 00:07:19,000
All this language translation, text summarization and all.

131
00:07:19,000 --> 00:07:24,000
Now we need to convert this task in the form of request and response, right?

132
00:07:24,000 --> 00:07:26,000
So that is the reason why three tasks are basically required.

133
00:07:26,000 --> 00:07:27,000
Right?

134
00:07:27,000 --> 00:07:30,000
So now what we do is that we go to the next step.

135
00:07:30,000 --> 00:07:33,000
That is next stage supervised fine tuning.

136
00:07:33,000 --> 00:07:38,000
Now what exactly this supervised fine tuning is with respect to safety.

137
00:07:38,000 --> 00:07:39,000
It is also called as safety.

138
00:07:40,000 --> 00:07:44,000
Now in safety what happens is that in one side a human being will be there.

139
00:07:44,000 --> 00:07:45,000
Okay.

140
00:07:45,000 --> 00:07:47,000
let's say this is a human being over here.

141
00:07:47,000 --> 00:07:49,000
Let's say I am sitting over here, right?

142
00:07:49,000 --> 00:07:54,000
And I'll talk about a very important role nowadays, which is very much famous, which is called as

143
00:07:54,000 --> 00:07:55,000
prompt engineering role.

144
00:07:55,000 --> 00:07:55,000
Okay.

145
00:07:55,000 --> 00:07:57,000
We'll also talk about that.

146
00:07:57,000 --> 00:07:57,000
Okay.

147
00:07:57,000 --> 00:08:01,000
So here on the left hand side, one human will be there on the right hand side, another human will

148
00:08:01,000 --> 00:08:02,000
be there.

149
00:08:02,000 --> 00:08:06,000
And this human will be acting like a chat bot agent.

150
00:08:06,000 --> 00:08:07,000
Okay.

151
00:08:07,000 --> 00:08:13,000
So what happens is that whenever this human being sends a request, let's say it asks a question like,

152
00:08:13,000 --> 00:08:14,000
hello, how are you?

153
00:08:14,000 --> 00:08:16,000
So the another human will say that.

154
00:08:16,000 --> 00:08:17,000
Yeah, I'm very good.

155
00:08:17,000 --> 00:08:18,000
I'm fine.

156
00:08:18,000 --> 00:08:19,000
Okay.

157
00:08:19,000 --> 00:08:21,000
Something some response will be there.

158
00:08:21,000 --> 00:08:23,000
Then again, this human will send another request.

159
00:08:23,000 --> 00:08:26,000
Then again, the another human will probably send the response.

160
00:08:27,000 --> 00:08:31,000
So like this it will be having the request and response continuously.

161
00:08:31,000 --> 00:08:33,000
So these are some real conversations.

162
00:08:33,000 --> 00:08:37,000
And these all conversations will be getting captured.

163
00:08:38,000 --> 00:08:46,000
So guys now this kind of real conversation will be converted into an sfti training data corpus.

164
00:08:47,000 --> 00:08:54,000
Right now this SFD training data corpus will basically be in the form of request and response.

165
00:08:54,000 --> 00:08:58,000
Request will be your input and response will be your output.

166
00:08:59,000 --> 00:08:59,000
Okay.

167
00:08:59,000 --> 00:09:03,000
So like this lot of request lot of different different responses.

168
00:09:03,000 --> 00:09:08,000
It can be that for a similar kind of request they can be multiple responses also.

169
00:09:08,000 --> 00:09:12,000
So they will try to first of all create this kind of training data corpus.

170
00:09:12,000 --> 00:09:17,000
And it will just not be 1 or 2 records but millions of records, right?

171
00:09:17,000 --> 00:09:19,000
Millions of records.

172
00:09:19,000 --> 00:09:24,000
Now once this is basically getting created you can see request is conversation history and response

173
00:09:24,000 --> 00:09:27,000
is the best ideal response itself.

174
00:09:27,000 --> 00:09:27,000
Right.

175
00:09:27,000 --> 00:09:33,000
So this this format of training data corpus will basically get created.

176
00:09:33,000 --> 00:09:39,000
And then it will then be sent to the base GPT model right for the training purpose.

177
00:09:40,000 --> 00:09:47,000
Now once it is sent to the base GPT model for the training purpose, then Then over here, the optimizers.

178
00:09:47,000 --> 00:09:52,000
According to the research paper that is basically been used, uh, the optimizer that is used over here

179
00:09:52,000 --> 00:09:54,000
is called as stochastic gradient descent.

180
00:09:54,000 --> 00:09:55,000
Right.

181
00:09:55,000 --> 00:09:59,000
And from this you are basically getting an SftP chatbot model.

182
00:09:59,000 --> 00:10:00,000
What is SFD?

183
00:10:00,000 --> 00:10:02,000
SFD I have already written over here.

184
00:10:02,000 --> 00:10:06,000
It is a supervised fine tuning model with respect to ChatGPT.

185
00:10:06,000 --> 00:10:07,000
Right.

186
00:10:07,000 --> 00:10:14,000
So SFD ChatGPT model I will be able to get after I probably do the stage two or supervised fine tuning

187
00:10:14,000 --> 00:10:15,000
now.

188
00:10:15,000 --> 00:10:17,000
Still, there are a lot of problems with this.

189
00:10:17,000 --> 00:10:19,000
Obviously it will be able to give you the answers right.

190
00:10:19,000 --> 00:10:28,000
But this model, this SFD ChatGPT model will be able to give you the output based on the data it is

191
00:10:28,000 --> 00:10:29,000
basically trained with.

192
00:10:29,000 --> 00:10:36,000
If I probably ask some other questions to this particular ChatGPT model that may not be there in this

193
00:10:36,000 --> 00:10:37,000
particular training data.

194
00:10:38,000 --> 00:10:43,000
Then it will start giving you some awkward answers which you may have not been seen also.

195
00:10:43,000 --> 00:10:43,000
Right?

196
00:10:43,000 --> 00:10:50,000
So this ChatGPT will start behaving in a way you'll also not know what exactly it is trying to say.

197
00:10:50,000 --> 00:10:54,000
And this all problems was faced also by the researchers when they were actually creating this.

198
00:10:55,000 --> 00:11:01,000
And that is the reason they came up with the stage three, which is called as reinforcement learning

199
00:11:01,000 --> 00:11:02,000
through Human Feedback.

200
00:11:02,000 --> 00:11:11,000
Now, because of this step, the model that was now created was called as ChatGPT model.

201
00:11:11,000 --> 00:11:17,000
And right now, whatever things you're using with respect to ChatGPT 3.5 or 4 is basically using this

202
00:11:17,000 --> 00:11:20,000
reinforcement learning through this human feedback.

203
00:11:20,000 --> 00:11:23,000
Now let's understand what exactly is happening over here.

204
00:11:23,000 --> 00:11:24,000
And here.

205
00:11:24,000 --> 00:11:29,000
I'm also going to give you a lot of amazing examples to just make you understand, because the most

206
00:11:29,000 --> 00:11:31,000
complex thing is not this.

207
00:11:31,000 --> 00:11:35,000
See, data creation is always a task that we probably do as a data scientist.

208
00:11:35,000 --> 00:11:40,000
Not this is also not a very, uh, very difficult step because there are transformers.

209
00:11:40,000 --> 00:11:41,000
We are using the same architecture.

210
00:11:41,000 --> 00:11:44,000
We are just taking huge amount of data and we are training with it.

211
00:11:44,000 --> 00:11:45,000
Right.

212
00:11:45,000 --> 00:11:48,000
The most important step is this.

213
00:11:48,000 --> 00:11:54,000
Because of this, the accuracy of the ChatGPT has been increased in a tremendous way.

214
00:11:54,000 --> 00:11:58,000
Right now, what exactly is this reinforcement learning through human feedback.

215
00:11:58,000 --> 00:12:05,000
So over here, when we have this safety trained model, let's say a human agent gives a request, then

216
00:12:05,000 --> 00:12:08,000
Sfti ChatGPT will give some kind of response.

217
00:12:08,000 --> 00:12:15,000
Now similarly for this kind of request for this kind of request, we may have multiple response also.

218
00:12:16,000 --> 00:12:16,000
Right.

219
00:12:16,000 --> 00:12:18,000
We may have different different response.

220
00:12:19,000 --> 00:12:22,000
Now this is based on this particular response.

221
00:12:22,000 --> 00:12:25,000
Over here you can see these are all my alternative response.

222
00:12:25,000 --> 00:12:31,000
Now when I say through human feedback where this new human has been put up over here in this right.

223
00:12:31,000 --> 00:12:37,000
So once we probably get multiple different responses, now this human agent, what it will do is that

224
00:12:37,000 --> 00:12:44,000
it will try to rank all this response saying that which is the most suitable response, right?

225
00:12:44,000 --> 00:12:50,000
Which is the most suitable response, or which is the best response based on that ranking will get assigned.

226
00:12:50,000 --> 00:12:54,000
So here you can see that response B is the best.

227
00:12:54,000 --> 00:12:55,000
Then response A is the best.

228
00:12:55,000 --> 00:12:58,000
Then response D is the best, then response C is the best.

229
00:12:58,000 --> 00:13:04,000
We are ranking all the responses now based on this response is ranking.

230
00:13:04,000 --> 00:13:07,000
What we do is that we create a reward model.

231
00:13:07,000 --> 00:13:13,000
Basically, the researcher created a reward model wherein based on every response, they probably provide

232
00:13:13,000 --> 00:13:16,000
a score right for every response they provide.

233
00:13:16,000 --> 00:13:18,000
Provide a score.

234
00:13:18,000 --> 00:13:20,000
And this score will be based on probability.

235
00:13:20,000 --> 00:13:25,000
So it becomes a binary classification if the probability is high, right?

236
00:13:25,000 --> 00:13:30,000
If the probability is very, very high, probability ranges between 0 to 1, the probability is high.

237
00:13:30,000 --> 00:13:34,000
That basically means that particular response is a very good response.

238
00:13:34,000 --> 00:13:35,000
The probability is low.

239
00:13:35,000 --> 00:13:40,000
That basically means the the that for that particular response the score is less.

240
00:13:40,000 --> 00:13:42,000
Right now this reward model.

241
00:13:42,000 --> 00:13:45,000
Now obviously if I'm explaining you like this, it is very difficult to just to understand.

242
00:13:45,000 --> 00:13:47,000
So let me give you an example over here.

243
00:13:47,000 --> 00:13:49,000
Let's say that there is a chef.

244
00:13:49,000 --> 00:13:50,000
Okay.

245
00:13:50,000 --> 00:13:54,000
Now this chef knows how to cook any kind of food okay.

246
00:13:55,000 --> 00:13:56,000
Any kind of food.

247
00:13:56,000 --> 00:14:02,000
Now suddenly in a restaurant there is a request saying that from a customer, hey, I want to have a

248
00:14:02,000 --> 00:14:04,000
very good non vegetarian food, okay?

249
00:14:04,000 --> 00:14:06,000
It can be chicken right.

250
00:14:06,000 --> 00:14:07,000
Something I'm just giving some things right.

251
00:14:07,000 --> 00:14:08,000
Like chicken.

252
00:14:08,000 --> 00:14:09,000
Right.

253
00:14:09,000 --> 00:14:15,000
And I want to have it and dinner right now if probably the chef gets this kind of request, chef will

254
00:14:15,000 --> 00:14:16,000
not initially know.

255
00:14:16,000 --> 00:14:17,000
Like, what food should I create?

256
00:14:17,000 --> 00:14:18,000
It depends.

257
00:14:18,000 --> 00:14:18,000
Right?

258
00:14:18,000 --> 00:14:23,000
If the chef is from India or he's from some other foreign countries, they'll try to create that kind

259
00:14:23,000 --> 00:14:23,000
of things.

260
00:14:23,000 --> 00:14:26,000
That is actually likable by the chef.

261
00:14:26,000 --> 00:14:26,000
Right.

262
00:14:26,000 --> 00:14:28,000
So what chef will do now?

263
00:14:28,000 --> 00:14:35,000
It will first of all ask from many, many people like what kind of food would you like.

264
00:14:35,000 --> 00:14:38,000
So this is the initial response that is taken.

265
00:14:38,000 --> 00:14:41,000
So these are all my response that is basically coming up.

266
00:14:41,000 --> 00:14:41,000
Right.

267
00:14:41,000 --> 00:14:43,000
So these are all my response.

268
00:14:43,000 --> 00:14:46,000
You will be able to see that I'm picking it up right.

269
00:14:46,000 --> 00:14:50,000
So chef initially will put up all the responses.

270
00:14:50,000 --> 00:14:50,000
Right.

271
00:14:50,000 --> 00:14:53,000
And it will also ask that okay fine.

272
00:14:53,000 --> 00:14:55,000
Then what will happen based on this response.

273
00:14:55,000 --> 00:14:56,000
It will try to rank it right.

274
00:14:56,000 --> 00:15:00,000
How many people have told that similar kind of responses.

275
00:15:00,000 --> 00:15:00,000
Right.

276
00:15:00,000 --> 00:15:04,000
If many people have told, okay, I like this specific food, obviously it can try to rank it, right.

277
00:15:04,000 --> 00:15:05,000
This can be greater than this.

278
00:15:05,000 --> 00:15:09,000
This can be greater than this, this can be greater than this, this can be less than this.

279
00:15:09,000 --> 00:15:15,000
Now once we try to rank this specific response now, chef, what it will do is that it will try to create

280
00:15:15,000 --> 00:15:16,000
a reward model.

281
00:15:16,000 --> 00:15:19,000
And this reward model will be very, very simple.

282
00:15:19,000 --> 00:15:21,000
It will be a binary classification.

283
00:15:21,000 --> 00:15:24,000
Over here cross entropy is also used okay.

284
00:15:24,000 --> 00:15:26,000
Cross entropy is used now based on this.

285
00:15:26,000 --> 00:15:32,000
What happens is that whenever the chef gives any response right, it should be able to consider that

286
00:15:32,000 --> 00:15:38,000
whether we should go ahead with this particular output or not, or whether should I cook this particular

287
00:15:38,000 --> 00:15:39,000
food or not like that.

288
00:15:39,000 --> 00:15:39,000
Right.

289
00:15:40,000 --> 00:15:44,000
So this is what is the reward model that is basically getting created.

290
00:15:44,000 --> 00:15:46,000
And this is based on the feedbacks.

291
00:15:46,000 --> 00:15:49,000
The feedback is probably coming from the human beings over here.

292
00:15:49,000 --> 00:15:50,000
Right?

293
00:15:50,000 --> 00:15:52,000
So I hope you got this specific idea about it.

294
00:15:52,000 --> 00:15:58,000
Now once the reward model is basically created, then reinforcement is basically applied by a technique

295
00:15:58,000 --> 00:16:01,000
which is called as proximal policy optimization.

296
00:16:01,000 --> 00:16:08,000
Now based on this proximal policy optimization, all the things that is basically happening is that

297
00:16:08,000 --> 00:16:14,000
reward model, first of all, updates the reward based on the response that is probably coming from

298
00:16:14,000 --> 00:16:16,000
the ChatGPT model, right?

299
00:16:16,000 --> 00:16:25,000
And then it will also make sure that it will update the specific rewards by using this proximal policy

300
00:16:25,000 --> 00:16:26,000
optimization technique.

301
00:16:26,000 --> 00:16:33,000
Again, in this technique, uh, if I, uh, probably I'll not explain explain about proximal policy

302
00:16:33,000 --> 00:16:36,000
optimization right now, but I'll make a dedicated video about this.

303
00:16:36,000 --> 00:16:42,000
But again, understand that this is a reinforcement technique wherein we will be able to improve the

304
00:16:42,000 --> 00:16:48,000
ChatGPT response, and based on that, whatever human feedback or response is coming up, it will try

305
00:16:48,000 --> 00:16:53,000
to make sure that it will try to increase the reward if the response is properly correct.

306
00:16:53,000 --> 00:16:53,000
Right.

307
00:16:53,000 --> 00:16:56,000
So that is the thing that is basically happening over here.

308
00:16:56,000 --> 00:17:04,000
And finally we basically get a ChatGPT model and this reward updation and and the policy model that

309
00:17:04,000 --> 00:17:11,000
is probably getting updated using the proximal policy optimization will happen continuously as the conversation

310
00:17:11,000 --> 00:17:12,000
is basically happening.

311
00:17:12,000 --> 00:17:13,000
Right.

312
00:17:13,000 --> 00:17:19,000
And this is how the entire process of reinforcement learning happens through human feedback.

313
00:17:19,000 --> 00:17:25,000
See guys I know from this, whatever things you know, you are very much familiar with stage one and

314
00:17:25,000 --> 00:17:25,000
stage two.

315
00:17:25,000 --> 00:17:29,000
You may also know that how you can probably create this particular data set, right?

316
00:17:29,000 --> 00:17:33,000
That can be a manual approach, but yes, you can definitely do it right.

317
00:17:33,000 --> 00:17:37,000
Only the thing, the most complex thing that we usually happens is over here.

318
00:17:37,000 --> 00:17:43,000
But understand if you are able to understand things right, writing a code for this is also very, very

319
00:17:43,000 --> 00:17:43,000
easy.

320
00:17:43,000 --> 00:17:48,000
Probably you can do it, but it will not just be possible by any companies to do this because they definitely

321
00:17:48,000 --> 00:17:51,000
require a huge amount of data set, right.