1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
So we are going to continue the discussion with respect to RNN.

3
00:00:05,000 --> 00:00:08,000
Uh, already we have seen about forward propagation in RNN.

4
00:00:08,000 --> 00:00:10,000
We have seen about backward propagation in RNN.

5
00:00:11,000 --> 00:00:16,000
Now we are going to understand what are some of the problems with RNN okay.

6
00:00:16,000 --> 00:00:21,000
And because of this, you know, uh, the next neural network after RNN, which we are specifically

7
00:00:21,000 --> 00:00:27,000
going to discuss about is LSTM, RNN and GRU RNN.

8
00:00:27,000 --> 00:00:28,000
Right.

9
00:00:28,000 --> 00:00:33,000
Why why we require this variance of our own when we already have a simple RNN.

10
00:00:33,000 --> 00:00:35,000
Okay, so we'll be discussing about that.

11
00:00:35,000 --> 00:00:40,000
And in the upcoming videos we will be discussing about LSTM, RNN and grew RNN.

12
00:00:41,000 --> 00:00:47,000
So let me just go back and probably revise some of the concepts with respect to RNN.

13
00:00:47,000 --> 00:00:51,000
So here is what my RNN looks like right?

14
00:00:51,000 --> 00:00:58,000
A simple RNN I have a seed feedback loop and then when we unfold this, you will be able to see that

15
00:00:58,000 --> 00:01:04,000
I will be getting something like this and this will basically be my.

16
00:01:04,000 --> 00:01:07,000
The output will go to my hidden layer.

17
00:01:07,000 --> 00:01:10,000
Again with respect to the specific neuron.

18
00:01:10,000 --> 00:01:12,000
This is my input.

19
00:01:13,000 --> 00:01:14,000
This is my input.

20
00:01:15,000 --> 00:01:17,000
And again it will go like this.

21
00:01:17,000 --> 00:01:25,000
And it'll keep on going to the next stage till we get the last word with respect to the time stamp.

22
00:01:25,000 --> 00:01:26,000
Right.

23
00:01:29,000 --> 00:01:36,000
And once we get this, what happens once we get the output here we are just going to apply a sigmoid

24
00:01:36,000 --> 00:01:38,000
or softmax activation function.

25
00:01:39,000 --> 00:01:42,000
I'll say soft or sigmoid.

26
00:01:43,000 --> 00:01:46,000
And finally we will be getting a y hat.

27
00:01:46,000 --> 00:01:50,000
And later on we go ahead and calculate our loss.

28
00:01:50,000 --> 00:01:52,000
Y minus y hat.

29
00:01:52,000 --> 00:01:54,000
And our main aim is to reduce this loss.

30
00:01:55,000 --> 00:01:58,000
So what we do we basically do the back propagation.

31
00:01:58,000 --> 00:02:05,000
And if I just talk about over here I will be having w of I here I will be having w of h.

32
00:02:05,000 --> 00:02:05,000
Okay.

33
00:02:05,000 --> 00:02:07,000
this will also be w of h.

34
00:02:07,000 --> 00:02:14,000
And here I will be having one input uh which will be like initialized to zero initially.

35
00:02:14,000 --> 00:02:17,000
Then similarly this will also be w of h.

36
00:02:17,000 --> 00:02:21,000
Here I will be getting w of I w of I w of h.

37
00:02:21,000 --> 00:02:23,000
Similarly w of I.

38
00:02:24,000 --> 00:02:27,000
And here finally we get w of zero.

39
00:02:27,000 --> 00:02:35,000
And with respect to this particular output here, you will be seeing that I will be having 010203 like

40
00:02:35,000 --> 00:02:37,000
this up to oh N right.

41
00:02:37,000 --> 00:02:42,000
So we discussed about this entire process and we have discussed about forward and backward propagation.

42
00:02:42,000 --> 00:02:46,000
Now let's understand what is the problem with RNN okay.

43
00:02:46,000 --> 00:02:52,000
And I hope in an uh when we were discussing, we discussed about a very important topic which is called

44
00:02:52,000 --> 00:02:57,000
as vanishing gradient problem.

45
00:02:58,000 --> 00:03:00,000
Vanishing gradient problem.

46
00:03:01,000 --> 00:03:01,000
Okay.

47
00:03:01,000 --> 00:03:04,000
So let's go ahead and discuss about this again.

48
00:03:04,000 --> 00:03:10,000
And here also in our simple RNN we may face this problem.

49
00:03:10,000 --> 00:03:14,000
Now the thing is that you really need to understand how do we face this particular problem okay.

50
00:03:14,000 --> 00:03:21,000
Now let's consider in in my data set I have some text I have some output okay.

51
00:03:22,000 --> 00:03:25,000
Till now we just saw smaller sentences.

52
00:03:25,000 --> 00:03:31,000
Let's say I have four words like the food is good right.

53
00:03:31,000 --> 00:03:34,000
So my output will be one.

54
00:03:34,000 --> 00:03:41,000
The food is bad so the output is zero okay.

55
00:03:41,000 --> 00:03:45,000
When we are we're using this all kind of words right.

56
00:03:45,000 --> 00:03:48,000
Or these all kind of text or sentences.

57
00:03:48,000 --> 00:03:50,000
This sentences were small.

58
00:03:51,000 --> 00:03:53,000
This sentences was specifically small.

59
00:03:54,000 --> 00:03:55,000
Right.

60
00:03:55,000 --> 00:04:00,000
And instead of taking this particular use case let's consider one more use case which is called as text

61
00:04:00,000 --> 00:04:01,000
generation.

62
00:04:02,000 --> 00:04:03,000
Okay.

63
00:04:04,000 --> 00:04:07,000
Let's say some of the example with respect to text generation.

64
00:04:07,000 --> 00:04:08,000
I will go ahead and write.

65
00:04:08,000 --> 00:04:12,000
Hey, um, I'll take a very simple example.

66
00:04:13,000 --> 00:04:20,000
I like to play Dash.

67
00:04:20,000 --> 00:04:21,000
Okay.

68
00:04:21,000 --> 00:04:25,000
So here I need to probably go ahead and predict my next word.

69
00:04:25,000 --> 00:04:27,000
It can be cricket, it can be football.

70
00:04:27,000 --> 00:04:28,000
It can be anything.

71
00:04:28,000 --> 00:04:28,000
Right.

72
00:04:29,000 --> 00:04:36,000
And over here this word will be predicted based on the context or based on the previous words that we

73
00:04:36,000 --> 00:04:37,000
have.

74
00:04:37,000 --> 00:04:37,000
Right.

75
00:04:37,000 --> 00:04:46,000
For this particular sentence, if I have this sentence S1 over here, you can see this word may be actually

76
00:04:46,000 --> 00:04:51,000
dependent on this word, or it can be on this word, or it can be on this word, or it can be on this

77
00:04:51,000 --> 00:04:51,000
word.

78
00:04:51,000 --> 00:04:52,000
Right.

79
00:04:52,000 --> 00:04:58,000
If I consider and if I try to train my neural network right here, you can basically consider that this

80
00:04:58,000 --> 00:05:02,000
has a kind of short term dependencies.

81
00:05:03,000 --> 00:05:06,000
Now when I say short term dependency, what does this basically mean?

82
00:05:06,000 --> 00:05:07,000
Right.

83
00:05:08,000 --> 00:05:11,000
Let's understand this with a very simple example.

84
00:05:11,000 --> 00:05:16,000
Right now here I have specifically written that I like to play.

85
00:05:16,000 --> 00:05:17,000
I need to predict the next word.

86
00:05:18,000 --> 00:05:22,000
Now this word may be dependent on this word or this word or this word or this word.

87
00:05:22,000 --> 00:05:23,000
Okay.

88
00:05:23,000 --> 00:05:28,000
So here you can see based on the length of this entire sentence.

89
00:05:29,000 --> 00:05:29,000
Right.

90
00:05:29,000 --> 00:05:35,000
Based on this length of this entire sentence, they hardly four words.

91
00:05:35,000 --> 00:05:36,000
Right.

92
00:05:37,000 --> 00:05:41,000
So I have a dependency within this shorter sentence.

93
00:05:41,000 --> 00:05:42,000
Okay.

94
00:05:42,000 --> 00:05:45,000
But let's say I will go ahead and write a new sentence.

95
00:05:45,000 --> 00:05:46,000
Let's say I'll write.

96
00:05:46,000 --> 00:05:49,000
Hey, my name is Krish.

97
00:05:52,000 --> 00:05:57,000
And I like sports.

98
00:05:57,000 --> 00:06:05,000
Like cricket, volleyball okay.

99
00:06:06,000 --> 00:06:07,000
And I.

100
00:06:09,000 --> 00:06:10,000
Like.

101
00:06:12,000 --> 00:06:20,000
Let's say I'll go ahead and write and also like to make.

102
00:06:22,000 --> 00:06:27,000
And this will be dash over here I need to predict this particular word.

103
00:06:27,000 --> 00:06:29,000
Now this word can be videos.

104
00:06:29,000 --> 00:06:30,000
I like to make food.

105
00:06:30,000 --> 00:06:31,000
It can be different.

106
00:06:31,000 --> 00:06:32,000
Different words.

107
00:06:32,000 --> 00:06:36,000
Now here, in order to determine this specific output.

108
00:06:36,000 --> 00:06:41,000
Now the dependency is on this word right.

109
00:06:41,000 --> 00:06:43,000
It may be on this word.

110
00:06:43,000 --> 00:06:47,000
It may be on this or this or it may also be on other words.

111
00:06:47,000 --> 00:06:50,000
But here you can see this sentence.

112
00:06:51,000 --> 00:06:55,000
Sentence is long is really long.

113
00:06:55,000 --> 00:06:59,000
right now when this kind of sentence is long.

114
00:06:59,000 --> 00:07:02,000
Just imagine if I am passing through my RNN.

115
00:07:03,000 --> 00:07:06,000
So how many timestamps I will be requiring I will be requiring T is equal to one.

116
00:07:06,000 --> 00:07:09,000
I may require t is equal to three four.

117
00:07:09,000 --> 00:07:10,000
Like this.

118
00:07:10,000 --> 00:07:17,000
If I keep on counting one, two, three, four, five, six, seven, eight, 910 1112 1314, 15, 16.

119
00:07:17,000 --> 00:07:21,000
So at t 16 I will be sending this word for my forward propagation.

120
00:07:22,000 --> 00:07:22,000
Okay.

121
00:07:22,000 --> 00:07:24,000
Now similarly this is just a smaller set.

122
00:07:24,000 --> 00:07:27,000
This is still I'll say a sentence with a good length.

123
00:07:27,000 --> 00:07:32,000
But let's say if I have a sentence which may have 100 words.

124
00:07:34,000 --> 00:07:35,000
Which may have 100 words.

125
00:07:35,000 --> 00:07:41,000
So at t is equal to 100, I will be sending my last word for the forward and the backward propagation

126
00:07:41,000 --> 00:07:42,000
right to calculate the loss.

127
00:07:43,000 --> 00:07:48,000
Now what will happen as the sentence keeps on getting longer and longer, right?

128
00:07:49,000 --> 00:07:52,000
The final output that I really want to get right.

129
00:07:52,000 --> 00:07:57,000
This may be dependent on some words that may have started earlier, right?

130
00:07:57,000 --> 00:07:59,000
That may have come up in time.

131
00:07:59,000 --> 00:08:01,000
Step T is equal to 432.

132
00:08:02,000 --> 00:08:08,000
And this dependency cannot be captured by simple RNN okay.

133
00:08:08,000 --> 00:08:10,000
Okay, so let me just go ahead and write it down.

134
00:08:10,000 --> 00:08:11,000
White cannot be captured.

135
00:08:11,000 --> 00:08:13,000
I will be explaining about it.

136
00:08:13,000 --> 00:08:13,000
Right.

137
00:08:13,000 --> 00:08:16,000
The long term dependency.

138
00:08:21,000 --> 00:08:36,000
Cannot be captured by RNN and hence RNN cannot provide a better accuracy in this kind of use cases where

139
00:08:36,000 --> 00:08:43,000
it has this kind of dependency, where it has this kind of dependency.

140
00:08:43,000 --> 00:08:49,000
Now, you may be thinking, Krish, uh, whatever things you are actually explaining over here, it

141
00:08:49,000 --> 00:08:51,000
it it hardly makes sense.

142
00:08:51,000 --> 00:08:52,000
Please try to give some example.

143
00:08:52,000 --> 00:08:56,000
Obviously I'm not going to stop over here right now.

144
00:08:56,000 --> 00:08:57,000
Let's take one example.

145
00:08:57,000 --> 00:09:01,000
Let's say I'm going to use this entire neural network again.

146
00:09:01,000 --> 00:09:01,000
Okay.

147
00:09:02,000 --> 00:09:05,000
I will just go ahead and copy this.

148
00:09:06,000 --> 00:09:13,000
I will go ahead and copy this and let's paste it over here okay.

149
00:09:15,000 --> 00:09:16,000
Let's go and do this okay.

150
00:09:16,000 --> 00:09:22,000
Let's say for the first instance I give my

151
00:09:22,000 --> 00:09:26,000
XI1X2XI3.

152
00:09:26,000 --> 00:09:31,000
Okay, so in one of the sentences I just had four words right?

153
00:09:32,000 --> 00:09:33,000
Or I just had three words.

154
00:09:33,000 --> 00:09:37,000
And after that I got my output I passed this it to my sigmoid.

155
00:09:37,000 --> 00:09:43,000
So if I really want to show you with respect to, let's say in one of the sentences, I just had three

156
00:09:43,000 --> 00:09:43,000
words.

157
00:09:43,000 --> 00:09:46,000
So my unfolding will basically happen like this.

158
00:09:46,000 --> 00:09:55,000
So at t is equal to one I will be giving my X11 as t is equal to two, I'll be giving X12.

159
00:09:55,000 --> 00:09:56,000
Then my X13.

160
00:09:56,000 --> 00:10:00,000
And finally when I get my output this will be passed.

161
00:10:01,000 --> 00:10:03,000
This will be my output here.

162
00:10:03,000 --> 00:10:07,000
Instead of writing hidden I will be writing something like this.

163
00:10:07,000 --> 00:10:10,000
I will be sending it to my.

164
00:10:11,000 --> 00:10:14,000
I'll be sending it to my users.

165
00:10:14,000 --> 00:10:16,000
Consider this as sigmoid.

166
00:10:16,000 --> 00:10:20,000
Okay, so I'll be sending it to my sigmoid.

167
00:10:20,000 --> 00:10:22,000
Let me just draw this properly.

168
00:10:22,000 --> 00:10:22,000
Okay.

169
00:10:22,000 --> 00:10:32,000
So here I'll be sending it to my sigmoid to perform the to capture or to calculate the y hat.

170
00:10:32,000 --> 00:10:33,000
Okay.

171
00:10:33,000 --> 00:10:33,000
okay.

172
00:10:35,000 --> 00:10:36,000
So this is pretty much easy.

173
00:10:36,000 --> 00:10:39,000
And here I will be saying that this will be my W zero okay.

174
00:10:40,000 --> 00:10:51,000
Now in this case when I try to find out the weights right I need to update weights over here as w or

175
00:10:51,000 --> 00:10:53,000
h w of I and w of zero.

176
00:10:54,000 --> 00:10:59,000
Now in order to calculate this weights again the generic formula is nothing, but w new is equal to

177
00:10:59,000 --> 00:11:05,000
w old minus learning rate of derivative of loss with respect to derivative of w old right.

178
00:11:06,000 --> 00:11:10,000
And out of this, what is the main thing that I really need to calculate is this particular value.

179
00:11:10,000 --> 00:11:11,000
Right.

180
00:11:11,000 --> 00:11:12,000
This.

181
00:11:12,000 --> 00:11:18,000
And this value needs to be calculated for first of all I will go ahead and write w of zero right.

182
00:11:18,000 --> 00:11:23,000
This uh loss with respect to zero here I will go ahead and calculate my loss Y minus y hat.

183
00:11:23,000 --> 00:11:29,000
Okay, now, when I calculate w, uh, derivative of loss with respect to derivative of w zero.

184
00:11:29,000 --> 00:11:32,000
So here you will be able to see the equation will be very much simple.

185
00:11:32,000 --> 00:11:37,000
So I will go ahead and write derivative of loss with respect to derivative of w uh y hat derivative

186
00:11:37,000 --> 00:11:42,000
of y hat with respect to derivative of um w ho.

187
00:11:42,000 --> 00:11:43,000
Okay.

188
00:11:43,000 --> 00:11:46,000
So for this the calculation will be very much easier.

189
00:11:46,000 --> 00:11:48,000
But now let's go ahead and consider one of the example.

190
00:11:48,000 --> 00:11:56,000
Let's say I want to update or I want to update this hidden weight okay this this w h okay.

191
00:11:56,000 --> 00:12:01,000
So let's go ahead and update this hidden weight okay I'll take this example.

192
00:12:01,000 --> 00:12:07,000
So for this I will go ahead and write w derivative or uh l with respect to derivative of h old.

193
00:12:07,000 --> 00:12:13,000
Now you know that my loss my loss, which is getting calculated, is dependent on.

194
00:12:13,000 --> 00:12:20,000
In this particular case, it is dependent on 0303 is dependent on O2, and O2 is finally dependent on

195
00:12:20,000 --> 00:12:20,000
w h.

196
00:12:20,000 --> 00:12:21,000
Okay.

197
00:12:21,000 --> 00:12:27,000
So here what I will do I'll write simply like derivative of loss with respect to derivative of y hat.

198
00:12:27,000 --> 00:12:27,000
Sorry.

199
00:12:27,000 --> 00:12:33,000
Loss is dependent on y hat, Y hat is dependent on y hat is probably dependent on O3.

200
00:12:33,000 --> 00:12:34,000
Okay.

201
00:12:34,000 --> 00:12:41,000
So so now I will go ahead and write derivative of y hat with respect to derivative of oh three.

202
00:12:42,000 --> 00:12:43,000
Okay.

203
00:12:43,000 --> 00:12:47,000
And this is simple chain rule that we have actually discussed.

204
00:12:47,000 --> 00:12:49,000
Derivative of oh three.

205
00:12:50,000 --> 00:12:51,000
Okay.

206
00:12:51,000 --> 00:12:57,000
Uh then we'll go and write derivative of oh three to oh two.

207
00:12:57,000 --> 00:13:05,000
Then we'll go ahead and write derivative of oh two to derivative of w h old.

208
00:13:05,000 --> 00:13:08,000
Now remember one thing here we are.

209
00:13:08,000 --> 00:13:11,000
Calculating with timestamp is equal to t is equal to one.

210
00:13:11,000 --> 00:13:12,000
Right.

211
00:13:12,000 --> 00:13:16,000
If I really want to calculate all the derivatives I need to keep on adding for t is equal to two.

212
00:13:16,000 --> 00:13:20,000
I need to add so I can go ahead and take this entire value.

213
00:13:20,000 --> 00:13:21,000
Okay.

214
00:13:21,000 --> 00:13:25,000
So at t is equal to two I can go ahead and add that.

215
00:13:25,000 --> 00:13:26,000
So at t is equal to two.

216
00:13:26,000 --> 00:13:28,000
Let's say I want to go ahead and do this.

217
00:13:28,000 --> 00:13:31,000
So I will be having a different derivative.

218
00:13:31,000 --> 00:13:35,000
Or I'll be having this differential uh separate chain rule over here.

219
00:13:35,000 --> 00:13:36,000
I'll be able to get it.

220
00:13:36,000 --> 00:13:39,000
Then at t is equal to three I will be able to get it a different chain rule okay.

221
00:13:39,000 --> 00:13:46,000
But I just want to focus over here right now in this right now here when you are seeing this this this

222
00:13:46,000 --> 00:13:50,000
is a specific chain rule that I have okay.

223
00:13:51,000 --> 00:13:54,000
And here you can see that with respect to t is equal to one.

224
00:13:54,000 --> 00:14:00,000
When my total number of words were just three, when t is equal to one, I was able to get this big,

225
00:14:01,000 --> 00:14:02,000
uh, chain rule of derivatives.

226
00:14:02,000 --> 00:14:02,000
Okay.

227
00:14:03,000 --> 00:14:11,000
Now, if I go ahead and consider if my length of one of my sentence, if my length of one of the sentence

228
00:14:11,000 --> 00:14:13,000
is having 50 words.

229
00:14:14,000 --> 00:14:22,000
Now, in the case of 50 words, now just imagine what will happen to this particular W of like if I

230
00:14:22,000 --> 00:14:27,000
go ahead and try to calculate the derivative of loss with respect to derivative of w hidden, please

231
00:14:27,000 --> 00:14:27,000
focus on this.

232
00:14:27,000 --> 00:14:28,000
Okay?

233
00:14:28,000 --> 00:14:33,000
If my length of the word is 50 words, so at t is equal to one, how am I equation will look.

234
00:14:33,000 --> 00:14:35,000
So I will go ahead and write like this.

235
00:14:36,000 --> 00:14:41,000
So at t is equal to 50 or sorry at t is equal to like total number of time stamps will be 50.

236
00:14:41,000 --> 00:14:46,000
And I want to find out derivative of loss with respect to derivative of w old with respect to t is equal

237
00:14:46,000 --> 00:14:48,000
to one right.

238
00:14:48,000 --> 00:14:50,000
Now this same equation it will come right.

239
00:14:50,000 --> 00:14:52,000
And here what it will become.

240
00:14:52,000 --> 00:14:57,000
See now my derivative of loss will be dependent on derivative of y hat.

241
00:14:57,000 --> 00:15:03,000
Next step will be that derivative of y hat will be dependent on what derivative of O.

242
00:15:04,000 --> 00:15:06,000
Whether it will be three or whether it will be 50.

243
00:15:06,000 --> 00:15:11,000
So now I'll go ahead and write 50 because I'm going to go ahead with till t is equal to 50.

244
00:15:11,000 --> 00:15:15,000
So my last step over here will basically be O 50.

245
00:15:15,000 --> 00:15:16,000
The output 50.

246
00:15:16,000 --> 00:15:18,000
That will be calculating right.

247
00:15:18,000 --> 00:15:25,000
Then similarly we will go ahead and calculate derivative of O 50 with respect to derivative of oh 49.

248
00:15:25,000 --> 00:15:30,000
Similarly, we'll go ahead and write derivative of oh 49 with respect to derivative of oh 48.

249
00:15:30,000 --> 00:15:31,000
And similarly.

250
00:15:31,000 --> 00:15:32,000
Like this.

251
00:15:32,000 --> 00:15:35,000
We'll keep on going till what step.

252
00:15:35,000 --> 00:15:44,000
We'll be going till we get this one right, till we get derivative of oh two with respect to derivative

253
00:15:44,000 --> 00:15:48,000
of w h old right.

254
00:15:48,000 --> 00:15:50,000
So at t is equal to one.

255
00:15:50,000 --> 00:15:52,000
This will be my first step.

256
00:15:53,000 --> 00:15:54,000
Right at t is equal to one.

257
00:15:54,000 --> 00:15:56,000
This will be the value that I really need to compute.

258
00:15:56,000 --> 00:15:58,000
Then again at t is equal to t two.

259
00:15:58,000 --> 00:16:00,000
I'll keep on writing it right.

260
00:16:00,000 --> 00:16:01,000
And again over there.

261
00:16:01,000 --> 00:16:02,000
Hardly.

262
00:16:02,000 --> 00:16:03,000
From 49 it will start.

263
00:16:03,000 --> 00:16:05,000
From 49 it will start and it will end till here.

264
00:16:05,000 --> 00:16:09,000
Then for t is equal to three it will start from 48 and then it will end till here.

265
00:16:09,000 --> 00:16:10,000
Right.

266
00:16:10,000 --> 00:16:14,000
Same thing will basically be happening along with this right now.

267
00:16:14,000 --> 00:16:17,000
Let's focus on this important things right.

268
00:16:17,000 --> 00:16:19,000
Let's focus on these values.

269
00:16:19,000 --> 00:16:20,000
These values okay.

270
00:16:21,000 --> 00:16:23,000
Let's focus on these values.

271
00:16:23,000 --> 00:16:24,000
These are very important values.

272
00:16:24,000 --> 00:16:26,000
And we will take this up okay.

273
00:16:26,000 --> 00:16:28,000
Let's let's consider one of the equation.

274
00:16:28,000 --> 00:16:34,000
Like uh out of all this I will go ahead and take one more over here derivative of O3 with respect to

275
00:16:34,000 --> 00:16:39,000
derivative of O2 plus uh, sorry, multiplied by this last one.

276
00:16:39,000 --> 00:16:40,000
Okay.

277
00:16:40,000 --> 00:16:47,000
So what I'm saying is that if I keep on doing this right, if I keep on going this like this, it will

278
00:16:47,000 --> 00:16:48,000
continue till this particular value.

279
00:16:48,000 --> 00:16:48,000
Right.

280
00:16:48,000 --> 00:16:54,000
And before this value, I also have something like derivative of O3 divided by derivative of O2.

281
00:16:54,000 --> 00:16:57,000
Now let's go ahead and calculate this okay.

282
00:16:57,000 --> 00:16:58,000
We will just go ahead and calculate it.

283
00:16:59,000 --> 00:17:00,000
Now derivative of O3.

284
00:17:00,000 --> 00:17:02,000
What is O3 over here.

285
00:17:02,000 --> 00:17:05,000
O3 is nothing but it is very much simple.

286
00:17:05,000 --> 00:17:08,000
We basically take this w of h c.

287
00:17:08,000 --> 00:17:11,000
Let's go ahead and see what exactly is O3 okay.

288
00:17:11,000 --> 00:17:13,000
O3 is nothing but derivative of.

289
00:17:13,000 --> 00:17:17,000
Let's say I'm going to apply a sigmoid activation function on this node.

290
00:17:17,000 --> 00:17:18,000
Okay.

291
00:17:18,000 --> 00:17:21,000
Let's consider this for for O3.

292
00:17:21,000 --> 00:17:23,000
I'm just considering okay I'm just taking an example.

293
00:17:23,000 --> 00:17:24,000
So O3 is nothing.

294
00:17:24,000 --> 00:17:27,000
But first of all, I'll just go ahead and write about oh three.

295
00:17:28,000 --> 00:17:28,000
Okay.

296
00:17:28,000 --> 00:17:30,000
What is oh three over here?

297
00:17:30,000 --> 00:17:34,000
Oh three is nothing, but it is, uh, how we are getting this particular output.

298
00:17:34,000 --> 00:17:36,000
It is very much simple.

299
00:17:36,000 --> 00:17:42,000
First of all, I will go ahead and multiply x of I three multiplied by w of I.

300
00:17:42,000 --> 00:17:42,000
Right.

301
00:17:42,000 --> 00:17:52,000
So let me just go ahead and write it down over here x of uh I three multiplied by W of I plus.

302
00:17:53,000 --> 00:17:59,000
Okay, plus we are multiplying with w of h multiplied by o two.

303
00:17:59,000 --> 00:17:59,000
Right?

304
00:17:59,000 --> 00:18:03,000
So here I will go ahead and say o two multiplied by w of h.

305
00:18:04,000 --> 00:18:04,000
Right.

306
00:18:04,000 --> 00:18:13,000
So this is my derivative of this o three value okay divided by derivative of o two divided by derivative

307
00:18:13,000 --> 00:18:14,000
of O2.

308
00:18:14,000 --> 00:18:20,000
Now once we perform this derivative, since we are doing it with respect to O2, this will become one.

309
00:18:20,000 --> 00:18:22,000
Okay, this will become one.

310
00:18:22,000 --> 00:18:26,000
So I will be getting this w h and this will be considered as a.

311
00:18:26,000 --> 00:18:28,000
This will be considered as a constant.

312
00:18:28,000 --> 00:18:35,000
But before this you will be able to see that when I am applying derivative over here on top of it,

313
00:18:35,000 --> 00:18:39,000
I also have to perform a sigmoid activation function on this node.

314
00:18:39,000 --> 00:18:41,000
I have to perform a on top of it.

315
00:18:41,000 --> 00:18:46,000
I also need to perform sigmoid activation function because there will be a sigmoid activation function

316
00:18:46,000 --> 00:18:46,000
over here.

317
00:18:47,000 --> 00:18:51,000
Along with this, you'll also be seeing that I will also have one bias parameter.

318
00:18:51,000 --> 00:18:53,000
So this will be plus bias okay.

319
00:18:53,000 --> 00:18:55,000
Now this will how the equation look like.

320
00:18:55,000 --> 00:19:02,000
So I'm saying hey derivative of sigmoid of this x of I three multiplied by w of I plus o two multiplied

321
00:19:02,000 --> 00:19:05,000
by w of h plus one plus b okay.

322
00:19:05,000 --> 00:19:08,000
And this is entirely my o three right.

323
00:19:08,000 --> 00:19:11,000
This is what I have replaced o three by.

324
00:19:11,000 --> 00:19:11,000
Right.

325
00:19:11,000 --> 00:19:19,000
So o three over here means what o three is nothing, but it is sigmoid of sigmoid of XI3 multiplied

326
00:19:19,000 --> 00:19:26,000
by w of I plus o two multiplied by w of h plus B, right?

327
00:19:27,000 --> 00:19:30,000
Now if I go ahead and find out the derivative how it will work.

328
00:19:30,000 --> 00:19:30,000
Derivative.

329
00:19:30,000 --> 00:19:36,000
If you know I will get this is sigma dash I will right because this is the derivative.

330
00:19:36,000 --> 00:19:40,000
Now further going ahead you will be able to see that.

331
00:19:41,000 --> 00:19:45,000
It's like if you use the derivative for this entire thing okay.

332
00:19:45,000 --> 00:19:49,000
If you probably use this entirely, this particular derivative, this will be nothing.

333
00:19:49,000 --> 00:19:58,000
But here you can specifically write one multiplied by w of h, okay, one multiplied by w of x.

334
00:19:58,000 --> 00:20:00,000
Because this will be a constant, this will be a constant.

335
00:20:00,000 --> 00:20:01,000
This will become zero.

336
00:20:01,000 --> 00:20:02,000
This will become zero.

337
00:20:02,000 --> 00:20:02,000
Okay.

338
00:20:02,000 --> 00:20:05,000
So I have something called as one.

339
00:20:05,000 --> 00:20:06,000
Multiply by w of h.

340
00:20:06,000 --> 00:20:07,000
That's it.

341
00:20:07,000 --> 00:20:07,000
Okay.

342
00:20:07,000 --> 00:20:11,000
So here is what I'm actually going to get okay.

343
00:20:11,000 --> 00:20:14,000
So this is nothing but derivative of sigmoid.

344
00:20:14,000 --> 00:20:17,000
And here I'm just going to get one multiplied by w of h.

345
00:20:17,000 --> 00:20:19,000
Now this will be nothing.

346
00:20:19,000 --> 00:20:22,000
But whenever we try to find out the derivative of sigmoid.

347
00:20:22,000 --> 00:20:26,000
So I have actually explained this in my activation function.

348
00:20:26,000 --> 00:20:30,000
So derivative of sigmoid ranges between 0 to 0.25.

349
00:20:30,000 --> 00:20:33,000
The output of the sigmoid ranges between 0 to 1.

350
00:20:33,000 --> 00:20:37,000
But whenever I try to find out the derivative it will be ranging between 0 to 0.25.

351
00:20:37,000 --> 00:20:41,000
So this value that you are going to get is nothing but 0 to 0.25.

352
00:20:41,000 --> 00:20:43,000
In between this values you will be getting.

353
00:20:43,000 --> 00:20:46,000
Now here you can see that you observe one thing, right?

354
00:20:46,000 --> 00:20:50,000
One of the value that I calculated in this multiplication, which was this one okay.

355
00:20:50,000 --> 00:20:55,000
This one was ranging between 0 to 0.25 okay.

356
00:20:56,000 --> 00:20:59,000
Now as we go ahead right now this is a decimal.

357
00:20:59,000 --> 00:21:01,000
Let's say that we went and calculated here.

358
00:21:01,000 --> 00:21:04,000
We got a decimal as .015.

359
00:21:04,000 --> 00:21:06,000
Okay I just calculated it randomly.

360
00:21:06,000 --> 00:21:09,000
Then let's say for this I got 0.02.

361
00:21:09,000 --> 00:21:10,000
Right.

362
00:21:10,000 --> 00:21:12,000
For this I got 0.25.

363
00:21:12,000 --> 00:21:12,000
Right.

364
00:21:12,000 --> 00:21:13,000
For this I got point.

365
00:21:13,000 --> 00:21:15,000
Uh, not for this.

366
00:21:15,000 --> 00:21:19,000
For one of the internal one, I got 0.01.

367
00:21:19,000 --> 00:21:24,000
Now, when I keep on doing this multiplication, you'll be seeing that at after some point, this will

368
00:21:24,000 --> 00:21:26,000
become a very small value.

369
00:21:26,000 --> 00:21:28,000
And it will be approximately equal to zero.

370
00:21:29,000 --> 00:21:36,000
Now when this is approximately equal to zero, that basically means what at t is equal to one.

371
00:21:37,000 --> 00:21:44,000
Where I had this first word that I'm actually giving it is not participating much to update the weight

372
00:21:44,000 --> 00:21:45,000
value.

373
00:21:45,000 --> 00:21:45,000
Right.

374
00:21:45,000 --> 00:21:47,000
So what is happening over here?

375
00:21:47,000 --> 00:21:48,000
The t is equal to one.

376
00:21:48,000 --> 00:21:50,000
The word.

377
00:21:51,000 --> 00:21:54,000
is not participating.

378
00:21:56,000 --> 00:22:04,000
Participating to update the weights.

379
00:22:04,000 --> 00:22:04,000
Value.

380
00:22:04,000 --> 00:22:06,000
Do you agree or not?

381
00:22:06,000 --> 00:22:11,000
Just see this white is not participating because this entire operation will be approximately equal to

382
00:22:11,000 --> 00:22:12,000
zero.

383
00:22:12,000 --> 00:22:14,000
Because we are just multiplying with these numbers, right?

384
00:22:14,000 --> 00:22:15,000
These numbers.

385
00:22:15,000 --> 00:22:19,000
And as you know that if you keep on multiplying with decimal numbers, you will be seeing that, hey,

386
00:22:19,000 --> 00:22:22,000
we'll be getting a value which will be approximately equal to zero.

387
00:22:22,000 --> 00:22:25,000
It will become smaller value, a very, very small value.

388
00:22:26,000 --> 00:22:32,000
Now, since this entire value is becoming zero, you'll be able to see that when I will try to add this

389
00:22:32,000 --> 00:22:38,000
number, or when I will try to add this with my next timestamp, it is not probably going to impact

390
00:22:38,000 --> 00:22:39,000
the weights, right?

391
00:22:39,000 --> 00:22:42,000
It is not going to change this specific value, right?

392
00:22:43,000 --> 00:22:47,000
You will be seeing so what is basically happening over here, here.

393
00:22:47,000 --> 00:22:54,000
In short, it is saying that the word that you initially gave in our simple RNN, when compared to the

394
00:22:54,000 --> 00:23:03,000
last word, this word is not playing a good or amazing impact in the simple RNN to probably predict

395
00:23:03,000 --> 00:23:05,000
the outcome or the output.

396
00:23:05,000 --> 00:23:06,000
Right.

397
00:23:06,000 --> 00:23:08,000
It is very much simple over here, right?

398
00:23:08,000 --> 00:23:14,000
Whichever will be the nearest value with respect to the time stamp, that may create a bigger impact.

399
00:23:14,000 --> 00:23:16,000
The reason is very much simple.

400
00:23:16,000 --> 00:23:17,000
Why?

401
00:23:17,000 --> 00:23:20,000
Because when you perform the operation, the operation will be very small.

402
00:23:20,000 --> 00:23:25,000
So here when I am trying to calculate, let's let's calculate for O3.

403
00:23:25,000 --> 00:23:26,000
Right.

404
00:23:26,000 --> 00:23:28,000
Just just for this last time timestamp.

405
00:23:28,000 --> 00:23:33,000
If I'm calculating if I'm trying to find out O3 loss is dependent on y hat, Y hat is dependent on O3,

406
00:23:34,000 --> 00:23:34,000
right.

407
00:23:34,000 --> 00:23:40,000
Or loss is dependent on y hat y hat is dependent on w o and w o is dependent on further dependent on

408
00:23:40,000 --> 00:23:41,000
w if I.

409
00:23:41,000 --> 00:23:44,000
If I want to update this, the chain rule will become very smaller.

410
00:23:44,000 --> 00:23:46,000
But in this case the chain rule is big.

411
00:23:46,000 --> 00:23:51,000
Now when the chain rule is big, when the chain rule is big.

412
00:23:53,000 --> 00:23:53,000
right?

413
00:23:53,000 --> 00:23:56,000
Big, big basically means the chain is very big over here.

414
00:23:56,000 --> 00:24:02,000
Right over here you can see that whichever is our first word, that will not play a very big impact

415
00:24:02,000 --> 00:24:05,000
to finally predict the last output.

416
00:24:06,000 --> 00:24:06,000
Right.

417
00:24:06,000 --> 00:24:11,000
And that is what we will be seeing because here we have a long term dependency.

418
00:24:11,000 --> 00:24:12,000
Long term dependency.

419
00:24:12,000 --> 00:24:12,000
Why?

420
00:24:12,000 --> 00:24:15,000
Because in this case I took 50 words.

421
00:24:15,000 --> 00:24:16,000
This was my first word.

422
00:24:16,000 --> 00:24:19,000
This was my first word that I passed with respect to time stamp.

423
00:24:19,000 --> 00:24:24,000
And if I'm updating this weights, it's hardly zero and it is not creating any impact for my final weight.

424
00:24:24,000 --> 00:24:25,000
Right.

425
00:24:25,000 --> 00:24:27,000
So this is the major problem.

426
00:24:27,000 --> 00:24:30,000
And whenever we face this kind of problem right.

427
00:24:30,000 --> 00:24:40,000
This is basically called as vanishing gradient problem okay.

428
00:24:40,000 --> 00:24:40,000
Okay.

429
00:24:40,000 --> 00:24:41,000
Vanishing gradient problem.

430
00:24:42,000 --> 00:24:42,000
Right.

431
00:24:43,000 --> 00:24:45,000
So this is very much important.

432
00:24:45,000 --> 00:24:48,000
Let's consider at t is equal to 50.

433
00:24:48,000 --> 00:24:49,000
What is going to happen okay.

434
00:24:49,000 --> 00:24:50,000
So at t is equal to 50.

435
00:24:50,000 --> 00:24:55,000
You'll be seeing if I go ahead and calculate my derivative of loss with respect to derivative of w h

436
00:24:55,000 --> 00:24:56,000
old.

437
00:24:56,000 --> 00:25:00,000
Now first I know my derivative of loss will be dependent on derivative of y hat.

438
00:25:00,000 --> 00:25:03,000
Then I have derivative of y hat.

439
00:25:03,000 --> 00:25:10,000
It will be dependent on what if I really want to just play with the w th last w h 50.

440
00:25:10,000 --> 00:25:10,000
Right.

441
00:25:10,000 --> 00:25:16,000
So this will basically be dependent on uh derivative of y hat will be dependent on my final output.

442
00:25:16,000 --> 00:25:18,000
My final output will be o 50.

443
00:25:18,000 --> 00:25:24,000
And derivative of O 50 will be dependent on derivative of w h old.

444
00:25:24,000 --> 00:25:25,000
Right.

445
00:25:25,000 --> 00:25:29,000
This is going to this is what is going to happen at t is equal to 50.

446
00:25:29,000 --> 00:25:35,000
Now here you can see this is the right.

447
00:25:35,000 --> 00:25:37,000
So this is basically t is equal to 50.

448
00:25:37,000 --> 00:25:38,000
Right.

449
00:25:38,000 --> 00:25:41,000
This is what I'm actually going to add at t is equal to 50.

450
00:25:41,000 --> 00:25:41,000
Right.

451
00:25:41,000 --> 00:25:45,000
So this here you can see my chain rule is very small.

452
00:25:45,000 --> 00:25:49,000
And here I may be getting some value because the chain rule is small.

453
00:25:49,000 --> 00:25:54,000
I don't have to probably do so much of calculation, so much of multiplication, like when we did it

454
00:25:54,000 --> 00:25:55,000
for t is equal to one.

455
00:25:55,000 --> 00:25:58,000
So similarly I will go ahead and multiply ad with t is equal to 49.

456
00:25:58,000 --> 00:26:00,000
Then we will have t is equal to 58.

457
00:26:00,000 --> 00:26:02,000
We'll keep on adding it till t is equal to one.

458
00:26:02,000 --> 00:26:04,000
Finally we'll be having t is equal to one.

459
00:26:04,000 --> 00:26:08,000
Here you could see that almost my value was approximately equal to zero.

460
00:26:09,000 --> 00:26:09,000
Right.

461
00:26:09,000 --> 00:26:14,000
And as we keep on going towards this timestamp, which is the last timestamp, you'll be seeing that

462
00:26:14,000 --> 00:26:19,000
these words will play a very important role while predicting the next output.

463
00:26:20,000 --> 00:26:21,000
Right.

464
00:26:21,000 --> 00:26:25,000
So here you'll be able to see that these all values will never be equal to zero.

465
00:26:25,000 --> 00:26:26,000
Right.

466
00:26:26,000 --> 00:26:30,000
It it will have some values because this will be creating an impact on this.

467
00:26:30,000 --> 00:26:36,000
And it this nearest words with respect to the timestamp will be responsible in updating the weights.

468
00:26:36,000 --> 00:26:40,000
But some of the weights that are present in t is equal to one, t is equal to two.

469
00:26:40,000 --> 00:26:43,000
These all values will be approximately equal to zero.

470
00:26:43,000 --> 00:26:45,000
And that is what vanishing gradient says.

471
00:26:45,000 --> 00:26:47,000
And this is the problem in simple RNN.

472
00:26:47,000 --> 00:26:53,000
It is just focusing on it is just having a dependency in this entire chain rule.

473
00:26:53,000 --> 00:26:58,000
It is having a dependency on the nearest word, not on the forest word that is available in that entire

474
00:26:58,000 --> 00:26:59,000
sentence.

475
00:26:59,000 --> 00:27:00,000
Right.

476
00:27:00,000 --> 00:27:01,000
And this is the major problem.

477
00:27:02,000 --> 00:27:04,000
Now how do we solve this problem?

478
00:27:04,000 --> 00:27:06,000
See, instead I told here we are using sigmoid.

479
00:27:06,000 --> 00:27:08,000
The derivative of sigmoid is 02.25.

480
00:27:08,000 --> 00:27:11,000
We can also change this to tanh activation function.

481
00:27:11,000 --> 00:27:15,000
But again in time you will be getting a derivative between 0 to 1.

482
00:27:15,000 --> 00:27:19,000
But as the chain rule keeps on increasing, the multiplication will keep on happening and your small

483
00:27:19,000 --> 00:27:20,000
you'll get a smaller value.

484
00:27:20,000 --> 00:27:21,000
Okay.

485
00:27:21,000 --> 00:27:29,000
So in order to solve this, uh, you can use other activation functions like ReLU okay, or leaky ReLU.

486
00:27:31,000 --> 00:27:36,000
So this, uh, here, it makes sure that the derivative is always, uh, near to one.

487
00:27:36,000 --> 00:27:37,000
Okay.

488
00:27:38,000 --> 00:27:43,000
The other case is that researcher came up with an amazing one more network, which is called as LSTM

489
00:27:43,000 --> 00:27:44,000
RNN.

490
00:27:44,000 --> 00:27:45,000
Okay.

491
00:27:45,000 --> 00:27:53,000
Now this LSTM RNN solves this problem of long short term memory.

492
00:27:53,000 --> 00:27:55,000
And that is why it is called as long short term memory RNA.

493
00:27:56,000 --> 00:27:57,000
Okay.

494
00:27:57,000 --> 00:28:00,000
Not only this, uh, you'll also be seeing.

495
00:28:00,000 --> 00:28:02,000
Along with this, we'll discuss about GRU RNA.

496
00:28:04,000 --> 00:28:05,000
Okay, GRU.

497
00:28:05,000 --> 00:28:11,000
And so both this RNN will solve the problems with respect to a simple RNA.

498
00:28:11,000 --> 00:28:15,000
And that is what we are going to discuss about in the upcoming videos.

499
00:28:15,000 --> 00:28:17,000
But I hope you have got an idea.

500
00:28:17,000 --> 00:28:17,000
Right.

501
00:28:17,000 --> 00:28:17,000
right?

502
00:28:18,000 --> 00:28:25,000
Uh, if I have a long term dependency with respect to the sentences, then your simple RNN will not

503
00:28:25,000 --> 00:28:26,000
be working well, right?

504
00:28:26,000 --> 00:28:28,000
That is the major problem.

505
00:28:28,000 --> 00:28:30,000
And I have also explained you why?

506
00:28:30,000 --> 00:28:32,000
Because at time is equal to 50 steps.

507
00:28:32,000 --> 00:28:35,000
You can see my chain rule was so, so big.

508
00:28:35,000 --> 00:28:41,000
And when we did just one of the derivative problem statement, when we try to solve it here, you could

509
00:28:41,000 --> 00:28:46,000
see that, hey, I'm getting this particular value with respect to weight, so this will be nothing,

510
00:28:46,000 --> 00:28:47,000
but this will be weight.

511
00:28:47,000 --> 00:28:52,000
Over here we are multiplying with the specific weight, but we are multiplying a smaller value okay.

512
00:28:52,000 --> 00:28:53,000
Smaller value.

513
00:28:53,000 --> 00:28:58,000
So I hope uh you are able to understand this particular video.

514
00:28:58,000 --> 00:29:01,000
This was about the problems of RNN.

515
00:29:01,000 --> 00:29:02,000
Yes.

516
00:29:02,000 --> 00:29:06,000
I will see you all in the next video where we will be discussing about LSTM, RNN and Grant.

517
00:29:06,000 --> 00:29:07,000
Thank you.

