1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:05,000
So we are going to continue the discussion with respect to our deep learning and NLP.

3
00:00:06,000 --> 00:00:13,000
Already we have discussed about this four important neural networks that is simple RNN, LSTM, RNN

4
00:00:13,000 --> 00:00:15,000
grown and bidirectional RNN.

5
00:00:15,000 --> 00:00:16,000
Right.

6
00:00:16,000 --> 00:00:21,000
And here you should definitely know what is the problem in simple RNN.

7
00:00:21,000 --> 00:00:26,000
The first problem in simple RNN is about vanishing gradient problem, right?

8
00:00:26,000 --> 00:00:33,000
And because of this vanishing gradient problem, what we did is that we saw the different variant uh,

9
00:00:33,000 --> 00:00:38,000
of RNN, specifically LSTM, RNA and RNN.

10
00:00:38,000 --> 00:00:38,000
Right.

11
00:00:38,000 --> 00:00:47,000
Here, uh, you could see that by using this two techniques we were able to add long short term memories.

12
00:00:47,000 --> 00:00:48,000
Right.

13
00:00:48,000 --> 00:00:50,000
Long short term memory.

14
00:00:50,000 --> 00:00:56,000
And because of that, you will be able to see that this was very much efficient in solving problems

15
00:00:56,000 --> 00:01:01,000
which were of, uh, you know, many to one RNN type.

16
00:01:01,000 --> 00:01:05,000
You know many to one like sentiment analysis or predicting the next word.

17
00:01:05,000 --> 00:01:06,000
And many more things as such, right?

18
00:01:06,000 --> 00:01:14,000
Then we understood there were also some problems with LSTM, RNN and GRU RNN wherein like if I really

19
00:01:14,000 --> 00:01:20,000
want to do a prediction of a word, if there is some context of the further words that is required,

20
00:01:20,000 --> 00:01:24,000
or if there is some kind of dependency of that particular output.

21
00:01:24,000 --> 00:01:30,000
Uh, with respect to the upcoming words, at that point of time, we cannot actually use LSTM or GRU.

22
00:01:30,000 --> 00:01:34,000
So for that we had this bidirectional RNN.

23
00:01:34,000 --> 00:01:34,000
Right.

24
00:01:34,000 --> 00:01:36,000
And we also understood the architecture.

25
00:01:36,000 --> 00:01:43,000
Now let me talk about one type of RNN which is specifically called as many to many.

26
00:01:44,000 --> 00:01:45,000
Right.

27
00:01:45,000 --> 00:01:52,000
So many to many RNN specifically indicates let's say if I have this specific RNN like this, right?

28
00:01:53,000 --> 00:02:00,000
Many to many RNN basically says that I have Multiple inputs, right?

29
00:02:00,000 --> 00:02:05,000
I have multiple inputs and with respect to that I have multiple outputs.

30
00:02:07,000 --> 00:02:09,000
Y1Y2Y3.

31
00:02:09,000 --> 00:02:12,000
Right now in this scenario you will be able to see.

32
00:02:12,000 --> 00:02:14,000
Let's talk about some of the use cases.

33
00:02:14,000 --> 00:02:24,000
Some of the examples let's say machine language translation okay.

34
00:02:24,000 --> 00:02:32,000
Now in this specific example let's say if I give a sentence in English I need to convert this into French.

35
00:02:32,000 --> 00:02:36,000
So here I know there will be many words as input in English language.

36
00:02:36,000 --> 00:02:41,000
Similarly, when I try to convert this back into French, you'll be able to see multiple words with

37
00:02:41,000 --> 00:02:43,000
respect to French language.

38
00:02:43,000 --> 00:02:46,000
So this basically becomes many to many type of RNN right.

39
00:02:46,000 --> 00:02:50,000
So here I will just write many to many RNN.

40
00:02:50,000 --> 00:02:55,000
Now whenever I talk with respect to this kind of architecture or this kind of use case.

41
00:02:56,000 --> 00:02:57,000
This is nothing.

42
00:02:57,000 --> 00:03:01,000
But this is called as sequence to sequence neural network, right?

43
00:03:01,000 --> 00:03:06,000
Now why do I say this as sequence to sequence neural network.

44
00:03:06,000 --> 00:03:09,000
This is very much important to understand right.

45
00:03:09,000 --> 00:03:14,000
I will not say neural network itself, but yeah we can also say sequence to sequence neural network.

46
00:03:14,000 --> 00:03:22,000
But here we will be seeing that we are trying to solve a problem which involves sequential data as in

47
00:03:22,000 --> 00:03:22,000
my input.

48
00:03:22,000 --> 00:03:25,000
And the output is also sequential data.

49
00:03:26,000 --> 00:03:26,000
Right.

50
00:03:26,000 --> 00:03:32,000
So use during this kind of use case where my input and output are both sequential data.

51
00:03:32,000 --> 00:03:36,000
I cannot specifically use LSTM and GRU or simple RNN.

52
00:03:36,000 --> 00:03:39,000
It will not give us that much good accuracy, right?

53
00:03:39,000 --> 00:03:47,000
So considering this, we will be learning a new architecture which is called as encoder decoder architecture.

54
00:03:47,000 --> 00:03:52,000
Okay, again, let me repeat what I am specifically talking about.

55
00:03:52,000 --> 00:03:52,000
Right.

56
00:03:52,000 --> 00:03:58,000
So everybody knows about this one type of RNN, which is called as many to many RNN, right?

57
00:03:58,000 --> 00:04:00,000
In many to many RNN.

58
00:04:00,000 --> 00:04:08,000
If I take some of the examples, let's say I want to create a model which will be able to convert one

59
00:04:08,000 --> 00:04:10,000
language into another, right?

60
00:04:10,000 --> 00:04:11,000
One language to another.

61
00:04:12,000 --> 00:04:16,000
So some of the examples with respect to this is Google translation, right?

62
00:04:16,000 --> 00:04:21,000
So let's say if I give a sentence with respect to English this should be converted into French.

63
00:04:21,000 --> 00:04:26,000
Now here, if I'm giving multiple inputs over here, let's say over here, I give a sentence saying

64
00:04:26,000 --> 00:04:31,000
that I like to eat food and in French it will automatically get converted right in French language.

65
00:04:31,000 --> 00:04:37,000
So in this kind of case, this is a kind of use case which is called as many to many RNN, right?

66
00:04:37,000 --> 00:04:44,000
Where I have multiple sequences as the input, multiple sequences as my input.

67
00:04:44,000 --> 00:04:49,000
Similarly, my output is also a sequence of words, right?

68
00:04:50,000 --> 00:04:56,000
Another example, which I can probably consider is in LinkedIn chat bot or in LinkedIn chat.

69
00:04:56,000 --> 00:04:57,000
Right.

70
00:04:57,000 --> 00:04:59,000
Whenever you're chatting with your friends, right?

71
00:04:59,000 --> 00:05:02,000
Let's say if I say hi, how are you?

72
00:05:04,000 --> 00:05:06,000
Now with respect to this, hi, how are you?

73
00:05:06,000 --> 00:05:08,000
You will be able to see that you'll be getting some suggestion.

74
00:05:08,000 --> 00:05:09,000
I'm good.

75
00:05:09,000 --> 00:05:12,000
You know which LinkedIn will automatically provide, right?

76
00:05:12,000 --> 00:05:12,000
Okay.

77
00:05:12,000 --> 00:05:13,000
Thanks.

78
00:05:13,000 --> 00:05:14,000
Something like that.

79
00:05:14,000 --> 00:05:20,000
So that is also a kind of sequence to sequence problem statement.

80
00:05:20,000 --> 00:05:20,000
Right.

81
00:05:20,000 --> 00:05:25,000
Because my output automatically it is being able to suggest some kind of sentence.

82
00:05:25,000 --> 00:05:26,000
Right.

83
00:05:26,000 --> 00:05:31,000
So this is some example of sequence to sequence data set.

84
00:05:31,000 --> 00:05:31,000
Right.

85
00:05:31,000 --> 00:05:37,000
Whenever I have this kind of use case where I have multiple inputs and I want to probably get multiple

86
00:05:37,000 --> 00:05:43,000
outputs right with respect to sequence, then we cannot use simple RNN or LSTM or GRU.

87
00:05:43,000 --> 00:05:45,000
It will not be that much efficient.

88
00:05:45,000 --> 00:05:49,000
So for that we specifically use a new architecture.

89
00:05:49,000 --> 00:05:52,000
Even we cannot use bidirectional RNN, right?

90
00:05:52,000 --> 00:05:59,000
If I have use cases like which is like many to one right, 1 to 1, I can use this kind of, uh, neural

91
00:05:59,000 --> 00:05:59,000
networks.

92
00:05:59,000 --> 00:06:04,000
So for this we specifically use a new architecture which is called as encoder and decoder.

93
00:06:04,000 --> 00:06:10,000
Now we will try to understand how does an encoder and decoder architecture look like.

94
00:06:10,000 --> 00:06:15,000
And how is it different from all these RNNs that we specifically have?

95
00:06:15,000 --> 00:06:15,000
Okay.

96
00:06:15,000 --> 00:06:25,000
So let us go ahead and let me just give you a simple working, simple working of encoder and decoder.

97
00:06:25,000 --> 00:06:28,000
So as the name suggests encoder.

98
00:06:28,000 --> 00:06:32,000
So there will be one encoder block over here okay.

99
00:06:32,000 --> 00:06:39,000
There will be one decoder block over here right now with respect to this encoder and decoder.

100
00:06:39,000 --> 00:06:40,000
Right.

101
00:06:40,000 --> 00:06:42,000
We will be giving our input sentence.

102
00:06:42,000 --> 00:06:45,000
Obviously we will be applying an embedding layer over here.

103
00:06:45,000 --> 00:06:51,000
Once we give this input sentence it will be converted word into array of numbers, which we basically

104
00:06:51,000 --> 00:06:52,000
see it as embeddings.

105
00:06:52,000 --> 00:06:56,000
Then we create a hidden state.

106
00:06:56,000 --> 00:07:02,000
This hidden state is nothing, but it is basically called as context vectors.

107
00:07:03,000 --> 00:07:05,000
Context vectors okay.

108
00:07:05,000 --> 00:07:07,000
We will talk more about this.

109
00:07:07,000 --> 00:07:08,000
What exactly is context vector.

110
00:07:08,000 --> 00:07:11,000
And then this is forwarded to the decoder layer.

111
00:07:12,000 --> 00:07:18,000
Now in the in the the main aim of the decoder layer is that it generates the output word by word and

112
00:07:18,000 --> 00:07:20,000
keep feeling of the words.

113
00:07:20,000 --> 00:07:23,000
Keep feeding the previous words into the decoder again.

114
00:07:24,000 --> 00:07:27,000
Okay, so this is what exactly happens in the decoder.

115
00:07:27,000 --> 00:07:34,000
The decoder generates the output word by word and keeps feeding the previous word into the decoder.

116
00:07:34,000 --> 00:07:35,000
Again okay.

117
00:07:35,000 --> 00:07:39,000
So this is a simple working about encoder and decoder right.

118
00:07:40,000 --> 00:07:43,000
Just get this particular understanding that I have an encoder.

119
00:07:44,000 --> 00:07:48,000
This encoder work is to get an input right.

120
00:07:48,000 --> 00:07:55,000
And with respect to this particular input, its work is to probably finally generate a context vector.

121
00:07:57,000 --> 00:08:03,000
And if I take this context vector and give it to my decoder now okay.

122
00:08:03,000 --> 00:08:08,000
Or if you want some other diagram, I will just try to draw it for you.

123
00:08:08,000 --> 00:08:13,000
So if I give you one more example with respect to this encoder decoder diagram.

124
00:08:13,000 --> 00:08:15,000
So here I will just show you one more diagram over here.

125
00:08:15,000 --> 00:08:16,000
Okay.

126
00:08:16,000 --> 00:08:19,000
Now see in this diagram what we are basically doing.

127
00:08:19,000 --> 00:08:22,000
This is my input text I am giving it to the encoder.

128
00:08:22,000 --> 00:08:24,000
It is generating a context vector.

129
00:08:24,000 --> 00:08:30,000
And this context vector will be given back to the decoder along with the output text.

130
00:08:30,000 --> 00:08:33,000
And finally we should be able to get the output okay.

131
00:08:33,000 --> 00:08:37,000
Instead of summary I will not say summary, but I'll probably talk about this as an output.

132
00:08:37,000 --> 00:08:38,000
Okay.

133
00:08:38,000 --> 00:08:44,000
Now I know this diagram looks very much simple, and I hope you may be getting more confused just by

134
00:08:44,000 --> 00:08:45,000
seeing this.

135
00:08:45,000 --> 00:08:50,000
So what I am actually going to do is that in order to show you one more better diagram, this is how

136
00:08:50,000 --> 00:08:51,000
it exactly happens.

137
00:08:51,000 --> 00:08:55,000
So I need to show you what exactly happens in an encoder.

138
00:08:55,000 --> 00:08:59,000
So as I said, this entire architecture has one encoder.

139
00:08:59,000 --> 00:09:06,000
Encoder takes the input and converts all this input into a context vector.

140
00:09:07,000 --> 00:09:15,000
This context vector talks more, has the entire information about that particular input in the form

141
00:09:15,000 --> 00:09:15,000
of vectors.

142
00:09:15,000 --> 00:09:16,000
Right?

143
00:09:17,000 --> 00:09:22,000
Then the second thing is that once we have this particular vector, we have to pass all this vector

144
00:09:22,000 --> 00:09:25,000
to the decoder to solve our use case.

145
00:09:25,000 --> 00:09:31,000
And this context vector will also pass get passed with this particular decoder.

146
00:09:31,000 --> 00:09:34,000
And finally we will be able to create our output over here.

147
00:09:34,000 --> 00:09:37,000
What kind of use case will be specifically used over here.

148
00:09:37,000 --> 00:09:42,000
Some of the use case like like language translation.

149
00:09:42,000 --> 00:09:44,000
Language translation.

150
00:09:46,000 --> 00:09:48,000
Text generation.

151
00:09:48,000 --> 00:09:49,000
Right.

152
00:09:49,000 --> 00:09:52,000
Text generation or sentence generation?

153
00:09:52,000 --> 00:09:56,000
I'll say not word generation, but the complete sentence or text generation.

154
00:09:56,000 --> 00:09:57,000
Right.

155
00:09:57,000 --> 00:10:05,000
And some more use case you will be able to see as text say like how we have this LinkedIn chat bot suggestion.

156
00:10:05,000 --> 00:10:05,000
Right.

157
00:10:05,000 --> 00:10:09,000
Or I can also go ahead and write it as text suggestion.

158
00:10:09,000 --> 00:10:15,000
So all these kind of use cases will be able to get solved with the help of encoder and decoder.

159
00:10:15,000 --> 00:10:17,000
Now let me do one thing.

160
00:10:17,000 --> 00:10:23,000
Let us go ahead, one step ahead and try to understand what is inside this encoder and what is inside

161
00:10:23,000 --> 00:10:24,000
this decoder.

162
00:10:24,000 --> 00:10:26,000
You really need to understand that.

163
00:10:26,000 --> 00:10:30,000
So for that we have this amazing diagram again.

164
00:10:30,000 --> 00:10:35,000
And this we basically see it as sequence to sequence encoder and decoder neural network.

165
00:10:36,000 --> 00:10:42,000
Usually inside the encoder and decoder we have LSTM, RNN okay why LSTM RNN?

166
00:10:42,000 --> 00:10:43,000
Why not RNN?

167
00:10:43,000 --> 00:10:49,000
Because c RNN already has a problem which is called as vanishing gradient problem, right?

168
00:10:49,000 --> 00:10:52,000
Vanishing gradient problem.

169
00:10:53,000 --> 00:10:56,000
Now because of this problem we cannot use vanishing gradient.

170
00:10:56,000 --> 00:11:00,000
Instead you can go ahead and use a LSTM, RNN or GRU inside this okay.

171
00:11:00,000 --> 00:11:01,000
Now see this one.

172
00:11:02,000 --> 00:11:03,000
Very much simple.

173
00:11:03,000 --> 00:11:06,000
And here you will be able to get the complete meaning.

174
00:11:06,000 --> 00:11:09,000
What does encoder actually do and what does decoder actually do okay.

175
00:11:10,000 --> 00:11:13,000
So here you see that it is an LSTM layer.

176
00:11:13,000 --> 00:11:16,000
And this is basically ruled based on the output.

177
00:11:16,000 --> 00:11:16,000
Right.

178
00:11:16,000 --> 00:11:18,000
Or sorry based on the input that I have.

179
00:11:19,000 --> 00:11:25,000
So here this line that you see this is basically your long term memory right.

180
00:11:25,000 --> 00:11:28,000
If you remember in LSTM it is your long term memory.

181
00:11:28,000 --> 00:11:30,000
And this is your hidden state right.

182
00:11:30,000 --> 00:11:35,000
So along with this input whatever operation basically happens inside the LSTM.

183
00:11:35,000 --> 00:11:40,000
If you remember we in the LSTM we have something called as forget gate Gate.

184
00:11:40,000 --> 00:11:43,000
We have this input gate, right?

185
00:11:43,000 --> 00:11:46,000
Uh, we have something called as candidate input.

186
00:11:47,000 --> 00:11:48,000
Candidate input.

187
00:11:48,000 --> 00:11:50,000
We have discussed all about this.

188
00:11:50,000 --> 00:11:52,000
Third thing that we specifically have is our output gate.

189
00:11:52,000 --> 00:11:53,000
Right.

190
00:11:53,000 --> 00:11:58,000
So all this gate, whatever operation, what operation it basically does, it does like what context

191
00:11:58,000 --> 00:12:03,000
needs to be added in the longer memory cell and what context needs to be removed from the memory cell.

192
00:12:03,000 --> 00:12:08,000
Along with that, what needs to be retained in the long short term, sorry short term memory cell,

193
00:12:08,000 --> 00:12:10,000
which is nothing but the hidden state.

194
00:12:10,000 --> 00:12:18,000
As I said, this is my this is my long term memory, right?

195
00:12:18,000 --> 00:12:21,000
And this is my short term memory.

196
00:12:23,000 --> 00:12:24,000
Right.

197
00:12:24,000 --> 00:12:26,000
We use this long term and short term memory.

198
00:12:26,000 --> 00:12:30,000
Now whatever operation is basically going to take with respect to that input.

199
00:12:30,000 --> 00:12:32,000
Now see there is an input which is called as thank you.

200
00:12:32,000 --> 00:12:33,000
Right.

201
00:12:33,000 --> 00:12:35,000
And I need to convert this word.

202
00:12:35,000 --> 00:12:40,000
Thank you let's say into French which is basically written as gracias.

203
00:12:40,000 --> 00:12:44,000
Okay, now in order to convert this how my input data will look like.

204
00:12:44,000 --> 00:12:47,000
So this will specifically be my data set.

205
00:12:47,000 --> 00:12:53,000
In my data set I will be having my English word, and I will be having my French word right now with

206
00:12:53,000 --> 00:12:58,000
respect to this English and French word, the first word will be something like thank you.

207
00:12:58,000 --> 00:13:03,000
And the conversion that you specifically have is something like gracias.

208
00:13:03,000 --> 00:13:06,000
Okay, I hope this is written in French.

209
00:13:06,000 --> 00:13:07,000
Okay.

210
00:13:07,000 --> 00:13:12,000
Now what does over here you will be able to see that inside this LSTM.

211
00:13:12,000 --> 00:13:16,000
I will be passing my entire sentence.

212
00:13:16,000 --> 00:13:17,000
Okay.

213
00:13:17,000 --> 00:13:21,000
I will be passing my entire sentence along with this sentence.

214
00:13:21,000 --> 00:13:26,000
What I will do, I will add a special character like SOS and end OS.

215
00:13:28,000 --> 00:13:34,000
SOS basically means it is just to indicate it is the start of this particular sentence, and iOS basically

216
00:13:34,000 --> 00:13:36,000
indicates it is the end of the statement, right?

217
00:13:36,000 --> 00:13:41,000
Similarly, for this French word, I will also go ahead and write SOS.

218
00:13:41,000 --> 00:13:45,000
Gracias, and we will write it as EOS.

219
00:13:45,000 --> 00:13:50,000
This is important just to make sure to indicate the neural network.

220
00:13:50,000 --> 00:13:52,000
What exactly is the start of the statement and end of the statement?

221
00:13:53,000 --> 00:13:53,000
Okay.

222
00:13:53,000 --> 00:13:58,000
Now this thank you is basically getting converted into a vector okay.

223
00:13:58,000 --> 00:14:01,000
So let's say over here by using all these characters.

224
00:14:01,000 --> 00:14:05,000
Now how many characters I have 1234.

225
00:14:05,000 --> 00:14:08,000
Now if I am using a one hot encoding right.

226
00:14:08,000 --> 00:14:11,000
So first word that is probably going to go into the LSTM.

227
00:14:11,000 --> 00:14:17,000
So what we are basically going to do this will basically be my LSTM in my encoder okay.

228
00:14:17,000 --> 00:14:20,000
So let's go ahead and draw this.

229
00:14:20,000 --> 00:14:23,000
So if you want I will draw it properly okay.

230
00:14:23,000 --> 00:14:29,000
So this is my entire encoder just to give you an idea how the entire training basically happens.

231
00:14:29,000 --> 00:14:31,000
So this is my encoder.

232
00:14:31,000 --> 00:14:36,000
Now inside my encoder you know that what I use I use LSTM right.

233
00:14:36,000 --> 00:14:39,000
So this is my LSTM inside my LSTM.

234
00:14:39,000 --> 00:14:42,000
You know that I should have two memory cell.

235
00:14:42,000 --> 00:14:44,000
One is for long term and one is for short term.

236
00:14:44,000 --> 00:14:46,000
This is indicated by CT.

237
00:14:46,000 --> 00:14:51,000
This is indicated by DT okay here forget gate, input gate and output gate.

238
00:14:51,000 --> 00:14:53,000
All the operation is basically going to happen right.

239
00:14:54,000 --> 00:14:57,000
Whatever output that is probably coming from here I will not be taking that.

240
00:14:58,000 --> 00:14:58,000
You know.

241
00:14:58,000 --> 00:15:02,000
So what we do in the encoder stage is that very simple.

242
00:15:02,000 --> 00:15:07,000
We give every word one by one to the input over here to the LSTM.

243
00:15:07,000 --> 00:15:08,000
Right.

244
00:15:08,000 --> 00:15:12,000
So let's say the first word that I'm actually going to give is nothing but SOS.

245
00:15:12,000 --> 00:15:16,000
So this SOS word how do I give it okay.

246
00:15:16,000 --> 00:15:19,000
Let's say if I'm passing SOS word right.

247
00:15:19,000 --> 00:15:20,000
First word is SOS.

248
00:15:20,000 --> 00:15:27,000
So in the first instance SOS when I pass this entire embedding will basically happen.

249
00:15:27,000 --> 00:15:29,000
Right here we can use word two vec.

250
00:15:29,000 --> 00:15:31,000
Here we can use any kind of embedding techniques.

251
00:15:31,000 --> 00:15:32,000
Right.

252
00:15:32,000 --> 00:15:37,000
But the most efficient one is embedding, uh, embedding technique with respect to embedding layers.

253
00:15:37,000 --> 00:15:37,000
Right.

254
00:15:37,000 --> 00:15:39,000
So here I have SOS.

255
00:15:39,000 --> 00:15:44,000
So for SOS since I have total four words this will be one and remaining all will be zero.

256
00:15:44,000 --> 00:15:47,000
Let's say that I'm just using one hot encoding right.

257
00:15:48,000 --> 00:15:53,000
So in this particular word my SOS is getting passed okay.

258
00:15:53,000 --> 00:15:57,000
Then in the next step what it will do in time is equal to one.

259
00:15:57,000 --> 00:16:00,000
It will go to my next LSTM.

260
00:16:01,000 --> 00:16:02,000
Right.

261
00:16:02,000 --> 00:16:05,000
Now in this LSTM what I'm actually going to pass my next word.

262
00:16:05,000 --> 00:16:06,000
My next word is nothing.

263
00:16:06,000 --> 00:16:09,000
But thank you.

264
00:16:09,000 --> 00:16:13,000
Right now in the case of thank you this will be 010 and zero right.

265
00:16:13,000 --> 00:16:16,000
So this will basically be thank you according to one hot encoding.

266
00:16:16,000 --> 00:16:17,000
Right.

267
00:16:17,000 --> 00:16:20,000
If I directly use word two vec I will be getting a different vector dimensions.

268
00:16:20,000 --> 00:16:21,000
Right.

269
00:16:21,000 --> 00:16:24,000
Then coming to the next third word okay.

270
00:16:24,000 --> 00:16:26,000
So this operation is specifically happening.

271
00:16:26,000 --> 00:16:30,000
Then in the third instance I will pass my third word right.

272
00:16:30,000 --> 00:16:33,000
In this particular third word, I will just go ahead and pass you.

273
00:16:33,000 --> 00:16:38,000
You is nothing, but it will give this particular input as like this.

274
00:16:38,000 --> 00:16:42,000
And here, instead of writing thank you, I will just pass you over here.

275
00:16:42,000 --> 00:16:45,000
This will be my third word right now.

276
00:16:45,000 --> 00:16:47,000
Similarly my fourth word what it will happen again.

277
00:16:47,000 --> 00:16:51,000
It will go to the next LSTM.

278
00:16:52,000 --> 00:16:55,000
And this is my entire encoder now okay.

279
00:16:56,000 --> 00:16:57,000
Encoder.

280
00:16:57,000 --> 00:16:57,000
Right.

281
00:16:57,000 --> 00:17:04,000
So here I'm actually going to pass my fourth word which will be my end of statement 0001.

282
00:17:04,000 --> 00:17:08,000
This is nothing but end of sentence right now.

283
00:17:08,000 --> 00:17:11,000
This is how we go ahead and pass it over here.

284
00:17:11,000 --> 00:17:17,000
And after this, the operation that we are basically going to do, we will be getting our CT and DT.

285
00:17:17,000 --> 00:17:19,000
So this will be my CT and DT.

286
00:17:19,000 --> 00:17:21,000
What does CT indicate?

287
00:17:21,000 --> 00:17:30,000
CT indicates my long term memory and DT indicates my short term memory Right.

288
00:17:31,000 --> 00:17:32,000
Short term memory.

289
00:17:33,000 --> 00:17:40,000
Now, this long term and short term memory combined is basically called as context vectors.

290
00:17:42,000 --> 00:17:44,000
Context vectors.

291
00:17:44,000 --> 00:17:49,000
In short, all these vector operations will probably be taking place inside this all the operations.

292
00:17:49,000 --> 00:17:51,000
Because inside this we have this forget gate.

293
00:17:51,000 --> 00:17:53,000
We have this input gate, we have this output gate.

294
00:17:53,000 --> 00:17:58,000
So with respect to all this time stamp, when I'm passing with t is equal to one, t is equal to two,

295
00:17:58,000 --> 00:18:00,000
t is equal to three, t is equal to four.

296
00:18:00,000 --> 00:18:01,000
Right.

297
00:18:01,000 --> 00:18:04,000
I'm passing this particular sentence one by one right.

298
00:18:04,000 --> 00:18:08,000
So this is what exactly happens inside the encoder.

299
00:18:09,000 --> 00:18:09,000
Okay.

300
00:18:09,000 --> 00:18:12,000
It is not like you have to learn decoder separately.

301
00:18:12,000 --> 00:18:14,000
Decoder will also happen at the same time.

302
00:18:14,000 --> 00:18:15,000
Now see this okay.

303
00:18:15,000 --> 00:18:20,000
So encoder what it has done is that it has taken all the words inside the sentence.

304
00:18:20,000 --> 00:18:22,000
And it has helped us to generate the context vector.

305
00:18:22,000 --> 00:18:23,000
How did it generate the context vector?

306
00:18:23,000 --> 00:18:28,000
Whatever operation is present inside this with respect to forget gate with respect to input gate with

307
00:18:28,000 --> 00:18:29,000
respect to output gate.

308
00:18:30,000 --> 00:18:32,000
Write all these three operations that it has write.

309
00:18:32,000 --> 00:18:34,000
And with respect to that I need to combine this.

310
00:18:34,000 --> 00:18:36,000
I need to add some input in the CT.

311
00:18:36,000 --> 00:18:39,000
Otherwise remove some context from the CT right?

312
00:18:39,000 --> 00:18:42,000
We finally get this entire CT and DT right?

313
00:18:42,000 --> 00:18:46,000
Now we will go ahead and create our decoders.

314
00:18:47,000 --> 00:18:49,000
Now in the case of decoder what will happen.

315
00:18:49,000 --> 00:18:51,000
See this okay.

316
00:18:51,000 --> 00:18:57,000
So in the case of decoder let's say in our training data set in the French word is gracious right.

317
00:18:57,000 --> 00:19:03,000
So what we will do over here again we will go ahead and use one LSTM.

318
00:19:03,000 --> 00:19:04,000
Right.

319
00:19:05,000 --> 00:19:08,000
Now in this particular case of LSTM what is basically going to happen.

320
00:19:08,000 --> 00:19:09,000
Now see this in decoder stage.

321
00:19:09,000 --> 00:19:11,000
What is basically going to happen.

322
00:19:11,000 --> 00:19:14,000
This entire CT and DT will get passed.

323
00:19:14,000 --> 00:19:16,000
We are going to pass this entire vector to this.

324
00:19:17,000 --> 00:19:22,000
Now you know that we have transformed our training data.

325
00:19:22,000 --> 00:19:25,000
So this I'm considering as my training data okay.

326
00:19:26,000 --> 00:19:27,000
now for the training data.

327
00:19:27,000 --> 00:19:28,000
Also my output.

328
00:19:28,000 --> 00:19:29,000
This is my output, right?

329
00:19:30,000 --> 00:19:32,000
And this is my true output.

330
00:19:32,000 --> 00:19:35,000
We basically say it as y truth, right?

331
00:19:35,000 --> 00:19:36,000
Not y hat.

332
00:19:36,000 --> 00:19:37,000
Y hat is your prediction.

333
00:19:38,000 --> 00:19:42,000
Now in your y truth you know that there is SOS, right?

334
00:19:42,000 --> 00:19:44,000
There is grass and there is iOS.

335
00:19:44,000 --> 00:19:47,000
Along with this context vector, what I will do, I'll training it.

336
00:19:47,000 --> 00:19:54,000
You know, so for the first instance I will go ahead and pass my input in this case first one that I'm

337
00:19:54,000 --> 00:19:55,000
going to pass is my SOS.

338
00:19:55,000 --> 00:20:01,000
Now, since there are only three words in the output like that in the entire data set, you may have

339
00:20:01,000 --> 00:20:03,000
many words, but I'm just considering.

340
00:20:03,000 --> 00:20:08,000
As an example, there are three words, so SOS is indicated by 101100.

341
00:20:08,000 --> 00:20:11,000
In this particular case this will basically get passed over here, right?

342
00:20:11,000 --> 00:20:12,000
How did I convert this?

343
00:20:12,000 --> 00:20:15,000
Again by using the same one hot encoding technique.

344
00:20:15,000 --> 00:20:17,000
We can replace this entire technique by embedding layer.

345
00:20:17,000 --> 00:20:22,000
But just to give you an idea what exactly is happening, I'm actually writing it over here right then

346
00:20:22,000 --> 00:20:24,000
with respect to this particular output.

347
00:20:24,000 --> 00:20:31,000
What we do over here, we pass it to a fully connected layer with softmax activation function.

348
00:20:31,000 --> 00:20:31,000
Okay.

349
00:20:31,000 --> 00:20:34,000
So let me just go ahead and talk about it.

350
00:20:34,000 --> 00:20:35,000
Right.

351
00:20:35,000 --> 00:20:37,000
You know what operation is basically going to happen over here.

352
00:20:37,000 --> 00:20:39,000
Because here also there will be forget gate.

353
00:20:39,000 --> 00:20:41,000
There will be input gate and there will be output gate.

354
00:20:41,000 --> 00:20:47,000
After we get the output we basically pass it to softmax.

355
00:20:47,000 --> 00:20:48,000
Why softmax?

356
00:20:48,000 --> 00:20:53,000
Because you will be able to see that I am having this three words, right?

357
00:20:53,000 --> 00:20:59,000
I'll be showing that I'll not be having three words, but I have added this start of sentence and end

358
00:20:59,000 --> 00:21:00,000
of sentence.

359
00:21:00,000 --> 00:21:00,000
Right?

360
00:21:00,000 --> 00:21:07,000
So with respect to softmax, I will be getting let's say three outputs over here one, two and three.

361
00:21:07,000 --> 00:21:08,000
Okay.

362
00:21:08,000 --> 00:21:14,000
Now with respect to this output, let's say my vectors look something like this I will I will get the

363
00:21:14,000 --> 00:21:16,000
vector something like this.

364
00:21:16,000 --> 00:21:17,000
Values let's say over here.

365
00:21:20,000 --> 00:21:21,000
Or let me define it something like this.

366
00:21:21,000 --> 00:21:24,000
So this will basically be my y hat.

367
00:21:24,000 --> 00:21:26,000
Y hat basically means predicted value.

368
00:21:26,000 --> 00:21:27,000
And this y hat.

369
00:21:27,000 --> 00:21:32,000
Since we are using softmax y, we are using softmax because this is becoming a multi-class classification.

370
00:21:32,000 --> 00:21:38,000
So with respect to shows that, I'm actually giving what should be the output that I should get along

371
00:21:38,000 --> 00:21:40,000
with this context vector.

372
00:21:40,000 --> 00:21:42,000
So let's say that here I'm going to get three coordinates.

373
00:21:42,000 --> 00:21:49,000
Let's say I'm going to get 0.5 I'm sorry I'm going to get .0.1.

374
00:21:49,000 --> 00:21:52,000
I'm going to probably get 0.4.

375
00:21:53,000 --> 00:21:54,000
I'm going to get 0.5.

376
00:21:54,000 --> 00:21:56,000
And I'm going to get 0.4.

377
00:21:56,000 --> 00:21:56,000
Right.

378
00:21:56,000 --> 00:22:02,000
So when I got this three vectors, we will just go ahead and see which value is the maximum in this

379
00:22:02,000 --> 00:22:03,000
vector.

380
00:22:03,000 --> 00:22:09,000
If it says 0.5, that basically means it is going to give us the output as gracias.

381
00:22:10,000 --> 00:22:10,000
Okay.

382
00:22:10,000 --> 00:22:16,000
So here what we do, we will convert this back into this particular output because the second second

383
00:22:17,000 --> 00:22:20,000
index or second vector that is basically there it is for this particular word gracias gracious.

384
00:22:20,000 --> 00:22:22,000
Because it is present over here, right?

385
00:22:23,000 --> 00:22:25,000
So this is how we are able to get the y hat.

386
00:22:25,000 --> 00:22:26,000
Okay.

387
00:22:26,000 --> 00:22:30,000
Yeah, it there may be scenario that I may not get this particular value.

388
00:22:30,000 --> 00:22:33,000
Also I may get that third value having a highest value.

389
00:22:33,000 --> 00:22:35,000
Highest value with respect to probability.

390
00:22:35,000 --> 00:22:35,000
Right.

391
00:22:35,000 --> 00:22:41,000
Because when we are using softmax here we are basically going to get the probability of y hat right.

392
00:22:41,000 --> 00:22:46,000
And right now since this is a multi-class classification based on three different vectors, whichever

393
00:22:46,000 --> 00:22:50,000
vector value is high, uh, with respect to the probability, let's say vector value.

394
00:22:50,000 --> 00:22:52,000
But I'll say this is the probability value, right.

395
00:22:52,000 --> 00:22:54,000
Again let me repeat this.

396
00:22:54,000 --> 00:22:59,000
So here once I give get the output I'm going to pass this through a softmax activation function.

397
00:22:59,000 --> 00:23:03,000
And we know that over here there are three words S.O.S. gracias and iOS.

398
00:23:03,000 --> 00:23:07,000
With respect to this particular S.O.S. along with context vector, it should predict something.

399
00:23:07,000 --> 00:23:13,000
And let's say if it gives this three output that I have got over here, let's say if it gives this three

400
00:23:13,000 --> 00:23:21,000
output okay, since we are actually getting this y hat, if it gets this three output, one is 0.1,

401
00:23:21,000 --> 00:23:25,000
one is 0.61 is 0.03.

402
00:23:25,000 --> 00:23:28,000
Now, in this case, the probability of 0.1 is.

403
00:23:28,000 --> 00:23:29,000
For this SOS word.

404
00:23:29,000 --> 00:23:34,000
The probability of 0.6 is for gracious, and this probability of 0.3 is for iOS.

405
00:23:34,000 --> 00:23:38,000
So whichever will be the highest probability that will be my predicted output.

406
00:23:38,000 --> 00:23:42,000
So here I will be specifically getting gracious okay.

407
00:23:42,000 --> 00:23:43,000
Yeah.

408
00:23:43,000 --> 00:23:48,000
There may be scenario that I may get another probability I may get 0.1 over here, 0.3 over here and

409
00:23:48,000 --> 00:23:49,000
0.6 over here.

410
00:23:49,000 --> 00:23:50,000
In this particular case, what will happen?

411
00:23:50,000 --> 00:23:52,000
The prediction will be end of sentence.

412
00:23:52,000 --> 00:23:57,000
So if I am getting end of sentence that basically means I will not go further.

413
00:23:57,000 --> 00:23:57,000
Right.

414
00:23:57,000 --> 00:24:00,000
But again during the back propagation I will propagate.

415
00:24:00,000 --> 00:24:03,000
I will update all the weights that is involved over here.

416
00:24:03,000 --> 00:24:06,000
And then we will again do the forward propagation.

417
00:24:06,000 --> 00:24:13,000
And we will make sure that this y minus y hat should be less right or it should be reducing right.

418
00:24:13,000 --> 00:24:15,000
So let's say I got gracious over here.

419
00:24:15,000 --> 00:24:19,000
So this is my start of the sentence I got.

420
00:24:19,000 --> 00:24:23,000
I did all the calculation inside this whatever happens in the LSTM layer.

421
00:24:23,000 --> 00:24:26,000
And finally I got the softmax and I got y hat.

422
00:24:26,000 --> 00:24:34,000
So in the next step what happens is that it will probably go to my next LSTM over here because I've

423
00:24:34,000 --> 00:24:36,000
got the right output that is called as gracious.

424
00:24:36,000 --> 00:24:42,000
And now with respect to this particular gracious, what I will do, I will pass this output back to

425
00:24:42,000 --> 00:24:44,000
my next LSTM.

426
00:24:44,000 --> 00:24:45,000
Right.

427
00:24:45,000 --> 00:24:49,000
And you know that this HT and KT will also be going over here.

428
00:24:49,000 --> 00:24:54,000
Right then along with this again, we will go ahead and pass this to our softmax.

429
00:24:55,000 --> 00:25:02,000
Now with respect to the softmax, if my training is going on well, I should be able to get my y hat

430
00:25:02,000 --> 00:25:05,000
over here with respect to the probability that I have.

431
00:25:05,000 --> 00:25:08,000
And here my output should be end of sentence.

432
00:25:08,000 --> 00:25:09,000
Right.

433
00:25:09,000 --> 00:25:13,000
Because gracious after gracious, I don't have any other word.

434
00:25:13,000 --> 00:25:13,000
Right.

435
00:25:13,000 --> 00:25:19,000
But here my next word that is probably be going will be nothing but gaseous vectors.

436
00:25:19,000 --> 00:25:20,000
Okay.

437
00:25:20,000 --> 00:25:23,000
Whatever vectors is over here, that is zero, one, zero.

438
00:25:24,000 --> 00:25:24,000
Right?

439
00:25:25,000 --> 00:25:29,000
And finally, when I get end of the sentence, that basically means I have to stop training over there.

440
00:25:29,000 --> 00:25:30,000
Okay.

441
00:25:30,000 --> 00:25:34,000
Now this is the best way of training, right?

442
00:25:34,000 --> 00:25:36,000
While I'm getting the exactly accurate value.

443
00:25:36,000 --> 00:25:40,000
But here you will be able to see that if I try to break this up.

444
00:25:40,000 --> 00:25:41,000
Right.

445
00:25:41,000 --> 00:25:44,000
With respect to the decoder, I will be having y y hat.

446
00:25:44,000 --> 00:25:45,000
Right?

447
00:25:45,000 --> 00:25:46,000
I will be having y hat.

448
00:25:46,000 --> 00:25:52,000
This y hat will again be happening with respect to the timestamp of t is equal to one two, three.

449
00:25:52,000 --> 00:25:54,000
Right now with respect to t is equal to one.

450
00:25:54,000 --> 00:25:59,000
Let's say over here with respect to t is equal to one, t is equal to two, t is equal to three.

451
00:25:59,000 --> 00:26:01,000
We will be having our y truth value.

452
00:26:01,000 --> 00:26:03,000
We will be having our y hat.

453
00:26:03,000 --> 00:26:03,000
Right.

454
00:26:03,000 --> 00:26:05,000
There will be this three vectors.

455
00:26:05,000 --> 00:26:07,000
There will be this three vectors.

456
00:26:07,000 --> 00:26:07,000
Right.

457
00:26:07,000 --> 00:26:09,000
Again here will be three vectors.

458
00:26:09,000 --> 00:26:11,000
Here there will be three vectors.

459
00:26:11,000 --> 00:26:13,000
Again there will be three vectors.

460
00:26:13,000 --> 00:26:18,000
Here there will be three vectors right now during this particular scenario, after we do the forward

461
00:26:18,000 --> 00:26:19,000
propagation, what we do.

462
00:26:19,000 --> 00:26:21,000
We basically calculate the loss.

463
00:26:21,000 --> 00:26:24,000
And loss is nothing but y minus y hat whole square right.

464
00:26:24,000 --> 00:26:25,000
It can be whole square.

465
00:26:25,000 --> 00:26:27,000
Or we are just trying to find out the difference.

466
00:26:27,000 --> 00:26:30,000
And our main aim is to reduce this loss.

467
00:26:31,000 --> 00:26:31,000
Right.

468
00:26:31,000 --> 00:26:36,000
So what we do in this particular case, once we find out the difference, we will try to calculate the

469
00:26:36,000 --> 00:26:36,000
average.

470
00:26:36,000 --> 00:26:40,000
And then what we will do is that we will try to reduce this particular loss.

471
00:26:40,000 --> 00:26:42,000
And when we are reducing this loss, how do we reduce it.

472
00:26:42,000 --> 00:26:47,000
We again have to use something called as optimizer right.

473
00:26:47,000 --> 00:26:53,000
Now when we are using this specific optimizer it is just going to update all these weights that needs

474
00:26:53,000 --> 00:26:55,000
to be updated inside all these things.

475
00:26:55,000 --> 00:26:58,000
Then in the back propagation all the weights will get updated.

476
00:26:58,000 --> 00:27:00,000
Then again we will do the forward propagation.

477
00:27:00,000 --> 00:27:07,000
But you understand one thing over here is that over here during the input we need to pass all the inputs

478
00:27:07,000 --> 00:27:14,000
first, then the output from the last layer or output from the last LSTM with respect to time.

479
00:27:14,000 --> 00:27:15,000
That context vector.

480
00:27:15,000 --> 00:27:17,000
Gets passed to the decoder.

481
00:27:17,000 --> 00:27:19,000
So this basically becomes the encoder.

482
00:27:19,000 --> 00:27:22,000
And this is my decoder right.

483
00:27:23,000 --> 00:27:25,000
So this way the entire training.

484
00:27:25,000 --> 00:27:27,000
Of the sentences happens.

485
00:27:27,000 --> 00:27:31,000
And that is what is basically shown over here we have the softmax activation function.

486
00:27:31,000 --> 00:27:34,000
Again we pass it to the same softmax activation function.

487
00:27:34,000 --> 00:27:36,000
And we are unrolling it right.

488
00:27:36,000 --> 00:27:39,000
If the loss is high then we have to again do the back propagation.

489
00:27:39,000 --> 00:27:44,000
Update all the weights in this LSTM from till this encoder from the first stage of encoder.

490
00:27:44,000 --> 00:27:49,000
And again we have to probably do the forward propagation and do the same thing again.

491
00:27:49,000 --> 00:27:52,000
Find out the softmax activation function, what values I'm actually getting.

492
00:27:52,000 --> 00:27:56,000
The most important thing is that how do we convert this into vectors okay.

493
00:27:56,000 --> 00:28:00,000
So if I really want to probably draw this particular diagram in a much more easier way.

494
00:28:00,000 --> 00:28:03,000
So this actually becomes my encoder.

495
00:28:03,000 --> 00:28:04,000
Okay.

496
00:28:05,000 --> 00:28:10,000
Then this is my decoder okay.

497
00:28:10,000 --> 00:28:17,000
And with respect to the encoder and decoder, I will be having my LSTM rnn unrolled with respect to

498
00:28:17,000 --> 00:28:17,000
time.

499
00:28:18,000 --> 00:28:25,000
Okay, now you know, in this LSTM I have two important lines that gets passed and this two important

500
00:28:25,000 --> 00:28:28,000
line will be responsible in creating my context vector.

501
00:28:28,000 --> 00:28:30,000
This is the inputs that I'm actually going to give.

502
00:28:30,000 --> 00:28:37,000
Here we are going to use an embedding layer which we have already discussed about embedding layer.

503
00:28:38,000 --> 00:28:39,000
Right.

504
00:28:39,000 --> 00:28:44,000
And with respect to this embedding layer I will be passing my X11X1.

505
00:28:44,000 --> 00:28:45,000
Sorry.

506
00:28:45,000 --> 00:28:52,000
Here embedding layer I will be passing my X11X12X13.

507
00:28:52,000 --> 00:28:56,000
And let's say my X13 is end of sentence okay.

508
00:28:56,000 --> 00:28:59,000
Then whatever output that I get this is nothing.

509
00:28:59,000 --> 00:29:02,000
But this is my context vectors right?

510
00:29:02,000 --> 00:29:05,000
The same output is my context vector.

511
00:29:05,000 --> 00:29:10,000
Then here we again go ahead and create my LSTM for the conversion sake.

512
00:29:10,000 --> 00:29:15,000
Whatever we need to generate this will be passed towards this right.

513
00:29:15,000 --> 00:29:22,000
Along with this we will start over here with Soas right here again we will be using another embedding

514
00:29:22,000 --> 00:29:23,000
layer.

515
00:29:23,000 --> 00:29:26,000
In this embedding layer I will be passing my Soas first.

516
00:29:26,000 --> 00:29:26,000
Okay.

517
00:29:26,000 --> 00:29:29,000
So here also first Soas needs to be passed.

518
00:29:29,000 --> 00:29:30,000
Okay.

519
00:29:30,000 --> 00:29:34,000
Then whatever output I get after performing my softmax.

520
00:29:34,000 --> 00:29:37,000
So let's say here I'm going to write my softmax.

521
00:29:37,000 --> 00:29:38,000
Right?

522
00:29:38,000 --> 00:29:43,000
The same thing will probably get passed to my input.

523
00:29:43,000 --> 00:29:49,000
Okay, so here I will just rub this because this input will be only till here.

524
00:29:49,000 --> 00:29:50,000
Here only I'll be using my embedding layer.

525
00:29:50,000 --> 00:29:53,000
But the same vectors will get passed over here.

526
00:29:53,000 --> 00:29:55,000
Then again I will be getting an output again.

527
00:29:55,000 --> 00:29:58,000
The same vectors will be passed over here and here.

528
00:29:58,000 --> 00:30:01,000
Finally we get our iOS end of statement.

529
00:30:02,000 --> 00:30:06,000
And this is how we probably go with the entire forward and backward propagation.

530
00:30:06,000 --> 00:30:09,000
This part is basically called as encoder.

531
00:30:09,000 --> 00:30:12,000
This part is specifically called as decoder.

532
00:30:12,000 --> 00:30:13,000
Right.

533
00:30:13,000 --> 00:30:17,000
So I hope you got an idea of encoder and decoder.

534
00:30:17,000 --> 00:30:19,000
Uh, what exactly it does.

535
00:30:19,000 --> 00:30:22,000
And usually we use an LSTM over here.

536
00:30:22,000 --> 00:30:23,000
Context vector are nothing.

537
00:30:23,000 --> 00:30:27,000
But it is just like an arrays itself with respect to the calculation.

538
00:30:27,000 --> 00:30:33,000
But whatever sentence we are probably passing right, it probably will give you the entire summary of

539
00:30:33,000 --> 00:30:36,000
this particular sentence or context of the sentence.

540
00:30:36,000 --> 00:30:39,000
Context of this sentence.

541
00:30:40,000 --> 00:30:41,000
Right.

542
00:30:41,000 --> 00:30:48,000
So, uh, I hope, uh, I know there are so many things written, but understand the context, but other

543
00:30:48,000 --> 00:30:51,000
than this, the forward and backward propagation will be almost same.

544
00:30:51,000 --> 00:30:53,000
So yes, this was it from my side.

545
00:30:54,000 --> 00:30:57,000
And here we have actually discussed about the encoder and decoder architecture.

546
00:30:57,000 --> 00:31:03,000
In the next video we will talk about uh the problems of encoder and decoder architecture.

547
00:31:03,000 --> 00:31:05,000
So yes this was it from my side.

548
00:31:05,000 --> 00:31:05,000
I'll see you in the next video.

549
00:31:05,000 --> 00:31:06,000
Thank you.