1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue a discussion with respect to Transformers.

3
00:00:03,000 --> 00:00:07,000
And uh, in this video we are going to discuss about the architecture.

4
00:00:08,000 --> 00:00:12,000
Uh, over here on the right hand side you can see this complex architecture over here.

5
00:00:12,000 --> 00:00:12,000
Okay.

6
00:00:12,000 --> 00:00:17,000
And obviously, uh, you may be thinking this is very, very difficult.

7
00:00:17,000 --> 00:00:22,000
Uh, but what I will do is that just directly going and seeing this architecture instead of this, I

8
00:00:22,000 --> 00:00:25,000
will try to break down the entire architecture step by step.

9
00:00:25,000 --> 00:00:28,000
And as I said, what all things we are going to learn, right?

10
00:00:28,000 --> 00:00:33,000
That entire plan of action that I've actually created to understand the architecture of Transformers,

11
00:00:33,000 --> 00:00:36,000
then we'll be understanding self-attention, positional encoding.

12
00:00:36,000 --> 00:00:38,000
We'll go step by step with respect to this.

13
00:00:38,000 --> 00:00:43,000
To start with a basic transformer architecture, let's consider this particular block diagram.

14
00:00:43,000 --> 00:00:44,000
So this is my transformer.

15
00:00:45,000 --> 00:00:52,000
Uh, this transformer will be used to solve our sequence to sequence task okay.

16
00:00:52,000 --> 00:00:59,000
And in this case, the task that I am probably considering will be, uh, language translation.

17
00:01:00,000 --> 00:01:01,000
Language translation.

18
00:01:01,000 --> 00:01:08,000
And I will be translating, let's say, from my English sentence to my French sentence.

19
00:01:08,000 --> 00:01:09,000
Okay.

20
00:01:10,000 --> 00:01:12,000
So I am planning to do this particular task.

21
00:01:12,000 --> 00:01:13,000
Okay.

22
00:01:13,000 --> 00:01:17,000
So here when I say input that basically means I'm giving I'm going to give my English sentence.

23
00:01:17,000 --> 00:01:21,000
And this is probably going to convert into a French sentence okay.

24
00:01:22,000 --> 00:01:26,000
So this is what a transformer does just by seeing this block diagram.

25
00:01:26,000 --> 00:01:30,000
But the main thing will be that what is inside this particular block diagram.

26
00:01:30,000 --> 00:01:33,000
So let's go ahead and first of all see this.

27
00:01:33,000 --> 00:01:35,000
What is inside this particular block diagram.

28
00:01:35,000 --> 00:01:40,000
So if you go down here inside this block diagram that is the transformer.

29
00:01:40,000 --> 00:01:43,000
You have something called as encoders and decoder.

30
00:01:44,000 --> 00:01:50,000
So this also follows a encoder and decoder architecture.

31
00:01:51,000 --> 00:01:55,000
Decoder architecture okay.

32
00:01:55,000 --> 00:01:58,000
It follows this specific architecture.

33
00:01:58,000 --> 00:02:00,000
Now what is inside this encoder.

34
00:02:00,000 --> 00:02:02,000
When we say encoder is there only one encoder.

35
00:02:02,000 --> 00:02:05,000
Or when we say decoder is there only one decoder?

36
00:02:05,000 --> 00:02:06,000
Nothing as such.

37
00:02:06,000 --> 00:02:11,000
Inside this encoder you may have multiple encoders like this okay.

38
00:02:11,000 --> 00:02:12,000
Like this.

39
00:02:12,000 --> 00:02:20,000
You may have multiple encoders and your text input that you are specifically giving over here in this

40
00:02:20,000 --> 00:02:27,000
encoder will be going from one encoder to the other encoders okay something like this.

41
00:02:27,000 --> 00:02:29,000
So this is my entire encoder over here.

42
00:02:29,000 --> 00:02:34,000
So this entire encoder I have actually drawn in this particular way so that you will be able to understand.

43
00:02:34,000 --> 00:02:35,000
Okay.

44
00:02:36,000 --> 00:02:42,000
So here I will be having multiple encoders step by step like in the research paper.

45
00:02:42,000 --> 00:02:44,000
That is attention is all you need.

46
00:02:44,000 --> 00:02:46,000
Uh, let me just show you the research paper.

47
00:02:46,000 --> 00:02:50,000
So this is the research paper that we will be looking at.

48
00:02:50,000 --> 00:02:51,000
You know, attention is all you need.

49
00:02:51,000 --> 00:02:57,000
And this particular research paper, if you probably go ahead and explore it right here specifically,

50
00:02:57,000 --> 00:03:02,000
we are going to talk about this particular transformer where we'll be learning about positional encoding,

51
00:03:02,000 --> 00:03:06,000
self-attention, multi-head attention, feedforward, all these things.

52
00:03:06,000 --> 00:03:06,000
Right.

53
00:03:06,000 --> 00:03:11,000
So uh, this what is scaled dot product attention Multi-head attention.

54
00:03:11,000 --> 00:03:16,000
So based on this particular research paper, I will be probably explaining each and everything.

55
00:03:16,000 --> 00:03:20,000
So as I said, uh, when we talk about an encoder.

56
00:03:20,000 --> 00:03:25,000
So here I will be having multiple encoders where my information will be passed from bottom to top completely.

57
00:03:25,000 --> 00:03:29,000
And then from this entire encoder, I will be passing all the information to the decoder.

58
00:03:29,000 --> 00:03:32,000
Similarly, decoder will also have.

59
00:03:32,000 --> 00:03:34,000
Let's just not like one block.

60
00:03:34,000 --> 00:03:36,000
It may have multiple decoders.

61
00:03:37,000 --> 00:03:37,000
Okay.

62
00:03:37,000 --> 00:03:37,000
Like this.

63
00:03:37,000 --> 00:03:41,000
It may have multiple decoders one by one okay.

64
00:03:41,000 --> 00:03:42,000
What is inside this decoder?

65
00:03:42,000 --> 00:03:44,000
We will get to know in some time.

66
00:03:44,000 --> 00:03:44,000
Okay.

67
00:03:44,000 --> 00:03:48,000
And one by one will be passing this particular decoder information on the top.

68
00:03:48,000 --> 00:03:52,000
And finally we should be able to get our output okay.

69
00:03:52,000 --> 00:03:54,000
So this is what exactly it is.

70
00:03:54,000 --> 00:03:55,000
Okay.

71
00:03:55,000 --> 00:04:00,000
And you'll be able to see that from encoder will be passing the information to the decoder also.

72
00:04:00,000 --> 00:04:00,000
Okay.

73
00:04:00,000 --> 00:04:06,000
And this entire uh uh architecture, when I say right in uh, in the research paper it is shown that

74
00:04:06,000 --> 00:04:08,000
we are going to use six encoders.

75
00:04:09,000 --> 00:04:09,000
Okay.

76
00:04:09,000 --> 00:04:11,000
Six encoders.

77
00:04:11,000 --> 00:04:13,000
And parallelly there will be six decoders.

78
00:04:13,000 --> 00:04:14,000
Okay.

79
00:04:14,000 --> 00:04:20,000
So what is there in the research paper also, uh, it's not like it needs to be six, but in the research

80
00:04:20,000 --> 00:04:25,000
paper when they created the research they have used six encoders and six decoders to do this entire

81
00:04:25,000 --> 00:04:26,000
task.

82
00:04:26,000 --> 00:04:26,000
Okay.

83
00:04:27,000 --> 00:04:30,000
Now here in the encoder I will be giving how are you.

84
00:04:30,000 --> 00:04:36,000
And then in the decoder should be able to give me the French translation of this English sentence.

85
00:04:36,000 --> 00:04:37,000
Okay.

86
00:04:37,000 --> 00:04:41,000
Now the question arises what is inside this encoder.

87
00:04:41,000 --> 00:04:43,000
We should definitely understand.

88
00:04:43,000 --> 00:04:51,000
So inside this encoder, if I go to my next block diagram here, you will be able to see in a single

89
00:04:51,000 --> 00:04:53,000
one encoder in a single encoder.

90
00:04:53,000 --> 00:05:00,000
So inside this particular encoder, you'll be seeing that there will be one self-attention layer and

91
00:05:00,000 --> 00:05:03,000
there will be one feed forward neural network layer.

92
00:05:03,000 --> 00:05:04,000
Okay.

93
00:05:04,000 --> 00:05:08,000
Again, try to understand this basic level of understanding.

94
00:05:08,000 --> 00:05:10,000
I have drawn this particular diagram in a very simpler way.

95
00:05:10,000 --> 00:05:11,000
So there will be one.

96
00:05:11,000 --> 00:05:15,000
Self-attention will be understanding what exactly is the self-attention.

97
00:05:15,000 --> 00:05:16,000
We'll try to see.

98
00:05:16,000 --> 00:05:21,000
Like what is the exact working of this as we go ahead then Then from this self attention is connected

99
00:05:21,000 --> 00:05:23,000
to the feed forward neural network.

100
00:05:23,000 --> 00:05:27,000
And if I consider just a single decoder, right.

101
00:05:27,000 --> 00:05:31,000
So inside a single decoder here you'll be able to see that I will be having a self-attention.

102
00:05:31,000 --> 00:05:35,000
Also I'll be having a feed forward neural a feed forward neural network.

103
00:05:35,000 --> 00:05:42,000
But along with this I also have an additional layer which is called as encoder decoder attention.

104
00:05:42,000 --> 00:05:45,000
Okay, so we will be understanding this entirely.

105
00:05:45,000 --> 00:05:46,000
Don't worry.

106
00:05:46,000 --> 00:05:50,000
But we just need to understand this basic architecture what exactly it is.

107
00:05:51,000 --> 00:05:57,000
Now let's go ahead and let's try to see what this self-attention actually does.

108
00:05:57,000 --> 00:05:59,000
And that is what we are going to discuss in this particular video.

109
00:05:59,000 --> 00:06:00,000
Okay.

110
00:06:00,000 --> 00:06:03,000
Why this self-attention is basically used okay.

111
00:06:04,000 --> 00:06:10,000
Now with respect to this self-attention you will be able to see that if I go down and if I still explore

112
00:06:10,000 --> 00:06:12,000
more about the encoder.

113
00:06:12,000 --> 00:06:14,000
So here this is my encoder okay.

114
00:06:14,000 --> 00:06:16,000
This is my entire encoder.

115
00:06:16,000 --> 00:06:20,000
So as I said in my encoder I have a self-attention.

116
00:06:20,000 --> 00:06:23,000
And then I have a feed forward neural network.

117
00:06:23,000 --> 00:06:26,000
Now let's say that I have a sentence over here.

118
00:06:26,000 --> 00:06:28,000
How are you?

119
00:06:28,000 --> 00:06:32,000
Okay, so this is my this is my sentence that I'm actually giving it over here.

120
00:06:32,000 --> 00:06:34,000
How are you then?

121
00:06:34,000 --> 00:06:35,000
This self attention.

122
00:06:36,000 --> 00:06:40,000
First of all, we need to convert this all the words into vectors.

123
00:06:40,000 --> 00:06:40,000
Right.

124
00:06:40,000 --> 00:06:42,000
So this is my vectors.

125
00:06:42,000 --> 00:06:43,000
This is basically getting converted.

126
00:06:43,000 --> 00:06:46,000
And here we can use vector embeddings.

127
00:06:46,000 --> 00:06:50,000
Or we can use an embedding layer whatever layer we specifically want.

128
00:06:50,000 --> 00:06:50,000
Right.

129
00:06:50,000 --> 00:06:57,000
Let's say here we have like embedding layer We have a embedding layer.

130
00:06:58,000 --> 00:07:02,000
And this embedding layer is responsible in converting a word into vectors.

131
00:07:02,000 --> 00:07:05,000
So each and every word is basically converted into a vector over here.

132
00:07:05,000 --> 00:07:14,000
Now as I pass this words to the self-attention layer, this will convert this particular vector into

133
00:07:14,000 --> 00:07:16,000
a different vector.

134
00:07:16,000 --> 00:07:22,000
And this vector is basically called as contextual vectors.

135
00:07:23,000 --> 00:07:23,000
Okay.

136
00:07:25,000 --> 00:07:32,000
So let's understand the differences between this vector versus the contextual vector.

137
00:07:32,000 --> 00:07:38,000
This is really important to understand okay we will we'll try to understand this in the in in this coming

138
00:07:38,000 --> 00:07:39,000
section itself.

139
00:07:39,000 --> 00:07:39,000
Right.

140
00:07:39,000 --> 00:07:43,000
So here the self-attention is doing a very simple task.

141
00:07:43,000 --> 00:07:48,000
It is taking the vector of a words, and it is converting this entire vector into another vector like

142
00:07:48,000 --> 00:07:49,000
z one, z two, z three.

143
00:07:49,000 --> 00:07:53,000
And this vector is nothing, but it is called as contextual vector.

144
00:07:54,000 --> 00:07:55,000
There are.

145
00:07:55,000 --> 00:07:57,000
Why do we say this as contextual vector?

146
00:07:57,000 --> 00:08:00,000
Because this vectors are not similar to this particular vector.

147
00:08:00,000 --> 00:08:07,000
But this vectors you will be seeing that it will be with respect to context of different words.

148
00:08:07,000 --> 00:08:10,000
Also it will consider the context of different words also.

149
00:08:10,000 --> 00:08:11,000
Okay.

150
00:08:11,000 --> 00:08:16,000
And that is the reason we will probably be getting a different vectors over here, which we specifically

151
00:08:16,000 --> 00:08:18,000
called it as contextual vectors.

152
00:08:18,000 --> 00:08:18,000
Okay.

153
00:08:19,000 --> 00:08:21,000
So I hope you got an idea about this.

154
00:08:21,000 --> 00:08:22,000
Okay.

155
00:08:22,000 --> 00:08:26,000
And once we get this contextual vector we send it to the feed forward neural network.

156
00:08:26,000 --> 00:08:32,000
So this feed forward neural network here you can see over here I will be passing this particular vector

157
00:08:32,000 --> 00:08:32,000
over here.

158
00:08:33,000 --> 00:08:34,000
here, here.

159
00:08:34,000 --> 00:08:35,000
Okay.

160
00:08:35,000 --> 00:08:40,000
And each and every vector will be passed to this particular feed forward neural network.

161
00:08:40,000 --> 00:08:49,000
And I should be able to get my output, let's say this specific output that I am, I am getting right.

162
00:08:49,000 --> 00:08:52,000
I will be considering it as my another vectors.

163
00:08:53,000 --> 00:08:56,000
We will discuss about it why exactly this is required.

164
00:08:57,000 --> 00:09:02,000
and then this will be sent to my next encoder.

165
00:09:04,000 --> 00:09:07,000
So I will just try to draw this.

166
00:09:09,000 --> 00:09:12,000
Or I will just try to draw in this way.

167
00:09:12,000 --> 00:09:16,000
So this will be sent to my encoder two.

168
00:09:17,000 --> 00:09:19,000
So this is my encoder two.

169
00:09:19,000 --> 00:09:22,000
This is my encoder one okay.

170
00:09:22,000 --> 00:09:24,000
And here it is.

171
00:09:24,000 --> 00:09:25,000
Be sent to my encoder two.

172
00:09:25,000 --> 00:09:29,000
Then again the encoder two will be having a self-attention.

173
00:09:29,000 --> 00:09:30,000
Okay.

174
00:09:30,000 --> 00:09:35,000
So here it will specifically have this self-attention again.

175
00:09:35,000 --> 00:09:39,000
And along with the self-attention it may also have a feed forward neural network.

176
00:09:40,000 --> 00:09:42,000
So it will still keep on continuing.

177
00:09:42,000 --> 00:09:48,000
So I will not draw this diagram because you know that what it will be available in my encoder two.

178
00:09:48,000 --> 00:09:48,000
Okay.

179
00:09:49,000 --> 00:09:50,000
Perfect.

180
00:09:50,000 --> 00:09:52,000
So this will be passed over here.

181
00:09:53,000 --> 00:09:56,000
So here you can see that I will rub this.

182
00:09:58,000 --> 00:10:01,000
And this information is going to pass over here.

183
00:10:03,000 --> 00:10:04,000
Okay.

184
00:10:04,000 --> 00:10:06,000
And this is my neural network.

185
00:10:06,000 --> 00:10:09,000
This is my feed forward neural network.

186
00:10:09,000 --> 00:10:10,000
Network?

187
00:10:11,000 --> 00:10:12,000
Neural network.

188
00:10:12,000 --> 00:10:13,000
Okay.

189
00:10:13,000 --> 00:10:19,000
And here you can see all this information is basically passed to this is passed to this okay.

190
00:10:20,000 --> 00:10:26,000
So this is the entire basic architecture about the encoder itself.

191
00:10:26,000 --> 00:10:28,000
So let me just go ahead and repeat it.

192
00:10:28,000 --> 00:10:30,000
So initially I had a transformer.

193
00:10:30,000 --> 00:10:32,000
The transformer is nothing but it is a combination.

194
00:10:32,000 --> 00:10:34,000
It is a having this encoder and decoder architecture.

195
00:10:34,000 --> 00:10:36,000
So here I will be having my encoder.

196
00:10:36,000 --> 00:10:41,000
Here I will be having my decoder now inside my encoder like this encoder.

197
00:10:41,000 --> 00:10:44,000
It can have multiple encoders like this 12345.

198
00:10:44,000 --> 00:10:48,000
And in the even in the research paper they have used six encoders in.

199
00:10:48,000 --> 00:10:52,000
Similarly in the decoder you will be seeing that I'll be having six decoders over here which are stacked

200
00:10:52,000 --> 00:10:53,000
one by one.

201
00:10:53,000 --> 00:10:53,000
Okay.

202
00:10:53,000 --> 00:10:57,000
Then what happens inside this encoder that we have.

203
00:10:57,000 --> 00:11:00,000
It has two important layers.

204
00:11:00,000 --> 00:11:01,000
One is the self-attention layer.

205
00:11:01,000 --> 00:11:03,000
And then you have the feed forward neural network layer.

206
00:11:04,000 --> 00:11:05,000
The self-attention layer.

207
00:11:06,000 --> 00:11:12,000
You know we don't we don't use LSTM or RNN like how we used to use in the encoder decoder architecture.

208
00:11:12,000 --> 00:11:15,000
Instead we have this self-attention layer in the decoder.

209
00:11:15,000 --> 00:11:18,000
Also, you'll be able to see that I will be having a self-attention feed forward.

210
00:11:18,000 --> 00:11:22,000
But between that there'll be something called as encoder decoder attention.

211
00:11:22,000 --> 00:11:29,000
Okay, then if I go forward and try to understand what exactly the self-attention does, it takes a

212
00:11:29,000 --> 00:11:35,000
fixed vector right of a word and then it converts this entire word into contextual vector.

213
00:11:35,000 --> 00:11:36,000
The contextual vector is very simple.

214
00:11:36,000 --> 00:11:43,000
Why we are actually changing it is just giving an idea that contextual vector means what all these vectors

215
00:11:43,000 --> 00:11:47,000
that is basically getting converted over here, or whatever vectors we are getting, this will also

216
00:11:47,000 --> 00:11:53,000
have this will be this vectors will be defined in such a way that it will be based on the context of

217
00:11:53,000 --> 00:11:54,000
other words.

218
00:11:54,000 --> 00:11:54,000
Okay.

219
00:11:55,000 --> 00:11:56,000
Here.

220
00:11:56,000 --> 00:12:00,000
What is the difference between this vectors and this vector is that here we don't have any context.

221
00:12:00,000 --> 00:12:02,000
Right here they are fixed vectors.

222
00:12:02,000 --> 00:12:08,000
If you are using some word embedding layer, how how word is basically determined by a vector that will

223
00:12:08,000 --> 00:12:09,000
get converted.

224
00:12:09,000 --> 00:12:09,000
Right.

225
00:12:09,000 --> 00:12:13,000
How R is defined that will get converted, how U is defined, that will get converted.

226
00:12:13,000 --> 00:12:18,000
But by using the self-attention, the vectors that I am able to get is that how will be having a vector

227
00:12:18,000 --> 00:12:21,000
considering the context of the other two words?

228
00:12:22,000 --> 00:12:22,000
Okay.

229
00:12:22,000 --> 00:12:27,000
Similarly, this particular vector that I'm actually getting it will be based on the context of other

230
00:12:27,000 --> 00:12:27,000
words.

231
00:12:27,000 --> 00:12:28,000
How and you.

232
00:12:28,000 --> 00:12:30,000
Similarly, the third word right.

233
00:12:30,000 --> 00:12:32,000
And that is nothing but my contextual vector.

234
00:12:32,000 --> 00:12:34,000
Now why do we require contextual vector.

235
00:12:34,000 --> 00:12:40,000
As I said that whenever you have a longer sentence, it definitely helps you to work.

236
00:12:40,000 --> 00:12:44,000
It helps you to get a model that will provide you an amazing accuracy.

237
00:12:44,000 --> 00:12:45,000
Okay.

238
00:12:45,000 --> 00:12:48,000
And that is the reason we are using this, right?

239
00:12:48,000 --> 00:12:53,000
And the next important thing, as I said when we were discussing in encoder decoder, we cannot pass

240
00:12:53,000 --> 00:12:55,000
the words parallelly, right?

241
00:12:55,000 --> 00:12:58,000
One by one we pass it, then we generate the context.

242
00:12:58,000 --> 00:13:07,000
But in this case, since we are using the self-attention layer here, all the words all the words will

243
00:13:07,000 --> 00:13:08,000
be passed.

244
00:13:11,000 --> 00:13:14,000
Passed parallelly.

245
00:13:15,000 --> 00:13:16,000
Okay.

246
00:13:16,000 --> 00:13:17,000
All the words.

247
00:13:17,000 --> 00:13:18,000
So at a time we are passing it.

248
00:13:18,000 --> 00:13:21,000
And this makes this entire architecture scalable.

249
00:13:22,000 --> 00:13:30,000
Now my next important discussion in this particular video will be that let's discuss more about self-attention.

250
00:13:30,000 --> 00:13:31,000
Okay.

251
00:13:31,000 --> 00:13:37,000
Now what exactly self-attention does, we will try to discuss this at a higher level.

252
00:13:38,000 --> 00:13:44,000
Self-attention at a higher level.

253
00:13:45,000 --> 00:13:48,000
And that is what we are going to discuss in our next video.

254
00:13:48,000 --> 00:13:50,000
But I hope you got an idea.

255
00:13:50,000 --> 00:13:53,000
With respect to the architecture, what exactly it is.

256
00:13:53,000 --> 00:13:58,000
Uh, the first thing that we will be discussing, that how this entire vector is basically getting converted

257
00:13:58,000 --> 00:14:03,000
into a contextual vector, and why contextual vector is required because it gives a proper context of

258
00:14:03,000 --> 00:14:05,000
the words that I'm actually passing.

259
00:14:05,000 --> 00:14:08,000
And all the words are basically passed in a parallel mode, right?

260
00:14:08,000 --> 00:14:09,000
So yes, this was it.

261
00:14:09,000 --> 00:14:12,000
I will see you all in the next video.

262
00:14:12,000 --> 00:14:12,000
Thank you.