1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue our discussion with respect to the Transformers.

3
00:00:03,000 --> 00:00:07,000
So this is the first topic that we are going to pick up and we are going to understand.

4
00:00:07,000 --> 00:00:10,000
So here uh I've also given the definition.

5
00:00:10,000 --> 00:00:14,000
So that is the reason I've written what and why in Transformers.

6
00:00:14,000 --> 00:00:18,000
So if you really just want to understand a simple definition about this particular architecture, it

7
00:00:18,000 --> 00:00:18,000
is nothing.

8
00:00:18,000 --> 00:00:25,000
But in transformers and natural language processing are a type of deep learning model that uses self-attention

9
00:00:25,000 --> 00:00:26,000
mechanism.

10
00:00:26,000 --> 00:00:32,000
Please keep a note of this because this is really important to analyze and process natural language

11
00:00:32,000 --> 00:00:33,000
data.

12
00:00:33,000 --> 00:00:40,000
They are encoder decoder model that can be used for many application including machine learning translation.

13
00:00:40,000 --> 00:00:45,000
Now if I talk about machine learning translation, this is specifically something called as sequence

14
00:00:45,000 --> 00:00:47,000
to sequence task.

15
00:00:47,000 --> 00:00:53,000
Right now when we say sequence to sequence tasks, uh, let's say I want to solve a problem statement.

16
00:00:53,000 --> 00:00:59,000
Uh, one of the example that we can probably take is like language translation, right?

17
00:00:59,000 --> 00:01:01,000
Language translation.

18
00:01:01,000 --> 00:01:07,000
And if you probably go ahead and see an example of Google Translate, right.

19
00:01:07,000 --> 00:01:13,000
This is a task of sequence to sequence uh, sequence to sequence task.

20
00:01:13,000 --> 00:01:14,000
It is a sequence to sequence task.

21
00:01:14,000 --> 00:01:21,000
Because if I really want to convert from English to French, write English to French.

22
00:01:21,000 --> 00:01:27,000
So in this scenario you will be able to see that my input, uh, with respect to English will be many

23
00:01:27,000 --> 00:01:27,000
words.

24
00:01:27,000 --> 00:01:31,000
I will be having many inputs over here, it will be many.

25
00:01:31,000 --> 00:01:35,000
And my output will also be consisting of many words.

26
00:01:35,000 --> 00:01:40,000
So this basically becomes like a many to many, uh, sequence to sequence task right Right.

27
00:01:40,000 --> 00:01:48,000
Now when we have the sequence to sequence task, and obviously length of the sentences is also a meaningful

28
00:01:48,000 --> 00:01:48,000
thing over here.

29
00:01:48,000 --> 00:01:49,000
Right.

30
00:01:49,000 --> 00:01:50,000
Length of the sentence.

31
00:01:51,000 --> 00:01:57,000
Now, as the length of the sentences increases, we should be able to solve this particular problem

32
00:01:57,000 --> 00:01:59,000
again with a very good accuracy.

33
00:01:59,000 --> 00:02:03,000
So in this kind of task usually transformers is used.

34
00:02:03,000 --> 00:02:03,000
Okay.

35
00:02:03,000 --> 00:02:05,000
But you may be thinking, Krish.

36
00:02:05,000 --> 00:02:06,000
Fine.

37
00:02:06,000 --> 00:02:13,000
You have also told that, uh, in the previous architecture of encoder decoder, encoder decoder, you

38
00:02:13,000 --> 00:02:15,000
told that this is also used for sequential sequence task.

39
00:02:16,000 --> 00:02:21,000
And if I talk about encoder and decoder like they were, uh, if I just probably go ahead with this

40
00:02:21,000 --> 00:02:22,000
particular diagram.

41
00:02:22,000 --> 00:02:24,000
So I had an encoder over here.

42
00:02:25,000 --> 00:02:26,000
Right.

43
00:02:26,000 --> 00:02:27,000
And I had a decoder over here.

44
00:02:27,000 --> 00:02:28,000
Right.

45
00:02:28,000 --> 00:02:34,000
So in encoder and decoder, what we used to do, we, we had our LSTM right over here.

46
00:02:35,000 --> 00:02:41,000
And based on this LSTM when we used to go ahead and pass this entire sentences okay.

47
00:02:42,000 --> 00:02:45,000
Please remember we do not take the output in this.

48
00:02:45,000 --> 00:02:46,000
Right.

49
00:02:46,000 --> 00:02:53,000
So if I just go ahead with a basic uh, architecture of encoder decoder, we'll be able to see that

50
00:02:53,000 --> 00:02:57,000
if I used to give my words over here, let's say X1X2X3.

51
00:02:57,000 --> 00:03:02,000
And this words we used to give based on timestamps t is equal to one, t is equal to two, t is equal

52
00:03:02,000 --> 00:03:03,000
to three.

53
00:03:03,000 --> 00:03:07,000
So let's say if this is my sentence one which is having this particular words.

54
00:03:07,000 --> 00:03:12,000
And again we used to convert this entire words by using some embedding layer okay.

55
00:03:12,000 --> 00:03:14,000
Convert this into vectors okay.

56
00:03:14,000 --> 00:03:18,000
And then pass it to the entire, uh, LSTM.

57
00:03:18,000 --> 00:03:20,000
So this was my LSTM over here.

58
00:03:20,000 --> 00:03:20,000
Right.

59
00:03:20,000 --> 00:03:23,000
And this is my encoder and this is my decoder.

60
00:03:25,000 --> 00:03:25,000
Right.

61
00:03:25,000 --> 00:03:32,000
So after doing this finally you should see that we were able to generate one C vector that is context

62
00:03:32,000 --> 00:03:33,000
vector right.

63
00:03:33,000 --> 00:03:36,000
So here I will just go ahead and generate this context vector.

64
00:03:36,000 --> 00:03:41,000
And based on this context vector we used to provide it to our next decoder layer.

65
00:03:41,000 --> 00:03:41,000
Right.

66
00:03:42,000 --> 00:03:45,000
Next decoder layer which used to have this LSTM.

67
00:03:45,000 --> 00:03:50,000
Okay, so here is my decoder completely I will pass it over here.

68
00:03:50,000 --> 00:03:51,000
I'll pass it over here.

69
00:03:52,000 --> 00:03:56,000
And then uh, with respect to this particular decoder, I used to also get my output.

70
00:03:56,000 --> 00:03:58,000
And here we used to use some kind of softmax.

71
00:03:58,000 --> 00:03:59,000
Right.

72
00:03:59,000 --> 00:04:03,000
Now this was the basic architecture of the encoder decoder.

73
00:04:03,000 --> 00:04:03,000
Right.

74
00:04:03,000 --> 00:04:05,000
And this is what we used to do.

75
00:04:05,000 --> 00:04:11,000
This context usually used to define when we are passing every words based on the timestamp, the final

76
00:04:11,000 --> 00:04:16,000
context that we used to get at the end of this particular sentence used to represent this entire sentence

77
00:04:16,000 --> 00:04:17,000
itself, right?

78
00:04:17,000 --> 00:04:19,000
Which is consisting of this entire words.

79
00:04:19,000 --> 00:04:22,000
And this context was further sent to our decoder model.

80
00:04:22,000 --> 00:04:27,000
And this decoder model would use to do the prediction based on this context that we have.

81
00:04:27,000 --> 00:04:30,000
And along with that, we used to also calculate the loss.

82
00:04:30,000 --> 00:04:32,000
Everything was discussed right, based on the encoder architecture.

83
00:04:33,000 --> 00:04:39,000
But one problem we understood from the encoder decoder was that this context was not sufficient to represent

84
00:04:39,000 --> 00:04:40,000
this entire sentence.

85
00:04:40,000 --> 00:04:46,000
If the sentence length increases, if the sentence length increases, if it.

86
00:04:46,000 --> 00:04:47,000
This increases.

87
00:04:47,000 --> 00:04:51,000
The sea context was not at all sufficient, right?

88
00:04:51,000 --> 00:04:54,000
And this was the problem with respect to the encoder decoder architecture.

89
00:04:54,000 --> 00:05:02,000
And that is the reason you could see that our blue score was decreasing as the length of the sentences

90
00:05:02,000 --> 00:05:04,000
was increasing.

91
00:05:04,000 --> 00:05:07,000
So this I have covered in my previous video.

92
00:05:07,000 --> 00:05:11,000
I would suggest if you want to quickly revise it, you can go ahead and do the revision.

93
00:05:11,000 --> 00:05:13,000
Okay, now for this.

94
00:05:13,000 --> 00:05:15,000
We also had seen an architecture.

95
00:05:15,000 --> 00:05:17,000
So here was my sequence to sequence learning.

96
00:05:17,000 --> 00:05:20,000
This was the entire research paper of this.

97
00:05:20,000 --> 00:05:23,000
You know how the entire working actually happens.

98
00:05:23,000 --> 00:05:25,000
Each and every thing was explained clearly over here.

99
00:05:25,000 --> 00:05:25,000
Right.

100
00:05:26,000 --> 00:05:33,000
So after this, what we did is that in order to solve this problem of the context that we had, like

101
00:05:33,000 --> 00:05:38,000
our main idea was that, okay, we should not create the entire context at once and probably send it

102
00:05:38,000 --> 00:05:40,000
to the decoder model instead.

103
00:05:40,000 --> 00:05:45,000
What we can actually do is that we can use something called as attention mechanism, and that is where

104
00:05:45,000 --> 00:05:47,000
this research paper came.

105
00:05:47,000 --> 00:05:47,000
Right.

106
00:05:47,000 --> 00:05:54,000
So based on this research paper, what we exactly did was that, uh, we the plan was very simple.

107
00:05:54,000 --> 00:06:00,000
Instead of just giving a single context, we have to actually provide an additional context to our decoder.

108
00:06:00,000 --> 00:06:03,000
And this was the architecture we specifically discussed about.

109
00:06:03,000 --> 00:06:10,000
And if you remember in our previous videos, we have discussed the entire mechanism of how this attention

110
00:06:10,000 --> 00:06:11,000
mechanism actually works.

111
00:06:11,000 --> 00:06:11,000
Right?

112
00:06:11,000 --> 00:06:14,000
And we discussed about this entire research paper over here.

113
00:06:14,000 --> 00:06:20,000
So here was the entire working, wherein along with a single context, we also have to provide additional

114
00:06:20,000 --> 00:06:26,000
context, create our alignment scores, create our attention weights and then pass it to the decoder.

115
00:06:26,000 --> 00:06:26,000
Okay.

116
00:06:26,000 --> 00:06:32,000
And that is what we specifically solved it with the help of attention mechanism.

117
00:06:32,000 --> 00:06:33,000
So let me do one thing.

118
00:06:33,000 --> 00:06:36,000
Let me just take this screenshot okay.

119
00:06:36,000 --> 00:06:40,000
So that you will be able to see this reference over here.

120
00:06:40,000 --> 00:06:42,000
So here I will just put this reference.

121
00:06:42,000 --> 00:06:45,000
And here also I have added this additional reference.

122
00:06:45,000 --> 00:06:52,000
So this was all possible because of your um you know the attention mechanism.

123
00:06:52,000 --> 00:06:53,000
Right.

124
00:06:53,000 --> 00:06:56,000
So this is the entire working, right?

125
00:06:56,000 --> 00:07:01,000
And with the help of this, what we were able to do is that we are able to provide additional context,

126
00:07:02,000 --> 00:07:06,000
additional context to the decoders.

127
00:07:07,000 --> 00:07:12,000
And then this decoders, what they were able to do is that they were able to do the prediction.

128
00:07:12,000 --> 00:07:16,000
And the problem with respect to the long sentences.

129
00:07:17,000 --> 00:07:18,000
right.

130
00:07:18,000 --> 00:07:24,000
The problem of accuracy that we are facing, it started increasing, right, because of this research

131
00:07:24,000 --> 00:07:24,000
paper.

132
00:07:25,000 --> 00:07:29,000
But still, uh, let's understand with this attention mechanism.

133
00:07:29,000 --> 00:07:29,000
Okay.

134
00:07:29,000 --> 00:07:33,000
So here we are using bi uh bi directional LSTM RNN here.

135
00:07:33,000 --> 00:07:35,000
Also we are using LSTM itself.

136
00:07:35,000 --> 00:07:43,000
One problem that we see with this attention mechanism or encoder decoder okay, attention mechanism

137
00:07:44,000 --> 00:07:50,000
or encoder decoder is that we pass every words based on timestamp, right.

138
00:07:50,000 --> 00:08:04,000
So here parallelly, parallelly we cannot send all the words in a sentence.

139
00:08:06,000 --> 00:08:06,000
right?

140
00:08:07,000 --> 00:08:13,000
And again, because of this, because of this C over here, you'll be able to see in even in encoder

141
00:08:13,000 --> 00:08:16,000
decoder, we use to send the words based on timestamp.

142
00:08:17,000 --> 00:08:21,000
Right now when we are sending the based on timestamp at t is equal to one, I'm sending one word, t

143
00:08:21,000 --> 00:08:21,000
is equal to two.

144
00:08:21,000 --> 00:08:23,000
I'm sending another word right.

145
00:08:23,000 --> 00:08:25,000
Similarly over here in this bidirectional LSTM.

146
00:08:25,000 --> 00:08:28,000
Also I'm sending one word at a various timestamp.

147
00:08:28,000 --> 00:08:35,000
We won't be able to do the entire execution or this entire training parallelly, right?

148
00:08:35,000 --> 00:08:41,000
And because of this, yet your attention model is not scalable.

149
00:08:41,000 --> 00:08:49,000
When I say scalable, that basically means if my data set is huge, right?

150
00:08:50,000 --> 00:08:59,000
Still the encoder decoder attention mechanism will not be scalable with respect to training, with respect

151
00:08:59,000 --> 00:09:00,000
to training.

152
00:09:02,000 --> 00:09:08,000
And this is really important to understand because the new models that we'll be seeing with respect

153
00:09:08,000 --> 00:09:09,000
to Transformers.

154
00:09:10,000 --> 00:09:16,000
Transformers uses something called as they don't use they don't use this.

155
00:09:16,000 --> 00:09:20,000
They they never use this LSTM RNN in encoders or decoder.

156
00:09:21,000 --> 00:09:22,000
Right.

157
00:09:22,000 --> 00:09:29,000
What they use is that they use this something called as self-attention self-attention module.

158
00:09:30,000 --> 00:09:38,000
And because of this self-attention module, you'll be able to see that all the words, all the words

159
00:09:38,000 --> 00:09:48,000
will be parallelly sent, parallelly sent sent to the encoder for further processing.

160
00:09:48,000 --> 00:09:49,000
Right.

161
00:09:49,000 --> 00:09:53,000
And that is what this architecture comes into existence right over here in the input embeddings.

162
00:09:53,000 --> 00:09:53,000
Right?

163
00:09:53,000 --> 00:09:56,000
You're sending all the input all at once.

164
00:09:56,000 --> 00:09:59,000
So it is supporting this entire parallel execution.

165
00:09:59,000 --> 00:10:04,000
And that is where because of this we'll also be learning about one more topic which is called as positional

166
00:10:04,000 --> 00:10:04,000
encoding.

167
00:10:07,000 --> 00:10:08,000
Positional encoding.

168
00:10:08,000 --> 00:10:10,000
We'll discuss about this as we go ahead.

169
00:10:10,000 --> 00:10:10,000
Right.

170
00:10:10,000 --> 00:10:16,000
And this will also play a very important role for this when we are sending all these words parallelly

171
00:10:16,000 --> 00:10:21,000
and how each and every word vectors is going to get computed, we'll discuss about this, what exactly

172
00:10:22,000 --> 00:10:24,000
this self-attention module will be doing.

173
00:10:24,000 --> 00:10:26,000
We'll be discussing about it right now.

174
00:10:26,000 --> 00:10:28,000
You should definitely understand over here.

175
00:10:28,000 --> 00:10:30,000
Is that in attention mechanism encoder decoder?

176
00:10:30,000 --> 00:10:33,000
We are not able to do this, hence the word scalable.

177
00:10:33,000 --> 00:10:38,000
Now when we say scalable, that basically means why this transformers are really performing well is

178
00:10:38,000 --> 00:10:46,000
that as we keep on increasing the data set, you will be able to see that we are able to get some amazing

179
00:10:46,000 --> 00:10:49,000
models, amazing state of the art models.

180
00:10:49,000 --> 00:10:51,000
I say Sota models, right?

181
00:10:51,000 --> 00:10:55,000
So state of art models we are able to get specifically with respect to NLP task.

182
00:10:55,000 --> 00:10:58,000
Now this is not just restricted to NLP task.

183
00:10:58,000 --> 00:11:01,000
Um, with the help of transfer learning, right?

184
00:11:01,000 --> 00:11:09,000
With the help of transfer learning, now transfer uh, transformers are even used in multi-modal okay.

185
00:11:09,000 --> 00:11:11,000
They're used in multi-modal tasks.

186
00:11:12,000 --> 00:11:14,000
Multi-Modal tasks basically means task.

187
00:11:15,000 --> 00:11:21,000
Uh, that has both NLP plus image, you know, so they're here also they are able to perform really

188
00:11:21,000 --> 00:11:22,000
really well okay.

189
00:11:22,000 --> 00:11:27,000
How it is able to perform and all we will understand once we understand the entire architecture.

190
00:11:27,000 --> 00:11:33,000
But this is the major, major important thing that you really need to understand why Transformers what

191
00:11:33,000 --> 00:11:37,000
was the problem that we had in our previous models like encoder, decoder and all?

192
00:11:37,000 --> 00:11:40,000
And this is one of the very important problems that we have.

193
00:11:40,000 --> 00:11:40,000
Right.

194
00:11:40,000 --> 00:11:40,000
right?

195
00:11:41,000 --> 00:11:48,000
Uh, so just defining all these things, uh, since it solves all these problems, you'll be able to

196
00:11:48,000 --> 00:11:49,000
see now, everywhere.

197
00:11:50,000 --> 00:11:51,000
Everywhere.

198
00:11:51,000 --> 00:11:53,000
This entire Transformers.

199
00:11:56,000 --> 00:11:59,000
Have really changed the AI space.

200
00:12:00,000 --> 00:12:04,000
Now you'll be able to see a lot of Sota models.

201
00:12:04,000 --> 00:12:04,000
Right.

202
00:12:04,000 --> 00:12:10,000
And this Sota models is nothing like, let's say, some of the transformer models that we have, which

203
00:12:10,000 --> 00:12:16,000
is already trained in huge data set, is like Bert GPT right now, what companies are doing, if they

204
00:12:16,000 --> 00:12:19,000
really want to create their own model, they don't have to train it from scratch.

205
00:12:19,000 --> 00:12:26,000
And these models are trained with huge data, trained with huge data, right?

206
00:12:26,000 --> 00:12:33,000
They can directly just use transfer learning and with the help of this transfer learning they are creating

207
00:12:33,000 --> 00:12:36,000
this amazing sort of models.

208
00:12:36,000 --> 00:12:36,000
Right.

209
00:12:36,000 --> 00:12:38,000
State of the art models.

210
00:12:39,000 --> 00:12:40,000
Right.

211
00:12:40,000 --> 00:12:44,000
And the architecture this is completely based on Transformers itself.

212
00:12:45,000 --> 00:12:47,000
And as I said, multimodal task.

213
00:12:47,000 --> 00:12:50,000
The same architecture is also used with respect to images.

214
00:12:50,000 --> 00:12:57,000
So if you see in OpenAI some of the application like Dall-E, right, they actually just based on a

215
00:12:57,000 --> 00:13:00,000
text, they are able to generate the images right entirely.

216
00:13:00,000 --> 00:13:02,000
It is completely based on transformers.

217
00:13:03,000 --> 00:13:09,000
So just to give you an idea, uh, this will also be very much important, uh, because based on this

218
00:13:09,000 --> 00:13:13,000
architectures, various LLM models are also used in generative AI.

219
00:13:14,000 --> 00:13:14,000
Okay.

220
00:13:14,000 --> 00:13:18,000
Generative AI Li LM lm basically means large language models.

221
00:13:19,000 --> 00:13:21,000
So I hope you got an idea.

222
00:13:21,000 --> 00:13:23,000
Just take this into consideration.

223
00:13:23,000 --> 00:13:26,000
This is a very important thing that we have discussed parallelly.

224
00:13:26,000 --> 00:13:31,000
We cannot send all the words in a sentence in attention mechanism or encoder decoder, but with the

225
00:13:31,000 --> 00:13:34,000
help of transformers we are not at all using LSTM RNN.

226
00:13:34,000 --> 00:13:39,000
Instead we are using self-attention module and this self-attention module functionality will be that

227
00:13:39,000 --> 00:13:45,000
we will be able to send all the words right Parallely uh, for the further processing, what exactly

228
00:13:45,000 --> 00:13:47,000
happens in the self-attention module?

229
00:13:47,000 --> 00:13:48,000
We'll discuss more about it.

230
00:13:48,000 --> 00:13:49,000
Okay.

231
00:13:49,000 --> 00:13:52,000
Uh, so I hope you got an idea about this.

232
00:13:52,000 --> 00:13:57,000
One more very important problem that we see in, uh, Transformers.

233
00:13:57,000 --> 00:14:00,000
That is with respect to contextual embedding.

234
00:14:02,000 --> 00:14:05,000
Contextual embeddings.

235
00:14:05,000 --> 00:14:06,000
Right.

236
00:14:07,000 --> 00:14:10,000
Now what exactly happens in contextual embedding?

237
00:14:10,000 --> 00:14:12,000
We will discuss about this okay.

238
00:14:14,000 --> 00:14:18,000
Now with respect to the contextual embedding, let me just give you an example.

239
00:14:18,000 --> 00:14:21,000
In order to make you understand this okay.

240
00:14:22,000 --> 00:14:23,000
Let's say I have a sentence.

241
00:14:23,000 --> 00:14:35,000
Uh, my name is Krish and I play cricket.

242
00:14:37,000 --> 00:14:48,000
Okay, now let's say over here when I pass this entire information right in my embedding layer.

243
00:14:48,000 --> 00:14:50,000
Let's say this is my embedding layer.

244
00:14:51,000 --> 00:14:57,000
Now, you know, in embedding layer when we pass it, our main task will be that from this embedding

245
00:14:57,000 --> 00:15:01,000
layer for every word, we should be able to get our vectors right.

246
00:15:01,000 --> 00:15:02,000
Some vectors.

247
00:15:02,000 --> 00:15:10,000
So if I pass my then name I should be getting another vector is I should be getting another vector crush,

248
00:15:10,000 --> 00:15:15,000
I should be getting another vector, and I should be getting another vector based on timestamps.

249
00:15:15,000 --> 00:15:20,000
I will be getting this kind of vectors and understand if you are specifically using embedding layer.

250
00:15:20,000 --> 00:15:25,000
And let's say if this embedding layer is using some some embedding word embedding techniques like word

251
00:15:25,000 --> 00:15:29,000
two vec for every word we are going to get a fixed vector.

252
00:15:29,000 --> 00:15:29,000
Okay.

253
00:15:29,000 --> 00:15:33,000
So this if my is there then I will be getting a fixed vector of some dimensions.

254
00:15:33,000 --> 00:15:34,000
Similarly name is there.

255
00:15:34,000 --> 00:15:35,000
I will be getting a fixed vector.

256
00:15:35,000 --> 00:15:37,000
Okay is is there?

257
00:15:37,000 --> 00:15:38,000
I will be getting a fixed vector.

258
00:15:38,000 --> 00:15:39,000
Krrish is there.

259
00:15:39,000 --> 00:15:41,000
I'll be getting a fixed vector and is there?

260
00:15:41,000 --> 00:15:42,000
I'll be getting a fixed vector.

261
00:15:42,000 --> 00:15:43,000
I will is there.

262
00:15:43,000 --> 00:15:45,000
I'll be getting a fixed vector right.

263
00:15:45,000 --> 00:15:47,000
And similarly play is there.

264
00:15:47,000 --> 00:15:49,000
I'll be getting a kind of vectors itself right.

265
00:15:49,000 --> 00:15:52,000
All these words will be represented in some kind of vectors.

266
00:15:52,000 --> 00:15:58,000
Now I need to make you understand about contextual vectors.

267
00:15:58,000 --> 00:16:01,000
What exactly is this contextual vectors.

268
00:16:02,000 --> 00:16:06,000
See, it is always a good idea that if you are able to get a vector, it is fine, right for every word.

269
00:16:06,000 --> 00:16:13,000
But we should always try to get a vector whenever we have a longer sentences based on the relationship

270
00:16:13,000 --> 00:16:15,000
with other words.

271
00:16:15,000 --> 00:16:19,000
Now here you can see my name is Krish and I play cricket, okay?

272
00:16:19,000 --> 00:16:21,000
I is obviously related to Krish.

273
00:16:22,000 --> 00:16:24,000
There are some kind of relationship, right?

274
00:16:24,000 --> 00:16:28,000
So if I probably give this same word.

275
00:16:30,000 --> 00:16:35,000
On my contextual vector embedding.

276
00:16:38,000 --> 00:16:40,000
Contextual embed vector embedding.

277
00:16:40,000 --> 00:16:43,000
Then all the vectors that I will be getting.

278
00:16:44,000 --> 00:16:51,000
This vector should be having some kind of relationship with respect to the other words like Christian,

279
00:16:51,000 --> 00:16:55,000
I hear there is a very strong correlation, right?

280
00:16:55,000 --> 00:16:56,000
So some context will be there.

281
00:16:56,000 --> 00:16:59,000
I play cricket, so cricket is also there.

282
00:16:59,000 --> 00:17:02,000
So this cricket vector, I should not be getting a fixed vector itself.

283
00:17:02,000 --> 00:17:06,000
I should be getting a vector which should be related to crush.

284
00:17:06,000 --> 00:17:11,000
Based on this particular relation, there should be some changes in this particular vector which suggest

285
00:17:11,000 --> 00:17:16,000
that hey, there is some contextual dependency in this particular word.

286
00:17:16,000 --> 00:17:17,000
right?

287
00:17:17,000 --> 00:17:23,000
And this entire problem is solved by our self-attention.

288
00:17:25,000 --> 00:17:31,000
And because of this, you'll be able to see that our transformers will be -- more accurate.

289
00:17:31,000 --> 00:17:32,000
Right.

290
00:17:32,000 --> 00:17:35,000
And we'll be talking more about the self-attention module in the upcoming video.

291
00:17:35,000 --> 00:17:42,000
But if I talk with respect to the most two overall problems, why Transformers are specifically used.

292
00:17:42,000 --> 00:17:47,000
One is here in Transformers Parallel, you can send all the words for processing, because of which

293
00:17:47,000 --> 00:17:51,000
it makes the entire model scalable with respect to training with huge data sets.

294
00:17:51,000 --> 00:17:52,000
Okay.

295
00:17:52,000 --> 00:17:56,000
The second thing is that it has this contextual embedding thing, right?

296
00:17:56,000 --> 00:18:01,000
Which was basically missing in encoder and decoder because the encoder and decoder, we just used to

297
00:18:01,000 --> 00:18:07,000
use an embedding layer used to pass each and every word, get a vector, and further do the processing

298
00:18:07,000 --> 00:18:08,000
with respect to this.

299
00:18:08,000 --> 00:18:13,000
But now, with the help of the self-attention, you'll be seeing that we'll be able to even create contextual

300
00:18:13,000 --> 00:18:14,000
vector embedding.

301
00:18:14,000 --> 00:18:20,000
Now in our next video, we are going to deep dive more into this and try to understand how this entire

302
00:18:20,000 --> 00:18:22,000
architecture basically works.

303
00:18:22,000 --> 00:18:26,000
And step by step, we'll try to get each and everything as we go ahead.

304
00:18:26,000 --> 00:18:26,000
Right.

305
00:18:26,000 --> 00:18:28,000
So yes, this was it for my side.

306
00:18:28,000 --> 00:18:29,000
I will see you all in the next video.

307
00:18:29,000 --> 00:18:30,000
Thank you.