1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue our discussion with respect to Transformers.

3
00:00:03,000 --> 00:00:05,000
Until now we have discussed so many things.

4
00:00:05,000 --> 00:00:06,000
That is self-attention layer.

5
00:00:06,000 --> 00:00:09,000
We have discussed about what is multi-head attentions.

6
00:00:09,000 --> 00:00:14,000
You know what all operations specifically happens inside the self-attention layer itself?

7
00:00:14,000 --> 00:00:15,000
We have actually discussed okay.

8
00:00:16,000 --> 00:00:20,000
Now in this video we are going to discuss about a very important topic which is called as positional

9
00:00:20,000 --> 00:00:21,000
encoding.

10
00:00:21,000 --> 00:00:24,000
That is nothing but representing order of sequence.

11
00:00:24,000 --> 00:00:24,000
Okay.

12
00:00:25,000 --> 00:00:31,000
Now see this order of sequence is a very important thing.

13
00:00:31,000 --> 00:00:31,000
Okay.

14
00:00:33,000 --> 00:00:45,000
One of the major advantage of using transformer is that okay, the major advantage, all this tokens.

15
00:00:45,000 --> 00:00:45,000
Right.

16
00:00:45,000 --> 00:00:48,000
All these words tokens, words, tokens.

17
00:00:52,000 --> 00:00:55,000
It can process.

18
00:00:58,000 --> 00:00:59,000
It can be processed.

19
00:00:59,000 --> 00:01:00,000
Parallely.

20
00:01:01,000 --> 00:01:02,000
Right.

21
00:01:02,000 --> 00:01:09,000
This is the most important, most important advantage because at that time I can just directly go ahead

22
00:01:09,000 --> 00:01:11,000
and pass my X1X2 word.

23
00:01:12,000 --> 00:01:15,000
And all this operation that specifically happens in the self-attention layer.

24
00:01:16,000 --> 00:01:19,000
Uh, it will just happen parallelly, right?

25
00:01:19,000 --> 00:01:26,000
And if I consider because of this advantage, there is also a specific drawback.

26
00:01:28,000 --> 00:01:38,000
So now if I talk about the drawback because of this advantage it lacks the sequential

27
00:01:39,000 --> 00:01:46,000
sequential structure of the words.

28
00:01:48,000 --> 00:01:50,000
Now, what does that basically mean, right?

29
00:01:50,000 --> 00:01:54,000
Now, this sequential structure basically means the order, right?

30
00:01:54,000 --> 00:02:01,000
Obviously, since all the words are specifically taken by the self-attention layer in a parallel mode,

31
00:02:01,000 --> 00:02:06,000
all the word tokens are taken all at once, and all the processing is basically happening.

32
00:02:06,000 --> 00:02:11,000
It will not have any idea whether x one word has come first or x two word has come first, right?

33
00:02:11,000 --> 00:02:14,000
Let's say I take an example over here, right?

34
00:02:14,000 --> 00:02:21,000
I say hey, lion eats tiger okay.

35
00:02:21,000 --> 00:02:24,000
Or I can basically say lion kills tiger.

36
00:02:24,000 --> 00:02:24,000
Okay.

37
00:02:25,000 --> 00:02:27,000
So we can take this example.

38
00:02:28,000 --> 00:02:30,000
So this is just one sentence.

39
00:02:30,000 --> 00:02:35,000
The other sentence can be tiger kills lion.

40
00:02:36,000 --> 00:02:42,000
Now because of this ordering of the sentence or sentence one and two, the meaning of the sentence completely

41
00:02:42,000 --> 00:02:43,000
changes, right?

42
00:02:44,000 --> 00:02:48,000
Anyhow, in both the sentences, all the three words are basically considered.

43
00:02:48,000 --> 00:02:54,000
And at the end of the day, if I pass this through self-attention layer, you know, and since right

44
00:02:54,000 --> 00:02:59,000
now my self-attention layer is also not taking care of the ordering part, both this sentence will be

45
00:02:59,000 --> 00:03:03,000
returned with the same vectors, right?

46
00:03:03,000 --> 00:03:04,000
With respect to all the words.

47
00:03:04,000 --> 00:03:10,000
But as you all know, this entire sentence is completely different because of the ordering of the word.

48
00:03:10,000 --> 00:03:18,000
So this is what right now my self-attention layer or this entire transformer architecture is missing,

49
00:03:18,000 --> 00:03:19,000
right?

50
00:03:19,000 --> 00:03:21,000
It is missing this information.

51
00:03:21,000 --> 00:03:28,000
So in order to prevent this, what we specifically do is that we use a concept of positional encoding.

52
00:03:28,000 --> 00:03:34,000
And this positional encoding will be responsible in representing the order of a sequence.

53
00:03:34,000 --> 00:03:34,000
Okay.

54
00:03:34,000 --> 00:03:37,000
And that is what we are going to learn about.

55
00:03:38,000 --> 00:03:43,000
Uh, because without this positional encoding it will not know which word is probably coming first.

56
00:03:43,000 --> 00:03:46,000
Okay, so the concept is very simple.

57
00:03:46,000 --> 00:03:53,000
What we will do is that, uh, according to the research paper of attention is all you need, right?

58
00:03:53,000 --> 00:03:55,000
Attention is all you need.

59
00:03:55,000 --> 00:04:01,000
So what they have actually done is that they will see if this is my embedding vector.

60
00:04:01,000 --> 00:04:02,000
Right.

61
00:04:02,000 --> 00:04:03,000
For this particular word.

62
00:04:03,000 --> 00:04:04,000
For this particular word.

63
00:04:04,000 --> 00:04:08,000
Similarly we will go ahead and create another one vector.

64
00:04:09,000 --> 00:04:12,000
This vector and we will see how we can generate this particular vector.

65
00:04:12,000 --> 00:04:17,000
And this vector will be nothing, but it will be the positional encoded vector.

66
00:04:17,000 --> 00:04:20,000
This is nothing but the positional encoded vector.

67
00:04:21,000 --> 00:04:31,000
And this positional encoded vector will be added will be added with this vector.

68
00:04:32,000 --> 00:04:35,000
Similarly, this positional encoded vector will be added with this vector.

69
00:04:36,000 --> 00:04:42,000
And this vector will be responsible in telling which words are in which position.

70
00:04:42,000 --> 00:04:42,000
Exactly.

71
00:04:42,000 --> 00:04:43,000
Okay.

72
00:04:43,000 --> 00:04:46,000
And that is what is the entire idea over it.

73
00:04:46,000 --> 00:04:47,000
Right.

74
00:04:47,000 --> 00:04:50,000
So there are multiple ways how we can fix this particular problem.

75
00:04:50,000 --> 00:04:57,000
One of the way is that why not with this embedding vectors that we are passing, let's say over here

76
00:04:57,000 --> 00:04:58,000
I've just taken three dimension.

77
00:04:58,000 --> 00:05:03,000
Why not just add one more dimension and probably talk about the position of the words that are coming?

78
00:05:03,000 --> 00:05:08,000
Let's say for the first sentence, tiger comes first, so this position will be one.

79
00:05:08,000 --> 00:05:14,000
Then the second vectors that we are specifically giving this position will be two, along with the vectors

80
00:05:14,000 --> 00:05:15,000
that we are giving.

81
00:05:15,000 --> 00:05:18,000
And the third word basically is about lion.

82
00:05:18,000 --> 00:05:21,000
Let's say that we are giving one more word over here with the third index.

83
00:05:22,000 --> 00:05:22,000
Right.

84
00:05:22,000 --> 00:05:26,000
So once I'm sending this information here we are giving some kind of position.

85
00:05:26,000 --> 00:05:27,000
Right.

86
00:05:27,000 --> 00:05:30,000
But do you think right now for a smaller text it is fine.

87
00:05:30,000 --> 00:05:38,000
But let's say if I have a book or if I have a journal, or if I have a novel right here, they may be

88
00:05:38,000 --> 00:05:40,000
more than one lakh words.

89
00:05:40,000 --> 00:05:46,000
Now, if there is a more than one lakh words, obviously we are not limiting how many number of positions

90
00:05:46,000 --> 00:05:48,000
will keep on writing, right?

91
00:05:48,000 --> 00:05:53,000
And when we have so many number of words and just imagine we are trying to add a vector with such a

92
00:05:53,000 --> 00:05:53,000
big number.

93
00:05:53,000 --> 00:05:59,000
So this usually causes a problem while we are doing the back propagation while updating weights.

94
00:05:59,000 --> 00:06:04,000
And if we consider huge, huge data right, there is no limit of that particular word.

95
00:06:04,000 --> 00:06:10,000
So that basically means the number right now with respect to the position, it is unbounded.

96
00:06:10,000 --> 00:06:12,000
This position can have any number of words.

97
00:06:12,000 --> 00:06:13,000
Right.

98
00:06:13,000 --> 00:06:14,000
So this is one of the problem.

99
00:06:14,000 --> 00:06:17,000
And we cannot directly use with this specific position.

100
00:06:17,000 --> 00:06:21,000
You may also say hey we can probably apply some kind of mathematical rule and we can try to replace

101
00:06:21,000 --> 00:06:22,000
it.

102
00:06:22,000 --> 00:06:27,000
But if we go according to the research paper, that is attention is all you need here.

103
00:06:27,000 --> 00:06:28,000
The idea is very simple.

104
00:06:28,000 --> 00:06:33,000
Here we will try to add a positional encoded vector which will be of the same size.

105
00:06:33,000 --> 00:06:38,000
And then we will try to add this particular vector to this embedding vector itself.

106
00:06:38,000 --> 00:06:38,000
Right.

107
00:06:38,000 --> 00:06:44,000
And that will actually give the uh, position information with respect to the word that comes in the

108
00:06:44,000 --> 00:06:45,000
sentence.

109
00:06:45,000 --> 00:06:47,000
But now the question comes right.

110
00:06:47,000 --> 00:06:52,000
How do we how do we probably create this positional encoding vectors?

111
00:06:52,000 --> 00:06:52,000
Okay.

112
00:06:52,000 --> 00:06:58,000
So for this, uh, there are two different ways and out of which we will be discussing one of the way,

113
00:06:58,000 --> 00:07:06,000
which is even spoken in the, uh, research paper or which is written in the, uh, research paper.

114
00:07:06,000 --> 00:07:08,000
So types of positional encoding.

115
00:07:08,000 --> 00:07:09,000
Okay.

116
00:07:09,000 --> 00:07:12,000
So first one is nothing but sinusoidal.

117
00:07:13,000 --> 00:07:18,000
Sinusoidal position encoding.

118
00:07:21,000 --> 00:07:23,000
Sinusoidal position encoding.

119
00:07:23,000 --> 00:07:30,000
The second one is something called as learned positional encoding okay.

120
00:07:31,000 --> 00:07:38,000
So out of this two okay I will give a detailed idea about what exactly sinusoidal positional encoding.

121
00:07:38,000 --> 00:07:40,000
We'll try to understand this okay.

122
00:07:40,000 --> 00:07:43,000
And uh sinusoidal.

123
00:07:43,000 --> 00:07:44,000
Uh see.

124
00:07:44,000 --> 00:07:46,000
So let's go with the first one.

125
00:07:46,000 --> 00:07:59,000
That is nothing but sine you saw the positional encoding and what exactly this is, this is basically

126
00:07:59,000 --> 00:08:02,000
a technique of creating that positional, uh, encoding vectors.

127
00:08:03,000 --> 00:08:11,000
So here it uses sine and cosine function.

128
00:08:14,000 --> 00:08:19,000
Sine and cosine functions of different frequencies.

129
00:08:19,000 --> 00:08:27,000
We'll try to understand what exactly it is okay of different frequencies to create.

130
00:08:29,000 --> 00:08:32,000
Positional encodings.

131
00:08:32,000 --> 00:08:35,000
And we'll solve this entirely with a good example also.

132
00:08:35,000 --> 00:08:36,000
Okay.

133
00:08:36,000 --> 00:08:40,000
And now because of this positional encoding.

134
00:08:40,000 --> 00:08:41,000
So sorry.

135
00:08:41,000 --> 00:08:43,000
Because of this positional will not have this particular problem.

136
00:08:43,000 --> 00:08:48,000
Also like it will not be like unlimited because the sinusoidal encoding all the values will be ranging

137
00:08:48,000 --> 00:08:55,000
between minus one to plus one, and at any point of time with respect to any number of vectors.

138
00:08:55,000 --> 00:08:59,000
Okay, you'll be able to see that either we'll be getting the values between minus one to plus one.

139
00:08:59,000 --> 00:09:00,000
Okay.

140
00:09:01,000 --> 00:09:04,000
Um, so let's see this how this exactly works.

141
00:09:04,000 --> 00:09:08,000
And uh, then we'll try to understand with a basic example.

142
00:09:08,000 --> 00:09:15,000
First of all, in order to compute the sinusoidal position encodings, the formula is very simple okay.

143
00:09:15,000 --> 00:09:20,000
And here what we are doing basically whatever vectors we really want to create right.

144
00:09:20,000 --> 00:09:23,000
Positional embedding vectors, all these values.

145
00:09:23,000 --> 00:09:28,000
Let's say if I have a four dimension, all these values will be replaced, right?

146
00:09:28,000 --> 00:09:33,000
All these values based on the position of the words will be replaced by this formula.

147
00:09:33,000 --> 00:09:34,000
So this is positional encoding.

148
00:09:35,000 --> 00:09:39,000
That is nothing but pos uh position comma two of I.

149
00:09:39,000 --> 00:09:41,000
It is nothing but sign.

150
00:09:43,000 --> 00:09:50,000
Position divided by 10,000 to the power of two of I.

151
00:09:51,000 --> 00:09:52,000
D of model.

152
00:09:52,000 --> 00:09:54,000
We will try to understand each and everything.

153
00:09:54,000 --> 00:09:54,000
Okay.

154
00:09:55,000 --> 00:09:59,000
This is for the sinusoidal uh, sine function.

155
00:09:59,000 --> 00:10:04,000
One more positional encoding that we will try to calculate is nothing but position of two of I plus

156
00:10:04,000 --> 00:10:05,000
one.

157
00:10:06,000 --> 00:10:09,000
Sorry two of I plus one.

158
00:10:09,000 --> 00:10:13,000
Yeah, this is nothing but cosine of same formula.

159
00:10:13,000 --> 00:10:19,000
That is position of 10,000 square two of I divided by d of model.

160
00:10:20,000 --> 00:10:20,000
Right.

161
00:10:20,000 --> 00:10:22,000
So this is the formula.

162
00:10:22,000 --> 00:10:24,000
Now here let us break down this particular formula.

163
00:10:24,000 --> 00:10:29,000
Because see at the end of the day we need to create this position embedding vectors in such a way that

164
00:10:29,000 --> 00:10:34,000
the dimensions that we specifically have with respect to the embedding vector, the same dimension,

165
00:10:34,000 --> 00:10:35,000
it should happen.

166
00:10:35,000 --> 00:10:44,000
So here we will go ahead and write where p is equal to sorry, where pos is equal to is nothing, but

167
00:10:44,000 --> 00:10:45,000
it is the position.

168
00:10:48,000 --> 00:10:52,000
Okay, I is the dimension.

169
00:10:55,000 --> 00:10:57,000
Okay D of model.

170
00:11:00,000 --> 00:11:00,000
Is the.

171
00:11:01,000 --> 00:11:03,000
Let me just clearly write this down.

172
00:11:04,000 --> 00:11:05,000
Okay.

173
00:11:06,000 --> 00:11:13,000
So here I will just go ahead and write where POS is the position.

174
00:11:14,000 --> 00:11:18,000
Everything will make sense once I solve a specific problem and show it to you.

175
00:11:19,000 --> 00:11:27,000
And then here I is the dimension.

176
00:11:27,000 --> 00:11:27,000
Okay.

177
00:11:28,000 --> 00:11:30,000
And then you have this D of model.

178
00:11:31,000 --> 00:11:33,000
It is the.

179
00:11:35,000 --> 00:11:37,000
Dimensionality

180
00:11:39,000 --> 00:11:41,000
of the embeddings.

181
00:11:45,000 --> 00:11:46,000
Okay.

182
00:11:47,000 --> 00:11:50,000
Now what exactly this is and how we will go ahead and compute it.

183
00:11:50,000 --> 00:11:54,000
I will just show you this with a specific example.

184
00:11:54,000 --> 00:11:54,000
Okay.

185
00:11:55,000 --> 00:12:04,000
Let's say, uh, here I have an example which says the cat sat.

186
00:12:04,000 --> 00:12:04,000
Okay.

187
00:12:05,000 --> 00:12:09,000
Uh, first of all, we'll convert this into vectors.

188
00:12:09,000 --> 00:12:09,000
Okay.

189
00:12:09,000 --> 00:12:18,000
So let's say that this is my the the is represented by a vector like this .1.2 I'm just initializing

190
00:12:18,000 --> 00:12:21,000
.3.4 okay.

191
00:12:21,000 --> 00:12:23,000
Similarly I have this cat.

192
00:12:23,000 --> 00:12:25,000
Cat is represented by

193
00:12:25,000 --> 00:12:31,000
.5.6.7.8

194
00:12:31,000 --> 00:12:32,000
okay.

195
00:12:32,000 --> 00:12:34,000
And then you have the SAT.

196
00:12:34,000 --> 00:12:40,000
SAT is nothing but it is point nine, 1.0, 1.1, 1.2.

197
00:12:40,000 --> 00:12:43,000
So this is the embedding vectors that we have.

198
00:12:44,000 --> 00:12:51,000
Now we will go ahead and create the positional encodings for this because before giving it to the self-attention

199
00:12:51,000 --> 00:12:54,000
layer the same dimension this is just a four dimension.

200
00:12:54,000 --> 00:12:58,000
So this entirely is like four dimension for each and every word.

201
00:12:58,000 --> 00:13:03,000
Similarly, we have to go ahead and create our positional encoding with the same four dimension.

202
00:13:03,000 --> 00:13:08,000
And we are going to specifically use this technique, which is called as positional encoding with the

203
00:13:08,000 --> 00:13:09,000
sinusoidal learning.

204
00:13:09,000 --> 00:13:09,000
Okay.

205
00:13:10,000 --> 00:13:11,000
So here what we will do.

206
00:13:11,000 --> 00:13:14,000
Again we'll go ahead and probably use the formula okay.

207
00:13:15,000 --> 00:13:18,000
Uh with respect to the positional encoding.

208
00:13:18,000 --> 00:13:22,000
So here I'll write P or let me use another color.

209
00:13:23,000 --> 00:13:32,000
So here I will just go ahead and write p with respect to pos comma two of I is nothing but sine of.

210
00:13:34,000 --> 00:13:35,000
Pos.

211
00:13:38,000 --> 00:13:43,000
10,000 square I of d of model.

212
00:13:44,000 --> 00:13:46,000
Okay, the same formula.

213
00:13:46,000 --> 00:13:56,000
And here I'm just going to write p that is positional encoding of pos comma two I plus one is equal

214
00:13:56,000 --> 00:13:59,000
to cos of pos.

215
00:14:02,000 --> 00:14:08,000
10,002 of I divided by d of model okay.

216
00:14:08,000 --> 00:14:09,000
Perfect.

217
00:14:09,000 --> 00:14:15,000
Now understand over here okay I have to generate for positional encoding vectors.

218
00:14:15,000 --> 00:14:16,000
Okay.

219
00:14:16,000 --> 00:14:20,000
So here for our example.

220
00:14:23,000 --> 00:14:24,000
Okay.

221
00:14:24,000 --> 00:14:28,000
D of model basically means what dimension of model.

222
00:14:28,000 --> 00:14:30,000
So dimension of model is four right.

223
00:14:31,000 --> 00:14:32,000
Four four over here.

224
00:14:32,000 --> 00:14:36,000
So I will just go ahead and write for position.

225
00:14:39,000 --> 00:14:41,000
POS is equal to zero for this position.

226
00:14:41,000 --> 00:14:44,000
For this position, how do I place a value?

227
00:14:44,000 --> 00:14:47,000
Okay, so for this I will go ahead and calculate p.

228
00:14:48,000 --> 00:14:51,000
My position is zero comma.

229
00:14:51,000 --> 00:14:53,000
What is I over here.

230
00:14:53,000 --> 00:14:57,000
What is I to I comma one to I plus one.

231
00:14:57,000 --> 00:15:01,000
Okay so for the first word you'll be able to see right.

232
00:15:01,000 --> 00:15:05,000
For the first word I have to represent this into for positional encodings.

233
00:15:05,000 --> 00:15:06,000
Right.

234
00:15:06,000 --> 00:15:10,000
So I will iterate through this specific I right.

235
00:15:10,000 --> 00:15:13,000
So here what I will say for position zero.

236
00:15:14,000 --> 00:15:20,000
If I want to calculate the positional encoding for zero comma zero it is nothing but sine of zero divided

237
00:15:20,000 --> 00:15:21,000
by whatever.

238
00:15:21,000 --> 00:15:25,000
Is there 10,000 square or sorry square?

239
00:15:25,000 --> 00:15:26,000
Uh two of I.

240
00:15:26,000 --> 00:15:28,000
So I is zero over here, right.

241
00:15:28,000 --> 00:15:30,000
So this is zero, right.

242
00:15:30,000 --> 00:15:32,000
So obviously this will entirely become zero.

243
00:15:32,000 --> 00:15:34,000
That is nothing but, uh, zero itself.

244
00:15:34,000 --> 00:15:38,000
So here I will just go ahead and write ten to the power of zero.

245
00:15:38,000 --> 00:15:38,000
Right.

246
00:15:38,000 --> 00:15:41,000
Because I is zero divided by d of model.

247
00:15:41,000 --> 00:15:42,000
Again it is nothing but four.

248
00:15:42,000 --> 00:15:46,000
So this entirely sign of zero it will become right.

249
00:15:46,000 --> 00:15:48,000
So here I'm actually going to get zero itself.

250
00:15:48,000 --> 00:15:49,000
Then.

251
00:15:49,000 --> 00:15:53,000
Similarly if I go ahead and say positional encoding for zero position.

252
00:15:53,000 --> 00:15:57,000
But first I like one ith position okay.

253
00:15:58,000 --> 00:16:03,000
Ith position how many ith position I have to really do it is 0123.

254
00:16:03,000 --> 00:16:04,000
Right.

255
00:16:04,000 --> 00:16:10,000
So for four for the first, again I will go ahead and compute sign zero divided by zero divided by.

256
00:16:10,000 --> 00:16:14,000
Anything over here will be zero sine of zero.

257
00:16:14,000 --> 00:16:19,000
Again it will be zero or sorry should not be sine of zero.

258
00:16:19,000 --> 00:16:21,000
Uh, just a second.

259
00:16:22,000 --> 00:16:23,000
It should be cos.

260
00:16:25,000 --> 00:16:30,000
Because for every position you'll be able to see that I will be applying this position and then I will

261
00:16:30,000 --> 00:16:31,000
be applying this position.

262
00:16:31,000 --> 00:16:31,000
Right.

263
00:16:32,000 --> 00:16:40,000
So here you'll be able to see with P with positional encoding with zero comma one okay is nothing but

264
00:16:40,000 --> 00:16:46,000
cos of whatever cost function we are specifically writing position is nothing but zero zero divided

265
00:16:46,000 --> 00:16:47,000
by anything will be zero.

266
00:16:47,000 --> 00:16:50,000
So cos zero is nothing but one.

267
00:16:50,000 --> 00:16:51,000
Okay.

268
00:16:51,000 --> 00:16:55,000
Similarly I will go ahead and compute for positional encoding for zero comma two.

269
00:16:55,000 --> 00:16:56,000
Now zero comma two.

270
00:16:56,000 --> 00:16:58,000
Again I will be using sign okay.

271
00:16:58,000 --> 00:17:07,000
See uh you'll be able to see that if I have two embeddings, like if I have this four dimensions, whenever

272
00:17:07,000 --> 00:17:09,000
I apply a sinusoidal wave.

273
00:17:09,000 --> 00:17:10,000
Right.

274
00:17:10,000 --> 00:17:13,000
Like this sinusoidal wave looks something like this.

275
00:17:13,000 --> 00:17:16,000
And the, uh, cosine waves looks something like this.

276
00:17:16,000 --> 00:17:22,000
So for each vector, we have to probably go ahead and find out the sinusoidal wave, uh, sinusoidal

277
00:17:22,000 --> 00:17:25,000
values for that and the cosine values for that.

278
00:17:25,000 --> 00:17:31,000
That basically means if I want to find out the positional encoding for this, for, uh, dimension vectors

279
00:17:31,000 --> 00:17:36,000
here, this one will basically be calculated as a combination of two.

280
00:17:37,000 --> 00:17:39,000
So let's say this is my first combination.

281
00:17:39,000 --> 00:17:41,000
This is my second combination.

282
00:17:41,000 --> 00:17:44,000
That is the reason we are using this formula.

283
00:17:44,000 --> 00:17:47,000
This combination will be calculated one.

284
00:17:47,000 --> 00:17:54,000
The first one will be calculated with this formula and the second one will be calculated with this formula.

285
00:17:54,000 --> 00:17:59,000
And again when we go to the next set of two, this will calculate it by this formula.

286
00:17:59,000 --> 00:18:01,000
And this will get calculated by this formula.

287
00:18:01,000 --> 00:18:04,000
That is how it exactly works okay.

288
00:18:04,000 --> 00:18:11,000
The reason is very simple why we are using in this way, because we should not be getting a similar

289
00:18:11,000 --> 00:18:13,000
kind of encoded values.

290
00:18:13,000 --> 00:18:22,000
If we get a similar kind of encoded values then we will miss the order of the will, miss the order

291
00:18:22,000 --> 00:18:25,000
of the elements.

292
00:18:25,000 --> 00:18:26,000
Right?

293
00:18:26,000 --> 00:18:27,000
Will miss the order of the elements.

294
00:18:27,000 --> 00:18:31,000
This is really important to specifically understand.

295
00:18:32,000 --> 00:18:37,000
Let's say you say that hey Chris, why to use this cosine instead just try to use sine, okay?

296
00:18:37,000 --> 00:18:42,000
The amazing thing about sine will be that all the values will be ranging between minus one to plus one.

297
00:18:42,000 --> 00:18:51,000
Okay, but if I just use sign, there may be that for same two different values, I may get the same

298
00:18:51,000 --> 00:18:53,000
sign value right for this.

299
00:18:53,000 --> 00:18:55,000
Also, I may get the same sign value for this.

300
00:18:55,000 --> 00:18:57,000
Also, I may get the same sign value.

301
00:18:57,000 --> 00:19:01,000
So that is the reason we may miss the order of the elements or of the tokens.

302
00:19:01,000 --> 00:19:03,000
And we don't want this to happen.

303
00:19:03,000 --> 00:19:07,000
So what we have done is that we have combined this with another cosine equation.

304
00:19:07,000 --> 00:19:08,000
Okay.

305
00:19:08,000 --> 00:19:10,000
Now what will happen for every vectors?

306
00:19:10,000 --> 00:19:14,000
First, if this is computed by sine the next vector will be computed by cosine.

307
00:19:14,000 --> 00:19:19,000
So let's continue this particular uh findings with respect to the positional encoding.

308
00:19:19,000 --> 00:19:24,000
So now my third one that I'm actually going to compute over here will be sine of zero comma two.

309
00:19:25,000 --> 00:19:27,000
See I will be into uh 012.

310
00:19:27,000 --> 00:19:30,000
At this position I you have to use sign here.

311
00:19:30,000 --> 00:19:31,000
I have to use cosine.

312
00:19:31,000 --> 00:19:31,000
Right.

313
00:19:31,000 --> 00:19:34,000
So this will basically become sine of zero.

314
00:19:35,000 --> 00:19:39,000
And then we are going to use 10,002 by four okay.

315
00:19:39,000 --> 00:19:42,000
But again here I'm actually going to get sine of zero.

316
00:19:42,000 --> 00:19:43,000
Then again this will be zero.

317
00:19:43,000 --> 00:19:47,000
Then finally we'll go ahead and compute the positional encoding of zero comma three.

318
00:19:47,000 --> 00:19:49,000
Again we are going to use cosine.

319
00:19:49,000 --> 00:19:51,000
Again this entire value will become zero and this will be one.

320
00:19:51,000 --> 00:19:59,000
So with respect to our first word right, what is the positional encoding that we are getting?

321
00:20:00,000 --> 00:20:04,000
It is nothing but 0101 right.

322
00:20:04,000 --> 00:20:10,000
Now similarly if we go with respect to the position one right position one is with respect to our second

323
00:20:10,000 --> 00:20:10,000
word, right.

324
00:20:10,000 --> 00:20:13,000
So if we go ahead and calculate it.

325
00:20:16,000 --> 00:20:19,000
So let me just go ahead and write for position one.

326
00:20:19,000 --> 00:20:23,000
So for position is equal to one.

327
00:20:23,000 --> 00:20:25,000
Now I'm just going to use the same formula.

328
00:20:25,000 --> 00:20:28,000
First of all I'll write position encoding for one comma zero.

329
00:20:28,000 --> 00:20:31,000
One comma zero basically means my second vector.

330
00:20:31,000 --> 00:20:33,000
So this is for this.

331
00:20:33,000 --> 00:20:38,000
And this one will be my second vector which will be represented by another four dimension.

332
00:20:38,000 --> 00:20:43,000
The positional encoding will be a positional encoding will be shown by.

333
00:20:43,000 --> 00:20:44,000
And this is for this particular word.

334
00:20:44,000 --> 00:20:46,000
The positional encoding is for this word.

335
00:20:46,000 --> 00:20:47,000
Okay.

336
00:20:47,000 --> 00:20:49,000
Again I have that same operation that I have to do.

337
00:20:49,000 --> 00:20:53,000
And when I say position one, that basically means we are going for the first word, right?

338
00:20:53,000 --> 00:20:56,000
And finally after that we'll go for the second word.

339
00:20:56,000 --> 00:20:57,000
So positional one.

340
00:20:57,000 --> 00:20:58,000
Then again we go ahead and compute it.

341
00:20:58,000 --> 00:21:05,000
It will be sign again it will be one divided by or 10,000 okay Okay.

342
00:21:05,000 --> 00:21:07,000
Zero divided by four.

343
00:21:07,000 --> 00:21:07,000
Right?

344
00:21:07,000 --> 00:21:08,000
Why zero divided by four?

345
00:21:08,000 --> 00:21:10,000
Because I is zero.

346
00:21:10,000 --> 00:21:12,000
So again I'm going to get sine of one.

347
00:21:12,000 --> 00:21:13,000
Sine of one is nothing.

348
00:21:13,000 --> 00:21:17,000
But it is approximately if you go ahead and calculate it it is 0.8415.

349
00:21:19,000 --> 00:21:26,000
So similarly we will go ahead and calculate the post encoding of the position one.

350
00:21:26,000 --> 00:21:28,000
So over here we will go ahead and compute.

351
00:21:28,000 --> 00:21:35,000
Now the post encoding of positional one with the I value is equal to one.

352
00:21:35,000 --> 00:21:37,000
So again here what formula we are going to use.

353
00:21:37,000 --> 00:21:40,000
We are just going to write cos and let's see the formula.

354
00:21:40,000 --> 00:21:46,000
So I will also make sure that I will try to specify this formula over here so that you will be able

355
00:21:46,000 --> 00:21:47,000
to refer this.

356
00:21:47,000 --> 00:21:47,000
Okay.

357
00:21:47,000 --> 00:21:49,000
So here I'm just going to copy it.

358
00:21:49,000 --> 00:21:51,000
I'm going to paste it over here.

359
00:21:51,000 --> 00:21:54,000
The next formula again I will select this.

360
00:21:55,000 --> 00:21:56,000
and I'll paste it over here.

361
00:21:56,000 --> 00:21:57,000
Okay.

362
00:21:58,000 --> 00:22:00,000
So once I'm calculating this okay.

363
00:22:00,000 --> 00:22:02,000
Please try to understand over here.

364
00:22:02,000 --> 00:22:03,000
So.

365
00:22:06,000 --> 00:22:09,000
So I'll select this right now.

366
00:22:10,000 --> 00:22:10,000
Okay.

367
00:22:10,000 --> 00:22:12,000
So I'll keep it over here.

368
00:22:12,000 --> 00:22:16,000
And similarly this I will keep it here for my reference okay.

369
00:22:17,000 --> 00:22:20,000
So here when I am calculating with position one and I is equal to one.

370
00:22:20,000 --> 00:22:22,000
So if I really want to use the cosine value.

371
00:22:22,000 --> 00:22:26,000
So this uh I that you will be specifically getting right.

372
00:22:26,000 --> 00:22:34,000
So here I will write position value one okay one divided by 10,000 okay.

373
00:22:34,000 --> 00:22:36,000
To the power of two of I.

374
00:22:36,000 --> 00:22:36,000
Right.

375
00:22:36,000 --> 00:22:37,000
So two of I is nothing.

376
00:22:37,000 --> 00:22:42,000
But uh, over here I will just go ahead and write two multiplied by one divided by four.

377
00:22:42,000 --> 00:22:46,000
And this whatever approximate value will be getting will get.

378
00:22:46,000 --> 00:22:48,000
It's 5403.

379
00:22:48,000 --> 00:22:52,000
Similarly, you can go ahead and calculate the post encoding for one comma two.

380
00:22:52,000 --> 00:22:53,000
Again we have to use the sine formula.

381
00:22:54,000 --> 00:22:59,000
And with respect to the sine formula just go ahead and write over here one by 10,002 by four.

382
00:22:59,000 --> 00:23:02,000
This will be approximately again equal to 0.01.

383
00:23:02,000 --> 00:23:06,000
And again once we go ahead and compute it for one comma three.

384
00:23:06,000 --> 00:23:08,000
again, have to use my cost value.

385
00:23:08,000 --> 00:23:10,000
Just try to do the computation.

386
00:23:10,000 --> 00:23:11,000
Okay I'll leave it up to you.

387
00:23:11,000 --> 00:23:17,000
So here I will be getting my value as 0.995 okay.

388
00:23:17,000 --> 00:23:24,000
So in short here what we have actually done is that the the word that we have specifically taken.

389
00:23:24,000 --> 00:23:24,000
Right.

390
00:23:24,000 --> 00:23:25,000
Cat.

391
00:23:25,000 --> 00:23:29,000
First of all, I will just go ahead and write with respect to the word for for the the word.

392
00:23:29,000 --> 00:23:35,000
So here you can see for the, the word which was given by the vectors.

393
00:23:37,000 --> 00:23:37,000
Right.

394
00:23:38,000 --> 00:23:43,000
Two, three for the vectors for the that we have actually used is .1.2.3.4.

395
00:23:44,000 --> 00:23:50,000
So here if I go ahead and write this is .1.2.3.4.

396
00:23:50,000 --> 00:23:56,000
For this we have converted this into some other encoding.

397
00:23:58,000 --> 00:24:01,000
And here the positional encoding is nothing.

398
00:24:01,000 --> 00:24:05,000
But, uh, this value that is 0101 okay.

399
00:24:05,000 --> 00:24:08,000
So here we have specifically got as zero one and zero one.

400
00:24:08,000 --> 00:24:15,000
So this is the positional encoding that we are able to find out for this particular vector.

401
00:24:15,000 --> 00:24:15,000
Right.

402
00:24:16,000 --> 00:24:17,000
For this vector.

403
00:24:17,000 --> 00:24:19,000
Similarly for the word cat.

404
00:24:20,000 --> 00:24:24,000
For the word cat, initially we had this vector.

405
00:24:24,000 --> 00:24:31,000
And after performing the positional encoding you'll be able to see it was .5.6.7.8.

406
00:24:31,000 --> 00:24:32,000
Right.

407
00:24:32,000 --> 00:24:37,000
So here I will go ahead and write .5.6.7.8.

408
00:24:38,000 --> 00:24:46,000
This we were able to get an encoding of the values that we got over here two, three, four, and the

409
00:24:46,000 --> 00:24:51,000
values were nothing but these all values 0.8415 .84.

410
00:24:51,000 --> 00:24:59,000
Then you got 0.54 and then 0.01 and 0.95 or 9995.

411
00:24:59,000 --> 00:25:00,000
Okay.

412
00:25:00,000 --> 00:25:04,000
Now this is how you have actually got the positional encoding.

413
00:25:04,000 --> 00:25:10,000
And through this way what you are actually doing is that we have to probably sum up these values, right?

414
00:25:10,000 --> 00:25:16,000
So finally, you'll be able to see whatever self-attention I will be giving right over here, whatever

415
00:25:16,000 --> 00:25:19,000
vectors I will be giving in this self-attention layer, okay.

416
00:25:20,000 --> 00:25:29,000
It will be the combination of your vectors that we are specifically getting with the positional encoding.

417
00:25:29,000 --> 00:25:31,000
That is nothing but 0.5.

418
00:25:31,000 --> 00:25:37,000
Let's say sorry 0.5 0.6, but instead I will write 0.840.

419
00:25:37,000 --> 00:25:44,000
Let me just go ahead and remove this and let's use an positional encoding here.

420
00:25:44,000 --> 00:25:50,000
It will be four dimension again based on the problem statement that we have done in uh, in the research

421
00:25:50,000 --> 00:25:52,000
paper it is taken as 512 dimension.

422
00:25:52,000 --> 00:25:52,000
Okay.

423
00:25:52,000 --> 00:25:54,000
You can go ahead and check out the research paper.

424
00:25:54,000 --> 00:25:54,000
So

425
00:25:54,000 --> 00:26:01,000
.84.54.01.9999

426
00:26:01,000 --> 00:26:01,000
okay.

427
00:26:01,000 --> 00:26:08,000
And this we have to sum it up with the our vectors that we have actually calculated.

428
00:26:08,000 --> 00:26:09,000
Right.

429
00:26:09,000 --> 00:26:14,000
So for the cat word this will be the word that will be going on point five plus point six plus point

430
00:26:14,000 --> 00:26:16,000
seven plus point eight.

431
00:26:16,000 --> 00:26:24,000
Similarly for the other vector that we have right, the again the ordering can be placed over here.

432
00:26:24,000 --> 00:26:28,000
Also, it can be placed over here, but we are handling it with the help of because all the words will

433
00:26:28,000 --> 00:26:29,000
be going parallelly.

434
00:26:29,000 --> 00:26:35,000
Right now, this positional encoding is making sure that whatever values we are putting, it knows the

435
00:26:35,000 --> 00:26:36,000
order now, right?

436
00:26:36,000 --> 00:26:41,000
So for this, let's say the value was .1.2.3.4.

437
00:26:41,000 --> 00:26:46,000
So so for this you are actually getting nothing but 0101.

438
00:26:46,000 --> 00:26:51,000
And this is also going to get summed up after the summing up is going to happen, then this entire value

439
00:26:51,000 --> 00:26:53,000
is going to go to the self-attention.

440
00:26:53,000 --> 00:26:57,000
And finally we are going to give entirely this to our feedforward neural network.

441
00:26:57,000 --> 00:26:59,000
And here we are going to get our Z one.

442
00:26:59,000 --> 00:27:03,000
Here we are going to get our z two considering the multi-head attention is applied.

443
00:27:03,000 --> 00:27:12,000
But this values that we have actually got this values that we have actually got.

444
00:27:12,000 --> 00:27:12,000
Sorry.

445
00:27:12,000 --> 00:27:14,000
Here I just need to reverse this okay.

446
00:27:15,000 --> 00:27:20,000
So this needs to go down and this needs to come up okay.

447
00:27:20,000 --> 00:27:23,000
Because this is the positional encoding okay.

448
00:27:26,000 --> 00:27:28,000
So this were my word vectors.

449
00:27:28,000 --> 00:27:34,000
So if you probably go ahead and see with respect to the cat this is the vector that is getting created.

450
00:27:34,000 --> 00:27:37,000
This is with respect to the.

451
00:27:37,000 --> 00:27:39,000
This is my positional encoding.

452
00:27:41,000 --> 00:27:45,000
Positional encoding to making sure that what word is gone in which order it is gone.

453
00:27:45,000 --> 00:27:50,000
But in this case, uh, you know, the the has to become a cat has to go over there.

454
00:27:50,000 --> 00:27:51,000
Right.

455
00:27:51,000 --> 00:27:54,000
And similarly over here, this is my positional encoding.

456
00:27:55,000 --> 00:27:56,000
Please don't get confused.

457
00:27:56,000 --> 00:27:58,000
I've just changed the order here itself.

458
00:27:58,000 --> 00:27:59,000
Okay, purposely.

459
00:27:59,000 --> 00:28:04,000
But it won't matter a lot because anyhow, I'm giving the positional encoding information so that to

460
00:28:04,000 --> 00:28:06,000
make sure that which word is coming after what?

461
00:28:06,000 --> 00:28:07,000
Right.

462
00:28:07,000 --> 00:28:09,000
And then we finally give it to the self-attention.

463
00:28:09,000 --> 00:28:15,000
And here if multi-head attention is basically there, multi-head attention is there, you know what

464
00:28:15,000 --> 00:28:20,000
is exactly happening in the self-attention then all these multi attention will give given to the neural

465
00:28:20,000 --> 00:28:21,000
network.

466
00:28:21,000 --> 00:28:22,000
Okay.

467
00:28:22,000 --> 00:28:22,000
okay.

468
00:28:23,000 --> 00:28:24,000
That is feed forward neural network.

469
00:28:24,000 --> 00:28:30,000
And finally we get this that is z one and z two which will be our contextual vector.

470
00:28:30,000 --> 00:28:34,000
So this was the overall idea about positional encoding.

471
00:28:34,000 --> 00:28:42,000
But in short if you really want to know about positional encoding, it's it's more about, you know,

472
00:28:42,000 --> 00:28:46,000
uh, it's more about making sure that the order of the words is taken.

473
00:28:46,000 --> 00:28:51,000
The representation order of order or sequence is given to the self-attention layer in some way.

474
00:28:51,000 --> 00:28:51,000
Right.

475
00:28:51,000 --> 00:28:53,000
We need to make them know.

476
00:28:53,000 --> 00:28:54,000
Okay, this is the first word.

477
00:28:54,000 --> 00:28:56,000
This is the second word in the sentence.

478
00:28:56,000 --> 00:29:01,000
Obviously this problem will not happen in RNN because our RNN we give one word at a time with respect

479
00:29:01,000 --> 00:29:02,000
to timestamps, right.

480
00:29:02,000 --> 00:29:06,000
So this was one of the most efficient technique that we used.

481
00:29:06,000 --> 00:29:09,000
One more technique is something called as learn positioned encoding.

482
00:29:09,000 --> 00:29:14,000
Now in the case of learn position encoding, what we do is that uh, in this case uh.

483
00:29:15,000 --> 00:29:17,000
Position encodings.

484
00:29:18,000 --> 00:29:24,000
Positional encodings are learned are learned during training okay.

485
00:29:27,000 --> 00:29:28,000
During training.

486
00:29:29,000 --> 00:29:31,000
Now when we say during training, it is learn.

487
00:29:31,000 --> 00:29:36,000
That basically means we again we need to create a positional encoding matrix over here, and it needs

488
00:29:36,000 --> 00:29:38,000
to get updated through back propagation.

489
00:29:38,000 --> 00:29:43,000
Again, I don't want to discuss more about this because I think in in the research paper also it is

490
00:29:43,000 --> 00:29:45,000
mentioned about sinusoidal positional encoding.

491
00:29:45,000 --> 00:29:45,000
Right.

492
00:29:45,000 --> 00:29:51,000
And this is nothing, but it is a combination of sin and cos where we keep on continue or we keep on

493
00:29:51,000 --> 00:29:58,000
creating this such that no two values with respect to our positions, no two values should get the same

494
00:29:58,000 --> 00:29:59,000
value over here.

495
00:29:59,000 --> 00:30:00,000
Uh, with respect to the encoding.

496
00:30:00,000 --> 00:30:03,000
That is the reason we specifically use this.

497
00:30:03,000 --> 00:30:07,000
And this is how the entire calculation is basically done in order to calculate the positional encoding

498
00:30:07,000 --> 00:30:08,000
of a specific word.

499
00:30:08,000 --> 00:30:10,000
So I hope you like this particular video.

500
00:30:10,000 --> 00:30:12,000
This was it about positional encoding.

501
00:30:13,000 --> 00:30:18,000
Um, in the next video we are going to discuss about something called as layer normalization.

502
00:30:18,000 --> 00:30:21,000
And that is what we are going to discuss in the next video.

503
00:30:21,000 --> 00:30:23,000
So yeah, I will see you all in the next video.

504
00:30:23,000 --> 00:30:23,000
Thank you.

505
00:30:23,000 --> 00:30:24,000
Take care.

