1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to self-attention.

3
00:00:04,000 --> 00:00:09,000
And uh, already we have seen the basic architecture of transformer.

4
00:00:09,000 --> 00:00:16,000
Now we are going to deep dive into the self-attention layer and we'll try to understand how exactly

5
00:00:16,000 --> 00:00:17,000
this works.

6
00:00:17,000 --> 00:00:23,000
Now this session, in the upcoming session, we are going to probably discuss a lot about self-attention,

7
00:00:23,000 --> 00:00:28,000
where we'll also be solving like how this entire self-attention layer actually works.

8
00:00:28,000 --> 00:00:31,000
Uh, All the mathematical intuition behind it.

9
00:00:31,000 --> 00:00:37,000
And uh, through this, I am definitely sure that you'll try to understand or you'll get to understand

10
00:00:37,000 --> 00:00:40,000
how the contextual embedding is specifically created.

11
00:00:40,000 --> 00:00:41,000
Okay.

12
00:00:41,000 --> 00:00:46,000
So again, uh, the idea behind self-attention, what exactly it is.

13
00:00:46,000 --> 00:00:51,000
So first of all, here you'll be able to see that let's say I have the self-attention layer okay.

14
00:00:52,000 --> 00:00:58,000
Now with respect to this self-attention layer, whenever we give any vectors.

15
00:01:00,000 --> 00:01:07,000
Right, let's say the word is like the cat sat, right.

16
00:01:07,000 --> 00:01:10,000
Let's say if this is my sentence that I'm actually giving.

17
00:01:10,000 --> 00:01:11,000
Okay.

18
00:01:11,000 --> 00:01:17,000
Now with respect to this particular sentence, you will be able to see that when I am giving this specific

19
00:01:17,000 --> 00:01:22,000
sentence, this words that you will be seeing one by one, which will I'm actually giving.

20
00:01:22,000 --> 00:01:26,000
This will first of all get converted into vectors.

21
00:01:26,000 --> 00:01:30,000
So let's say I'm converting this into vectors.

22
00:01:33,000 --> 00:01:34,000
Vectors.

23
00:01:34,000 --> 00:01:37,000
And here also we will convert this into vectors.

24
00:01:37,000 --> 00:01:49,000
Let's say that right now the default vectors for the is like 100 for cat is 010, and for SAT is 001.

25
00:01:49,000 --> 00:01:49,000
Okay.

26
00:01:50,000 --> 00:01:59,000
Now as soon as we pass through this embedding layer or sorry, not embedding layer, it's the self-attention

27
00:01:59,000 --> 00:01:59,000
layer.

28
00:02:04,000 --> 00:02:05,000
Okay.

29
00:02:05,000 --> 00:02:12,000
As soon as we pass through the self-attention layer, then from here, we should be getting the output.

30
00:02:15,000 --> 00:02:24,000
Now, the output that we should definitely be getting over here will be another vector, will be another

31
00:02:24,000 --> 00:02:25,000
vector.

32
00:02:26,000 --> 00:02:28,000
And this vector that we specifically get.

33
00:02:30,000 --> 00:02:33,000
this is our contextual embedding.

34
00:02:33,000 --> 00:02:35,000
And why do we say this as contextual embedding.

35
00:02:36,000 --> 00:02:37,000
Because.

36
00:02:41,000 --> 00:02:48,000
Contextual embedding because see here we are going to take the importance of different tokens right

37
00:02:48,000 --> 00:02:50,000
in the input sequence.

38
00:02:50,000 --> 00:02:52,000
And we are going to create this particular vector.

39
00:02:52,000 --> 00:03:03,000
So right now if I just consider this as an example, let's say this vectors gets converted as .8.3.4.

40
00:03:05,000 --> 00:03:15,000
Then this particular vectors gets converted to like you can just say that as .4.6.2.

41
00:03:15,000 --> 00:03:16,000
Okay, I'm just saying as an example.

42
00:03:17,000 --> 00:03:23,000
And this particular vector gets converted as .3.4.8.

43
00:03:24,000 --> 00:03:28,000
Now what is the importance of getting this specific vector.

44
00:03:28,000 --> 00:03:37,000
It is very much uh, it is just simply you can consider that here, the entire uh, here the self-attention,

45
00:03:37,000 --> 00:03:43,000
what it is basically doing is that it is also specifying some importance of the other tokens or other

46
00:03:43,000 --> 00:03:49,000
words, and it is displaying or it is creating this particular vectors.

47
00:03:49,000 --> 00:03:49,000
Okay.

48
00:03:49,000 --> 00:03:52,000
So first of all, let's see this entire definition.

49
00:03:52,000 --> 00:03:57,000
So here it shows self-attention is also known as scaled dot product.

50
00:03:57,000 --> 00:04:03,000
Attention is a crucial mechanism in the transformer architecture that allows model to weigh the importance

51
00:04:03,000 --> 00:04:06,000
of different tokens in the input sequence related to each other.

52
00:04:07,000 --> 00:04:11,000
So here we are just not going to keep our embedding vectors fixed.

53
00:04:11,000 --> 00:04:18,000
Instead, we are going to change all these embedding vectors into contextual embedding vectors based

54
00:04:18,000 --> 00:04:20,000
on the importance of all the other words.

55
00:04:20,000 --> 00:04:21,000
Right.

56
00:04:21,000 --> 00:04:23,000
And this is what we are doing.

57
00:04:23,000 --> 00:04:24,000
And why do we do this.

58
00:04:24,000 --> 00:04:29,000
Because C with respect to different different sentences different different paragraphs.

59
00:04:29,000 --> 00:04:29,000
Right.

60
00:04:30,000 --> 00:04:33,000
All the paragraphs, all the sentences will have different context.

61
00:04:33,000 --> 00:04:35,000
They may be importance of some other word.

62
00:04:35,000 --> 00:04:38,000
One word will be dependent on the other word itself.

63
00:04:38,000 --> 00:04:39,000
Right.

64
00:04:39,000 --> 00:04:45,000
And because of this right you will be able to see that will be able to efficiently create applications

65
00:04:45,000 --> 00:04:49,000
like, uh, language translation, right.

66
00:04:49,000 --> 00:04:51,000
Language translation and.

67
00:04:51,000 --> 00:04:54,000
All right, language Language translation or text summarization.

68
00:04:54,000 --> 00:04:58,000
This kind of use cases, we will be specifically be able to create.

69
00:05:00,000 --> 00:05:02,000
Now how to convert this to this.

70
00:05:02,000 --> 00:05:03,000
Right.

71
00:05:03,000 --> 00:05:09,000
And the entire self-attention works the on the concept of scaled dot product attention.

72
00:05:09,000 --> 00:05:09,000
Right.

73
00:05:09,000 --> 00:05:12,000
So this is the entire operation that it specifically takes.

74
00:05:13,000 --> 00:05:18,000
Uh, here, one by one we will be discussing what is this key q k v.

75
00:05:18,000 --> 00:05:18,000
Right.

76
00:05:18,000 --> 00:05:19,000
These are some vectors.

77
00:05:19,000 --> 00:05:21,000
Why do we use this specific vectors.

78
00:05:21,000 --> 00:05:25,000
But at the end of the day this is the entire operation that we are doing now.

79
00:05:25,000 --> 00:05:30,000
What we are going to do is that we are going to see that how the self-attention layer is going to convert

80
00:05:30,000 --> 00:05:34,000
this vectors to this vectors, right, which is a contextual embedding.

81
00:05:35,000 --> 00:05:41,000
But here, just to give an idea, this vectors has basically been changed because considering the importance

82
00:05:41,000 --> 00:05:41,000
of different tokens.

83
00:05:41,000 --> 00:05:42,000
Okay.

84
00:05:42,000 --> 00:05:46,000
Now let's go ahead and let's talk about self-attention in detail.

85
00:05:46,000 --> 00:05:52,000
So let's go ahead and discuss about self-attention in much more detail.

86
00:05:52,000 --> 00:05:52,000
Right.

87
00:05:52,000 --> 00:05:58,000
And uh uh, over here our main aim is to convert this particular vectors into contextual embedding vectors.

88
00:05:58,000 --> 00:05:59,000
Right now.

89
00:05:59,000 --> 00:06:06,000
See one thing converting this particular word into a vector right into a dense vector like this.

90
00:06:06,000 --> 00:06:10,000
Here, for this particular scenario, we can use embedding layers, right?

91
00:06:10,000 --> 00:06:14,000
So in embedding layers what happens based on the dimensions.

92
00:06:14,000 --> 00:06:15,000
Right.

93
00:06:15,000 --> 00:06:20,000
We will be able to get a fixed vector for each and every word.

94
00:06:21,000 --> 00:06:21,000
Right.

95
00:06:21,000 --> 00:06:23,000
Like for the the cat sat right.

96
00:06:23,000 --> 00:06:25,000
We can actually get the fixed vector.

97
00:06:26,000 --> 00:06:34,000
But getting this contextual embedding vector based on the context of other words, it depends on the

98
00:06:34,000 --> 00:06:39,000
entire sentence, or it depends on the data set that we have.

99
00:06:39,000 --> 00:06:39,000
Right?

100
00:06:40,000 --> 00:06:48,000
So how with the help of this sentence or data set, we will be able to convert from this vector to this

101
00:06:48,000 --> 00:06:49,000
vector.

102
00:06:49,000 --> 00:06:49,000
Right.

103
00:06:49,000 --> 00:06:54,000
So for this there are multiple steps that we really need to follow okay.

104
00:06:54,000 --> 00:06:57,000
So let's talk about this specific step.

105
00:06:57,000 --> 00:07:02,000
And that is where this entire scaled dot product attention will come into existence.

106
00:07:02,000 --> 00:07:10,000
So the first important thing that we really need to do is that we need to create or we need to derive

107
00:07:10,000 --> 00:07:12,000
three important vectors.

108
00:07:12,000 --> 00:07:15,000
So I will first of all talk about this inputs.

109
00:07:16,000 --> 00:07:20,000
So in the first step we need to derive our three important vectors.

110
00:07:20,000 --> 00:07:22,000
One is queries.

111
00:07:22,000 --> 00:07:25,000
Then I have keys.

112
00:07:25,000 --> 00:07:29,000
And then we have something called as values.

113
00:07:29,000 --> 00:07:29,000
Okay.

114
00:07:30,000 --> 00:07:33,000
So queries keys and values.

115
00:07:33,000 --> 00:07:37,000
Right now what is the importance of this query.

116
00:07:37,000 --> 00:07:42,000
Keys and values that we will try to understand Okay.

117
00:07:44,000 --> 00:07:53,000
For every word, if you are trying to convert that into a contextual vector for every token, we specifically

118
00:07:53,000 --> 00:07:54,000
create a model.

119
00:07:54,000 --> 00:07:55,000
Right.

120
00:07:55,000 --> 00:08:01,000
And this model will be computing this three important vectors.

121
00:08:01,000 --> 00:08:10,000
One is queries, one is keys and one is values vectors.

122
00:08:10,000 --> 00:08:17,000
Okay so guys now before we go ahead let's go ahead and understand or see the basic definition of this.

123
00:08:17,000 --> 00:08:19,000
Queries keys and values vector.

124
00:08:19,000 --> 00:08:21,000
And what is the importance of this okay.

125
00:08:21,000 --> 00:08:25,000
So here you'll be able to see first we will understand about query vector.

126
00:08:25,000 --> 00:08:27,000
Now what exactly is a query vector?

127
00:08:27,000 --> 00:08:32,000
A query vector represents the token for which we are calculating the attention.

128
00:08:32,000 --> 00:08:35,000
Let's say for the first word that I'm actually passing the.

129
00:08:35,000 --> 00:08:42,000
If I really want to probably calculate, uh, the attention for this, which is the output vectors,

130
00:08:42,000 --> 00:08:42,000
right.

131
00:08:42,000 --> 00:08:45,000
Which we say it as contextual embedding vectors also.

132
00:08:45,000 --> 00:08:45,000
Right.

133
00:08:45,000 --> 00:08:51,000
So in order to calculate this we will specifically specify this vector as our query vector okay.

134
00:08:51,000 --> 00:08:57,000
Then they help to determine the importance of other token in the context of the current token, right?

135
00:08:57,000 --> 00:09:03,000
So in this particular token, the importance of the other token will also specified the final vector

136
00:09:03,000 --> 00:09:04,000
that will be able to get.

137
00:09:04,000 --> 00:09:06,000
Now what is the importance over here.

138
00:09:06,000 --> 00:09:07,000
Focus determination.

139
00:09:07,000 --> 00:09:09,000
This is the first importance.

140
00:09:09,000 --> 00:09:15,000
Queries help the model decide which part of the sequence to focus on for each specific token, because

141
00:09:15,000 --> 00:09:18,000
there will be multiple token in a sentence.

142
00:09:18,000 --> 00:09:25,000
By calculating the dot product between a query vector and all key vectors, the model assesses how much

143
00:09:25,000 --> 00:09:30,000
attention to give to each token relative to the current token, right?

144
00:09:30,000 --> 00:09:37,000
So what we do is that we multiply this query vectors or or we we do a dot product or not say multiply.

145
00:09:37,000 --> 00:09:40,000
We do a dot product of query vectors to key vectors.

146
00:09:40,000 --> 00:09:47,000
And this helps us to determine how much attention to give to each token related to the current token.

147
00:09:47,000 --> 00:09:47,000
Right.

148
00:09:47,000 --> 00:09:51,000
So if I have the if I have cat right.

149
00:09:51,000 --> 00:09:54,000
If I have set, let's say this is my three words.

150
00:09:54,000 --> 00:09:57,000
So this initially becomes my query vector for this particular.

151
00:09:57,000 --> 00:10:03,000
If I'm calculating the attention for this, and the final vector that I'm actually going to get is that

152
00:10:03,000 --> 00:10:07,000
when I multiply this query vector with the key vector, right, I will be.

153
00:10:07,000 --> 00:10:14,000
The model will be able to assess how much attention to give with respect to all the other tokens that

154
00:10:14,000 --> 00:10:16,000
are there relative to this particular token.

155
00:10:16,000 --> 00:10:17,000
Right.

156
00:10:17,000 --> 00:10:20,000
So that is what query vectors will talk about.

157
00:10:20,000 --> 00:10:21,000
Right?

158
00:10:21,000 --> 00:10:26,000
And as we see more examples, you will be seeing that I will be solving a step by step like how this

159
00:10:26,000 --> 00:10:30,000
entire query vector will be calculated, key vectors will be calculated or not.

160
00:10:30,000 --> 00:10:36,000
So second important thing is that with respect to contextual understanding, queries contribute to understanding

161
00:10:36,000 --> 00:10:38,000
the relationship.

162
00:10:38,000 --> 00:10:44,000
This is very important relationship between the current token and the rest of the sequence.

163
00:10:44,000 --> 00:10:46,000
This is the most important thing.

164
00:10:46,000 --> 00:10:51,000
Query vectors actually contribute to understanding the relationship between the current token and the

165
00:10:51,000 --> 00:10:55,000
rest of the token, which is essential for capturing dependencies and context.

166
00:10:55,000 --> 00:10:56,000
Okay.

167
00:10:56,000 --> 00:10:59,000
Then coming to the key vectors, what exactly is key vectors?

168
00:10:59,000 --> 00:11:05,000
Key vector represent all the token in the sequence and are used to compare with the query vector to

169
00:11:05,000 --> 00:11:07,000
calculate the attention scores.

170
00:11:08,000 --> 00:11:11,000
Now what exactly key vector is?

171
00:11:11,000 --> 00:11:14,000
It will have all the tokens that is probably there in the sentence, right?

172
00:11:14,000 --> 00:11:21,000
And it is basically a query will be compared to this key vectors, uh, to calculate the attention score,

173
00:11:21,000 --> 00:11:25,000
what exactly attention score is, we will discuss about it and how to calculate it.

174
00:11:25,000 --> 00:11:26,000
Also will discuss about it.

175
00:11:26,000 --> 00:11:30,000
So let's talk about the importance of key vectors.

176
00:11:30,000 --> 00:11:36,000
Keys are compared with queries to measure the relevance or compatibility of each token with the current

177
00:11:36,000 --> 00:11:36,000
token.

178
00:11:36,000 --> 00:11:42,000
This comparison helps in determining how much attention needs, how much attention each token should

179
00:11:42,000 --> 00:11:42,000
receive.

180
00:11:42,000 --> 00:11:43,000
Okay.

181
00:11:43,000 --> 00:11:44,000
Information retrieval.

182
00:11:44,000 --> 00:11:49,000
It also plays a critical role in retrieving the most relevant information from sequence by providing

183
00:11:49,000 --> 00:11:50,000
a basis for attention.

184
00:11:50,000 --> 00:11:53,000
Mechanism to compute similarity score.

185
00:11:53,000 --> 00:11:55,000
So what we do is that we take a query vector.

186
00:11:55,000 --> 00:11:59,000
We we we, we can directly query from this key vector itself.

187
00:11:59,000 --> 00:12:05,000
And based on similarity scores we get the most relevant information based on the dependency of all the

188
00:12:05,000 --> 00:12:06,000
other words also.

189
00:12:06,000 --> 00:12:07,000
Okay.

190
00:12:07,000 --> 00:12:13,000
And finally, uh, the value vectors, value vectors holds the actual information that will be aggregated

191
00:12:13,000 --> 00:12:16,000
to form the output of the attention mechanism.

192
00:12:16,000 --> 00:12:16,000
Okay.

193
00:12:16,000 --> 00:12:21,000
So value contains the data that will be weighted by the attention scores, the weights, some of the

194
00:12:21,000 --> 00:12:24,000
values from from the output of the self-attention mechanism.

195
00:12:24,000 --> 00:12:27,000
So we will discuss about all these things uh, as we go ahead.

196
00:12:27,000 --> 00:12:33,000
But I really wanted to put a basic definition of key, uh, query key and value vectors.

197
00:12:33,000 --> 00:12:33,000
Okay.

198
00:12:34,000 --> 00:12:37,000
Now let's take one specific example.

199
00:12:37,000 --> 00:12:42,000
And by that you will be able to understand how we are going to create this query key and value vector

200
00:12:42,000 --> 00:12:43,000
okay.

201
00:12:43,000 --> 00:12:45,000
So let's say I have an input sequence.

202
00:12:45,000 --> 00:12:49,000
So let's consider this is my input sequence.

203
00:12:49,000 --> 00:12:52,000
And the input sequence is nothing.

204
00:12:52,000 --> 00:12:54,000
But it is having this word the.

205
00:12:56,000 --> 00:13:00,000
Cat sat.

206
00:13:00,000 --> 00:13:07,000
Okay, so this is my input sequence right now.

207
00:13:07,000 --> 00:13:15,000
Let's say, uh, we are just going to take the embedding dimension by the embedding layer.

208
00:13:15,000 --> 00:13:19,000
So the embedding size I'm going to probably take it as four.

209
00:13:19,000 --> 00:13:23,000
That basically means for every word we will convert this into four vectors.

210
00:13:23,000 --> 00:13:23,000
Okay.

211
00:13:24,000 --> 00:13:33,000
And uh uh, even for the q k v vectors that we are also going to compute, let's consider this also

212
00:13:33,000 --> 00:13:35,000
as four dimension okay.

213
00:13:35,000 --> 00:13:38,000
So I'm just going to consider all this as four dimension.

214
00:13:38,000 --> 00:13:43,000
So here you will be able to see that how we will be able to generate this k q v.

215
00:13:43,000 --> 00:13:45,000
We will try to see okay.

216
00:13:45,000 --> 00:13:51,000
So first of all the first step over here is something called as token embeddings okay.

217
00:13:52,000 --> 00:13:57,000
Token embeddings basically means whatever sentence we have we will try to convert that into some kind

218
00:13:57,000 --> 00:13:58,000
of vectors.

219
00:13:58,000 --> 00:14:02,000
So let's say I will be writing e of the the embedding of the.

220
00:14:02,000 --> 00:14:03,000
Okay.

221
00:14:03,000 --> 00:14:08,000
Let's say that here there are three words, but I am considering there may be four dimensions.

222
00:14:08,000 --> 00:14:12,000
So I will just go ahead and mention hey though is represented like this.

223
00:14:12,000 --> 00:14:15,000
Okay then I will be having e of cat.

224
00:14:15,000 --> 00:14:18,000
Let's say cat is represented something like this.

225
00:14:19,000 --> 00:14:26,000
And then I have E off set one comma, one comma, one.

226
00:14:26,000 --> 00:14:32,000
Okay, so this is how the entire tokens is basically represented for this particular words.

227
00:14:32,000 --> 00:14:34,000
And these are my fixed vectors okay.

228
00:14:34,000 --> 00:14:41,000
Now the second important step is that see see guys one very simple thing that you really need to understand.

229
00:14:41,000 --> 00:14:46,000
If I want to convert one vector into a context vector okay.

230
00:14:46,000 --> 00:14:48,000
By using this self-attention.

231
00:14:48,000 --> 00:14:49,000
See this is my one vector, right.

232
00:14:49,000 --> 00:14:55,000
Let's say if this is 100I need to convert this into .8.2.3.

233
00:14:55,000 --> 00:14:55,000
Right.

234
00:14:56,000 --> 00:15:01,000
This vector is basically getting created based on the context of all the other words also.

235
00:15:01,000 --> 00:15:01,000
Right.

236
00:15:02,000 --> 00:15:06,000
So there is a huge dependency on the entire sentence.

237
00:15:06,000 --> 00:15:07,000
Right.

238
00:15:07,000 --> 00:15:12,000
Because with respect to different different data set there will be different different sentence.

239
00:15:12,000 --> 00:15:15,000
And based on different different context different different dependencies.

240
00:15:15,000 --> 00:15:17,000
This vector will keep on changing.

241
00:15:17,000 --> 00:15:17,000
Right.

242
00:15:17,000 --> 00:15:24,000
So we should definitely come up with a model wherein I will say this model as self-attention.

243
00:15:25,000 --> 00:15:29,000
Self-attention where I give one kind of input and I should be able to get my contextual embedding based

244
00:15:29,000 --> 00:15:30,000
on the sentence.

245
00:15:30,000 --> 00:15:37,000
This will have all the information about all the other words also, and that is why the reason we are

246
00:15:37,000 --> 00:15:41,000
using this query query key vectors value vectors.

247
00:15:41,000 --> 00:15:46,000
That is the reason for that, so that it has some information about the other words and sentences also.

248
00:15:46,000 --> 00:15:47,000
Okay.

249
00:15:47,000 --> 00:15:50,000
So now once we have done the token embedding.

250
00:15:50,000 --> 00:15:53,000
So that is what we are doing now we will try to create this entire model.

251
00:15:53,000 --> 00:15:56,000
And I will also show you that how this model will be trained.

252
00:15:56,000 --> 00:15:59,000
The second step is something called as linear transformation.

253
00:16:00,000 --> 00:16:03,000
Now what this linear transformation basically does.

254
00:16:04,000 --> 00:16:05,000
Okay.

255
00:16:06,000 --> 00:16:19,000
So here you will be able to see we will create query key value vectors by multiplying.

256
00:16:20,000 --> 00:16:24,000
See I am telling you we have to create a model when we create a model right.

257
00:16:24,000 --> 00:16:31,000
Definitely a model needs to have some kind of weights, because if I just want to convert a word into

258
00:16:31,000 --> 00:16:34,000
a fixed vector, we can directly use word two vec or some kind of embedding layer.

259
00:16:34,000 --> 00:16:35,000
Right.

260
00:16:35,000 --> 00:16:39,000
And uh uh, but here our scenario is something different.

261
00:16:39,000 --> 00:16:42,000
We need to create a contextual embedding based on the other words.

262
00:16:42,000 --> 00:16:44,000
And we need to do it in the runtime.

263
00:16:44,000 --> 00:16:44,000
Right.

264
00:16:44,000 --> 00:16:50,000
So we have to probably train a model which in which it will be able to take this particular fixed vector

265
00:16:50,000 --> 00:16:53,000
and convert that into a uh, contextual embedded vector.

266
00:16:53,000 --> 00:16:54,000
Okay.

267
00:16:54,000 --> 00:16:54,000
okay.

268
00:16:54,000 --> 00:16:58,000
So for that we perform in the second step that is called as linear transformation.

269
00:16:58,000 --> 00:17:08,000
Here we create query key and values by multiplying by multiplying the embeddings by multiplying the

270
00:17:08,000 --> 00:17:17,000
embeddings by learning by learned weights sorry by learned weights.

271
00:17:17,000 --> 00:17:19,000
matrices.

272
00:17:19,000 --> 00:17:29,000
I will try to denote this matrices as w of q, w of k and w of v okay.

273
00:17:29,000 --> 00:17:30,000
Weights of queries.

274
00:17:30,000 --> 00:17:33,000
Weights of keys and weights of values okay.

275
00:17:34,000 --> 00:17:36,000
Now what does this basically mean.

276
00:17:36,000 --> 00:17:38,000
So this is very simple.

277
00:17:38,000 --> 00:17:43,000
Let's say if I have a vector let's say this is my word of cat.

278
00:17:44,000 --> 00:17:44,000
Okay.

279
00:17:45,000 --> 00:17:47,000
I will take this cat vectors.

280
00:17:47,000 --> 00:17:48,000
Let's say it is 100.

281
00:17:49,000 --> 00:17:54,000
I will probably take a weight.

282
00:17:57,000 --> 00:18:03,000
Let's say this is my weight w of Q right.

283
00:18:03,000 --> 00:18:12,000
If I take this vector and do a dot operation with this, I should be able to get my I should be able

284
00:18:12,000 --> 00:18:13,000
to get my.

285
00:18:16,000 --> 00:18:17,000
Key vector query vectors.

286
00:18:17,000 --> 00:18:18,000
Okay.

287
00:18:19,000 --> 00:18:22,000
I should be able to get my query vector.

288
00:18:22,000 --> 00:18:25,000
So this will basically be my query vectors.

289
00:18:25,000 --> 00:18:26,000
Okay.

290
00:18:26,000 --> 00:18:27,000
So this is one cross three.

291
00:18:27,000 --> 00:18:28,000
This is three cross three.

292
00:18:28,000 --> 00:18:31,000
If I do a dot product it will be nothing but one cross three okay.

293
00:18:31,000 --> 00:18:40,000
Similarly, if I consider about my another weights in order to compute the key value or the key vectors,

294
00:18:40,000 --> 00:18:46,000
I will multiply with a with a weight matrix, or do a dot operation with respect to weight matrix for

295
00:18:46,000 --> 00:18:50,000
this particular um, for this particular vector.

296
00:18:50,000 --> 00:18:55,000
And I should be able to get my key vectors.

297
00:18:56,000 --> 00:18:58,000
Okay, so this basically becomes my key vector.

298
00:18:58,000 --> 00:19:08,000
And similarly if I go ahead and consider my another weights that is nothing but values w of v, I should

299
00:19:08,000 --> 00:19:12,000
be able to get my v vectors.

300
00:19:12,000 --> 00:19:14,000
Okay, so this is what we are doing.

301
00:19:14,000 --> 00:19:18,000
So we create a q, k and v by multiplying the embeddings.

302
00:19:18,000 --> 00:19:21,000
This embeddings will be done as a dot product.

303
00:19:21,000 --> 00:19:25,000
Dot product with respect to this particular vector.

304
00:19:25,000 --> 00:19:28,000
Then this vector and this vector okay.

305
00:19:28,000 --> 00:19:31,000
And then we'll be able to get this k q v.

306
00:19:31,000 --> 00:19:35,000
But understand here we have written learned weights.

307
00:19:35,000 --> 00:19:37,000
That basically means initially we'll go ahead and initialize the weights.

308
00:19:37,000 --> 00:19:42,000
And later on with the help of back propagation this needs to be learned to get the exact right kind

309
00:19:42,000 --> 00:19:46,000
of key, uh, query key and value vectors.

310
00:19:46,000 --> 00:19:46,000
Okay.

311
00:19:46,000 --> 00:19:48,000
And that is what we are specifically doing.

312
00:19:48,000 --> 00:19:54,000
Now let's if I really want to show you one practical application or example, let's go ahead and do

313
00:19:54,000 --> 00:19:54,000
one thing.

314
00:19:54,000 --> 00:19:55,000
Okay.

315
00:19:55,000 --> 00:19:57,000
So here is my token embedding which I have actually considered.

316
00:19:57,000 --> 00:20:11,000
Now let's consider my I have initialized w of q I have initialized w of q, w of k, and w of v is equal

317
00:20:11,000 --> 00:20:12,000
to identity matrix.

318
00:20:12,000 --> 00:20:14,000
So identity matrix.

319
00:20:14,000 --> 00:20:18,000
In this particular case if I'm taking three cross three this will have all the diagonal elements as

320
00:20:18,000 --> 00:20:22,000
one and remaining all as zeros okay.

321
00:20:22,000 --> 00:20:25,000
So this can definitely be an identity matrix.

322
00:20:25,000 --> 00:20:27,000
So let's take this identity matrix.

323
00:20:27,000 --> 00:20:35,000
Let's probably initialize it now uh, you'll be able to see that if I really want to compute the query

324
00:20:35,000 --> 00:20:36,000
of the right.

325
00:20:36,000 --> 00:20:40,000
So the is nothing, but, uh, you'll be able to see that.

326
00:20:40,000 --> 00:20:41,000
Uh, right.

327
00:20:41,000 --> 00:20:47,000
So in this particular scenario, if I just take this, the the is nothing but this 1010.

328
00:20:47,000 --> 00:20:48,000
Okay.

329
00:20:48,000 --> 00:20:51,000
So here I'm going to basically write 1010.

330
00:20:51,000 --> 00:20:59,000
If I do a dot operation with 100010001, right?

331
00:20:59,000 --> 00:21:06,000
So in this scenario, if I do a dot operation with this entire operation, uh, then I know with respect

332
00:21:06,000 --> 00:21:11,000
to the identity matrix I'm going to get the same value similarly, uh, key of the right.

333
00:21:11,000 --> 00:21:15,000
If I really want to find out the vectors again, I will be able to get the same thing.

334
00:21:15,000 --> 00:21:19,000
Then value of the again, I'm going to get the same thing right.

335
00:21:19,000 --> 00:21:24,000
So in this scenario you will be able to see that with respect to the key query and value.

336
00:21:24,000 --> 00:21:31,000
If I am multiplying this with my or if I am initializing and identity matrix, I'm actually going to

337
00:21:31,000 --> 00:21:32,000
get the same value.

338
00:21:32,000 --> 00:21:32,000
Okay.

339
00:21:32,000 --> 00:21:34,000
So this is the first step right?

340
00:21:34,000 --> 00:21:40,000
Similarly, if I go ahead and compute uh, so this is my first thing where I have actually computed

341
00:21:40,000 --> 00:21:49,000
the key of the K of sorry queries of the key of the and V of the right.

342
00:21:49,000 --> 00:21:52,000
It will be nothing, but it will be 1010.

343
00:21:52,000 --> 00:21:54,000
And this how we have actually computed?

344
00:21:54,000 --> 00:21:55,000
Very simple.

345
00:21:55,000 --> 00:22:00,000
We have taken our vectors, the embedding vectors, and we have multiplied with the identity matrix.

346
00:22:00,000 --> 00:22:00,000
Okay.

347
00:22:00,000 --> 00:22:02,000
Now we have done the dot operation.

348
00:22:02,000 --> 00:22:11,000
Now similarly for the other word let's say the other word is nothing but q of cat and k of cat and V

349
00:22:11,000 --> 00:22:12,000
of cat.

350
00:22:13,000 --> 00:22:16,000
It is not compulsory that you need to initialize it as an identity matrix.

351
00:22:16,000 --> 00:22:17,000
You can initialize anything.

352
00:22:17,000 --> 00:22:20,000
But just for my computation, I've made it like this right now.

353
00:22:20,000 --> 00:22:22,000
Here you'll be able to see for Cat what exactly it is.

354
00:22:22,000 --> 00:22:24,000
For cat it is 0101.

355
00:22:24,000 --> 00:22:24,000
Okay.

356
00:22:24,000 --> 00:22:28,000
So here I'm going to go ahead and write 0101.

357
00:22:28,000 --> 00:22:28,000
Right.

358
00:22:28,000 --> 00:22:34,000
So this basically becomes my query, uh, sorry query vectors, key vectors and value vectors.

359
00:22:34,000 --> 00:22:35,000
Right.

360
00:22:35,000 --> 00:22:35,000
For cat.

361
00:22:35,000 --> 00:22:43,000
Similarly, the third one that we are basically going to compute is q of sat, k of SAT, and V of sat.

362
00:22:43,000 --> 00:22:49,000
So in this case we'll be able to see that I'm going to get 1111 right.

363
00:22:49,000 --> 00:22:51,000
So all this particular vectors is there.

364
00:22:51,000 --> 00:22:51,000
Perfect.

365
00:22:52,000 --> 00:22:57,000
Now once we have actually done this this is the first step, right?

366
00:22:57,000 --> 00:22:59,000
Uh, the first step is basically to get the token embedding.

367
00:22:59,000 --> 00:23:04,000
Then we create uh q k v by multiplying the embeddings by learned weight matrix.

368
00:23:04,000 --> 00:23:06,000
Right now it is still not learned.

369
00:23:06,000 --> 00:23:10,000
We have initialized it, but with the help of back propagation this will keep on getting learning okay.

370
00:23:10,000 --> 00:23:19,000
Now the third step that we are going to basically do is compute attention scores.

371
00:23:21,000 --> 00:23:24,000
We compute the attention scores.

372
00:23:24,000 --> 00:23:25,000
Okay.

373
00:23:25,000 --> 00:23:31,000
Now, as I told you, how do we compute the attention score?

374
00:23:31,000 --> 00:23:31,000
It is very simple.

375
00:23:31,000 --> 00:23:37,000
And over here we have told queries help the model to decide which part of the sequence to focus on for

376
00:23:37,000 --> 00:23:38,000
each token.

377
00:23:38,000 --> 00:23:44,000
By calculating the dot product between a query and all the key vectors, the model assesses how much

378
00:23:44,000 --> 00:23:45,000
attention to give to each token.

379
00:23:45,000 --> 00:23:51,000
For doing this, we need to do a dot product of q query vector and all key vectors, and that is how

380
00:23:51,000 --> 00:23:54,000
we calculate our attention scores.

381
00:23:54,000 --> 00:23:56,000
And why do we do that?

382
00:23:56,000 --> 00:23:58,000
It is very much simple to provide.

383
00:23:58,000 --> 00:24:00,000
See why do we do that.

384
00:24:00,000 --> 00:24:05,000
Because the model assesses how much attention to give to each token relative to the current token.

385
00:24:05,000 --> 00:24:10,000
So with relative to this particular token that is the how much importance I need to give to Cat and

386
00:24:10,000 --> 00:24:11,000
SAT tokens.

387
00:24:11,000 --> 00:24:11,000
Okay.

388
00:24:11,000 --> 00:24:14,000
And that is how we basically go ahead and compute our score okay.

389
00:24:14,000 --> 00:24:17,000
So for computing the score right.

390
00:24:17,000 --> 00:24:24,000
So let's say for the the word if I go ahead and compute the score, the score will be nothing but query

391
00:24:24,000 --> 00:24:26,000
of the and key of the.

392
00:24:26,000 --> 00:24:29,000
So first of all we'll go ahead and compute this.

393
00:24:29,000 --> 00:24:32,000
So I know my query is nothing but 1010.

394
00:24:32,000 --> 00:24:40,000
And then if I want to do if I want to multiply with k right k is also 1010.

395
00:24:40,000 --> 00:24:43,000
So I will probably take a transformation of this.

396
00:24:43,000 --> 00:24:43,000
Right.

397
00:24:44,000 --> 00:24:51,000
Because if I really want to do the dot operation, my if my q vector is like this 1010, I will probably

398
00:24:51,000 --> 00:24:53,000
do a dot operation by taking the transformation of this.

399
00:24:53,000 --> 00:24:57,000
And then, then and then only I will be able to do the dot operation.

400
00:24:57,000 --> 00:24:57,000
Right.

401
00:24:57,000 --> 00:25:00,000
So once I do this, I will be able to get the value as two right.

402
00:25:00,000 --> 00:25:05,000
So one multiply by one plus zero plus one plus zero right.

403
00:25:05,000 --> 00:25:06,000
So finally you will be able to get two.

404
00:25:06,000 --> 00:25:14,000
So for the with respect to query of the and q a key of the when I do the dot operation I'm actually

405
00:25:14,000 --> 00:25:15,000
going to get the value of two.

406
00:25:15,000 --> 00:25:22,000
Similarly, if I go ahead and calculate the score with respect to query of the and key of Cat right.

407
00:25:22,000 --> 00:25:27,000
If I do this particular dot operation, because we need to understand the context with respect to the

408
00:25:27,000 --> 00:25:27,000
other words also.

409
00:25:27,000 --> 00:25:28,000
Right.

410
00:25:28,000 --> 00:25:30,000
And here is what I'm actually going to get it.

411
00:25:30,000 --> 00:25:32,000
So I'll be writing 1010.

412
00:25:32,000 --> 00:25:36,000
And I'm going to do a transform operation with what what is key a key of cat.

413
00:25:37,000 --> 00:25:39,000
See what is key of Cat.

414
00:25:39,000 --> 00:25:43,000
It is nothing but 0100101 and this will be the transformed value.

415
00:25:44,000 --> 00:25:49,000
Now, if I try to go ahead and find out, I will be getting the output as zero, right?

416
00:25:49,000 --> 00:25:58,000
Similarly, if I go ahead and calculate square off score off query of the with respect to key off set

417
00:25:58,000 --> 00:25:59,000
right key vectors of sad.

418
00:25:59,000 --> 00:26:06,000
When we do the dot operation of q off the and key key vectors of SAT here, what we are going to get.

419
00:26:07,000 --> 00:26:09,000
again I'll go and write 1010.

420
00:26:09,000 --> 00:26:11,000
And again what is key of SAT.

421
00:26:11,000 --> 00:26:14,000
It is nothing but one comma, one comma, one.

422
00:26:14,000 --> 00:26:16,000
This will be the transform value.

423
00:26:16,000 --> 00:26:19,000
So here also we are going to get the value as two right.

424
00:26:19,000 --> 00:26:22,000
So this is how we compute the score values right.

425
00:26:22,000 --> 00:26:23,000
And why we are doing this.

426
00:26:23,000 --> 00:26:25,000
Again why we are doing this.

427
00:26:25,000 --> 00:26:27,000
It is very simple over here.

428
00:26:28,000 --> 00:26:32,000
It is basically to assess how much attention to give each token related to the current token.

429
00:26:32,000 --> 00:26:35,000
Okay, so we have got some kind of values over here.

430
00:26:35,000 --> 00:26:37,000
And this will actually help us to understand that.

431
00:26:37,000 --> 00:26:43,000
How much context how much uh, how much focus should I really need to give with respect to the other

432
00:26:43,000 --> 00:26:46,000
token, with respect to the current token that I have actually considered?

433
00:26:46,000 --> 00:26:47,000
Right.

434
00:26:47,000 --> 00:26:49,000
So here I have actually got some values scores.

435
00:26:49,000 --> 00:26:49,000
Right.

436
00:26:49,000 --> 00:26:51,000
202 okay.

437
00:26:51,000 --> 00:26:55,000
Similarly we go ahead and compute the for the token cat.

438
00:26:55,000 --> 00:26:59,000
Now for the token cat we need to do the same thing right.

439
00:27:00,000 --> 00:27:01,000
We need to do the same thing.

440
00:27:01,000 --> 00:27:03,000
That is go ahead and compute the score.

441
00:27:03,000 --> 00:27:10,000
Now the first score will be Q of Cat with respect to key of the right.

442
00:27:10,000 --> 00:27:14,000
Now I have to probably consider whether there is a dependency, how much dependency I need to keep with

443
00:27:14,000 --> 00:27:16,000
respect to this.

444
00:27:16,000 --> 00:27:16,000
Right.

445
00:27:16,000 --> 00:27:19,000
So here you'll be able to see that I will be able to compute it.

446
00:27:19,000 --> 00:27:22,000
Now here I will have 0101.

447
00:27:22,000 --> 00:27:23,000
Right.

448
00:27:23,000 --> 00:27:25,000
And this time I'm going to do a transform operation.

449
00:27:25,000 --> 00:27:27,000
With what dot transform operation with.

450
00:27:27,000 --> 00:27:30,000
Basically I need to transform this and do a dot operation.

451
00:27:30,000 --> 00:27:31,000
Right.

452
00:27:31,000 --> 00:27:33,000
So here you'll be able to see if I do the computation.

453
00:27:33,000 --> 00:27:35,000
Um I'll be getting the value as zero.

454
00:27:35,000 --> 00:27:44,000
Similarly, if I go ahead and compute the score of q cat with cat.

455
00:27:44,000 --> 00:27:45,000
Right.

456
00:27:46,000 --> 00:27:53,000
So here it will be nothing but 0101, which will be a dot operation of 0101 transform.

457
00:27:53,000 --> 00:27:54,000
Right.

458
00:27:54,000 --> 00:27:56,000
And here I'm actually going to get the value as zero.

459
00:27:56,000 --> 00:28:03,000
Then again if I go ahead and compute this q of cat with k off set right.

460
00:28:03,000 --> 00:28:05,000
In short I'm just doing a dot operation right.

461
00:28:05,000 --> 00:28:11,000
So here I will be having 0101.1111 transform.

462
00:28:11,000 --> 00:28:13,000
And again I'm going to get the value as two okay.

463
00:28:15,000 --> 00:28:15,000
Right.

464
00:28:15,000 --> 00:28:20,000
So just by seeing this word you can just understand SAT is important for Cat.

465
00:28:20,000 --> 00:28:26,000
So that is the reason why I'm getting equal values for this particular score for uh cat with cat and

466
00:28:26,000 --> 00:28:26,000
cat with sat.

467
00:28:26,000 --> 00:28:27,000
Right.

468
00:28:27,000 --> 00:28:29,000
This kind of dependency is definitely there.

469
00:28:29,000 --> 00:28:31,000
And I've definitely used identity matrix.

470
00:28:31,000 --> 00:28:35,000
But this all weight parameters will also get trained right.

471
00:28:36,000 --> 00:28:37,000
Then similarly for the token.

472
00:28:38,000 --> 00:28:40,000
For the token set.

473
00:28:42,000 --> 00:28:45,000
For the token set how do we go ahead and compute it.

474
00:28:45,000 --> 00:28:48,000
So we will go ahead and write the score.

475
00:28:48,000 --> 00:28:52,000
And here you'll be able to see Q of sat.

476
00:28:53,000 --> 00:28:55,000
K of sat right.

477
00:28:55,000 --> 00:29:01,000
So sorry K of the we need to do the dot operation with first of all the queries of the query vector

478
00:29:01,000 --> 00:29:04,000
offset with the dot uh, with the key vectors of the right.

479
00:29:04,000 --> 00:29:11,000
So here I will be getting one comma, one comma one, which is my query vectors of sat dot operation

480
00:29:11,000 --> 00:29:15,000
of 1010, which is my transform operation.

481
00:29:15,000 --> 00:29:17,000
And here I'm going to get two.

482
00:29:17,000 --> 00:29:26,000
Similarly, if I go ahead and compute this with SAT and K of Cat, I will be getting two over here.

483
00:29:26,000 --> 00:29:28,000
You can go ahead and do the computation.

484
00:29:28,000 --> 00:29:32,000
And then here you will be able to see K of SAT.

485
00:29:32,000 --> 00:29:34,000
I'm also going to get a four right.

486
00:29:35,000 --> 00:29:39,000
So this is in turn the values that you will be able to see.

487
00:29:39,000 --> 00:29:40,000
Right.

488
00:29:40,000 --> 00:29:41,000
And this is my third step.

489
00:29:41,000 --> 00:29:47,000
So initial step what we do is that first of all we go ahead and compute.

490
00:29:47,000 --> 00:29:48,000
We take a token embeddings.

491
00:29:48,000 --> 00:29:55,000
Then we do linear transformation and probably create this k q vectors by initializing by by a learned

492
00:29:55,000 --> 00:29:59,000
weights matrices of w of q, w of k and w of v.

493
00:29:59,000 --> 00:30:04,000
In this scenario, I've just considered this as identity matrix, so let's consider it.

494
00:30:05,000 --> 00:30:09,000
Okay then I've shown you that.

495
00:30:09,000 --> 00:30:15,000
How do we probably go ahead and compute this right when we do this particular dot operation, dot operation,

496
00:30:15,000 --> 00:30:16,000
dot operation.

497
00:30:16,000 --> 00:30:19,000
So I'll be getting the same values and this is what I get it finally.

498
00:30:19,000 --> 00:30:21,000
Then we compute the attention scores.

499
00:30:21,000 --> 00:30:26,000
And this attention score is for a different purpose just to how much focus I really need to give on

500
00:30:26,000 --> 00:30:27,000
the other tokens.

501
00:30:27,000 --> 00:30:28,000
Right.

502
00:30:28,000 --> 00:30:35,000
So here I get all my token score values right then Then my next step after this is something called

503
00:30:35,000 --> 00:30:37,000
as scaling.

504
00:30:38,000 --> 00:30:39,000
Okay.

505
00:30:39,000 --> 00:30:42,000
Now what exactly is scaling okay.

506
00:30:42,000 --> 00:30:46,000
See whatever scores that we are getting right.

507
00:30:46,000 --> 00:30:48,000
Whatever score that we are getting over here.

508
00:30:49,000 --> 00:30:50,000
Right.

509
00:30:50,000 --> 00:30:54,000
You'll be able to see that I'm getting some values like 2 to 4 right.

510
00:30:54,000 --> 00:30:56,000
0 to 2 and 202.

511
00:30:56,000 --> 00:30:57,000
Right.

512
00:30:57,000 --> 00:31:00,000
And you can clearly see that.

513
00:31:00,000 --> 00:31:04,000
Why uh why this scaling needs to be done.

514
00:31:04,000 --> 00:31:07,000
Why why we we are specifically performing scaling.

515
00:31:07,000 --> 00:31:08,000
Okay.

516
00:31:08,000 --> 00:31:12,000
First of all I'll just make you understand what does exactly scaling mean okay.

517
00:31:12,000 --> 00:31:19,000
So scaling what we do is that we take up we we take up this course.

518
00:31:23,000 --> 00:31:25,000
And scale.

519
00:31:27,000 --> 00:31:33,000
Scale down by dividing.

520
00:31:35,000 --> 00:31:37,000
By dividing the scores.

521
00:31:40,000 --> 00:31:42,000
By dividing the scores.

522
00:31:44,000 --> 00:31:51,000
By the square root of dimensions of key vectors.

523
00:31:51,000 --> 00:31:52,000
Okay.

524
00:31:52,000 --> 00:31:58,000
So in our case our dimension of the key vector is four.

525
00:31:58,000 --> 00:32:03,000
So if I do square root of d of k it is basically going to become two.

526
00:32:03,000 --> 00:32:07,000
So guys now let's understand why do we really need to consider scaling okay.

527
00:32:07,000 --> 00:32:09,000
So scaling.

528
00:32:10,000 --> 00:32:12,000
So let me just write it down.

529
00:32:12,000 --> 00:32:16,000
Scaling in the attention mechanism.

530
00:32:20,000 --> 00:32:30,000
In the attention mechanism is crucial to prevent.

531
00:32:33,000 --> 00:32:36,000
The dot product.

532
00:32:40,000 --> 00:32:46,000
Dot product from growing too large.

533
00:32:48,000 --> 00:32:50,000
From growing too large.

534
00:32:50,000 --> 00:32:52,000
Now when we say dot product.

535
00:32:52,000 --> 00:32:53,000
Dot product between what?

536
00:32:54,000 --> 00:33:00,000
Dot product between the queries and key vectors that we are specifically done with respect to this right

537
00:33:01,000 --> 00:33:02,000
now.

538
00:33:03,000 --> 00:33:08,000
Why see why we need to prevent this dot product from growing too large?

539
00:33:08,000 --> 00:33:08,000
Okay.

540
00:33:08,000 --> 00:33:18,000
And in most of the research paper it is basically given to ensure ensure stable gradients during training.

541
00:33:22,000 --> 00:33:25,000
So let's go ahead and understand this with an example okay.

542
00:33:26,000 --> 00:33:31,000
And uh if we don't scale then what kind of problems we will be facing.

543
00:33:31,000 --> 00:33:38,000
And uh, as you know that if DK, dk basically means, uh, the dimension of the key vectors, if it

544
00:33:38,000 --> 00:33:45,000
keeps on getting large here, you'll be able to see that we will be facing two different kind of problems.

545
00:33:45,000 --> 00:33:45,000
Okay.

546
00:33:45,000 --> 00:33:51,000
And as the dimension keeps on getting larger, if we don't try to scale with this value, and if the

547
00:33:51,000 --> 00:33:53,000
dimension is getting larger and larger, right.

548
00:33:53,000 --> 00:33:55,000
We face two different types of problem.

549
00:33:55,000 --> 00:34:01,000
One is gradient exploding right here.

550
00:34:01,000 --> 00:34:06,000
You'll be able to see that during the back propagation this gradient will become very, very large.

551
00:34:06,000 --> 00:34:10,000
Uh, and because of that, your training will not be stable.

552
00:34:10,000 --> 00:34:10,000
Right.

553
00:34:10,000 --> 00:34:14,000
And the second one is something called as softmax saturation.

554
00:34:17,000 --> 00:34:23,000
So when I say softmax saturation, this basically means, uh, the softmax function has become saturated

555
00:34:23,000 --> 00:34:31,000
when most of the attention weight is assigned to us, just a single token and other tokens for other

556
00:34:31,000 --> 00:34:31,000
tokens.

557
00:34:32,000 --> 00:34:35,000
The weight will be assigned to near to zero.

558
00:34:36,000 --> 00:34:39,000
The weight will be approximately equal to zero, so we will discuss about it.

559
00:34:39,000 --> 00:34:42,000
Don't worry, I will try to explain you with an example.

560
00:34:42,000 --> 00:34:45,000
Okay, so two problems we are going to face over here.

561
00:34:45,000 --> 00:34:47,000
One is gradient exploding and the softmax saturation.

562
00:34:47,000 --> 00:34:48,000
Okay.

563
00:34:48,000 --> 00:34:51,000
Now let's uh see one example.

564
00:34:51,000 --> 00:34:54,000
Let's say that I am going to consider three vectors.

565
00:34:54,000 --> 00:34:56,000
One is k a q.

566
00:34:56,000 --> 00:34:58,000
This is nothing but two three, four, one.

567
00:34:58,000 --> 00:35:04,000
And let's say I have another vector KK1 right 1010.

568
00:35:04,000 --> 00:35:09,000
And another key vector that is nothing but 0101.

569
00:35:09,000 --> 00:35:09,000
Okay.

570
00:35:11,000 --> 00:35:17,000
Now let's uh without scaling, let's do the dot product okay.

571
00:35:17,000 --> 00:35:19,000
When we do the dot product.

572
00:35:19,000 --> 00:35:24,000
So here what I will do I will be using k along with k one of transpose.

573
00:35:24,000 --> 00:35:31,000
So if I do the dot operation how it will be two multiplied by one plus three multiplied by zero plus

574
00:35:31,000 --> 00:35:35,000
four multiplied by one, four multiplied by zero.

575
00:35:35,000 --> 00:35:39,000
So here overall I'm actually going to get this value okay okay.

576
00:35:40,000 --> 00:35:41,000
Uh one multiplied by zero.

577
00:35:41,000 --> 00:35:47,000
It should not be four multiplied by zero, but one multiply four multiplied by one and one multiplied

578
00:35:47,000 --> 00:35:47,000
by zero.

579
00:35:47,000 --> 00:35:52,000
So overall, if you see two plus four which will be six okay.

580
00:35:53,000 --> 00:35:58,000
And similarly if I do the dot operation with the next key vector over here.

581
00:35:58,000 --> 00:36:06,000
So this will be two multiplied by zero, three multiplied by one, four multiplied by zero, one multiplied

582
00:36:06,000 --> 00:36:06,000
by one.

583
00:36:06,000 --> 00:36:07,000
Okay.

584
00:36:07,000 --> 00:36:11,000
And if you see over here zero plus three plus zero plus one, it is nothing but four.

585
00:36:12,000 --> 00:36:18,000
So with respect to this two dot operation, you know, initially my key values were this much okay,

586
00:36:19,000 --> 00:36:23,000
key vectors were this much, but now I'm after the dot operation I'm getting six and four.

587
00:36:23,000 --> 00:36:23,000
Okay.

588
00:36:24,000 --> 00:36:31,000
Now, you know, after after the scaling, let's say that I'm going to apply a softmax because that

589
00:36:31,000 --> 00:36:33,000
is what happens here also.

590
00:36:33,000 --> 00:36:35,000
So first of all we need to apply scaling.

591
00:36:35,000 --> 00:36:40,000
And then after that we'll apply a softmax activation function to get the final contextual embedding

592
00:36:40,000 --> 00:36:40,000
okay.

593
00:36:40,000 --> 00:36:42,000
That is what happens in the next step.

594
00:36:42,000 --> 00:36:44,000
So let's say here I have got two scores.

595
00:36:45,000 --> 00:36:47,000
One is the score of six.

596
00:36:47,000 --> 00:36:49,000
The other one is the score of four.

597
00:36:49,000 --> 00:36:49,000
Okay.

598
00:36:50,000 --> 00:36:52,000
Now let me do one thing here.

599
00:36:52,000 --> 00:36:54,000
We are not going to apply any scaling.

600
00:36:54,000 --> 00:36:57,000
Scaling I'll just go ahead and write.

601
00:36:57,000 --> 00:36:58,000
You'll understand the problem.

602
00:36:58,000 --> 00:37:00,000
What problem you are going to face.

603
00:37:00,000 --> 00:37:01,000
Scaling not applied.

604
00:37:03,000 --> 00:37:09,000
So if I don't apply the scaling you will be able to see over here what I will just go ahead and apply

605
00:37:09,000 --> 00:37:10,000
my softmax.

606
00:37:11,000 --> 00:37:16,000
Now softmax will get applied to both this number six comma four okay.

607
00:37:17,000 --> 00:37:22,000
And uh here uh when we are specifically applying the softmax.

608
00:37:22,000 --> 00:37:22,000
Okay.

609
00:37:23,000 --> 00:37:29,000
So here you will be able to see that if you remember about the softmax formula, it is nothing but e,

610
00:37:29,000 --> 00:37:34,000
let's say e to the power of six divided by e to the power of six plus e to the power of four.

611
00:37:34,000 --> 00:37:36,000
Okay, comma.

612
00:37:38,000 --> 00:37:43,000
E to the power of four divided by e to the power of six, plus e to the power of four.

613
00:37:43,000 --> 00:37:45,000
When we specifically apply on this two.

614
00:37:45,000 --> 00:37:46,000
Okay.

615
00:37:46,000 --> 00:37:47,000
And.

616
00:37:50,000 --> 00:37:54,000
Here if I just go ahead and see to this.

617
00:37:54,000 --> 00:37:56,000
So let's say e to the power of six.

618
00:37:56,000 --> 00:38:00,000
And here I will take E to the power of six as common.

619
00:38:00,000 --> 00:38:02,000
And I'll write one plus e to the power of minus two.

620
00:38:02,000 --> 00:38:04,000
This is how we can represent it.

621
00:38:04,000 --> 00:38:10,000
Similarly, if I go ahead and write E to the power of four, and if I take common as e to the power

622
00:38:10,000 --> 00:38:13,000
of four, I'm going to get E to the power of two plus one.

623
00:38:13,000 --> 00:38:19,000
Okay, so from here, if we do the forward calculation, since we are applying a softmax activation

624
00:38:19,000 --> 00:38:23,000
function, this will be nothing, but it will be.

625
00:38:23,000 --> 00:38:25,000
Or let me do one thing.

626
00:38:26,000 --> 00:38:32,000
Let me just go ahead and write one divided by e to the power of six, E to the power of six will go

627
00:38:32,000 --> 00:38:36,000
off, so it will be e to the power of one, plus e to the power of minus two comma.

628
00:38:37,000 --> 00:38:41,000
And this will be one plus e to the power of two plus one.

629
00:38:41,000 --> 00:38:44,000
Write this e to the power of four, E to the power four will get deducted.

630
00:38:45,000 --> 00:38:53,000
Now, once we are calculating this, uh, here, you will be able to see that, uh, the final output.

631
00:38:53,000 --> 00:38:56,000
If I do the calculation with the help of calculator, it is nothing.

632
00:38:56,000 --> 00:39:00,000
But it is approximately equal to 0.88.12.

633
00:39:00,000 --> 00:39:08,000
Okay, so what has happened is that when we applied this softmax activation function on a score of six

634
00:39:08,000 --> 00:39:17,000
comma four, which we got as a product of query vectors with the key one vector of this specific word

635
00:39:17,000 --> 00:39:18,000
of any word.

636
00:39:18,000 --> 00:39:20,000
So here we had this key one vector v.

637
00:39:20,000 --> 00:39:22,000
Here we had this key two vectors right.

638
00:39:22,000 --> 00:39:26,000
And when we do this when we did this dot product we got six form of four.

639
00:39:26,000 --> 00:39:28,000
We calculated the score here.

640
00:39:28,000 --> 00:39:29,000
We did not apply any scaling.

641
00:39:29,000 --> 00:39:34,000
And finally you could see that we could get a soft when we applied a softmax activation function here

642
00:39:34,000 --> 00:39:38,000
you got a value of 0.88 and 0.12 okay.

643
00:39:38,000 --> 00:39:41,000
Now what does this value basically mean?

644
00:39:41,000 --> 00:39:44,000
First of all this value actually means that.

645
00:39:45,000 --> 00:40:05,000
most of the attention weight, most of the attention weight, okay, is assigned to the first key vector,

646
00:40:06,000 --> 00:40:06,000
right?

647
00:40:06,000 --> 00:40:08,000
Because the value is higher.

648
00:40:08,000 --> 00:40:09,000
This is higher than this, right?

649
00:40:10,000 --> 00:40:12,000
And very little.

650
00:40:14,000 --> 00:40:17,000
Very little to the second.

651
00:40:19,000 --> 00:40:20,000
Second vector.

652
00:40:21,000 --> 00:40:22,000
Okay.

653
00:40:22,000 --> 00:40:24,000
Now let's understand what does this sentence mean.

654
00:40:24,000 --> 00:40:30,000
Most of the attention weight is assigned to point uh to the first key vector that is .88 and very little

655
00:40:30,000 --> 00:40:32,000
to the second vector okay.

656
00:40:32,000 --> 00:40:32,000
Okay.

657
00:40:32,000 --> 00:40:41,000
And let's understand see one of the property I think you should be knowing if you have knowledge of

658
00:40:42,000 --> 00:40:43,000
deep learning.

659
00:40:43,000 --> 00:40:43,000
Right.

660
00:40:43,000 --> 00:40:50,000
So one of the property of softmax is that whenever we apply a softmax to two numbers right.

661
00:40:50,000 --> 00:40:51,000
Two scores are there.

662
00:40:52,000 --> 00:40:56,000
Here you can see the difference between the score is very less six comma four.

663
00:40:56,000 --> 00:41:00,000
But with respect to the output here you are able to see that there is a huge difference.

664
00:41:00,000 --> 00:41:04,000
0.88 I'm getting to the score six and 0.12 I'm getting to the score four.

665
00:41:04,000 --> 00:41:05,000
Right.

666
00:41:05,000 --> 00:41:11,000
And similarly, if I try to apply a softmax activation function on a number like ten comma one.

667
00:41:11,000 --> 00:41:16,000
Now in this case ten comma one, you'll be seeing that there is a huge difference.

668
00:41:16,000 --> 00:41:17,000
Right.

669
00:41:17,000 --> 00:41:20,000
And this values are basically coming from the dot product of q and k.

670
00:41:20,000 --> 00:41:21,000
Right.

671
00:41:21,000 --> 00:41:21,000
right.

672
00:41:21,000 --> 00:41:23,000
And here you can see that I have not applied scaling.

673
00:41:23,000 --> 00:41:24,000
Right.

674
00:41:24,000 --> 00:41:29,000
So one of the property, if I do not apply scaling whenever I try to find out the softmax for this,

675
00:41:29,000 --> 00:41:34,000
again, if I try to find out the output, you will be able to see that one value will be like 0.99 and

676
00:41:34,000 --> 00:41:36,000
one value will be like 0.01.

677
00:41:36,000 --> 00:41:40,000
Now in this scenario, again this value is very high.

678
00:41:40,000 --> 00:41:42,000
This value is very less okay.

679
00:41:42,000 --> 00:41:47,000
You should understand all the scores will be specifically used for updating that particular weights

680
00:41:47,000 --> 00:41:49,000
that we are using, right?

681
00:41:49,000 --> 00:41:53,000
The weights like w of q, w of k, w of v.

682
00:41:53,000 --> 00:41:59,000
Now when we are doing back propagation right this small value, just imagine when this small score comes

683
00:41:59,000 --> 00:42:07,000
into picture, obviously we are going to face a problem of gradient or sorry softmax saturation.

684
00:42:07,000 --> 00:42:08,000
Right.

685
00:42:08,000 --> 00:42:13,000
So there will be one problem that softmax saturation where my gradient is not going to get updated more

686
00:42:13,000 --> 00:42:17,000
because the value that I'm actually going to get is becoming very small.

687
00:42:17,000 --> 00:42:21,000
And during the back propagation, I can also see this term as vanishing gradient problem.

688
00:42:22,000 --> 00:42:22,000
Right.

689
00:42:22,000 --> 00:42:25,000
Vanishing gradient problem where my weights are not getting updated.

690
00:42:27,000 --> 00:42:28,000
Vanishing gradient problem.

691
00:42:28,000 --> 00:42:31,000
So this is the property with respect to softmax.

692
00:42:31,000 --> 00:42:34,000
See right now I did not do any scaling.

693
00:42:34,000 --> 00:42:37,000
And I got this kind of value like 0.88 .12.

694
00:42:37,000 --> 00:42:40,000
Even though my value of scores was six comma four.

695
00:42:40,000 --> 00:42:43,000
Not much difference, hardly a difference of two.

696
00:42:43,000 --> 00:42:48,000
Similarly, if I do a dot product of query and key vectors, if I'm getting a very high value over here,

697
00:42:48,000 --> 00:42:52,000
if I apply the softmax activation function, just try to apply the softmax activation function.

698
00:42:52,000 --> 00:42:57,000
You'll be able to see that I'll be getting a very high value at one of the vector, and the other value

699
00:42:57,000 --> 00:42:57,000
will be very, very slow.

700
00:42:57,000 --> 00:42:59,000
Uh, very, very small.

701
00:42:59,000 --> 00:42:59,000
Right.

702
00:42:59,000 --> 00:43:06,000
So this scenario we usually face when continuously, you'll be able to see when your large, when your

703
00:43:06,000 --> 00:43:12,000
dot product, when your dot product that you specifically do is coming as a large values.

704
00:43:13,000 --> 00:43:14,000
Okay.

705
00:43:14,000 --> 00:43:15,000
Now let me do one thing.

706
00:43:15,000 --> 00:43:21,000
Let me just show this by using or by scaling this value okay.

707
00:43:21,000 --> 00:43:21,000
Okay.

708
00:43:21,000 --> 00:43:25,000
So now I will just go ahead and write with scaling.

709
00:43:25,000 --> 00:43:28,000
With scaling will try to apply the softmax activation function.

710
00:43:28,000 --> 00:43:30,000
Now here you can see with scaling.

711
00:43:30,000 --> 00:43:36,000
First of all I will just go ahead and compute my scaled dot product.

712
00:43:39,000 --> 00:43:40,000
Okay.

713
00:43:40,000 --> 00:43:47,000
You know that compute scaled dot products right now obviously you know my score.

714
00:43:47,000 --> 00:43:49,000
My score one.

715
00:43:49,000 --> 00:43:49,000
Right?

716
00:43:50,000 --> 00:43:52,000
My score one was six and four.

717
00:43:52,000 --> 00:43:55,000
Right now for this six and four.

718
00:43:55,000 --> 00:44:03,000
What we do first of all we if I really want to scale it right I need to divide it by root of d k right.

719
00:44:03,000 --> 00:44:05,000
I need to divide this number by root of d k.

720
00:44:05,000 --> 00:44:07,000
So I will take each and every number.

721
00:44:08,000 --> 00:44:09,000
I'll divide by root of decay.

722
00:44:09,000 --> 00:44:15,000
Route of decay is nothing, but it is root of four, which is nothing but root of two.

723
00:44:15,000 --> 00:44:19,000
So if I divide six by two and four by two, I'm going to get what?

724
00:44:20,000 --> 00:44:21,000
Three comma two.

725
00:44:21,000 --> 00:44:22,000
Understand?

726
00:44:22,000 --> 00:44:28,000
My scores were initially six comma four, but now I have scaled it with this value.

727
00:44:28,000 --> 00:44:30,000
That is root of decay.

728
00:44:30,000 --> 00:44:32,000
And then I'm actually getting three comma two.

729
00:44:32,000 --> 00:44:42,000
Now if I go ahead and apply softmax on three comma two, you just go ahead and do this calculation how

730
00:44:42,000 --> 00:44:43,000
much you will be getting okay.

731
00:44:43,000 --> 00:44:50,000
So again I will do the calculation over here E to the e to the power of three e to the power of three

732
00:44:50,000 --> 00:44:57,000
plus e to the power of two comma E to the power of three.

733
00:44:57,000 --> 00:44:58,000
Sorry, e to the power of two.

734
00:44:59,000 --> 00:45:03,000
And this is e to a power of three, plus e to the power of two.

735
00:45:03,000 --> 00:45:08,000
Similarly, uh, if I go ahead and probably do this computation.

736
00:45:08,000 --> 00:45:14,000
Okay, uh, you can take what all commons you specifically want, let's say here I'm going to take a

737
00:45:14,000 --> 00:45:18,000
common off e to the power of three divided by e to the power of three.

738
00:45:18,000 --> 00:45:20,000
I will take it as common.

739
00:45:20,000 --> 00:45:22,000
This will be one plus e to the power of minus one.

740
00:45:23,000 --> 00:45:27,000
Okay, comma, E to the power of two, e to the power of two.

741
00:45:27,000 --> 00:45:28,000
I will take a common.

742
00:45:28,000 --> 00:45:31,000
Then here I will write e to the power of one plus one.

743
00:45:31,000 --> 00:45:32,000
Okay.

744
00:45:32,000 --> 00:45:33,000
And this will deduct it.

745
00:45:33,000 --> 00:45:34,000
This will get deducted.

746
00:45:34,000 --> 00:45:37,000
So finally what I'm output I will be getting.

747
00:45:37,000 --> 00:45:38,000
If I do the computation.

748
00:45:38,000 --> 00:45:48,000
It is nothing but 0.73 comma 0.27 right now Before, when we did not apply the scaling, you could see

749
00:45:48,000 --> 00:45:50,000
that I'm getting a value as 0.88.12.

750
00:45:50,000 --> 00:45:51,000
The difference is very huge.

751
00:45:51,000 --> 00:45:54,000
But now here you can see there is a less difference.

752
00:45:54,000 --> 00:46:00,000
So when we are now back propagating right, we will not face that problem of vanishing gradient problem.

753
00:46:00,000 --> 00:46:00,000
Right.

754
00:46:00,000 --> 00:46:06,000
And here you will be able to see the output or this is also called as an attention weights.

755
00:46:06,000 --> 00:46:08,000
This is my attention weights.

756
00:46:08,000 --> 00:46:10,000
This Attention weights.

757
00:46:11,000 --> 00:46:13,000
Let me write a point over here.

758
00:46:14,000 --> 00:46:15,000
Here.

759
00:46:16,000 --> 00:46:19,000
The attention weights.

760
00:46:22,000 --> 00:46:27,000
Here are the attention weights are more balanced.

761
00:46:29,000 --> 00:46:36,000
Are more balanced compared to the Unscaled case.

762
00:46:38,000 --> 00:46:41,000
So if we are scaling now, we are getting a proper balance.

763
00:46:41,000 --> 00:46:43,000
Before the difference was huge.

764
00:46:43,000 --> 00:46:48,000
I know here also you can see the difference is there, but at least in the back propagation with respect

765
00:46:48,000 --> 00:46:49,000
to the other vector that we have.

766
00:46:49,000 --> 00:46:49,000
Right.

767
00:46:49,000 --> 00:46:54,000
And when we try to update the weights, at least we won't face the vanishing gradient problem okay.

768
00:46:54,000 --> 00:46:57,000
So this is the most important thing right?

769
00:46:57,000 --> 00:47:04,000
And because of this, if I just try to write down the summary for you okay, let me just note down all

770
00:47:04,000 --> 00:47:07,000
the summaries point that I really want to specify over here.

771
00:47:08,000 --> 00:47:12,000
Because of this, there are two important things that you need to take a note of.

772
00:47:12,000 --> 00:47:12,000
Okay.

773
00:47:14,000 --> 00:47:20,000
So the first thing is that here you are stabilizing.

774
00:47:20,000 --> 00:47:22,000
You are stabilizing the training.

775
00:47:22,000 --> 00:47:26,000
Scaling prevents extremely large dot products, which helps in stabilizing the gradient during back

776
00:47:26,000 --> 00:47:30,000
propagation, making the training process more stable and efficient.

777
00:47:30,000 --> 00:47:35,000
And second, you are preventing saturation by scaling the dot product, the softmax produces more balanced

778
00:47:35,000 --> 00:47:37,000
attention weights.

779
00:47:37,000 --> 00:47:39,000
Now there is a more balanced attention weight.

780
00:47:39,000 --> 00:47:41,000
This token is also having some importance.

781
00:47:41,000 --> 00:47:46,000
This token is also having some kind of importance right before, even though we give any kind of values

782
00:47:46,000 --> 00:47:50,000
over here, the gap was very high when we were trying to create the softmax.

783
00:47:50,000 --> 00:47:55,000
And that is what is basically like creating a stable gradient, right?

784
00:47:55,000 --> 00:47:57,000
We specifically divide by this number.

785
00:47:58,000 --> 00:48:03,000
Now you may be thinking, Krish, how do it become for for this particular number.

786
00:48:03,000 --> 00:48:03,000
Right.

787
00:48:03,000 --> 00:48:06,000
Why we are specifically dividing by root of d of k.

788
00:48:06,000 --> 00:48:14,000
So guys for this there is a there is a there is a one simple research with respect to variance.

789
00:48:14,000 --> 00:48:15,000
Okay.

790
00:48:15,000 --> 00:48:23,000
Whenever we do the dot product, you know as we keep on increasing the dimension the variance, the

791
00:48:23,000 --> 00:48:26,000
dot product variance will also keep on increasing okay.

792
00:48:26,000 --> 00:48:34,000
So as this was keeping uh, and there is a research for this guys uh, as this keeps on increasing Okay.

793
00:48:34,000 --> 00:48:39,000
Here you will be able to see that if we divide by root of d k.

794
00:48:39,000 --> 00:48:46,000
This is making sure that even though the dimension increases, even the dimension is two three, right?

795
00:48:46,000 --> 00:48:50,000
The variance for that particular dot product will be almost same.

796
00:48:50,000 --> 00:48:53,000
That is what this specific number is all about.

797
00:48:53,000 --> 00:48:53,000
Right.

798
00:48:53,000 --> 00:48:55,000
And this is a small research in variance.

799
00:48:55,000 --> 00:48:59,000
Uh, I just have mentioned it so that you can refer it whenever you want.

800
00:48:59,000 --> 00:49:03,000
But I think for our understanding, this much is more than sufficient.

801
00:49:03,000 --> 00:49:04,000
Okay.

802
00:49:04,000 --> 00:49:11,000
So because of this scaling, one more point that I can specify over here is that scaling ensures the

803
00:49:11,000 --> 00:49:17,000
dot product are always kept within a range that allows softmax activation function to operate effectively,

804
00:49:17,000 --> 00:49:19,000
providing a more balanced distribution.

805
00:49:20,000 --> 00:49:24,000
Okay, so if I go again back with respect to all the steps.

806
00:49:24,000 --> 00:49:26,000
So my fourth step was scaling.

807
00:49:26,000 --> 00:49:30,000
And after scaling, uh, we will just try to understand what is the next step okay.

808
00:49:30,000 --> 00:49:33,000
But here I've shown you multiple examples with respect to scaling.

809
00:49:33,000 --> 00:49:34,000
Right.

810
00:49:34,000 --> 00:49:40,000
So if I just go ahead and show you what all things we discussed till now, first of all we go ahead

811
00:49:40,000 --> 00:49:43,000
and you know, do the token embedding.

812
00:49:43,000 --> 00:49:48,000
And then with respect to the token embedding we do the linear transformation where we are going to find

813
00:49:48,000 --> 00:49:50,000
out three important vectors query key and value pairs.

814
00:49:51,000 --> 00:49:54,000
I've told you about the discussion with query key value pairs.

815
00:49:54,000 --> 00:49:55,000
Why it is important.

816
00:49:55,000 --> 00:49:58,000
Then we go ahead and compute this.

817
00:49:58,000 --> 00:50:00,000
Then we compute the attention scores.

818
00:50:00,000 --> 00:50:03,000
And this attention scores also play a very important role.

819
00:50:03,000 --> 00:50:06,000
Then we probably do the scaling part with respect to this okay.

820
00:50:06,000 --> 00:50:15,000
Now once this scaling is done right, once we specifically do the scaling, then we calculated the weighted

821
00:50:15,000 --> 00:50:18,000
sum of all the values okay.

822
00:50:18,000 --> 00:50:21,000
That is what the next step is all about right.

823
00:50:21,000 --> 00:50:27,000
And I know this may be somewhat confusing, but it is always good that we try to understand things.

824
00:50:27,000 --> 00:50:27,000
Okay.

825
00:50:27,000 --> 00:50:31,000
So let me go back to my previous problem statement here.

826
00:50:31,000 --> 00:50:35,000
If you remember, we had calculated this course, right?

827
00:50:35,000 --> 00:50:36,000
All this course right.

828
00:50:36,000 --> 00:50:43,000
For compute attention scores for the the score of query of the with respect to key of the was 202.

829
00:50:43,000 --> 00:50:46,000
Like uh, this course we have actually computed.

830
00:50:46,000 --> 00:50:47,000
Right.

831
00:50:47,000 --> 00:50:48,000
So let me do one thing.

832
00:50:48,000 --> 00:50:52,000
Let me just go ahead and apply softmax for before that we will do the scaling okay.

833
00:50:52,000 --> 00:50:54,000
And then we will compute.

834
00:50:54,000 --> 00:50:56,000
And I will take only one word for this okay.

835
00:50:56,000 --> 00:50:59,000
Or one word I'll just try to show you with respect to one word.

836
00:50:59,000 --> 00:51:01,000
And then you should be able to understand.

837
00:51:01,000 --> 00:51:05,000
So the fourth step that we are specifically going to do is nothing but scaling.

838
00:51:05,000 --> 00:51:07,000
I've shown you one example of scaling.

839
00:51:07,000 --> 00:51:11,000
So scaling what we will do, we will try to divide by root of d of k okay.

840
00:51:11,000 --> 00:51:13,000
In this case it is four.

841
00:51:13,000 --> 00:51:15,000
So this will basically become two right.

842
00:51:15,000 --> 00:51:17,000
Root of four which will become two right.

843
00:51:17,000 --> 00:51:21,000
Then we are just going to calculate our scaled score.

844
00:51:21,000 --> 00:51:30,000
So if I go ahead and calculate with the query off the width key of the right, we had this value initially

845
00:51:30,000 --> 00:51:30,000
two.

846
00:51:30,000 --> 00:51:32,000
And we are dividing this by two.

847
00:51:32,000 --> 00:51:34,000
So I'm going to get one right.

848
00:51:34,000 --> 00:51:44,000
Similarly if I go ahead and calculate our scaled score with q of the and k of cat.

849
00:51:44,000 --> 00:51:48,000
So here we are just going to divide by number zero by two which is equal to zero.

850
00:51:49,000 --> 00:51:54,000
Then I have my scaled score.

851
00:51:54,000 --> 00:51:58,000
Then this is nothing but q of the comma K of SAT.

852
00:51:58,000 --> 00:52:00,000
So initially this was two.

853
00:52:00,000 --> 00:52:03,000
So I'm going to divide by two which is nothing but one.

854
00:52:03,000 --> 00:52:06,000
Now once I do this division you'll be able to see that again.

855
00:52:06,000 --> 00:52:09,000
This value has not decreased right.

856
00:52:09,000 --> 00:52:12,000
So similarly all the scaling will be done.

857
00:52:12,000 --> 00:52:16,000
Similarly you will be seeing that scaling will be done.

858
00:52:17,000 --> 00:52:26,000
Similarly scaling will be done for all other tokens.

859
00:52:28,000 --> 00:52:29,000
Perfect right.

860
00:52:30,000 --> 00:52:36,000
Now after this, as I have already told you, the next operation will be specifically apply softmax.

861
00:52:38,000 --> 00:52:41,000
Now, when I apply softmax here, you'll be able to see that.

862
00:52:42,000 --> 00:52:48,000
And this softmax will be applied to the, to the, uh, scores that we have actually got the scaled

863
00:52:48,000 --> 00:52:48,000
scores.

864
00:52:48,000 --> 00:52:54,000
And then whenever we apply the softmax we basically create our attention weights.

865
00:52:54,000 --> 00:52:59,000
So the attention weights is what we get.

866
00:52:59,000 --> 00:53:01,000
And this attention waits is nothing but my.

867
00:53:01,000 --> 00:53:09,000
So for the the word, the attention weight will be softmax applied on my values like one comma zero

868
00:53:09,000 --> 00:53:10,000
comma one.

869
00:53:10,000 --> 00:53:14,000
So here once I apply the softmax you can go and and check it out.

870
00:53:14,000 --> 00:53:15,000
I've done the calculation.

871
00:53:15,000 --> 00:53:23,000
It is nothing but 0.42 to 3.1554 .1554.4223.

872
00:53:23,000 --> 00:53:23,000
Okay.

873
00:53:23,000 --> 00:53:26,000
So here initially your value was this.

874
00:53:28,000 --> 00:53:29,000
Right with respect to the the word.

875
00:53:29,000 --> 00:53:31,000
What was your initial vectors.

876
00:53:31,000 --> 00:53:37,000
See this uh your initial vectors was 1010 okay.

877
00:53:37,000 --> 00:53:39,000
1010.

878
00:53:39,000 --> 00:53:39,000
Okay.

879
00:53:39,000 --> 00:53:41,000
I missed out one.

880
00:53:41,000 --> 00:53:41,000
Oh.

881
00:53:41,000 --> 00:53:41,000
No.

882
00:53:41,000 --> 00:53:43,000
We we we got it exactly right.

883
00:53:43,000 --> 00:53:53,000
So 1010 was your initial vectors and now I am able to get this as my contextual embedding vectors.

884
00:53:53,000 --> 00:53:59,000
Similarly, when I go ahead and calculate for the attention weights of cat.

885
00:54:02,000 --> 00:54:04,000
So this was the cat right here.

886
00:54:04,000 --> 00:54:10,000
If I apply the softmax activation function and the softmax activation function is applied on zero comma

887
00:54:10,000 --> 00:54:11,000
two comma two.

888
00:54:12,000 --> 00:54:23,000
Then I'm going to get 0.155, 4.4223, and 0.4223.

889
00:54:23,000 --> 00:54:29,000
Now, whatever vectors we are getting, this is considering the other vectors in context okay.

890
00:54:29,000 --> 00:54:34,000
Similarly, when we go ahead and calculate the attention weights.

891
00:54:37,000 --> 00:54:40,000
For sat sat word.

892
00:54:40,000 --> 00:54:42,000
So again we have to go ahead and pass our softmax.

893
00:54:42,000 --> 00:54:46,000
And this will go to our two comma two comma four okay.

894
00:54:47,000 --> 00:54:55,000
And here we are specifically going to get .2119.2119.

895
00:54:55,000 --> 00:54:58,000
And this is 0.562 right.

896
00:54:58,000 --> 00:54:58,000
Right.

897
00:54:59,000 --> 00:55:02,000
So this is how we go ahead and apply the softmax activation function.

898
00:55:02,000 --> 00:55:04,000
And we get the attention weights.

899
00:55:04,000 --> 00:55:04,000
Okay.

900
00:55:04,000 --> 00:55:09,000
Now finally see this is just our attention weights.

901
00:55:09,000 --> 00:55:12,000
Like what are the attention weights with respect to all the other values.

902
00:55:12,000 --> 00:55:15,000
But you remember right initially we gave a four dimension vectors.

903
00:55:15,000 --> 00:55:18,000
And finally the output also should be a four dimensional vector.

904
00:55:18,000 --> 00:55:28,000
So the final step that we are specifically doing is weighted sum of weighted sum of values.

905
00:55:31,000 --> 00:55:35,000
Now in the weighted sum of values what we do is that we multiply.

906
00:55:38,000 --> 00:55:39,000
The attention weights.

907
00:55:42,000 --> 00:55:51,000
Attention weights by the corresponding value vector, because somewhere that value vector needs to be

908
00:55:51,000 --> 00:55:51,000
used, right?

909
00:55:52,000 --> 00:55:55,000
Value vectors to get the final output.

910
00:55:55,000 --> 00:55:57,000
So for the token.

911
00:55:58,000 --> 00:56:00,000
For the token, the right.

912
00:56:00,000 --> 00:56:03,000
If I want to go ahead and calculate the weighted sum.

913
00:56:03,000 --> 00:56:05,000
So I will just go ahead and write.

914
00:56:05,000 --> 00:56:11,000
The output of the will be nothing, but it will be.

915
00:56:11,000 --> 00:56:13,000
What is the attention weight?

916
00:56:13,000 --> 00:56:15,000
0.4223.

917
00:56:15,000 --> 00:56:18,000
So first one I'm getting 0.4223.

918
00:56:18,000 --> 00:56:23,000
This will be multiplied by value vector of the right.

919
00:56:24,000 --> 00:56:29,000
Then the attention weight of cat right.

920
00:56:29,000 --> 00:56:30,000
We will just go ahead and do this.

921
00:56:30,000 --> 00:56:35,000
See the the the the the the the output.

922
00:56:35,000 --> 00:56:37,000
See the the vector needs to be converted into some other vector.

923
00:56:37,000 --> 00:56:38,000
Right.

924
00:56:38,000 --> 00:56:41,000
And it should be converted with respect to the context of cat and sat.

925
00:56:41,000 --> 00:56:43,000
So attention weight of cat.

926
00:56:43,000 --> 00:56:49,000
So what we are going to do, we are going to basically use the next weight that is 0.1554.

927
00:56:49,000 --> 00:56:52,000
So here I will go ahead and write 0.1554.

928
00:56:52,000 --> 00:56:57,000
And this much attention needs to be given to the value vector of cat.

929
00:56:57,000 --> 00:57:01,000
Similarly the last one is 0.24223.

930
00:57:01,000 --> 00:57:07,000
This much attention needs to be given to the value of sat right value vector of SAT.

931
00:57:07,000 --> 00:57:09,000
So 0.423 right.

932
00:57:09,000 --> 00:57:13,000
Similarly, once I probably go and do the forward calculation.

933
00:57:13,000 --> 00:57:17,000
So here you'll be able to see I will use for 2 to 3 value of the is what.

934
00:57:17,000 --> 00:57:21,000
So if you remember what did we consider as the value of the.

935
00:57:21,000 --> 00:57:23,000
If you see this.

936
00:57:23,000 --> 00:57:27,000
So this is my q q k value.

937
00:57:27,000 --> 00:57:31,000
So value is nothing but the initial value that we have used embeddings right.

938
00:57:31,000 --> 00:57:32,000
So let's see the value.

939
00:57:32,000 --> 00:57:35,000
So for the it is 1010.

940
00:57:35,000 --> 00:57:37,000
For this it is 0101.

941
00:57:37,000 --> 00:57:40,000
And for SAT it is 11112 okay.

942
00:57:40,000 --> 00:57:43,000
So let's go ahead and write this I will do the calculation over here.

943
00:57:43,000 --> 00:57:49,000
I know there are many steps guys but please focus I'm solving it completely step by step 1010.

944
00:57:50,000 --> 00:57:51,000
Then we will multiply.

945
00:57:51,000 --> 00:57:52,000
Sorry.

946
00:57:52,000 --> 00:57:52,000
We will add

947
00:57:54,000 --> 00:57:59,000
.1554 multiplied by v of cat.

948
00:57:59,000 --> 00:57:59,000
Right.

949
00:57:59,000 --> 00:58:03,000
So V of cat is nothing but 0101.

950
00:58:03,000 --> 00:58:04,000
Then you have

951
00:58:04,000 --> 00:58:10,000
0.42231111

952
00:58:10,000 --> 00:58:11,000
Okay.

953
00:58:11,000 --> 00:58:16,000
So now once we do this dot product what values we are going to get over here.

954
00:58:16,000 --> 00:58:17,000
Very simple.

955
00:58:17,000 --> 00:58:19,000
Just go ahead and do the computation.

956
00:58:19,000 --> 00:58:25,000
So first of all uh when we do this particular dot product over here uh so the further calculation once

957
00:58:25,000 --> 00:58:32,000
we do this particular dot operation here, we are going to get 0.42230.

958
00:58:32,000 --> 00:58:35,000
Then again .42.

959
00:58:35,000 --> 00:58:35,000
Two.

960
00:58:35,000 --> 00:58:36,000
Three zero.

961
00:58:36,000 --> 00:58:37,000
Right.

962
00:58:37,000 --> 00:58:38,000
This will be my first vector.

963
00:58:38,000 --> 00:58:42,000
Then I have this vector .1554.

964
00:58:43,000 --> 00:58:44,000
Sorry, it will be zero.

965
00:58:44,000 --> 00:58:47,000
Initially it will be zero because we are doing a dot operation with zero.

966
00:58:47,000 --> 00:58:55,000
Then I have .1554, then again zero, then .1554.

967
00:58:55,000 --> 00:58:56,000
Okay.

968
00:58:56,000 --> 00:58:58,000
And similarly here I will be having

969
00:58:58,000 --> 00:59:07,000
.4223.4223.4223.4223

970
00:59:07,000 --> 00:59:07,000
okay.

971
00:59:08,000 --> 00:59:09,000
Done.

972
00:59:09,000 --> 00:59:12,000
So once I get this, it is nothing.

973
00:59:12,000 --> 00:59:18,000
But finally I'll be able to get 1.2669 and point.

974
00:59:18,000 --> 00:59:22,000
If I just do the, uh, plus operation with respect to all the numbers.

975
00:59:22,000 --> 00:59:23,000
Okay.

976
00:59:23,000 --> 00:59:28,000
.3491.2669.

977
00:59:28,000 --> 00:59:30,000
You can validate it if I'm doing all the computation right.

978
00:59:30,000 --> 00:59:33,000
If not, you can let me know in the comment section.

979
00:59:33,000 --> 00:59:33,000
Okay.

980
00:59:33,000 --> 00:59:35,000
So this is basically my output.

981
00:59:35,000 --> 00:59:42,000
So initially you'll be able to see that when I gave the the keyword which was having an embedding vector

982
00:59:42,000 --> 00:59:49,000
of 1010 after passing through the self-attention.

983
00:59:52,000 --> 00:59:55,000
It is giving us this keyword that is nothing.

984
00:59:55,000 --> 00:59:56,000
This vector that is contextual vector

985
00:59:56,000 --> 01:00:06,000
1.2269.9991.2669.9999.

986
01:00:06,000 --> 01:00:10,000
Okay, so this is what is my contextual vector.

987
01:00:12,000 --> 01:00:12,000
Okay.

988
01:00:12,000 --> 01:00:16,000
Contextual vector here what we did.

989
01:00:16,000 --> 01:00:19,000
First we calculated our q k v values vectors.

990
01:00:19,000 --> 01:00:25,000
For this we initialized w of q w of k w of v vectors.

991
01:00:25,000 --> 01:00:26,000
And again this will be trained.

992
01:00:26,000 --> 01:00:31,000
Then after this we went and calculated our attention score.

993
01:00:32,000 --> 01:00:36,000
After calculating attention score we scaled the value.

994
01:00:37,000 --> 01:00:41,000
Then after doing this, we passed it through an softmax activation function.

995
01:00:41,000 --> 01:00:49,000
And finally we calculated the weighted sum of values sum of values.

996
01:00:49,000 --> 01:00:50,000
And how do we calculate this?

997
01:00:50,000 --> 01:00:51,000
It is very simple.

998
01:00:52,000 --> 01:01:02,000
After performing the softmax, uh uh, whatever value we get, we have to multiply with V, right?

999
01:01:02,000 --> 01:01:05,000
And that is how we specifically get all these values.

1000
01:01:06,000 --> 01:01:07,000
Okay.

1001
01:01:07,000 --> 01:01:10,000
I hope you are able to understand this.

1002
01:01:10,000 --> 01:01:16,000
And similarly for token cat and all we can also do the calculation I will live up to you like how you

1003
01:01:16,000 --> 01:01:18,000
can probably do the calculation.

1004
01:01:18,000 --> 01:01:23,000
But uh again for token cat you can follow the same mechanism.

1005
01:01:23,000 --> 01:01:25,000
So here you have .1554.

1006
01:01:25,000 --> 01:01:30,000
Then you multiply it by the vector of value of the then value vectors of the.

1007
01:01:30,000 --> 01:01:34,000
Then you have the second 1.4223 multiplied with the value of the.

1008
01:01:34,000 --> 01:01:37,000
And similarly when you do this you will be able to get all these things right.

1009
01:01:37,000 --> 01:01:43,000
So this, in short, was your entire self-attention uh, step by step.

1010
01:01:43,000 --> 01:01:48,000
What all things we did, how the calculation was everything I've explained to you This entire weights

1011
01:01:48,000 --> 01:01:51,000
will be trained it.

1012
01:01:51,000 --> 01:01:53,000
These are trained, learned weights.

1013
01:01:53,000 --> 01:01:55,000
I will say it as learned weights.

1014
01:01:55,000 --> 01:02:02,000
And we have to make it learn based on the back propagation, like how we specifically do for other neural

1015
01:02:02,000 --> 01:02:02,000
networks.

1016
01:02:02,000 --> 01:02:06,000
So I hope you are able to understand this particular video.

1017
01:02:06,000 --> 01:02:07,000
This is it from my side.

1018
01:02:07,000 --> 01:02:10,000
I will see you all in the next video.

1019
01:02:10,000 --> 01:02:10,000
Thank you.

1020
01:02:10,000 --> 01:02:11,000
Take care.

