1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
So we are going to continue the discussion with respect to our Transformers.

3
00:00:04,000 --> 00:00:09,000
Already in our previous video we have discussed till Self-attention, and we have also noted down all

4
00:00:09,000 --> 00:00:12,000
the steps that we specifically use in self-attention.

5
00:00:12,000 --> 00:00:18,000
So first of all, we go ahead and calculate our key uh query key and value vector.

6
00:00:18,000 --> 00:00:23,000
And this is from a learned weights that is w of q w of k and w of v.

7
00:00:23,000 --> 00:00:25,000
Then we calculate the attention score.

8
00:00:25,000 --> 00:00:27,000
Then we calculate our scale scores.

9
00:00:28,000 --> 00:00:31,000
Uh, after doing that we apply a softmax activation function.

10
00:00:31,000 --> 00:00:34,000
And finally we calculate the weighted sum of the values.

11
00:00:35,000 --> 00:00:39,000
And this is just done by my multiplying with the vector v okay.

12
00:00:39,000 --> 00:00:44,000
So overall uh you'll be able to see that initially when we are giving some kind of embedding vectors,

13
00:00:44,000 --> 00:00:47,000
we should be able to get some kind of contextual vectors over here.

14
00:00:47,000 --> 00:00:48,000
Right.

15
00:00:48,000 --> 00:00:53,000
And this contextual vector is based on the context and the and the dependency of the other words that

16
00:00:53,000 --> 00:00:55,000
are present in that sentence.

17
00:00:55,000 --> 00:00:55,000
Okay.

18
00:00:55,000 --> 00:01:01,000
So this is one of the blog, uh, that, uh, you know, has some amazing representation diagram.

19
00:01:01,000 --> 00:01:09,000
Uh, and uh, the URL specifically, if I probably go ahead and talk about the URL, I will just also

20
00:01:09,000 --> 00:01:10,000
show you the URL over here.

21
00:01:10,000 --> 00:01:12,000
Again, I really want to give the credits.

22
00:01:12,000 --> 00:01:15,000
So it is nothing but Jalal Ma.

23
00:01:15,000 --> 00:01:19,000
So his this amazing diagrams.

24
00:01:19,000 --> 00:01:23,000
He is basically covered with the working of the entire transformer.

25
00:01:23,000 --> 00:01:25,000
So you can definitely check it out and you can have a look onto this.

26
00:01:25,000 --> 00:01:28,000
Okay, now my idea is very much simple over here.

27
00:01:28,000 --> 00:01:34,000
What I really want to do is that I want to again show, with the help of diagram what all things we

28
00:01:34,000 --> 00:01:34,000
did.

29
00:01:34,000 --> 00:01:35,000
Right.

30
00:01:35,000 --> 00:01:38,000
So here you can see, uh, this is one of the diagram.

31
00:01:38,000 --> 00:01:44,000
And here in this video I'm also going to discuss about something called as Multi-head attention.

32
00:01:45,000 --> 00:01:46,000
Okay.

33
00:01:46,000 --> 00:01:50,000
So this we are going to discuss about it Multi-head attention okay.

34
00:01:50,000 --> 00:01:54,000
Now initially I have my words over here.

35
00:01:54,000 --> 00:01:55,000
Let's say this is my word one.

36
00:01:56,000 --> 00:01:57,000
What is the word like?

37
00:01:57,000 --> 00:02:01,000
This can be the this can be cat right.

38
00:02:01,000 --> 00:02:02,000
This will be first.

39
00:02:02,000 --> 00:02:05,000
Uh, all my vectors will be available over here.

40
00:02:05,000 --> 00:02:09,000
We are multiplying or we are doing a dot operation with w off key.

41
00:02:09,000 --> 00:02:11,000
And this is how we get our query vectors.

42
00:02:11,000 --> 00:02:12,000
We get our key vectors.

43
00:02:12,000 --> 00:02:18,000
And remember, for each and every word that I have, I get a separate query and key value vectors.

44
00:02:18,000 --> 00:02:19,000
Right.

45
00:02:19,000 --> 00:02:21,000
And similarly separate value vectors.

46
00:02:21,000 --> 00:02:23,000
We separately get all these things right.

47
00:02:23,000 --> 00:02:32,000
And after we get this, what we do, we first of all do a multiplication operation with the query and

48
00:02:32,000 --> 00:02:33,000
the key, right?

49
00:02:33,000 --> 00:02:37,000
Then we divide it by root of d k where we are specifically doing scaling.

50
00:02:38,000 --> 00:02:40,000
Then we apply a softmax activation function.

51
00:02:40,000 --> 00:02:43,000
And finally we do a dot operation with our value vector.

52
00:02:43,000 --> 00:02:52,000
And that is how we get our z value which is nothing, but it is my, uh, attention, uh, self-attention

53
00:02:52,000 --> 00:02:55,000
for the word the right, the vectors that we are specifically getting.

54
00:02:56,000 --> 00:02:59,000
now till here.

55
00:02:59,000 --> 00:03:00,000
I think everybody has followed this.

56
00:03:00,000 --> 00:03:04,000
And step by step, uh, I have actually explained it right.

57
00:03:04,000 --> 00:03:11,000
And we can actually say this one as our attention head, attention head.

58
00:03:12,000 --> 00:03:17,000
Now the idea with multiple multi-head attention, what exactly is that?

59
00:03:17,000 --> 00:03:21,000
We will have a self-attention.

60
00:03:23,000 --> 00:03:27,000
Self-attention with multiple heads.

61
00:03:29,000 --> 00:03:30,000
With multi heads.

62
00:03:31,000 --> 00:03:31,000
Okay.

63
00:03:32,000 --> 00:03:33,000
Now this is fine.

64
00:03:33,000 --> 00:03:37,000
Here you can see I got one kind of vector for the right.

65
00:03:37,000 --> 00:03:46,000
Similarly, can't you think that I can probably go ahead and create multiple attention heads for the

66
00:03:46,000 --> 00:03:47,000
same words.

67
00:03:47,000 --> 00:03:48,000
Right?

68
00:03:48,000 --> 00:03:49,000
So here you can see.

69
00:03:51,000 --> 00:03:56,000
I have taken Thinking Machine as an example over here.

70
00:03:56,000 --> 00:03:58,000
So this is my embedding vectors.

71
00:03:58,000 --> 00:04:06,000
Initially we went and we initialized some other weights like w of o of q, w of o of k, w of o of v

72
00:04:06,000 --> 00:04:07,000
right.

73
00:04:07,000 --> 00:04:13,000
And we calculated this query q of o, k of o and v of O right.

74
00:04:13,000 --> 00:04:19,000
Similarly, you know and after this what will happen is that again that same, same method, uh, it

75
00:04:19,000 --> 00:04:24,000
will probably go to our, it will calculate the score, then it will go to the softmax.

76
00:04:24,000 --> 00:04:30,000
Then we multiply by v and finally we get our attention head.

77
00:04:30,000 --> 00:04:31,000
Attention head.

78
00:04:31,000 --> 00:04:32,000
Right.

79
00:04:32,000 --> 00:04:36,000
Or we get our final vectors after performing all this particular operation.

80
00:04:36,000 --> 00:04:37,000
Right.

81
00:04:37,000 --> 00:04:41,000
So we get our contextual vectors here.

82
00:04:41,000 --> 00:04:42,000
We get it right.

83
00:04:42,000 --> 00:04:49,000
Now this vectors may probably take up dependencies with respect to other words.

84
00:04:49,000 --> 00:04:51,000
Let's say in this example thinking machine is there.

85
00:04:51,000 --> 00:04:51,000
Or.

86
00:04:51,000 --> 00:04:56,000
I can also take another example the cat sat mat.

87
00:04:56,000 --> 00:05:01,000
Okay, if this three is there right now, finally you'll be able to see whatever vectors I get over

88
00:05:01,000 --> 00:05:02,000
here.

89
00:05:02,000 --> 00:05:05,000
This will be with respect to dependencies on this and this.

90
00:05:05,000 --> 00:05:05,000
Right.

91
00:05:05,000 --> 00:05:09,000
Let's let's consider that it has captured with respect to this and with respect to this.

92
00:05:09,000 --> 00:05:11,000
And it has probably given me this.

93
00:05:11,000 --> 00:05:17,000
Similarly, it can go and create another attention head where it will be able to capture some more words.

94
00:05:17,000 --> 00:05:18,000
Importance.

95
00:05:18,000 --> 00:05:18,000
Right.

96
00:05:18,000 --> 00:05:24,000
And for this they will initialize another vectors, uh, another weights like w one of QW1 of KW1 of

97
00:05:24,000 --> 00:05:24,000
v.

98
00:05:24,000 --> 00:05:28,000
And similarly they will be getting vectors like q one, k one and v one.

99
00:05:28,000 --> 00:05:32,000
So whatever vectors we are getting out of this, this may be different than this.

100
00:05:32,000 --> 00:05:34,000
This may be different than this.

101
00:05:34,000 --> 00:05:38,000
Here some more other context importance is basically getting captured here.

102
00:05:38,000 --> 00:05:39,000
Some more other context.

103
00:05:40,000 --> 00:05:42,000
Uh, the importance is basically getting captured.

104
00:05:42,000 --> 00:05:44,000
So this can actually become my Z zero.

105
00:05:44,000 --> 00:05:45,000
This can become my Z one.

106
00:05:45,000 --> 00:05:51,000
So similarly I can go ahead and create multiple attention heads so that.

107
00:05:51,000 --> 00:05:55,000
What will be the main aim of this particular multiple attention heads?

108
00:05:55,000 --> 00:06:01,000
It will expand the model's ability to focus on different positions.

109
00:06:01,000 --> 00:06:03,000
So here I will go ahead and write.

110
00:06:03,000 --> 00:06:21,000
It expands the model's ability to focus on different position of words or different position of tokens.

111
00:06:22,000 --> 00:06:23,000
This is what it says.

112
00:06:23,000 --> 00:06:23,000
Yeah.

113
00:06:24,000 --> 00:06:26,000
So here you can probably see right.

114
00:06:26,000 --> 00:06:33,000
Z zero may contain a little bit more information of every other encoding, but it can be dominated by

115
00:06:33,000 --> 00:06:36,000
one word like sat right here.

116
00:06:37,000 --> 00:06:39,000
Uh z sorry, Z zero may be like that.

117
00:06:39,000 --> 00:06:42,000
Z 1st May be like it may focus more on different words.

118
00:06:42,000 --> 00:06:44,000
It can be dominated by word like mat.

119
00:06:44,000 --> 00:06:50,000
So if we have all this kind of combination right then what will happen is that we will be able to get

120
00:06:50,000 --> 00:06:52,000
more and more information to our models.

121
00:06:52,000 --> 00:06:52,000
Right?

122
00:06:53,000 --> 00:07:00,000
So here this attention layer gives uh provides multiple representation subspaces okay.

123
00:07:00,000 --> 00:07:08,000
And um in case of multi-head attention, we have not only one but multiple set of key value pairs.

124
00:07:09,000 --> 00:07:11,000
Multiple set of key value pairs.

125
00:07:11,000 --> 00:07:12,000
Right.

126
00:07:12,000 --> 00:07:15,000
And each of this set is randomly initialized initially.

127
00:07:15,000 --> 00:07:20,000
Then after training is set, is used to project the input embeddings into a different representation

128
00:07:20,000 --> 00:07:22,000
shape subspace okay.

129
00:07:22,000 --> 00:07:25,000
That basically means it will take a vector and it will try to convert them.

130
00:07:25,000 --> 00:07:31,000
So overall you will be able to see that by using this we may get this kind of diagram.

131
00:07:31,000 --> 00:07:32,000
See.

132
00:07:32,000 --> 00:07:33,000
So here.

133
00:07:36,000 --> 00:07:45,000
With a different set of q k v we will be able to get this attention z zero with a different set of q

134
00:07:45,000 --> 00:07:46,000
k v.

135
00:07:46,000 --> 00:07:54,000
And along with that obviously w w of k w of q w of v, this weights will be also different.

136
00:07:54,000 --> 00:07:58,000
Similarly we can go ahead and calculate this attention which will capture other important thing in the

137
00:07:58,000 --> 00:08:00,000
token right with respect to the positional encoding.

138
00:08:00,000 --> 00:08:05,000
And then here you will be able to see different different captures uh different different uh outputs.

139
00:08:05,000 --> 00:08:06,000
Right.

140
00:08:06,000 --> 00:08:13,000
So this is the entire information about Multi-head attention.

141
00:08:14,000 --> 00:08:20,000
So here we are just not focusing on providing one set of vectors with respect to this.

142
00:08:20,000 --> 00:08:25,000
But we'll have multiple set of vectors, and each one of the other vectors that we are getting right,

143
00:08:25,000 --> 00:08:26,000
which we also call it as attention head.

144
00:08:26,000 --> 00:08:31,000
It will have importance with respect to other other tokens that is available over here in the sentence.

145
00:08:31,000 --> 00:08:31,000
Okay.

146
00:08:32,000 --> 00:08:36,000
So yeah, this was it about Multi-head attention.

147
00:08:37,000 --> 00:08:39,000
Uh, the concept is very similar.

148
00:08:39,000 --> 00:08:44,000
Whatever steps you are doing, instead of just getting one one attention head, we will try to get multiple

149
00:08:44,000 --> 00:08:49,000
attention head wherein we will try to train it with multiple key.

150
00:08:49,000 --> 00:08:56,000
Q uh, that is queries key value pairs by initializing W1W uh w one of QW1 of KW1 of v right weights

151
00:08:56,000 --> 00:09:02,000
of q, k, and v, and as many number of attention head will be there, that many number of weights

152
00:09:02,000 --> 00:09:03,000
can be parallelly done right.

153
00:09:03,000 --> 00:09:08,000
So this was uh more about the multi attention head.

154
00:09:08,000 --> 00:09:12,000
So uh multi head attention and uh yeah in the next video.

155
00:09:12,000 --> 00:09:15,000
Now here you can see that what all things we have specifically discussed.

156
00:09:15,000 --> 00:09:16,000
Right.

157
00:09:16,000 --> 00:09:17,000
If I go up.

158
00:09:17,000 --> 00:09:19,000
Oh my God I have written so many things for you all.

159
00:09:19,000 --> 00:09:25,000
So here you can see after performing the self attention, I need to provide all this information to

160
00:09:25,000 --> 00:09:27,000
my feed forward, uh, neural network.

161
00:09:27,000 --> 00:09:28,000
Right.

162
00:09:28,000 --> 00:09:32,000
But here you can probably see now how my diagram will change by multi-head attention.

163
00:09:32,000 --> 00:09:38,000
Once I give how right after this self-attention, I will be able to get like this.

164
00:09:38,000 --> 00:09:40,000
Multiple heads of vectors.

165
00:09:40,000 --> 00:09:41,000
Right.

166
00:09:41,000 --> 00:09:42,000
So this will be my attention.

167
00:09:42,000 --> 00:09:43,000
Head one zero.

168
00:09:43,000 --> 00:09:45,000
Attention head one.

169
00:09:45,000 --> 00:09:46,000
Attention head two.

170
00:09:46,000 --> 00:09:46,000
Attention.

171
00:09:46,000 --> 00:09:47,000
Head three.

172
00:09:47,000 --> 00:09:48,000
Attention.

173
00:09:48,000 --> 00:09:49,000
Head four.

174
00:09:49,000 --> 00:09:50,000
Attention head five.

175
00:09:50,000 --> 00:09:50,000
Right.

176
00:09:50,000 --> 00:09:51,000
Similarly for this.

177
00:09:51,000 --> 00:09:52,000
What?

178
00:09:52,000 --> 00:09:53,000
I may get other attentions.

179
00:09:53,000 --> 00:09:57,000
Now all this I need to pass it to my feedforward neural network.

180
00:09:57,000 --> 00:10:02,000
So obviously the feedforward neural network will consider this as just one encoding.

181
00:10:02,000 --> 00:10:04,000
So we need to combine this.

182
00:10:04,000 --> 00:10:09,000
And that is what we are going to see that how we are just going to pass all these things to a feedforward

183
00:10:09,000 --> 00:10:12,000
neural network, and how the forward and the backward propagation is going to take place.

184
00:10:12,000 --> 00:10:16,000
That is what we are going to see in our next video.

185
00:10:16,000 --> 00:10:17,000
So I hope you like this particular video.

186
00:10:17,000 --> 00:10:18,000
I will see you all in the next video.

187
00:10:18,000 --> 00:10:19,000
Thank you.