1
00:00:00,000 --> 00:00:04,000
So guys, we are going to continue the discussion, uh, with respect to our Transformers.

2
00:00:04,000 --> 00:00:09,000
So already inside a encoder, we have discussed how the self-attention exactly works.

3
00:00:10,000 --> 00:00:17,000
Um, and again in our previous video we have spoken about query key value vectors and then how our score

4
00:00:17,000 --> 00:00:21,000
is basically calculated attention score, how a softmax activation function is calculated, how our

5
00:00:21,000 --> 00:00:24,000
softmax function is activation function is applied.

6
00:00:24,000 --> 00:00:28,000
Everything we discussed right now after we get the vectors.

7
00:00:28,000 --> 00:00:36,000
Now see in this scenario, right when I spoken about Multi-head attention, for every word, I'll be

8
00:00:36,000 --> 00:00:38,000
getting like attention head one.

9
00:00:38,000 --> 00:00:46,000
So I'll be writing z one, z two, z three, right z four, z five, and so much like it can be 7 to

10
00:00:46,000 --> 00:00:47,000
8 z's, right?

11
00:00:47,000 --> 00:00:54,000
So if I say there are eight attention heads, I'll be getting 8Z1, z two, Z3Z fours.

12
00:00:54,000 --> 00:00:57,000
Uh, a kind of vectors by passing this specific word.

13
00:00:58,000 --> 00:01:03,000
Now the next step will be that I need to pass all this to my feed forward network.

14
00:01:03,000 --> 00:01:09,000
Feed forward neural network so that, uh, I'll be able to do the further steps right inside this particular

15
00:01:09,000 --> 00:01:10,000
encoder.

16
00:01:10,000 --> 00:01:16,000
Now, the problem is that before what we had drawn in the diagram here, I said that, hey, we are

17
00:01:16,000 --> 00:01:17,000
just passing one word.

18
00:01:17,000 --> 00:01:18,000
I'll be getting one vectors.

19
00:01:18,000 --> 00:01:19,000
right.

20
00:01:19,000 --> 00:01:21,000
But here we are getting multiple vectors.

21
00:01:21,000 --> 00:01:26,000
So before we pass through the feed forward neural network, we need to combine all these specific vectors

22
00:01:26,000 --> 00:01:28,000
and send it to the feed forward neural network.

23
00:01:28,000 --> 00:01:31,000
And then probably do the forward and the backward propagation.

24
00:01:32,000 --> 00:01:38,000
So um, so the feed forward neural network is expecting a single matrix.

25
00:01:38,000 --> 00:01:44,000
So we need to find out a way how we can condense all this attention head into a single matrix, along

26
00:01:44,000 --> 00:01:49,000
with some, uh, a single weights itself and probably do the entire task.

27
00:01:49,000 --> 00:01:53,000
So here you'll be able to see, uh, right now I have multiple attentions, uh, like this.

28
00:01:53,000 --> 00:01:56,000
Attention, head zero, attention head one till seven.

29
00:01:56,000 --> 00:01:58,000
And here I have all the values, right?

30
00:01:58,000 --> 00:02:02,000
I have over here, I have over here, I have over here.

31
00:02:02,000 --> 00:02:04,000
All these values are specifically there.

32
00:02:04,000 --> 00:02:04,000
Okay.

33
00:02:05,000 --> 00:02:08,000
Uh, and when I say values here you have this skew curve.

34
00:02:08,000 --> 00:02:13,000
Here you have separate q k v right here also you have a separate q k v.

35
00:02:13,000 --> 00:02:19,000
Now the problem starts at how do we combine all these things before we forward it to the feed forward

36
00:02:19,000 --> 00:02:20,000
neural network.

37
00:02:20,000 --> 00:02:22,000
So for that here you can see an amazing diagram.

38
00:02:22,000 --> 00:02:25,000
We concatenate all the attention head like this.

39
00:02:25,000 --> 00:02:28,000
So Z0Z1Z2, Z3Z4, Z5Z6.

40
00:02:28,000 --> 00:02:35,000
At seven we are considering that there is a seven head attention Okay, so we combine all of them.

41
00:02:35,000 --> 00:02:40,000
We multi we do a dot product with W of zero.

42
00:02:41,000 --> 00:02:47,000
So this will be my W of zero where I will be initializing a new weight for my feed forward neural network

43
00:02:48,000 --> 00:02:50,000
feed forward neural network.

44
00:02:51,000 --> 00:02:54,000
And after this we go ahead and calculate our final Z.

45
00:02:55,000 --> 00:02:55,000
Okay.

46
00:02:55,000 --> 00:02:59,000
So after this whatever z we are probably getting over here.

47
00:02:59,000 --> 00:03:01,000
This will be a single Z.

48
00:03:01,000 --> 00:03:03,000
This will be my another Z.

49
00:03:03,000 --> 00:03:05,000
This will be my another Z.

50
00:03:05,000 --> 00:03:05,000
Right.

51
00:03:05,000 --> 00:03:09,000
And this is what we are going to pass it to our next encoder.

52
00:03:09,000 --> 00:03:09,000
Right.

53
00:03:09,000 --> 00:03:15,000
So all the multiple attention head, we are just going to combine it together to make sure that we capture

54
00:03:15,000 --> 00:03:17,000
all the specific importance.

55
00:03:17,000 --> 00:03:19,000
We give it to the feed forward neural network.

56
00:03:19,000 --> 00:03:23,000
And along with this we train a specific weight itself.

57
00:03:23,000 --> 00:03:24,000
And this is just for one word.

58
00:03:24,000 --> 00:03:28,000
Just imagine the one word that I've given the right.

59
00:03:28,000 --> 00:03:29,000
Or it can be cat.

60
00:03:29,000 --> 00:03:35,000
So from this cat, by using self-attention, I've created so many different heads right.

61
00:03:35,000 --> 00:03:36,000
By using.

62
00:03:36,000 --> 00:03:39,000
And all these heads will have a separate key value.

63
00:03:39,000 --> 00:03:40,000
Key value vectors.

64
00:03:40,000 --> 00:03:42,000
Everything will be separate, right?

65
00:03:42,000 --> 00:03:43,000
Then.

66
00:03:43,000 --> 00:03:47,000
Along with this, I'm taking a weight forwarding it to the feed forward neural network.

67
00:03:47,000 --> 00:03:51,000
Then we'll train it to finally get our output that is called as Z.

68
00:03:51,000 --> 00:03:51,000
Okay.

69
00:03:52,000 --> 00:03:56,000
And this is pretty much all, uh, to the multi-head self-attention.

70
00:03:56,000 --> 00:03:57,000
It is a handful.

71
00:03:57,000 --> 00:03:59,000
It is a handful of matrices.

72
00:03:59,000 --> 00:03:59,000
Okay.

73
00:03:59,000 --> 00:04:02,000
And you'll be able to do it uh, probably in this way.

74
00:04:03,000 --> 00:04:07,000
Overall, if you really want to check one more amazing diagram again from the same blog, I will also

75
00:04:07,000 --> 00:04:09,000
draw it for you over here.

76
00:04:09,000 --> 00:04:13,000
And this is how the things happen right from the entire self-attention till the feed forward neural

77
00:04:13,000 --> 00:04:14,000
network.

78
00:04:14,000 --> 00:04:23,000
Okay, so here you can see that from this part to this part, right from here or from here, the entire

79
00:04:23,000 --> 00:04:25,000
feed forward neural network actually takes place.

80
00:04:25,000 --> 00:04:30,000
So here is your m two words that I have given thinking machines.

81
00:04:30,000 --> 00:04:34,000
Then for all these words you have this separate separate weights.

82
00:04:34,000 --> 00:04:39,000
So that will be able to get our separate q, k and v vectors.

83
00:04:39,000 --> 00:04:40,000
And finally you get your z zero.

84
00:04:40,000 --> 00:04:42,000
This is for the attention head one.

85
00:04:42,000 --> 00:04:44,000
Then again for this you have a separate vectors.

86
00:04:44,000 --> 00:04:47,000
For this you have a separate vectors right.

87
00:04:47,000 --> 00:04:53,000
And at for all the uh, for all the attention head, you will be having a separate, separate key.

88
00:04:53,000 --> 00:04:55,000
Uh, Q queries key value vectors itself.

89
00:04:55,000 --> 00:05:00,000
And finally you get this entire weights and you train it on a feed forward neural network to finally

90
00:05:00,000 --> 00:05:01,000
get your Z.

91
00:05:01,000 --> 00:05:03,000
So this was my initial vector.

92
00:05:03,000 --> 00:05:06,000
This is what is my final vector that we are specifically getting.

93
00:05:06,000 --> 00:05:07,000
Okay.

94
00:05:07,000 --> 00:05:08,000
So this is our input sentence.

95
00:05:08,000 --> 00:05:10,000
We embed each word split into eight heads.

96
00:05:10,000 --> 00:05:14,000
We multiply uh x or ah with weight matrices.

97
00:05:14,000 --> 00:05:17,000
You calculate the attention using the resulting matrix.

98
00:05:17,000 --> 00:05:23,000
Then you concatenate the result of z matrices and multiply with the weight matrix w two to produce the

99
00:05:23,000 --> 00:05:24,000
output of the layer.

100
00:05:24,000 --> 00:05:27,000
So with respect to one word I will be able to get this.

101
00:05:28,000 --> 00:05:29,000
Like let's say thinking is there.

102
00:05:29,000 --> 00:05:32,000
Then I'll be able to get this vector machine is there.

103
00:05:32,000 --> 00:05:34,000
I will be able to get this vector right.

104
00:05:34,000 --> 00:05:39,000
So this overall explains about the entire attention head.

105
00:05:39,000 --> 00:05:47,000
The the reason of having multiple attention is that I really want to show you with one practical example.

106
00:05:47,000 --> 00:05:47,000
Okay.

107
00:05:47,000 --> 00:05:49,000
So here you can see.

108
00:05:50,000 --> 00:05:53,000
I have uh, over here one visualization diagram.

109
00:05:53,000 --> 00:05:55,000
I'll be giving this link in the resources section.

110
00:05:55,000 --> 00:05:56,000
Okay.

111
00:05:56,000 --> 00:05:59,000
Here you'll be able to see that with respect to every word.

112
00:05:59,000 --> 00:05:59,000
Right.

113
00:06:00,000 --> 00:06:01,000
And uh okay.

114
00:06:01,000 --> 00:06:02,000
This will not be a good diagram.

115
00:06:02,000 --> 00:06:04,000
So here you can see I have multiple attention.

116
00:06:04,000 --> 00:06:05,000
So this can be my attention.

117
00:06:05,000 --> 00:06:07,000
Head one attention two head.

118
00:06:07,000 --> 00:06:07,000
Attention three head.

119
00:06:07,000 --> 00:06:13,000
If I click on each and every attention, you'll be able to see that every every attention has a different,

120
00:06:13,000 --> 00:06:15,000
different importance word.

121
00:06:15,000 --> 00:06:15,000
Right?

122
00:06:15,000 --> 00:06:19,000
It is trying to capture much importance with respect to all the words that we have.

123
00:06:20,000 --> 00:06:25,000
So if I go ahead and click this in my other attention, this can be my other attention.

124
00:06:25,000 --> 00:06:27,000
This can be other other attention head.

125
00:06:27,000 --> 00:06:30,000
You know, it shows the importance of each and every word.

126
00:06:30,000 --> 00:06:37,000
So if I just go and highlight over here, see, Li is basically very much highly dependent on on.

127
00:06:37,000 --> 00:06:37,000
Right.

128
00:06:37,000 --> 00:06:41,000
So this vectors that will be defining this layer will be specifically on on right.

129
00:06:41,000 --> 00:06:47,000
And similarly if I go ahead and click on other vectors let's say on is their cat, is their C mat is

130
00:06:47,000 --> 00:06:51,000
their separator, is their cat and mat are highly correlated.

131
00:06:51,000 --> 00:06:51,000
Right.

132
00:06:51,000 --> 00:06:55,000
Correlated basically means there is a huge, uh, context dependency on that.

133
00:06:55,000 --> 00:06:56,000
Right.

134
00:06:57,000 --> 00:06:59,000
So this actually gives an idea.

135
00:06:59,000 --> 00:07:03,000
And here also you can go ahead and check it out with respect to sentence to a to a sentence A to B.

136
00:07:03,000 --> 00:07:06,000
Let's say the cat sat on the mat.

137
00:07:06,000 --> 00:07:08,000
The cat lay on the rug.

138
00:07:08,000 --> 00:07:11,000
If I want to probably see is there any dependencies?

139
00:07:11,000 --> 00:07:12,000
So cat is having a dependency on cat.

140
00:07:12,000 --> 00:07:16,000
If I probably consider sentence a to a cat sat on the mat.

141
00:07:16,000 --> 00:07:21,000
So C on sat is the first word that it is having a dependency on.

142
00:07:21,000 --> 00:07:25,000
So it's the main idea of self-attention is to get this all right.

143
00:07:25,000 --> 00:07:28,000
And with respect to different different head we will be able to get this.

144
00:07:28,000 --> 00:07:31,000
Some more examples which I will be putting it here.

145
00:07:31,000 --> 00:07:32,000
Right.

146
00:07:32,000 --> 00:07:33,000
You can probably check it out.

147
00:07:33,000 --> 00:07:37,000
So this is one attention head which you will be able to see.

148
00:07:37,000 --> 00:07:39,000
So here I have a sentence.

149
00:07:39,000 --> 00:07:46,000
The animal uh, did cross the street because it was too tired, didn't cross the street, it was too

150
00:07:46,000 --> 00:07:46,000
tired.

151
00:07:46,000 --> 00:07:52,000
So here you can see when I highlight on it, it is dependent on street tired and animal.

152
00:07:52,000 --> 00:07:52,000
Right.

153
00:07:52,000 --> 00:07:54,000
So some more words are specifically here.

154
00:07:54,000 --> 00:07:58,000
Similarly, if I go ahead and see with respect to other attention, head right.

155
00:07:58,000 --> 00:08:01,000
Once we calculate, you will also be able to see this kind of context.

156
00:08:02,000 --> 00:08:07,000
So here it is dependent on the animal with different different color cross right.

157
00:08:07,000 --> 00:08:12,000
So multiple attention head basically means you are able to get information from multiple heads itself

158
00:08:12,000 --> 00:08:15,000
with respect to the dependency on all the other words.

159
00:08:15,000 --> 00:08:22,000
So, uh, this was it, uh, where we have specifically discussed about Multi-head attention.

160
00:08:22,000 --> 00:08:24,000
We have discussed about self-attention and all.

161
00:08:24,000 --> 00:08:26,000
Still, there are many more videos that are going to come.

162
00:08:26,000 --> 00:08:27,000
Couple of videos.

163
00:08:27,000 --> 00:08:30,000
One is positional encoding, which we are going to discuss.

164
00:08:30,000 --> 00:08:35,000
And after that we will also go ahead and discuss about the entire transformer architecture, because

165
00:08:35,000 --> 00:08:38,000
still we are just understanding some of the components in transformer.

166
00:08:38,000 --> 00:08:38,000
Right.

167
00:08:38,000 --> 00:08:40,000
So yes, this was it from my side.

168
00:08:40,000 --> 00:08:41,000
I hope you liked this particular video.

169
00:08:41,000 --> 00:08:42,000
I'll see you all in the next video.

170
00:08:42,000 --> 00:08:42,000
Thank you.