1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to Transformers.

3
00:00:03,000 --> 00:00:08,000
Now it's time to probably understand about the entire working of decoders.

4
00:00:08,000 --> 00:00:12,000
Already in our previous video we have seen about encoders.

5
00:00:12,000 --> 00:00:17,000
We have understood the in-depth architecture, intuition and how exactly encoder works.

6
00:00:17,000 --> 00:00:24,000
Now let's go ahead and let's try to understand about decoder to cover decoder in transformers.

7
00:00:24,000 --> 00:00:29,000
Uh, here I will be providing you a plan of action how we are going to specifically cover this topics.

8
00:00:30,000 --> 00:00:34,000
So here you can see on the right hand side you have this entire decoder working.

9
00:00:34,000 --> 00:00:39,000
Now inside just one decoder you have this three important components okay.

10
00:00:39,000 --> 00:00:45,000
Masked Multi-head attention Multi-head attention and feed forward neural network.

11
00:00:45,000 --> 00:00:50,000
I hope you know about Multi-head attention, how it exactly works in encoder and all.

12
00:00:50,000 --> 00:00:55,000
You know that you also know about feed forward neural network, but one information that you will be

13
00:00:55,000 --> 00:00:57,000
able to see from the encoders.

14
00:00:57,000 --> 00:00:58,000
Right.

15
00:00:58,000 --> 00:01:01,000
And in the research paper we know that there are six encoders like this.

16
00:01:01,000 --> 00:01:05,000
So the output of the encoder is basically passed to the decoder.

17
00:01:05,000 --> 00:01:07,000
Multi-head attention.

18
00:01:07,000 --> 00:01:07,000
Okay.

19
00:01:07,000 --> 00:01:10,000
So one line you will be able to see it over here.

20
00:01:10,000 --> 00:01:13,000
So we will discuss more about this what exactly it is okay.

21
00:01:13,000 --> 00:01:16,000
So first of all let's go ahead and see the definition.

22
00:01:16,000 --> 00:01:22,000
So here it shows the transformer decoder is responsible for generating the output sequence one token

23
00:01:22,000 --> 00:01:28,000
at a time using the encoder output encoder output and the previously generated token.

24
00:01:28,000 --> 00:01:31,000
So from here also we will be generating some tokens okay.

25
00:01:32,000 --> 00:01:39,000
When I talk about decoder there are three main components and we will be discussing them step by step.

26
00:01:39,000 --> 00:01:48,000
The first main component is mast multi head self attention.

27
00:01:49,000 --> 00:01:52,000
So we will try to understand about this.

28
00:01:52,000 --> 00:01:59,000
How does mast Multi-head self-attention works and why do we use this over here okay so this is the first

29
00:01:59,000 --> 00:02:00,000
important component.

30
00:02:00,000 --> 00:02:05,000
The second important component is multi head attention.

31
00:02:05,000 --> 00:02:11,000
I hope everybody knows what exactly is multi-head attention from the encoder, but we also say this

32
00:02:11,000 --> 00:02:15,000
as encoder decoder attention.

33
00:02:18,000 --> 00:02:20,000
Encoder decoder attention.

34
00:02:20,000 --> 00:02:26,000
So we will also try to understand about this coming to the third one which is basically called as feed

35
00:02:26,000 --> 00:02:28,000
forward neural network.

36
00:02:30,000 --> 00:02:38,000
So these are the most important components that, uh, you need to know about if you really need to

37
00:02:38,000 --> 00:02:39,000
understand about decoders.

38
00:02:39,000 --> 00:02:44,000
Just to give you an idea what exactly how how does encoder and decoder work over here?

39
00:02:44,000 --> 00:02:48,000
So let's say if I am giving if I am giving some information to my encoder.

40
00:02:48,000 --> 00:02:53,000
So if this is my encoder and you know what is exactly there inside your encoder.

41
00:02:53,000 --> 00:03:01,000
So if I'm giving this inputs, let's say I want to translate this entire information right into some

42
00:03:01,000 --> 00:03:02,000
other language.

43
00:03:02,000 --> 00:03:05,000
So what will happen whenever I give some kind of input.

44
00:03:05,000 --> 00:03:10,000
This input from this output from the encoders will probably go to the decoder.

45
00:03:11,000 --> 00:03:15,000
And it should be able to give your output one by one.

46
00:03:15,000 --> 00:03:19,000
So this will be my x one, uh, y one word.

47
00:03:19,000 --> 00:03:20,000
This can be your y two word.

48
00:03:20,000 --> 00:03:22,000
This can be your y three word.

49
00:03:22,000 --> 00:03:25,000
This is the basic difference between encoder and decoder.

50
00:03:25,000 --> 00:03:28,000
Here X1X2X3.

51
00:03:28,000 --> 00:03:33,000
Whichever is the words that I'm actually passing in the form of embedding layer, uh in the form of

52
00:03:33,000 --> 00:03:37,000
embedding vectors, these are all passed all at once okay.

53
00:03:37,000 --> 00:03:39,000
All at once.

54
00:03:40,000 --> 00:03:46,000
But when we see with respect to generation, you'll be able to see that my output will be generated

55
00:03:46,000 --> 00:03:47,000
based on timestamp.

56
00:03:47,000 --> 00:03:50,000
So one word at a time at t is equal to one.

57
00:03:50,000 --> 00:03:51,000
This will get generated t is equal to two.

58
00:03:51,000 --> 00:03:53,000
This will get generated t is equal to three.

59
00:03:53,000 --> 00:03:54,000
This will get generated right.

60
00:03:54,000 --> 00:04:00,000
So that is the basic difference between encoder and decoder and how this is getting generated.

61
00:04:00,000 --> 00:04:01,000
We will talk about this.

62
00:04:01,000 --> 00:04:06,000
So we will try to understand decoder based on two mechanism.

63
00:04:06,000 --> 00:04:15,000
One is training mechanism and one is through the inference mechanism okay.

64
00:04:15,000 --> 00:04:23,000
Inference mechanism like during training how this entire decoder is basically trained and during inferencing

65
00:04:23,000 --> 00:04:29,000
how I will be able, if I give any input to my encoder, how it is basically translating directly to

66
00:04:29,000 --> 00:04:30,000
some other words.

67
00:04:30,000 --> 00:04:35,000
Because while training we will also be giving our real output over here.

68
00:04:35,000 --> 00:04:35,000
Right.

69
00:04:35,000 --> 00:04:40,000
So this will let's say this will be my this is y one hat y two hat y three hat.

70
00:04:40,000 --> 00:04:43,000
Now I will give y one over here, y two over here and y three over here.

71
00:04:43,000 --> 00:04:47,000
And we will do some kind of masking also or padding and masking.

72
00:04:47,000 --> 00:04:49,000
That is what we are going to discuss about.

73
00:04:49,000 --> 00:04:49,000
Okay.

74
00:04:50,000 --> 00:04:56,000
So uh, so this is what we are specifically going to discuss about in the next video.

75
00:04:56,000 --> 00:05:01,000
We are going to probably start with something called as masking.

76
00:05:01,000 --> 00:05:05,000
Uh, the first important thing masking multi-head attention.

77
00:05:05,000 --> 00:05:10,000
So let me just go ahead and write down the topics, what all things we will be discussing over here.

78
00:05:10,000 --> 00:05:16,000
So here the first component that we are going to discuss is something called as.

79
00:05:18,000 --> 00:05:19,000
Masked.

80
00:05:21,000 --> 00:05:23,000
Multi-head.

81
00:05:24,000 --> 00:05:26,000
Self-attention.

82
00:05:27,000 --> 00:05:32,000
Okay, so we are going to discuss about this topic which is the first component in decoder.

83
00:05:32,000 --> 00:05:32,000
Okay.

84
00:05:32,000 --> 00:05:37,000
And remember over here we will also go ahead and divide this.

85
00:05:37,000 --> 00:05:42,000
Let's say the first topic will be input embedding okay.

86
00:05:42,000 --> 00:05:46,000
And then we will again have this positional embedding.

87
00:05:49,000 --> 00:05:51,000
Positional embedding.

88
00:05:51,000 --> 00:05:57,000
Second step we will specifically do the linear projection.

89
00:05:57,000 --> 00:06:03,000
Now I hope you know why linear projection is used for calculating your q k v okay.

90
00:06:03,000 --> 00:06:10,000
Third we will be doing something called as scaled dot product attention.

91
00:06:12,000 --> 00:06:22,000
This also you know why we need to do it okay then over here we will do mask application.

92
00:06:22,000 --> 00:06:29,000
And this is what is the additional step in masked multi-head self-attention when compared to the self-attention

93
00:06:29,000 --> 00:06:39,000
fifth step, we will go ahead and create this multi-head attention mechanism.

94
00:06:41,000 --> 00:06:46,000
Then, coming to the sixth step, it is nothing but concatenation.

95
00:06:49,000 --> 00:06:54,000
And final linear projection.

96
00:06:58,000 --> 00:07:10,000
And coming to the seventh important step is nothing but residual connection and layer normalization.

97
00:07:11,000 --> 00:07:19,000
So if you probably see a self-attention model and if you see a mask, multi-head self-attention.

98
00:07:19,000 --> 00:07:23,000
If I take multi-head self-attention, see in in your encoder.

99
00:07:23,000 --> 00:07:25,000
Also, you had multi-head attention right?

100
00:07:25,000 --> 00:07:27,000
Multi-head self-attention.

101
00:07:27,000 --> 00:07:32,000
So what is the difference between multi-head self-attention and masked multi-head self-attention?

102
00:07:32,000 --> 00:07:37,000
That is, there is one more step which is called as mask application that is getting added.

103
00:07:37,000 --> 00:07:46,000
We will try to understand the importance of this, the importance.

104
00:07:46,000 --> 00:07:53,000
Okay, so in our next video let's go ahead and explore this masked multi-head self-attention step by

105
00:07:53,000 --> 00:07:56,000
step, by taking an example and seeing that.

106
00:07:56,000 --> 00:07:59,000
What is the use of this masked multi-head self-attention?

107
00:07:59,000 --> 00:07:59,000
Okay.

108
00:08:00,000 --> 00:08:06,000
Uh, and here you'll be able to understand the basic difference between, uh, this one.

109
00:08:06,000 --> 00:08:06,000
Right.

110
00:08:06,000 --> 00:08:10,000
Because, see, over here, also, you have multi-head attention here also, you have multi-head attention.

111
00:08:10,000 --> 00:08:12,000
But what is the use of this masked multi-head?

112
00:08:12,000 --> 00:08:13,000
Uh, attention.

113
00:08:13,000 --> 00:08:13,000
Right.

114
00:08:13,000 --> 00:08:17,000
And we will try to understand with respect to training mechanism, first of all.

115
00:08:17,000 --> 00:08:17,000
Okay.

116
00:08:17,000 --> 00:08:22,000
And then I will also be showing that during inferencing with respect to a new data, how this masking

117
00:08:22,000 --> 00:08:24,000
will separately happen.

118
00:08:24,000 --> 00:08:26,000
So yes, this was it from my side.

119
00:08:26,000 --> 00:08:30,000
I will see you all in the next video where we'll discuss about all these points.