1
00:00:00,000 --> 00:00:05,000
So guys, now let's go ahead and, uh, let me show you an example with respect to all the steps that

2
00:00:05,000 --> 00:00:07,000
you have discussed in layer normalization.

3
00:00:07,000 --> 00:00:07,000
Okay.

4
00:00:07,000 --> 00:00:12,000
So for this example, as I've already mentioned, uh, we will be taking a single token in sequence

5
00:00:12,000 --> 00:00:16,000
going through a layer normalization process within a transformer layer.

6
00:00:16,000 --> 00:00:17,000
Okay.

7
00:00:17,000 --> 00:00:20,000
So the initial step we have initialized some values over here okay.

8
00:00:20,000 --> 00:00:21,000
Token embeddings.

9
00:00:21,000 --> 00:00:23,000
Let's say this is the token for cat.

10
00:00:23,000 --> 00:00:25,000
It is represented by this vectors.

11
00:00:25,000 --> 00:00:27,000
We have initialized some parameters okay.

12
00:00:27,000 --> 00:00:35,000
And this uh parameter is basically when I probably say right, this is uh, this gamma is your learned

13
00:00:35,000 --> 00:00:37,000
scale parameter.

14
00:00:39,000 --> 00:00:39,000
Okay.

15
00:00:39,000 --> 00:00:42,000
And beta is basically your shift parameter.

16
00:00:42,000 --> 00:00:47,000
So that is the reason we say it as s scale and shift.

17
00:00:47,000 --> 00:00:47,000
Right.

18
00:00:47,000 --> 00:00:54,000
This combinedly is basically called as scale and shift patterns.

19
00:00:56,000 --> 00:00:59,000
Now let's go ahead with the step by step calculation.

20
00:01:00,000 --> 00:01:05,000
So for the first vector first of all we will go ahead and compute the mean.

21
00:01:06,000 --> 00:01:13,000
Because if you remember the z score formula it is nothing but z score is equal to x of I minus mu divided

22
00:01:13,000 --> 00:01:15,000
by standard deviation okay.

23
00:01:15,000 --> 00:01:17,000
So here if I go ahead and compute the mean.

24
00:01:17,000 --> 00:01:19,000
so it will be nothing but one by four.

25
00:01:19,000 --> 00:01:26,000
Let us take all the vectors values 2.4, 4.4, 6.0, 8.0.

26
00:01:26,000 --> 00:01:33,000
And let's go ahead and compute this by 20.0 divided by four, which is nothing but 5.0.

27
00:01:33,000 --> 00:01:33,000
Okay.

28
00:01:34,000 --> 00:01:39,000
Next, uh, let's go ahead and compute the variance.

29
00:01:42,000 --> 00:01:45,000
Compute the variance.

30
00:01:45,000 --> 00:01:47,000
So variance is nothing but your sigma square.

31
00:01:47,000 --> 00:01:49,000
So for that how do we compute it.

32
00:01:49,000 --> 00:01:50,000
It is very simple.

33
00:01:50,000 --> 00:01:53,000
Sigma square is equal to one by four.

34
00:01:53,000 --> 00:01:57,000
Then we take up each and every numbers okay.

35
00:01:57,000 --> 00:02:00,000
And we subtract it.

36
00:02:00,000 --> 00:02:12,000
Let's say the first number is 2.0 -5.0 which is my mean plus 4.0, -5.0, whole square plus 6.0, -5.0

37
00:02:12,000 --> 00:02:17,000
whole square and plus 8.0, -5.0 whole square.

38
00:02:17,000 --> 00:02:19,000
So this is how you calculate the variance.

39
00:02:19,000 --> 00:02:23,000
So overall you will be able to get your variance as five okay.

40
00:02:23,000 --> 00:02:28,000
So now here you can see obviously your mean and your variance looks five five okay.

41
00:02:28,000 --> 00:02:33,000
So now in the third step we are going to normalize all our inputs okay.

42
00:02:33,000 --> 00:02:34,000
Third step.

43
00:02:34,000 --> 00:02:39,000
Let's go ahead and normalize the inputs.

44
00:02:39,000 --> 00:02:48,000
Now in the case of normalizing input it is x I hat which is nothing but x of I minus mean divided by

45
00:02:48,000 --> 00:02:55,000
uh just dividing by this route or see uh, what we will be basically doing is that we will also be dividing

46
00:02:55,000 --> 00:02:57,000
this by standard deviation.

47
00:02:57,000 --> 00:03:01,000
But instead of just writing standard deviation, I will just go ahead and give a new representation

48
00:03:01,000 --> 00:03:04,000
square root of variance plus epsilon value okay.

49
00:03:04,000 --> 00:03:08,000
Now this epsilon value we will be using a very small value.

50
00:03:08,000 --> 00:03:11,000
Let's say one e to the power of minus five okay.

51
00:03:11,000 --> 00:03:18,000
And this is used to avoid division by zero okay.

52
00:03:18,000 --> 00:03:18,000
Okay.

53
00:03:19,000 --> 00:03:21,000
This is basically used to avoid division by zero.

54
00:03:21,000 --> 00:03:32,000
So here if I just go ahead and see and calculate this, it is nothing but root of 5.0 plus one e minus

55
00:03:32,000 --> 00:03:42,000
five, which is equal to nothing but root of 5.0001, which is nothing but 2.236.

56
00:03:43,000 --> 00:03:47,000
Okay, so this is my 2.236 over here.

57
00:03:48,000 --> 00:03:51,000
And this is what is my below value that is coming.

58
00:03:51,000 --> 00:03:55,000
Now we will go ahead and normalize for each and every values with respect to the numbers.

59
00:03:55,000 --> 00:04:05,000
So the first normalized value will be x I bar okay let's say and I will write 2.5 uh 2.0 -5.0 which

60
00:04:05,000 --> 00:04:05,000
is my mean.

61
00:04:05,000 --> 00:04:09,000
And here we are just going to divide by 2.236.

62
00:04:09,000 --> 00:04:11,000
This is approximately equal to 1.34.

63
00:04:11,000 --> 00:04:21,000
Similarly x two bar vector that we have I will subtract 4.05.0 divided by 2.236, which will be approximately

64
00:04:21,000 --> 00:04:24,000
equal to -0.45.

65
00:04:24,000 --> 00:04:29,000
And similarly coming to the next vector that is x three bar third vector.

66
00:04:29,000 --> 00:04:37,000
Here again we will go ahead and write 6.0 -5.0 divided by 2.236, which will be approximately equal

67
00:04:37,000 --> 00:04:50,000
to 0.45 and x4 bar will be nothing but 8.05.02.236, which will be approximately again equal to 1.34.

68
00:04:50,000 --> 00:04:53,000
Okay, so these are my outputs.

69
00:04:53,000 --> 00:04:55,000
And this is my normalized vector.

70
00:04:55,000 --> 00:04:58,000
So here I have my normalized vector.

71
00:04:59,000 --> 00:05:01,000
But still we need to add something very important.

72
00:05:01,000 --> 00:05:08,000
So normalized vector my x bar will be nothing but -1.34, -0.45

73
00:05:09,000 --> 00:05:13,000
.451.34.

74
00:05:14,000 --> 00:05:14,000
Okay.

75
00:05:16,000 --> 00:05:20,000
Finally let's go ahead and apply our fourth thing.

76
00:05:20,000 --> 00:05:25,000
Still we have not applied the beta and gamma parameters, which is like scale and shift.

77
00:05:25,000 --> 00:05:30,000
So here we will go ahead and apply scale and shift.

78
00:05:31,000 --> 00:05:41,000
Now with respect to scale and shift, I will go ahead and write y of I gamma phi x of I plus beta of

79
00:05:41,000 --> 00:05:41,000
I.

80
00:05:41,000 --> 00:05:41,000
Right.

81
00:05:41,000 --> 00:05:48,000
You know, your gamma uh, gamma is nothing but 1.0, 1.0, 1.0.

82
00:05:48,000 --> 00:05:50,000
So we are just going to scale and shift.

83
00:05:50,000 --> 00:05:56,000
Okay, beta is nothing but 0.00.00.00.0.

84
00:05:57,000 --> 00:05:58,000
Okay.

85
00:05:58,000 --> 00:06:03,000
And again, in order to just do the scaling and shifting, obviously just by seeing this particular

86
00:06:03,000 --> 00:06:12,000
values, you are going to get the same x bar or final y output, which is nothing but -1.34, -0.4,

87
00:06:12,000 --> 00:06:16,000
5.45 and 1.34.

88
00:06:16,000 --> 00:06:20,000
So in this scenario, when we initialize this particular value and this is what I explained.

89
00:06:20,000 --> 00:06:26,000
If I say, hey this distribution is important, I will just this value will be remaining like this itself

90
00:06:26,000 --> 00:06:28,000
and we will be getting the same value.

91
00:06:28,000 --> 00:06:32,000
Let's say if you want to change this distribution, some more values will be separated present over

92
00:06:32,000 --> 00:06:32,000
here.

93
00:06:32,000 --> 00:06:35,000
And based on that we can multiply and add that particular values.

94
00:06:35,000 --> 00:06:36,000
Right.

95
00:06:36,000 --> 00:06:42,000
So this was the entire understanding about how a layer normalization work.

96
00:06:42,000 --> 00:06:47,000
And this we have specifically applied to a specific vectors I know on this vector we apply positional

97
00:06:47,000 --> 00:06:48,000
encodings.

98
00:06:48,000 --> 00:06:53,000
Then along with that we go ahead with self attention multi head attention and all.

99
00:06:53,000 --> 00:06:57,000
And then uh finally whatever output we are basically getting we add it right.

100
00:06:57,000 --> 00:07:01,000
Uh at the end of the day from this particular architecture that we see.

101
00:07:01,000 --> 00:07:07,000
Now similarly, once we do this normalization, you can see over here we are passing this information

102
00:07:07,000 --> 00:07:08,000
to feed forward neural network.

103
00:07:08,000 --> 00:07:16,000
But before that again we are going to pass this entire signal back to the next add and normalization

104
00:07:16,000 --> 00:07:20,000
where I will be adding this output again back to my output of the feed forward neural network.

105
00:07:20,000 --> 00:07:22,000
And again we will be performing normalization.

106
00:07:22,000 --> 00:07:28,000
So after doing this, this entire information will probably go to the decoder which we'll discuss in

107
00:07:28,000 --> 00:07:28,000
the later stages.

108
00:07:28,000 --> 00:07:34,000
But I hope you got an idea with respect to the entire working with respect to add a normalization Multi-head

109
00:07:34,000 --> 00:07:38,000
attention, positional encoding, input encoding, almost each and everything has been explained.

110
00:07:38,000 --> 00:07:43,000
Now in the upcoming video, we are going to discuss about the encoder architecture, what is remaining,

111
00:07:43,000 --> 00:07:47,000
and how the entire other operation is basically performed that I will be discussing in my next video.

112
00:07:47,000 --> 00:07:48,000
Thank you.

