1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to NLP.

3
00:00:03,000 --> 00:00:05,000
Already we have already seen that.

4
00:00:05,000 --> 00:00:06,000
Now what is our next step?

5
00:00:06,000 --> 00:00:11,000
You know after text pre-processing where we have specifically performed stemming Lemmatization Stopwords.

6
00:00:11,000 --> 00:00:12,000
And we have cleaned the data.

7
00:00:12,000 --> 00:00:13,000
Right.

8
00:00:13,000 --> 00:00:16,000
Now the main thing is that we really need to convert the text into vectors.

9
00:00:16,000 --> 00:00:18,000
And there are multiple ways.

10
00:00:18,000 --> 00:00:22,000
The first way that we are going to discuss about is something called as one hot encoding.

11
00:00:22,000 --> 00:00:22,000
Okay.

12
00:00:22,000 --> 00:00:27,000
Now we'll try to understand how how with the help of one hot encoding, we are converting the words

13
00:00:27,000 --> 00:00:29,000
into a vectors itself.

14
00:00:29,000 --> 00:00:32,000
So let's take this specific example.

15
00:00:32,000 --> 00:00:35,000
So here I have a text and a specific output right.

16
00:00:35,000 --> 00:00:39,000
Our main aim is that first of all we need to take this particular text.

17
00:00:39,000 --> 00:00:41,000
And we need to convert this into vectors okay.

18
00:00:41,000 --> 00:00:47,000
Now with respect to this obviously, uh, I'm also not going to again lower all the sentences or words,

19
00:00:47,000 --> 00:00:52,000
but let me focus on more on like how you can basically implement this one hot encoding.

20
00:00:52,000 --> 00:00:57,000
And what is the theoretical intuition behind it and how the vectors are actually created.

21
00:00:57,000 --> 00:01:02,000
So, uh, to begin with over here, let's say that this is my text, the document one is the food is

22
00:01:02,000 --> 00:01:04,000
good, the food is bad.

23
00:01:04,000 --> 00:01:05,000
Pizza is amazing.

24
00:01:05,000 --> 00:01:06,000
Okay.

25
00:01:06,000 --> 00:01:08,000
So this is basically pea size.

26
00:01:08,000 --> 00:01:09,000
Amazing okay.

27
00:01:09,000 --> 00:01:14,000
Now how do you find out how many unique vocabulary is there.

28
00:01:14,000 --> 00:01:19,000
Because see the unique vocabulary will play a very important role.

29
00:01:19,000 --> 00:01:21,000
Okay, while creating your vectors.

30
00:01:21,000 --> 00:01:21,000
Okay.

31
00:01:21,000 --> 00:01:27,000
So to begin with, what I'm actually going to do is that we just need to find out how many unique words

32
00:01:27,000 --> 00:01:27,000
are there.

33
00:01:27,000 --> 00:01:29,000
So let me go ahead and write it down.

34
00:01:29,000 --> 00:01:34,000
And obviously you know that if I combine all these three documents it becomes a paragraph or corpus.

35
00:01:34,000 --> 00:01:38,000
So let me just go ahead and let me just write down all the unique, unique documents.

36
00:01:38,000 --> 00:01:42,000
So there is one unique, uh, sorry, not document vocabulary.

37
00:01:42,000 --> 00:01:45,000
The is there food is there right.

38
00:01:45,000 --> 00:01:49,000
Is is there right and good is there.

39
00:01:49,000 --> 00:01:51,000
So over here you can see good is also there.

40
00:01:51,000 --> 00:01:56,000
Then from the second sentence you can see the food is getting repeated.

41
00:01:56,000 --> 00:01:59,000
So I'm just going to write bad okay.

42
00:01:59,000 --> 00:02:06,000
And if I go probably to the third sentence or the documents here you will be able to see pizza is there

43
00:02:06,000 --> 00:02:09,000
which is a unique word or unique vocabulary.

44
00:02:09,000 --> 00:02:10,000
Is is again getting repeated.

45
00:02:10,000 --> 00:02:13,000
And finally we have something called as amazing.

46
00:02:13,000 --> 00:02:19,000
So all these words that are probably present over here, this are my unique vocabulary.

47
00:02:19,000 --> 00:02:20,000
Right.

48
00:02:21,000 --> 00:02:22,000
Vocabulary.

49
00:02:22,000 --> 00:02:30,000
So these are just nothing but the unique words, unique words that are available in this entire paragraph

50
00:02:30,000 --> 00:02:33,000
or sentence, or in this particular data set.

51
00:02:33,000 --> 00:02:38,000
Now, based on this unique words one hot encoding means what?

52
00:02:38,000 --> 00:02:43,000
Okay, now let's say if I probably consider this particular document 1D1 okay.

53
00:02:43,000 --> 00:02:45,000
So let's say this is my D one.

54
00:02:45,000 --> 00:02:47,000
Now what does d one do.

55
00:02:47,000 --> 00:02:56,000
Is that like if you are probably applying one hot encoding we will convert all this specific words into

56
00:02:56,000 --> 00:02:57,000
this vector representation.

57
00:02:57,000 --> 00:03:02,000
That basically means let's say if I'm considering the word, then the vector representation for the

58
00:03:02,000 --> 00:03:08,000
word will be 1000000.

59
00:03:08,000 --> 00:03:08,000
Why?

60
00:03:08,000 --> 00:03:10,000
Because one is present over there.

61
00:03:10,000 --> 00:03:13,000
That basically means the is present in this specific word.

62
00:03:13,000 --> 00:03:15,000
So this is basically becoming one.

63
00:03:15,000 --> 00:03:20,000
If I consider for the next word that is food, then how will how will be able to represent it?

64
00:03:20,000 --> 00:03:24,000
It will be 0100000.

65
00:03:24,000 --> 00:03:25,000
Right?

66
00:03:25,000 --> 00:03:32,000
So here clearly you will be able to see wherever those specific word is, it will be represented by

67
00:03:32,000 --> 00:03:36,000
a vector which will be of a dimension v.

68
00:03:36,000 --> 00:03:37,000
Okay.

69
00:03:37,000 --> 00:03:42,000
Now what is this v v is nothing but the words that are present over here.

70
00:03:42,000 --> 00:03:46,000
So here you will be able to see one word, two word three word for word, five word, six word, seven

71
00:03:46,000 --> 00:03:46,000
word.

72
00:03:46,000 --> 00:03:49,000
So the unique vocabulary I have seven.

73
00:03:49,000 --> 00:03:55,000
So it is going to be represented by the seven vectors where for those specific word it will be represented

74
00:03:55,000 --> 00:03:57,000
as one and remaining all will be represented as zero.

75
00:03:57,000 --> 00:03:58,000
Right.

76
00:03:58,000 --> 00:04:05,000
So now if I probably consider how can I represent this D one that is, the food is good with respect

77
00:04:05,000 --> 00:04:06,000
to this particular vectors okay.

78
00:04:06,000 --> 00:04:11,000
So now understand that in this particular food uh, there are four words, right.

79
00:04:11,000 --> 00:04:12,000
1234.

80
00:04:12,000 --> 00:04:21,000
So over here with respect to D one the first word will have a representation of 10000.

81
00:04:21,000 --> 00:04:22,000
So how many zeros are there.

82
00:04:22,000 --> 00:04:24,000
123456.

83
00:04:24,000 --> 00:04:25,000
Right.

84
00:04:25,000 --> 00:04:28,000
123456.

85
00:04:28,000 --> 00:04:29,000
Right.

86
00:04:29,000 --> 00:04:32,000
So this will be my first word okay then coming to the second word.

87
00:04:32,000 --> 00:04:34,000
Now the second word over here is food.

88
00:04:34,000 --> 00:04:35,000
Right.

89
00:04:35,000 --> 00:04:39,000
So here you'll be able to see 0100000.

90
00:04:39,000 --> 00:04:40,000
Right.

91
00:04:40,000 --> 00:04:41,000
So this will be my second word.

92
00:04:41,000 --> 00:04:44,000
Now if I come to my third word it is basically is right.

93
00:04:44,000 --> 00:04:49,000
So is is basically represented by 0010000.

94
00:04:49,000 --> 00:04:50,000
So this is my third word.

95
00:04:50,000 --> 00:04:53,000
And coming to the fourth word which is good.

96
00:04:53,000 --> 00:04:58,000
So here it will become 0001000.

97
00:04:58,000 --> 00:04:58,000
Right.

98
00:04:58,000 --> 00:05:05,000
So this is my sentence one one hot code, one hot code representation with respect to this particular

99
00:05:05,000 --> 00:05:06,000
text.

100
00:05:06,000 --> 00:05:12,000
So this text if I apply one hot encoding I'm going to get this kind of representation.

101
00:05:12,000 --> 00:05:17,000
So guys now if I talk about the shape of this D one sentence which has four words right.

102
00:05:17,000 --> 00:05:20,000
So here you can see the first word representation is this one.

103
00:05:20,000 --> 00:05:24,000
The second word representation is this one vector representation.

104
00:05:24,000 --> 00:05:26,000
Third word vector representation is this one.

105
00:05:26,000 --> 00:05:28,000
And fourth word vector representation is this one.

106
00:05:28,000 --> 00:05:37,000
So if I probably see the shape right it is basically four cross seven four words 1234 okay.

107
00:05:37,000 --> 00:05:44,000
And seven seven uh vocabulary size right 1234567 okay.

108
00:05:44,000 --> 00:05:48,000
Now what we are going to do, we are also going to do the same thing for D two.

109
00:05:48,000 --> 00:05:48,000
Okay.

110
00:05:48,000 --> 00:05:50,000
So let's go ahead and try with respect to D two.

111
00:05:50,000 --> 00:05:55,000
I would suggest just pause the video, try it by yourself and then probably continue the video.

112
00:05:55,000 --> 00:05:58,000
Anyhow I'll be showing you the entire uh representation again okay.

113
00:05:58,000 --> 00:06:02,000
So over here you'll be able to see that, which is my first word.

114
00:06:02,000 --> 00:06:05,000
Again, the in D2I have the food is bad right.

115
00:06:05,000 --> 00:06:07,000
So again this two are same.

116
00:06:07,000 --> 00:06:08,000
This three words are same.

117
00:06:08,000 --> 00:06:10,000
So I'm going to basically write the same representation.

118
00:06:10,000 --> 00:06:18,000
It will be one 0 or 1 000I think there are six zeros 123456 zeros okay.

119
00:06:18,000 --> 00:06:20,000
So this will be my first word.

120
00:06:20,000 --> 00:06:21,000
Coming to the second word.

121
00:06:21,000 --> 00:06:25,000
This is this one okay I'm just going to write this over here okay.

122
00:06:25,000 --> 00:06:26,000
So this is my second word.

123
00:06:26,000 --> 00:06:34,000
So here I'm going to write 010000I hope the size is 71234567.

124
00:06:34,000 --> 00:06:34,000
Perfect.

125
00:06:34,000 --> 00:06:41,000
So coming to the third word now, here you'll be able to see 0010000.

126
00:06:41,000 --> 00:06:42,000
Okay.

127
00:06:42,000 --> 00:06:46,000
Now here you can see this food is the food is it is a similar word like D one only.

128
00:06:46,000 --> 00:06:54,000
So the first three is just a replica coming to the last word right now if I say last word over here

129
00:06:54,000 --> 00:06:54,000
you can see good.

130
00:06:54,000 --> 00:06:56,000
Over here you can see bad.

131
00:06:56,000 --> 00:07:00,000
Now whenever you have bad this particular word will become one, right?

132
00:07:00,000 --> 00:07:06,000
So you'll be having 000000000 and then 100.

133
00:07:06,000 --> 00:07:13,000
So I'll just go ahead and write 000 again 0100.

134
00:07:13,000 --> 00:07:13,000
Perfect.

135
00:07:13,000 --> 00:07:17,000
So this is how the D two is basically represented.

136
00:07:17,000 --> 00:07:22,000
And again over here you'll be able to see the shape will be somewhere around four cross seven.

137
00:07:22,000 --> 00:07:23,000
Okay.

138
00:07:23,000 --> 00:07:24,000
Now uh, yes.

139
00:07:24,000 --> 00:07:26,000
Uh, this was the technique.

140
00:07:26,000 --> 00:07:27,000
Simple technique of one hot encoding.

141
00:07:27,000 --> 00:07:30,000
Like how we converted a text into a vectors.

142
00:07:31,000 --> 00:07:37,000
Uh, over here you can see each and every word is basically represented by a one hot encoded based on

143
00:07:37,000 --> 00:07:41,000
the vocabulary size, whatever the vocabulary size is basically present.

144
00:07:41,000 --> 00:07:44,000
So here you can have you can see that I have a vocabulary size of seven.

145
00:07:44,000 --> 00:07:48,000
So based on that one hot representation is given for each and every word.

146
00:07:48,000 --> 00:07:49,000
Okay.

147
00:07:49,000 --> 00:07:54,000
Now, uh, going forward in the next video, we are going to discuss about all the advantages and disadvantages

148
00:07:54,000 --> 00:07:55,000
of this.

149
00:07:55,000 --> 00:08:00,000
And again, one hot encoding is not getting used for this NLP use cases.

150
00:08:00,000 --> 00:08:03,000
We have techniques like Bag of Word and TF-IDF.

151
00:08:03,000 --> 00:08:05,000
We'll also understand what are the advantages and disadvantages.

152
00:08:05,000 --> 00:08:07,000
Why should we not use this.

153
00:08:07,000 --> 00:08:10,000
And then we'll be trying to understand the bag of words and TF-IDF.

154
00:08:10,000 --> 00:08:10,000
Okay.

155
00:08:10,000 --> 00:08:12,000
So yes, uh, this was it.

156
00:08:12,000 --> 00:08:13,000
I will see you all in the next video.

157
00:08:13,000 --> 00:08:14,000
Thank you.

