1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
We are going to continue the discussion with respect to NLP.

3
00:00:03,000 --> 00:00:10,000
In our previous video, we have already seen that we have discussed how one hot encoding works, right?

4
00:00:10,000 --> 00:00:16,000
And if I probably talk with respect to different types of ways how we can convert a text into a vectors,

5
00:00:16,000 --> 00:00:19,000
we had completed already one hot encoding.

6
00:00:19,000 --> 00:00:22,000
Now we are into the second that is bag of words.

7
00:00:22,000 --> 00:00:26,000
Now let's go ahead and understand how bag of words actually work.

8
00:00:26,000 --> 00:00:32,000
And this is super important because this is the technique which you can do simple tasks like sentiment

9
00:00:32,000 --> 00:00:38,000
classifications or whether, uh, basically, in short, a text classification kind of task easily you

10
00:00:38,000 --> 00:00:43,000
will be able to solve, okay, uh, some of the applications famous application like whether a mail

11
00:00:43,000 --> 00:00:46,000
is a spam or a ham, everything, you'll be able to solve it.

12
00:00:46,000 --> 00:00:49,000
Now let's say that I have a data set.

13
00:00:49,000 --> 00:00:49,000
Okay.

14
00:00:49,000 --> 00:00:54,000
And in this particular data set, uh, I'm just saying that this is a positive or negative statements.

15
00:00:54,000 --> 00:00:58,000
Now, in this particular data set, I have three sentences.

16
00:00:58,000 --> 00:00:59,000
Let's say he is a good boy.

17
00:00:59,000 --> 00:01:01,000
She is a good girl.

18
00:01:01,000 --> 00:01:04,000
Boy and girl are good okay.

19
00:01:05,000 --> 00:01:07,000
And all these are like positive statements.

20
00:01:07,000 --> 00:01:09,000
So the output of all this are ones right?

21
00:01:09,000 --> 00:01:13,000
In supervised machine learning we really need to know the output also.

22
00:01:13,000 --> 00:01:18,000
Now let me go step by step and show you that how bag of words are implemented.

23
00:01:18,000 --> 00:01:20,000
So this is basically the step one right?

24
00:01:20,000 --> 00:01:22,000
I have the data set.

25
00:01:22,000 --> 00:01:23,000
Now let's go to the step two.

26
00:01:23,000 --> 00:01:29,000
Now what happens in the step two is that once I go over here right, there's multiple steps that uh,

27
00:01:29,000 --> 00:01:30,000
actually occur.

28
00:01:30,000 --> 00:01:32,000
And initially we should also go with respect to this.

29
00:01:32,000 --> 00:01:38,000
The first thing is that if I probably consider two things you are going to basically happen.

30
00:01:38,000 --> 00:01:45,000
First of all, we will lower all the words and usually all the steps We will even be doing for all the

31
00:01:45,000 --> 00:01:46,000
other techniques also.

32
00:01:46,000 --> 00:01:51,000
First we will lower all the words and then we will probably apply stop words, right?

33
00:01:51,000 --> 00:01:53,000
And this also I have already shown you.

34
00:01:53,000 --> 00:01:59,000
Now when I lower all the words, let's say my sentence one write my sentence one now becomes I'm going

35
00:01:59,000 --> 00:02:00,000
to take only this text data.

36
00:02:00,000 --> 00:02:04,000
I don't have to worry about the output because later on, once we convert this text into vectors, we

37
00:02:04,000 --> 00:02:06,000
can apply it to the machine learning algorithm.

38
00:02:06,000 --> 00:02:10,000
So in the sentence one here you can see as soon as I lower the words what will happen?

39
00:02:10,000 --> 00:02:12,000
This all will become smaller.

40
00:02:12,000 --> 00:02:13,000
Now why I'm doing this?

41
00:02:13,000 --> 00:02:15,000
Because there may be some repeated words.

42
00:02:15,000 --> 00:02:20,000
Now in this particular case, the capital with capital B you have boy here small B you have a boy.

43
00:02:20,000 --> 00:02:25,000
So both these words are same, but since it is in uppercase this will be treated as a separate word.

44
00:02:25,000 --> 00:02:28,000
So we really need to lower all lowercase all the words.

45
00:02:28,000 --> 00:02:34,000
So I'm just going to write lowercase all the words now in sentence one after lowering casing.

46
00:02:34,000 --> 00:02:36,000
So what will happen is that this all will become smaller letters.

47
00:02:36,000 --> 00:02:38,000
Then once we apply stop words.

48
00:02:38,000 --> 00:02:39,000
Now what happens in stop words?

49
00:02:39,000 --> 00:02:45,000
This is all words like he she is a you know it will get deleted, right?

50
00:02:45,000 --> 00:02:50,000
Because we don't require this particular word for any kind of task like sentiment analysis.

51
00:02:50,000 --> 00:02:53,000
Important words like good boy, good girl is basically required.

52
00:02:53,000 --> 00:02:57,000
So this all things will go like and R will also go.

53
00:02:57,000 --> 00:02:59,000
Now what will happen with respect to this?

54
00:02:59,000 --> 00:03:02,000
The sentence one words will become only like this.

55
00:03:02,000 --> 00:03:04,000
Good boy.

56
00:03:04,000 --> 00:03:08,000
Okay, I'm showing you step by step, uh, with the help of Python.

57
00:03:08,000 --> 00:03:10,000
When we do with the help of libraries, when we do, it is very much simple.

58
00:03:10,000 --> 00:03:13,000
We don't have to worry that much about it.

59
00:03:13,000 --> 00:03:13,000
Okay.

60
00:03:14,000 --> 00:03:16,000
Uh, we just have to use one library.

61
00:03:16,000 --> 00:03:17,000
So sentence two.

62
00:03:17,000 --> 00:03:20,000
What it will become once it is stop, words will get removed.

63
00:03:20,000 --> 00:03:22,000
It will become good girl, right?

64
00:03:23,000 --> 00:03:31,000
And similarly with sentence three, I will be having this specific words that is boy girl.

65
00:03:31,000 --> 00:03:34,000
Good, right?

66
00:03:34,000 --> 00:03:35,000
So all these things are there.

67
00:03:35,000 --> 00:03:37,000
I have sentence one, sentence two, sentence three.

68
00:03:37,000 --> 00:03:38,000
Perfect.

69
00:03:38,000 --> 00:03:42,000
Uh, now what we do is that we go ahead and calculate the vocabulary.

70
00:03:42,000 --> 00:03:44,000
Now, how many words are there in this vocabulary?

71
00:03:44,000 --> 00:03:44,000
We have?

72
00:03:44,000 --> 00:03:45,000
Good boy.

73
00:03:45,000 --> 00:03:47,000
girl, boy and girl.

74
00:03:47,000 --> 00:03:49,000
And again it is getting repeated.

75
00:03:49,000 --> 00:03:56,000
So if I probably consider how many unique words are there in the vocabulary, so I will be able to write

76
00:03:56,000 --> 00:03:56,000
it over.

77
00:03:56,000 --> 00:03:58,000
The first word is nothing but good.

78
00:03:58,000 --> 00:04:00,000
So I'll go ahead and write good.

79
00:04:00,000 --> 00:04:01,000
Good is my first word.

80
00:04:01,000 --> 00:04:06,000
And one more thing that I will probably write that what is the frequency of this specific word?

81
00:04:06,000 --> 00:04:10,000
Like how many times this words are there in different different sentences.

82
00:04:10,000 --> 00:04:13,000
And obviously you can see one, two, three, three are there.

83
00:04:13,000 --> 00:04:15,000
So I'm going to keep the count as three.

84
00:04:15,000 --> 00:04:18,000
Then we have something called as boy over here.

85
00:04:18,000 --> 00:04:20,000
You will be able to see how many times boy are there.

86
00:04:20,000 --> 00:04:23,000
So in sentence three also boy is there in sentence one also boy is there.

87
00:04:23,000 --> 00:04:28,000
So I'm just going to make the count to two and coming to the next one which is called as girl.

88
00:04:28,000 --> 00:04:31,000
So again girl word is also there in this vocabulary.

89
00:04:31,000 --> 00:04:36,000
And again girl also you will be able to see two times it is there right now.

90
00:04:36,000 --> 00:04:41,000
First of all, you need to see that in this frequency with respect to different different vocabulary

91
00:04:41,000 --> 00:04:41,000
words.

92
00:04:41,000 --> 00:04:43,000
Is this in ascending order?

93
00:04:43,000 --> 00:04:46,000
And obviously, uh, sorry, is this in descending order.

94
00:04:46,000 --> 00:04:49,000
So maximum number of frequency will be put up in the first word.

95
00:04:49,000 --> 00:04:53,000
Right then boys present two times and girl is present two times.

96
00:04:53,000 --> 00:04:55,000
So this can be moved up and down.

97
00:04:55,000 --> 00:05:00,000
But just understand what I have done is that based on this descending order I have just ordered all

98
00:05:00,000 --> 00:05:01,000
these words.

99
00:05:01,000 --> 00:05:05,000
So from it is basically from maximum to minimum okay.

100
00:05:06,000 --> 00:05:08,000
So this is that perfect.

101
00:05:08,000 --> 00:05:13,000
Now before applying bag of words you already have seen that.

102
00:05:13,000 --> 00:05:14,000
How many vocabulary size.

103
00:05:14,000 --> 00:05:16,000
What is the vocabulary size.

104
00:05:16,000 --> 00:05:18,000
The vocabulary size is three right.

105
00:05:18,000 --> 00:05:22,000
And I can see in a bigger dataset I can have this kind of words a lot.

106
00:05:22,000 --> 00:05:26,000
So I will be having all these words like this and frequency will also be there.

107
00:05:26,000 --> 00:05:30,000
And one important step is that it is not necessary that you use all the words.

108
00:05:30,000 --> 00:05:33,000
This is also there in this bag of words itself.

109
00:05:33,000 --> 00:05:37,000
Let's say that over here in the vocabulary there are 100 100 unique words.

110
00:05:37,000 --> 00:05:40,000
And let's say some of the words are just present once, right?

111
00:05:40,000 --> 00:05:45,000
So if some of the words are just present once, we even do not take that.

112
00:05:45,000 --> 00:05:50,000
So when we are, when we'll be doing the coding with respect to a bag of words, you know, we also

113
00:05:50,000 --> 00:05:54,000
have an option that we just select the top ten features or top 20 features.

114
00:05:54,000 --> 00:05:56,000
Which of the words are getting repeated more and more.

115
00:05:56,000 --> 00:05:59,000
And that is the importance of this particular frequency.

116
00:05:59,000 --> 00:05:59,000
Okay.

117
00:06:00,000 --> 00:06:00,000
Now perfect.

118
00:06:00,000 --> 00:06:02,000
Uh, we have we have we are till here.

119
00:06:02,000 --> 00:06:07,000
Now the next step will be that it is very, very simple now based on the top most frequency, what I'm

120
00:06:07,000 --> 00:06:10,000
actually going to do, I'm going to keep this words as my feature.

121
00:06:10,000 --> 00:06:15,000
So good will come over here boy will come over here and girl will come over here.

122
00:06:15,000 --> 00:06:16,000
Okay?

123
00:06:17,000 --> 00:06:18,000
Now you already know.

124
00:06:18,000 --> 00:06:20,000
What is the sentence one right?

125
00:06:20,000 --> 00:06:23,000
So sentence one, what will happen is that this sentence one has.

126
00:06:23,000 --> 00:06:24,000
Good boy.

127
00:06:24,000 --> 00:06:27,000
Now see how this will get converted into a vectors.

128
00:06:27,000 --> 00:06:29,000
Wherever good is present, that will become one.

129
00:06:29,000 --> 00:06:33,000
Wherever boy is present that will become one and remaining will become zero.

130
00:06:33,000 --> 00:06:40,000
So this entire sentence will be able to see, uh, that it is getting converted to 110 as a vector.

131
00:06:40,000 --> 00:06:40,000
Okay.

132
00:06:40,000 --> 00:06:41,000
So this was the text.

133
00:06:41,000 --> 00:06:44,000
Now this is getting converted to 110 as a vector.

134
00:06:44,000 --> 00:06:49,000
Now similarly if I go with respect to sentence two, wherever there is a word like good that will become

135
00:06:49,000 --> 00:06:54,000
one, wherever there is a girl that will become one and remaining all will become zero.

136
00:06:54,000 --> 00:06:57,000
I'll talk about the advantages and disadvantages why we are doing this.

137
00:06:57,000 --> 00:06:59,000
Previously in one hot encoded.

138
00:06:59,000 --> 00:07:03,000
You saw that for every word we are doing this for every word, we are creating a vector.

139
00:07:03,000 --> 00:07:07,000
But here for the entire sentence, this vector is coming, okay.

140
00:07:07,000 --> 00:07:12,000
And there are a lot of advantages if when I talk about this, which I'll discuss, uh, first of all,

141
00:07:12,000 --> 00:07:15,000
let's understand what a more we'll be having in this.

142
00:07:15,000 --> 00:07:20,000
So s three you'll be able to see the sentence three here I have boy, girl and good.

143
00:07:20,000 --> 00:07:25,000
So wherever boy is there that will become one girl is there that will become one and good will be there.

144
00:07:25,000 --> 00:07:26,000
They will become one.

145
00:07:26,000 --> 00:07:32,000
Okay, now this is my entire vectors and obviously I will also have my output variable.

146
00:07:32,000 --> 00:07:34,000
The output variable can be one zeros.

147
00:07:34,000 --> 00:07:35,000
Anything as such right.

148
00:07:35,000 --> 00:07:41,000
If you are probably solving the sentiment analysis or something like that okay now this are my entire

149
00:07:41,000 --> 00:07:43,000
vectors okay.

150
00:07:43,000 --> 00:07:46,000
And this is the vector for the entire sentence okay.

151
00:07:46,000 --> 00:07:48,000
Entire syntax.

152
00:07:48,000 --> 00:07:53,000
And this is how the entire bag of words converts a text into a vectors.

153
00:07:53,000 --> 00:07:56,000
Okay, now what we can do is that we can take this particular vectors.

154
00:07:56,000 --> 00:08:00,000
We can train with an in machine learning model and will be able to get the output.

155
00:08:00,000 --> 00:08:06,000
Now one more important thing that I really want to put, let's say that I, I have a word over here.

156
00:08:06,000 --> 00:08:06,000
Right.

157
00:08:06,000 --> 00:08:07,000
Good girl.

158
00:08:07,000 --> 00:08:07,000
Girl.

159
00:08:07,000 --> 00:08:07,000
Right.

160
00:08:07,000 --> 00:08:10,000
So let's say if I have one more word like good.

161
00:08:11,000 --> 00:08:13,000
Now in this case, what will happen?

162
00:08:13,000 --> 00:08:13,000
Okay.

163
00:08:13,000 --> 00:08:18,000
In this case what will happen usually in bag of words since good is repeated two times.

164
00:08:18,000 --> 00:08:22,000
So I will increase the count to two okay.

165
00:08:22,000 --> 00:08:23,000
In this particular case.

166
00:08:23,000 --> 00:08:25,000
So I what I will do I will increase the count to two.

167
00:08:25,000 --> 00:08:26,000
Okay.

168
00:08:26,000 --> 00:08:27,000
Now there are two things.

169
00:08:27,000 --> 00:08:35,000
One is binary bag of words and one is normal bag of words.

170
00:08:35,000 --> 00:08:39,000
Now in the case of binary bag of words, even though the count is two, what it is going to do, it

171
00:08:39,000 --> 00:08:42,000
is going to force it to become one.

172
00:08:42,000 --> 00:08:46,000
So most of the time, if the word is present, it may be present any number of time.

173
00:08:46,000 --> 00:08:49,000
The value is either one or it is zero.

174
00:08:49,000 --> 00:08:54,000
And in normal bag of words, I can increase the count to two, three, four based on the number of words

175
00:08:54,000 --> 00:08:54,000
that is there.

176
00:08:54,000 --> 00:08:59,000
So that is the basic difference between binary bag of words and bag of words.

177
00:08:59,000 --> 00:09:03,000
That basically means here you will be having only ones and zeros.

178
00:09:03,000 --> 00:09:07,000
And here based on the frequency count will get updated.

179
00:09:08,000 --> 00:09:13,000
Count will get updated based on frequency.

180
00:09:13,000 --> 00:09:17,000
So this is the basic difference with respect to this.

181
00:09:17,000 --> 00:09:17,000
Okay.

182
00:09:19,000 --> 00:09:24,000
So I hope you have understood like how with the help of Bag of words we are converting a text into a

183
00:09:24,000 --> 00:09:25,000
vectors.

184
00:09:26,000 --> 00:09:31,000
Now in the next video I'm going to discuss about what are the advantages and disadvantages with respect

185
00:09:31,000 --> 00:09:34,000
to this, like how we have actually discussed in one hot encoding.

186
00:09:34,000 --> 00:09:35,000
Okay.

187
00:09:35,000 --> 00:09:38,000
And first of all, I'll talk about like what all problems are getting solved.

188
00:09:38,000 --> 00:09:39,000
And all right.

189
00:09:39,000 --> 00:09:41,000
So yes, I will see you all in the next video.

190
00:09:41,000 --> 00:09:46,000
I hope you are able to understand it and just try with some other data set of your own.

191
00:09:46,000 --> 00:09:52,000
Just try to create a text, try to perform all these steps and then try to convert that into a vectors.

192
00:09:52,000 --> 00:09:54,000
So yes, I will see you all in the next video.

193
00:09:54,000 --> 00:09:55,000
Thank you.