1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:05,000
So we are going to continue our discussion with respect to Bag of Words, uh, already we have understood

3
00:00:05,000 --> 00:00:10,000
the intuition behind Bag of Words how it is converting a text into vectors.

4
00:00:10,000 --> 00:00:15,000
Now, as usual, let's go ahead and discuss about the advantages and disadvantages.

5
00:00:15,000 --> 00:00:18,000
So first of all I will go ahead and write the advantages.

6
00:00:19,000 --> 00:00:23,000
And then I will go ahead and write the disadvantages.

7
00:00:25,000 --> 00:00:26,000
Okay.

8
00:00:26,000 --> 00:00:32,000
So uh and obviously we have also discussed about the advantages and disadvantages with respect to one

9
00:00:32,000 --> 00:00:33,000
hot encoding.

10
00:00:33,000 --> 00:00:38,000
We'll try to compare with this and we'll try to see that what all problems is getting fixed okay.

11
00:00:38,000 --> 00:00:39,000
First of all yes.

12
00:00:39,000 --> 00:00:43,000
Again uh this is easy to implement and it is intuitive.

13
00:00:43,000 --> 00:00:46,000
So I will just write something like simple and intuitive.

14
00:00:48,000 --> 00:00:52,000
Simple and intuitive.

15
00:00:52,000 --> 00:00:52,000
Okay.

16
00:00:53,000 --> 00:00:55,000
The second point, uh, with respect to advantages.

17
00:00:55,000 --> 00:00:58,000
Now here, what what we'll do in one hot encoding.

18
00:00:58,000 --> 00:01:03,000
You, you see that we have seen that some important thing is there in machine learning algorithm.

19
00:01:03,000 --> 00:01:03,000
Right.

20
00:01:04,000 --> 00:01:04,000
Okay.

21
00:01:04,000 --> 00:01:08,000
With respect to sparse matrix, I'll be discussing with respect to semantic meaning out of vocabulary

22
00:01:08,000 --> 00:01:09,000
everything I'll be discussing.

23
00:01:09,000 --> 00:01:15,000
First let's consider this particular second topic, which is like uh for ML algorithms we give fixed

24
00:01:15,000 --> 00:01:16,000
size inputs.

25
00:01:16,000 --> 00:01:18,000
Now over here with respect to bag of words.

26
00:01:18,000 --> 00:01:19,000
Any statement.

27
00:01:19,000 --> 00:01:23,000
Now here you can see that some of the sentence may be three words, five words, ten words.

28
00:01:23,000 --> 00:01:29,000
At the end of the day, based on the vocabulary size, you are able to get all the sentences converted

29
00:01:29,000 --> 00:01:32,000
into that many number of dimensions of words.

30
00:01:32,000 --> 00:01:35,000
So here the vectors is getting fixed.

31
00:01:35,000 --> 00:01:39,000
The inputs are getting fixed because here our vocabulary is getting fixed.

32
00:01:39,000 --> 00:01:42,000
So this particular problem is getting solved.

33
00:01:42,000 --> 00:01:42,000
Okay.

34
00:01:42,000 --> 00:01:46,000
In, uh, one hot encoding you do not have a fixed size inputs.

35
00:01:46,000 --> 00:01:48,000
Since we are creating words for every vectors.

36
00:01:48,000 --> 00:01:51,000
Sorry, we are creating vectors for every words.

37
00:01:51,000 --> 00:01:53,000
Okay, so what we are going to do over here.

38
00:01:53,000 --> 00:01:54,000
The second point that you'll be seeing.

39
00:01:54,000 --> 00:01:58,000
Yes you have a fixed size input.

40
00:01:58,000 --> 00:01:59,000
Right.

41
00:01:59,000 --> 00:02:07,000
And this will superbly help you for ML algorithms training okay ML algorithms.

42
00:02:07,000 --> 00:02:10,000
Now this is the two major advantages.

43
00:02:10,000 --> 00:02:15,000
Now if I talk about the disadvantage see over here the first disadvantage with respect to one hot encoding

44
00:02:15,000 --> 00:02:16,000
is sparse matrix.

45
00:02:16,000 --> 00:02:19,000
And I've already told you what exactly sparse matrix it is.

46
00:02:19,000 --> 00:02:21,000
Just nothing but ones and zeros.

47
00:02:21,000 --> 00:02:24,000
Let's say if you have vocabulary size is 50,000 then what will happen?

48
00:02:24,000 --> 00:02:29,000
Every sentence will get converted into you know that size of the vocabulary right.

49
00:02:29,000 --> 00:02:32,000
So still sparse matrix problem is there.

50
00:02:32,000 --> 00:02:38,000
So with respect to disadvantage again I'm going to write it as sparse matrix.

51
00:02:38,000 --> 00:02:38,000
An array.

52
00:02:41,000 --> 00:02:43,000
Or array is still there.

53
00:02:43,000 --> 00:02:48,000
And this will actually lead to overfitting okay.

54
00:02:48,000 --> 00:02:51,000
Now second major disadvantage.

55
00:02:51,000 --> 00:02:56,000
Again see at the end of the day whatever statement that you have like good boy, good girl, you know

56
00:02:56,000 --> 00:02:58,000
or it can be boy girl good.

57
00:02:59,000 --> 00:02:59,000
Okay.

58
00:02:59,000 --> 00:03:00,000
Something like this.

59
00:03:01,000 --> 00:03:03,000
You'll be seeing that based on this sentence, right?

60
00:03:03,000 --> 00:03:08,000
And based on this vocabulary and based on the frequency of the vocabulary, the ordering of the word

61
00:03:08,000 --> 00:03:09,000
is changing.

62
00:03:09,000 --> 00:03:13,000
Now see, understand if in a sentence the ordering of the word changes.

63
00:03:14,000 --> 00:03:18,000
And based on that, this vector is getting created because, see, based on the frequency, we have

64
00:03:18,000 --> 00:03:23,000
written all the vector, all the, all the vocabularies right over here, good was present maximum number

65
00:03:23,000 --> 00:03:23,000
of times.

66
00:03:23,000 --> 00:03:26,000
So we wrote it as first boy was present in the second number.

67
00:03:26,000 --> 00:03:28,000
So we wrote it over here.

68
00:03:28,000 --> 00:03:28,000
Right.

69
00:03:28,000 --> 00:03:31,000
And a girl was present uh like two for two times.

70
00:03:31,000 --> 00:03:33,000
And we have written it at last.

71
00:03:33,000 --> 00:03:37,000
Right now over here, you can see that if I probably consider the third statement.

72
00:03:37,000 --> 00:03:38,000
Boy.

73
00:03:38,000 --> 00:03:38,000
Girl.

74
00:03:38,000 --> 00:03:38,000
Good.

75
00:03:38,000 --> 00:03:39,000
Right.

76
00:03:39,000 --> 00:03:42,000
But here you can see that entire word is getting ordered.

77
00:03:42,000 --> 00:03:44,000
Uh, like it is completely changed, right?

78
00:03:44,000 --> 00:03:46,000
The ordering of the word is completely changed.

79
00:03:46,000 --> 00:03:48,000
So I'm having 110 for sentence three.

80
00:03:48,000 --> 00:03:49,000
I'm having 111.

81
00:03:49,000 --> 00:03:49,000
Now.

82
00:03:49,000 --> 00:03:54,000
When word ordering is changed, the meaning of the sentence is also gets changed.

83
00:03:54,000 --> 00:03:58,000
And because of that, some of the semantic information is not getting captured.

84
00:03:58,000 --> 00:04:03,000
I'll talk about more semantic information, but here you'll be able to see that ordering of the words

85
00:04:03,000 --> 00:04:04,000
is getting changed.

86
00:04:04,000 --> 00:04:05,000
This is super important.

87
00:04:05,000 --> 00:04:13,000
Ordering of the word is getting changed because if this is getting changed, the meaning of the sentence

88
00:04:13,000 --> 00:04:16,000
is changes, right is getting changed.

89
00:04:16,000 --> 00:04:22,000
So this was the second disadvantage if I probably talk about the third disadvantage okay.

90
00:04:22,000 --> 00:04:23,000
Third disadvantage.

91
00:04:23,000 --> 00:04:26,000
Again we'll go and see over here with respect to out of vocabulary.

92
00:04:26,000 --> 00:04:31,000
Now what happens if I probably add a new word like boy girl good.

93
00:04:31,000 --> 00:04:34,000
And let's say I'm going to add something called a school.

94
00:04:35,000 --> 00:04:39,000
Now here you will be seeing that the school word is not present in the vocabulary.

95
00:04:39,000 --> 00:04:42,000
So what it is going to do for this specific word anyhow?

96
00:04:42,000 --> 00:04:44,000
It is going to get rejected, right?

97
00:04:44,000 --> 00:04:44,000
It is.

98
00:04:44,000 --> 00:04:47,000
It is not all getting considered in this training data.

99
00:04:48,000 --> 00:04:55,000
Let's say that for our new test data, in our new test data, we have included a school word and we

100
00:04:55,000 --> 00:04:58,000
need to do the prediction for this particular word with respect to output.

101
00:04:58,000 --> 00:05:01,000
So the first step will be that we will do text preprocessing.

102
00:05:01,000 --> 00:05:07,000
And then we'll try to convert this into a bag of words using the same technique what we did in the training

103
00:05:07,000 --> 00:05:08,000
data set.

104
00:05:08,000 --> 00:05:14,000
But here you see that in my training data set I do not have a vocabulary which is called a school.

105
00:05:14,000 --> 00:05:18,000
So what it is going to do, it is just going to ignore the specific word, and it is just going to see

106
00:05:18,000 --> 00:05:20,000
that where good and boy and girl are there.

107
00:05:21,000 --> 00:05:21,000
right?

108
00:05:21,000 --> 00:05:27,000
So still out of vocabulary still exist because this word may be an important word for the sentence,

109
00:05:27,000 --> 00:05:31,000
but it is getting removed because we don't have that in the vocabulary, right?

110
00:05:31,000 --> 00:05:32,000
Major problem.

111
00:05:32,000 --> 00:05:37,000
So yes, out of vocabulary is obviously an issue over here.

112
00:05:39,000 --> 00:05:39,000
Right.

113
00:05:39,000 --> 00:05:42,000
This still persists okay.

114
00:05:42,000 --> 00:05:43,000
00V mm.

115
00:05:44,000 --> 00:05:45,000
Now this was the there.

116
00:05:46,000 --> 00:05:48,000
Now one more important thing.

117
00:05:48,000 --> 00:05:53,000
Semantic meaning in this is still not being getting captured while tell you.

118
00:05:55,000 --> 00:05:56,000
Semantic meaning.

119
00:06:00,000 --> 00:06:01,000
Is still not getting captured.

120
00:06:01,000 --> 00:06:04,000
And there are multiple things to explain this.

121
00:06:06,000 --> 00:06:11,000
Okay, now first of all, obviously you know that I'm having either ones and zeros.

122
00:06:11,000 --> 00:06:11,000
Okay.

123
00:06:12,000 --> 00:06:13,000
Now in this particular case good.

124
00:06:13,000 --> 00:06:18,000
And boy they are getting the same importance right.

125
00:06:18,000 --> 00:06:18,000
For girl.

126
00:06:18,000 --> 00:06:21,000
Obviously if the word is not present I'm getting zero.

127
00:06:21,000 --> 00:06:26,000
Small amount of semantic information is getting captured when compared to the one hot encoding format.

128
00:06:26,000 --> 00:06:31,000
But here you see that when we have many vocabularies, either my values will be one or zeros.

129
00:06:31,000 --> 00:06:34,000
One is just indicating whether the word is present or not.

130
00:06:34,000 --> 00:06:37,000
But which is the most important word?

131
00:06:37,000 --> 00:06:39,000
What is the most important context in that particular sentence?

132
00:06:39,000 --> 00:06:41,000
That is obviously not getting captured.

133
00:06:41,000 --> 00:06:44,000
And if that is not getting captured, semantic in turn will not get captured.

134
00:06:45,000 --> 00:06:49,000
Now the other thing over here is that there is there is also very important thing.

135
00:06:49,000 --> 00:06:51,000
Let's say that I'm having two sentences, okay?

136
00:06:52,000 --> 00:06:54,000
It is like the food is good.

137
00:06:56,000 --> 00:06:58,000
Let's say in my data set I have this sentence.

138
00:06:58,000 --> 00:07:00,000
The food is not good.

139
00:07:02,000 --> 00:07:03,000
Not good.

140
00:07:03,000 --> 00:07:07,000
Now let's say I don't go ahead and remove all the stop words and all.

141
00:07:07,000 --> 00:07:14,000
For this I will be having a vocabulary like 1 or 1.

142
00:07:14,000 --> 00:07:16,000
Let's say all these words are there, okay?

143
00:07:16,000 --> 00:07:19,000
And there is also a separate vocabulary.

144
00:07:19,000 --> 00:07:21,000
Food is also a separate vocabulary.

145
00:07:21,000 --> 00:07:22,000
Is is also a separate.

146
00:07:22,000 --> 00:07:24,000
How many unique vocabulary are there for?

147
00:07:24,000 --> 00:07:25,000
Because naught is also there, right?

148
00:07:25,000 --> 00:07:30,000
So is will also become one, naught will be zero and good will be one, right?

149
00:07:30,000 --> 00:07:33,000
So this is how we convert from this to this, right?

150
00:07:33,000 --> 00:07:35,000
Similarly from here to here.

151
00:07:35,000 --> 00:07:40,000
If I really need to convert, then what will happen 11111 because naught is also present.

152
00:07:40,000 --> 00:07:41,000
So I'm writing one.

153
00:07:42,000 --> 00:07:45,000
Now let's say this is my vector one and this is my vector two.

154
00:07:45,000 --> 00:07:51,000
If I try to find out the difference or how similar this vector is just by plotting some points.

155
00:07:51,000 --> 00:07:56,000
Let's say that I have converted this particular dimension to two dimension using PC, and probably I've

156
00:07:56,000 --> 00:07:58,000
plotted it based on this.

157
00:07:58,000 --> 00:07:58,000
Right?

158
00:07:58,000 --> 00:08:00,000
Only one value is getting changed is and not right.

159
00:08:00,000 --> 00:08:04,000
So I will get both these vectors very much near to each other.

160
00:08:05,000 --> 00:08:08,000
And this we can basically do it through something called as cosine similarity.

161
00:08:09,000 --> 00:08:11,000
So let's say this is my vector one.

162
00:08:11,000 --> 00:08:12,000
This is my vector one.

163
00:08:12,000 --> 00:08:14,000
This is my vector two.

164
00:08:14,000 --> 00:08:16,000
So vector one is basically present over here.

165
00:08:16,000 --> 00:08:21,000
Vector two is present over here if it is near to each other, if the angle between them is very near

166
00:08:21,000 --> 00:08:24,000
to each other, or if the angle between them is very less.

167
00:08:24,000 --> 00:08:29,000
I may say that this both the sentences are same, almost same or similar, right?

168
00:08:29,000 --> 00:08:30,000
This is almost similar.

169
00:08:31,000 --> 00:08:35,000
But do you think this both sentences are almost similar because it is the complete opposite of them,

170
00:08:35,000 --> 00:08:36,000
right?

171
00:08:36,000 --> 00:08:42,000
But since there is only one word that is getting changed because of that only one, one of the value

172
00:08:42,000 --> 00:08:44,000
is getting changed over here, right?

173
00:08:44,000 --> 00:08:46,000
Like zeros and ones are happening.

174
00:08:46,000 --> 00:08:50,000
And when we plot this, it is becoming kind of a kind of a similar word.

175
00:08:50,000 --> 00:08:51,000
But this should not be a similar word.

176
00:08:51,000 --> 00:08:53,000
This is completely opposite word.

177
00:08:53,000 --> 00:08:54,000
Right.

178
00:08:54,000 --> 00:08:58,000
So this kind of situation is also not getting handled well with the bag of words.

179
00:08:59,000 --> 00:09:03,000
And later on the techniques that will be learning like uh, word two vec and all this will be solving

180
00:09:03,000 --> 00:09:04,000
all these problems.

181
00:09:05,000 --> 00:09:06,000
Right?

182
00:09:06,000 --> 00:09:11,000
So I hope you are able to understand the advantages and disadvantages of bag of words.

183
00:09:11,000 --> 00:09:18,000
Super important with respect to interview and if your basics of this is getting strong, trust me you

184
00:09:18,000 --> 00:09:21,000
will be able to understand bag of words average word two vec.

185
00:09:21,000 --> 00:09:26,000
Sorry, you'll be able to understand word two vec average word two vec in a very easy manner.

186
00:09:26,000 --> 00:09:29,000
And there are techniques in deep learning which is also going to come, which is called as embedding

187
00:09:29,000 --> 00:09:30,000
techniques.

188
00:09:30,000 --> 00:09:33,000
Word embedding and all all those will get solved in a very easy way.

189
00:09:34,000 --> 00:09:34,000
Right.

190
00:09:34,000 --> 00:09:38,000
So I hope you are able to understand this was with respect to the bag of words.

191
00:09:38,000 --> 00:09:42,000
And now in the next video, what we are going to do is that we are going to do some practicals and we'll

192
00:09:42,000 --> 00:09:45,000
try to see, with the help of sklearn how we can perform bag of words.

193
00:09:45,000 --> 00:09:45,000
Right.

194
00:09:45,000 --> 00:09:47,000
So yes, this was it from my side.

195
00:09:47,000 --> 00:09:48,000
I will see you all in the next video.

196
00:09:48,000 --> 00:09:49,000
Thank you.