1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue our discussion with respect to one hot encoding and NLP.

3
00:00:04,000 --> 00:00:11,000
And I hope you have understood that how we can convert words into vectors and probably the entire process.

4
00:00:11,000 --> 00:00:16,000
I have actually explained to you over here in this video, we are going to talk about the advantages

5
00:00:16,000 --> 00:00:18,000
and disadvantages of using one hot encoding.

6
00:00:18,000 --> 00:00:21,000
So let me go ahead and let me write it down.

7
00:00:21,000 --> 00:00:24,000
So here are the advantages okay.

8
00:00:24,000 --> 00:00:27,000
And here are the disadvantages.

9
00:00:31,000 --> 00:00:32,000
Okay.

10
00:00:32,000 --> 00:00:38,000
Now first of all, the basic advantage is that it is very easy to implement with Python programming

11
00:00:38,000 --> 00:00:40,000
language, right?

12
00:00:41,000 --> 00:00:43,000
Easy to implement with Python.

13
00:00:43,000 --> 00:00:50,000
Now why I'm saying you that because in sklearn we have a specific library where we can easily implement

14
00:00:50,000 --> 00:00:50,000
this.

15
00:00:50,000 --> 00:00:54,000
And it is basically called as one hot encoder Okay.

16
00:00:54,000 --> 00:01:01,000
And in pandas, if you are familiar with pandas library, we have something called as PD dot get underscore

17
00:01:01,000 --> 00:01:03,000
dummies right.

18
00:01:03,000 --> 00:01:10,000
So this function will basically help you to create this entire one hot encoding based on the words right.

19
00:01:10,000 --> 00:01:16,000
And again the second advantage and obviously uh, we'll see this kind of examples as we go ahead.

20
00:01:16,000 --> 00:01:20,000
But again I'm not going to implement specifically separately as a video in one hot encoding because

21
00:01:20,000 --> 00:01:23,000
we don't mostly use this NLP in NLP technique.

22
00:01:23,000 --> 00:01:24,000
And I'll tell you why.

23
00:01:24,000 --> 00:01:26,000
Because there are a lot of disadvantages also.

24
00:01:26,000 --> 00:01:29,000
Now let's go ahead and talk about the disadvantage.

25
00:01:29,000 --> 00:01:31,000
The first disadvantage over here.

26
00:01:31,000 --> 00:01:33,000
You see that over here.

27
00:01:33,000 --> 00:01:38,000
At the end of the day if I probably consider right the food is good right.

28
00:01:38,000 --> 00:01:40,000
At the end of the day I'm getting 1000.

29
00:01:41,000 --> 00:01:43,000
And for the next one I'm getting this 010 zeros.

30
00:01:43,000 --> 00:01:50,000
And there's so many number of zeros and ones in, in arithmetic, in in linear algebra we basically

31
00:01:50,000 --> 00:01:53,000
say this as sparse matrix.

32
00:01:54,000 --> 00:01:55,000
Okay.

33
00:01:55,000 --> 00:02:00,000
So this basically creates a sparse matrix right.

34
00:02:00,000 --> 00:02:02,000
Now what is sparse matrix exactly.

35
00:02:02,000 --> 00:02:03,000
To talk about.

36
00:02:03,000 --> 00:02:08,000
Sparse matrix is that in an array in a matrix you have lot of ones and zeros okay.

37
00:02:08,000 --> 00:02:11,000
And what is the problem with respect to sparse matrix?

38
00:02:11,000 --> 00:02:14,000
I can also say this as we can also convert this into an array also.

39
00:02:14,000 --> 00:02:17,000
So we can also say this as sparse arrays.

40
00:02:17,000 --> 00:02:20,000
But understand what is the disadvantage with respect to sparse matrix.

41
00:02:20,000 --> 00:02:26,000
Whenever we apply any machine learning algorithm this specific machine learning algorithm, you know,

42
00:02:26,000 --> 00:02:29,000
after we convert the text into vectors, right?

43
00:02:29,000 --> 00:02:34,000
In most of the machine learning algorithms, this leads to something called as overfitting.

44
00:02:35,000 --> 00:02:37,000
Right now, what exactly is overfitting?

45
00:02:37,000 --> 00:02:42,000
Overfitting is a process wherein you get a very good accuracy with the training data, but with respect

46
00:02:42,000 --> 00:02:46,000
to any new data, it will not be able to give you a very good accuracy.

47
00:02:46,000 --> 00:02:51,000
So this sparse matrix usually leads to something called as overfitting.

48
00:02:51,000 --> 00:02:51,000
Right.

49
00:02:51,000 --> 00:02:53,000
So this is the one of the major disadvantage.

50
00:02:53,000 --> 00:02:57,000
Now let me talk about the next disadvantage over here okay.

51
00:02:57,000 --> 00:03:02,000
Now over here let's say that I have this words like the food is good, the pizza is the sorry, the

52
00:03:02,000 --> 00:03:03,000
food is good, the food is bad.

53
00:03:03,000 --> 00:03:05,000
Pizza is amazing right now.

54
00:03:05,000 --> 00:03:07,000
See one thing over here, right?

55
00:03:07,000 --> 00:03:10,000
In any machine learning algorithm, whenever you give the data.

56
00:03:10,000 --> 00:03:11,000
Right.

57
00:03:11,000 --> 00:03:14,000
And in this case also if I probably say this vocabulary right.

58
00:03:14,000 --> 00:03:16,000
These are all my features.

59
00:03:16,000 --> 00:03:17,000
The food is good.

60
00:03:17,000 --> 00:03:18,000
Bad pizza.

61
00:03:18,000 --> 00:03:18,000
Amazing.

62
00:03:18,000 --> 00:03:25,000
Right now over here you can see that most of the time my inputs for every word that is getting converted

63
00:03:25,000 --> 00:03:28,000
into a vector the size is seven, right?

64
00:03:28,000 --> 00:03:32,000
1234567 right.

65
00:03:32,000 --> 00:03:36,000
And based on the number of words here, you will be seeing that I'm having four words.

66
00:03:36,000 --> 00:03:38,000
So I'm getting four cross seven here.

67
00:03:38,000 --> 00:03:41,000
Also I'm having four words which I'm actually getting four cross seven.

68
00:03:41,000 --> 00:03:46,000
But if I consider with respect to the third statement or third sentence, right.

69
00:03:46,000 --> 00:03:49,000
So if I probably consider D D3 right.

70
00:03:49,000 --> 00:03:57,000
And if I start creating a one hot encoded format for this, so what it will be see is there.

71
00:03:57,000 --> 00:03:57,000
Right.

72
00:03:57,000 --> 00:04:00,000
So pizza where it is one at this particular instance.

73
00:04:00,000 --> 00:04:05,000
So I will probably create something like this 0000010.

74
00:04:05,000 --> 00:04:05,000
Right.

75
00:04:05,000 --> 00:04:07,000
So this is pizza right.

76
00:04:07,000 --> 00:04:10,000
Is again we have is over here is is nothing.

77
00:04:10,000 --> 00:04:12,000
But this will be one and remaining all will be zero.

78
00:04:12,000 --> 00:04:17,000
So I'll just write 0010000.

79
00:04:17,000 --> 00:04:18,000
Right.

80
00:04:18,000 --> 00:04:24,000
And the third statement is amazing, which will be the last one as far as I remember from the vocabulary.

81
00:04:24,000 --> 00:04:28,000
So this is how my D3 looks right in one hot encoded format.

82
00:04:28,000 --> 00:04:33,000
But one one thing you really need to understand over here the size is three cross seven.

83
00:04:34,000 --> 00:04:37,000
Right now in machine learning you need to understand one thing.

84
00:04:38,000 --> 00:04:44,000
Whenever we perform NLP or let's say any machine learning use case, the number of features should be

85
00:04:44,000 --> 00:04:46,000
fixed with respect to the length.

86
00:04:46,000 --> 00:04:48,000
But here you can see this is four cross seven.

87
00:04:48,000 --> 00:04:49,000
This is four cross seven.

88
00:04:49,000 --> 00:04:50,000
This is three cross seven.

89
00:04:51,000 --> 00:04:57,000
So I cannot train I cannot train my this particular data for a machine learning algorithm.

90
00:04:57,000 --> 00:05:01,000
Because over here we still don't have a fixed text size.

91
00:05:01,000 --> 00:05:01,000
Right.

92
00:05:01,000 --> 00:05:07,000
So this is one of the major, major disadvantage unless and until and with the help of One-hot encoding,

93
00:05:07,000 --> 00:05:11,000
we are not getting a fixed text size right over here I have got four cross seven four cross seven.

94
00:05:11,000 --> 00:05:15,000
If this was also four cross seven I could have trained it right.

95
00:05:15,000 --> 00:05:23,000
So over here I'll say that for ML algorithm we need we need fixed size.

96
00:05:24,000 --> 00:05:27,000
We need fixed size input Right.

97
00:05:27,000 --> 00:05:29,000
And right now it is not there.

98
00:05:29,000 --> 00:05:32,000
So this is again a major disadvantage right.

99
00:05:32,000 --> 00:05:36,000
We'll be seeing in the upcoming lectures how with the help of Bag of words tf IDF will be getting a

100
00:05:36,000 --> 00:05:37,000
fixed size of words.

101
00:05:37,000 --> 00:05:38,000
Okay.

102
00:05:38,000 --> 00:05:40,000
Now the third one that you will be seeing.

103
00:05:41,000 --> 00:05:44,000
Most of the times we are finding zeros and ones, right?

104
00:05:44,000 --> 00:05:48,000
Zeros and ones in zeros and ones most of the times.

105
00:05:48,000 --> 00:05:52,000
If if a specific word is there, that will become one and remaining all will be zero.

106
00:05:52,000 --> 00:06:01,000
But if I talk about the semantic meaning between these two words like dough and food, right, we are

107
00:06:01,000 --> 00:06:07,000
not able to exactly calculate how far, how equal it is or how similar that specific word is.

108
00:06:07,000 --> 00:06:09,000
And this process is basically called as semantic.

109
00:06:09,000 --> 00:06:21,000
So here I will say that no semantic meaning is getting captured, No semantic meaning is getting captured.

110
00:06:21,000 --> 00:06:23,000
Now let me talk about this in a very good example.

111
00:06:24,000 --> 00:06:24,000
Okay.

112
00:06:24,000 --> 00:06:31,000
Let's say I have something like this food pizza burger.

113
00:06:31,000 --> 00:06:32,000
Okay.

114
00:06:32,000 --> 00:06:35,000
Now you know that let's say that in my vocabulary I have three words.

115
00:06:35,000 --> 00:06:40,000
So for the food representation I will basically write it as 100 for pizza.

116
00:06:40,000 --> 00:06:45,000
Let's say I'm going to write it as 010, and for burger, I'm just going to write it as 001.

117
00:06:46,000 --> 00:06:48,000
Now understand these are my vectors.

118
00:06:49,000 --> 00:06:53,000
And right now, since there are three features, I'm basically having three vectors.

119
00:06:55,000 --> 00:07:00,000
Now, if you have probably heard of something called as cosine similarity, or if you really want to

120
00:07:00,000 --> 00:07:05,000
find out the distance between this vector to this vector and then this vector to this vector, if I

121
00:07:05,000 --> 00:07:09,000
probably consider this, let's say I'm just going to draw a three dimension okay.

122
00:07:10,000 --> 00:07:14,000
Let's say this is being determined by something like burger.

123
00:07:14,000 --> 00:07:17,000
And this is being determined by something like food.

124
00:07:17,000 --> 00:07:20,000
And this is being determined by something like pizza.

125
00:07:21,000 --> 00:07:26,000
Now if I talk about all these things right, like let's say in the case of food it is 100.

126
00:07:26,000 --> 00:07:29,000
So in this axis I will probably be getting one.

127
00:07:29,000 --> 00:07:33,000
So so this will be represented by one comma zero comma 0 in 3 dimension.

128
00:07:33,000 --> 00:07:36,000
And if I talk about PC 010.

129
00:07:36,000 --> 00:07:37,000
So this will be my another point.

130
00:07:37,000 --> 00:07:43,000
Let's say I'm just going to denote this by another point zero comma one comma zero.

131
00:07:43,000 --> 00:07:45,000
And let's let's talk about the third point over here.

132
00:07:45,000 --> 00:07:49,000
The third point will be somewhere here it'll be in the same distance.

133
00:07:49,000 --> 00:07:51,000
So this will be zero comma zero comma one.

134
00:07:51,000 --> 00:07:57,000
Now if I probably calculate the distance between this all these things it will almost be equal, right?

135
00:07:57,000 --> 00:08:02,000
So it is not being able to tell the exact difference between foot pizza and burger.

136
00:08:02,000 --> 00:08:07,000
It is obviously considering that, okay, all these words are in equal distance.

137
00:08:07,000 --> 00:08:11,000
So we cannot understand that how this particular word is different from this.

138
00:08:12,000 --> 00:08:12,000
Right.

139
00:08:12,000 --> 00:08:14,000
This is super important to understand.

140
00:08:14,000 --> 00:08:20,000
So in short, what I am actually trying to say over here is that no semantic meaning is basically getting

141
00:08:20,000 --> 00:08:21,000
captured.

142
00:08:21,000 --> 00:08:25,000
That basically means in this particular sentence, I'm not able to understand which is the most important

143
00:08:25,000 --> 00:08:31,000
word, how this word is related to this word, or how this word is much more similar to this word.

144
00:08:31,000 --> 00:08:35,000
So that information is not getting captured, because at the end of the day, I'm getting ones and zeros.

145
00:08:35,000 --> 00:08:36,000
Okay.

146
00:08:36,000 --> 00:08:39,000
So this was the uh, third major disadvantage.

147
00:08:39,000 --> 00:08:42,000
Now talking about the fourth disadvantage.

148
00:08:42,000 --> 00:08:44,000
And this is also very super important.

149
00:08:44,000 --> 00:08:48,000
And this particular concept is something called as out of vocabulary.

150
00:08:48,000 --> 00:08:51,000
Out of vocabulary.

151
00:08:52,000 --> 00:08:54,000
And what does this basically mean Ovi.

152
00:08:55,000 --> 00:08:55,000
Okay.

153
00:08:55,000 --> 00:08:57,000
Now what what is this all about.

154
00:08:57,000 --> 00:09:02,000
Let's say that right now I have this many vocabularies of word.

155
00:09:02,000 --> 00:09:06,000
Let's say for the after I train my model now I want to test it for my new data set.

156
00:09:06,000 --> 00:09:16,000
So for testing on my new data set this will be my test data Let's say I will say burger is bad and I

157
00:09:16,000 --> 00:09:17,000
need to predict this.

158
00:09:17,000 --> 00:09:20,000
So this is my test data and I need to predict this.

159
00:09:20,000 --> 00:09:26,000
Now you know that over here with respect to this particular sentences okay.

160
00:09:26,000 --> 00:09:30,000
Burger is nowhere present in this particular, uh, vocabulary.

161
00:09:31,000 --> 00:09:32,000
So what will be the problem?

162
00:09:32,000 --> 00:09:34,000
We will not find out a way.

163
00:09:34,000 --> 00:09:39,000
We will not have any way to represent this burger in the form of vectors, right?

164
00:09:39,000 --> 00:09:43,000
Will not be able to form a vectors over here because anyhow, we don't have a vocabulary word.

165
00:09:43,000 --> 00:09:45,000
So what will happen in this particular case.

166
00:09:45,000 --> 00:09:50,000
And obviously we will not be able to perform this one hot encoding because in our vocabulary I just

167
00:09:50,000 --> 00:09:51,000
have this many number of words.

168
00:09:51,000 --> 00:09:57,000
So this will not work when a new word is actually coming, wherein it is not present in the vocabulary

169
00:09:57,000 --> 00:09:59,000
with respect to the test data.

170
00:09:59,000 --> 00:10:02,000
And obviously this is again a major disadvantage, right?

171
00:10:02,000 --> 00:10:06,000
So this technique is basically called as out of vocabulary.

172
00:10:06,000 --> 00:10:10,000
So this was in short about the advantages and disadvantages of this.

173
00:10:10,000 --> 00:10:14,000
Understand sparse matrix basically means many ones and zeros.

174
00:10:14,000 --> 00:10:17,000
It leads to overfitting with respect to various machine learning algorithm.

175
00:10:17,000 --> 00:10:22,000
Then for any machine learning algorithm I definitely require a fixed size input.

176
00:10:22,000 --> 00:10:23,000
Right now we are not able to get it.

177
00:10:23,000 --> 00:10:26,000
That basically means all my sentences should be of fixed size.

178
00:10:26,000 --> 00:10:27,000
It will not work.

179
00:10:27,000 --> 00:10:28,000
No semantic meaning.

180
00:10:29,000 --> 00:10:30,000
Sorry meaning is getting captured.

181
00:10:30,000 --> 00:10:35,000
I have already told you because if you try to calculate the distance, all these words are equal distance.

182
00:10:35,000 --> 00:10:41,000
It is not saying that whether Pisa is very much similar to foot or burger, how much similar it is to

183
00:10:41,000 --> 00:10:46,000
foot, right, that that similarity is not there and based on that semantic meaning is not getting captured

184
00:10:46,000 --> 00:10:51,000
because here either we are getting ones and zeros, so we are not able to provide.

185
00:10:51,000 --> 00:10:57,000
Right over here you can see that much more maximum information, which is the most most important word.

186
00:10:57,000 --> 00:10:59,000
That information is also not getting captured.

187
00:10:59,000 --> 00:11:02,000
So this was in short about the advantages and disadvantage.

188
00:11:02,000 --> 00:11:04,000
I hope, uh, you have understood this.

189
00:11:04,000 --> 00:11:10,000
And again, the next, uh, technique with respect to converting word to vectors is something called

190
00:11:10,000 --> 00:11:11,000
as bag of words.

191
00:11:11,000 --> 00:11:15,000
We'll try to see that how it is fixing some of the disadvantage from here.

192
00:11:15,000 --> 00:11:18,000
And uh, yes, one more, one more major disadvantage.

193
00:11:18,000 --> 00:11:19,000
I'll see you that.

194
00:11:19,000 --> 00:11:22,000
Let's say that if your right now I just have three sentences.

195
00:11:22,000 --> 00:11:25,000
Let's say I have just seven vocabulary.

196
00:11:25,000 --> 00:11:29,000
Let's say tomorrow I will be having 50 k unique vocabulary size.

197
00:11:30,000 --> 00:11:32,000
Vocabulary size.

198
00:11:32,000 --> 00:11:34,000
Now what will happen in this particular case.

199
00:11:35,000 --> 00:11:36,000
Right.

200
00:11:36,000 --> 00:11:38,000
I'll be getting so many number of ones and zeros.

201
00:11:38,000 --> 00:11:41,000
So in short again this is actually leading into sparse matrix.

202
00:11:41,000 --> 00:11:45,000
But in a real world scenario I'll just not be having three sentences right.

203
00:11:45,000 --> 00:11:49,000
I'll be having bigger sentences and I'll also be having many sentences as such.

204
00:11:49,000 --> 00:11:54,000
So this is also one of the thing in any interview, if they ask you, you probably need to talk about

205
00:11:54,000 --> 00:11:57,000
or explain like this with respect to advantages and disadvantages.

206
00:11:57,000 --> 00:11:59,000
So I hope you have understood this.

207
00:11:59,000 --> 00:12:00,000
I will see you all in the next video.

208
00:12:00,000 --> 00:12:04,000
And in the next video I'm going to discuss about Bag of Words.

209
00:12:04,000 --> 00:12:04,000
Thank you.