1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to natural language processing.

3
00:00:04,000 --> 00:00:07,000
We have already seen in our previous video about Bag of Words.

4
00:00:07,000 --> 00:00:08,000
We have understood the intuition.

5
00:00:08,000 --> 00:00:12,000
We have also made sure that we understood the advantages and disadvantages.

6
00:00:12,000 --> 00:00:17,000
Now let's go ahead towards the practical part and see that how we can implement bag of words.

7
00:00:17,000 --> 00:00:21,000
So over here I have an amazing data set which is called as spam classification.

8
00:00:21,000 --> 00:00:23,000
So if I probably open this right.

9
00:00:23,000 --> 00:00:26,000
So this is a data set which is having a label like ham.

10
00:00:26,000 --> 00:00:28,000
And then I have a message right.

11
00:00:28,000 --> 00:00:31,000
Then again I have ham, then again I have a message.

12
00:00:31,000 --> 00:00:33,000
So here you can see spam.

13
00:00:33,000 --> 00:00:34,000
And then again I have a message.

14
00:00:34,000 --> 00:00:37,000
So like this I have this entire data set.

15
00:00:37,000 --> 00:00:42,000
What I'm actually going to do is that I'm going to basically apply on this particular data set, uh,

16
00:00:42,000 --> 00:00:43,000
bag of words model.

17
00:00:43,000 --> 00:00:47,000
And later on as we go ahead, all the different techniques also will try to see it.

18
00:00:47,000 --> 00:00:51,000
So over here I'm just going to create a new Python file okay.

19
00:00:51,000 --> 00:00:53,000
And uh yeah.

20
00:00:53,000 --> 00:00:57,000
Let me just go ahead and write, uh, bag of words.

21
00:00:59,000 --> 00:01:00,000
Practicals.

22
00:01:00,000 --> 00:01:01,000
Right.

23
00:01:01,000 --> 00:01:02,000
Perfect.

24
00:01:02,000 --> 00:01:05,000
So we will go line by line and we'll try to understand everything.

25
00:01:05,000 --> 00:01:08,000
And, uh, first of all, we'll read the data set itself.

26
00:01:08,000 --> 00:01:08,000
Okay.

27
00:01:08,000 --> 00:01:11,000
Now, before I go ahead, uh, I really need some of the libraries.

28
00:01:11,000 --> 00:01:15,000
So I'm going to import pandas as PD.

29
00:01:15,000 --> 00:01:16,000
Okay.

30
00:01:16,000 --> 00:01:18,000
And then I will just try to read this messages.

31
00:01:18,000 --> 00:01:21,000
So I'll write PD dot read underscore CSV.

32
00:01:22,000 --> 00:01:28,000
And here you can see that uh, obviously it is present inside a folder which is called a spam classification

33
00:01:28,000 --> 00:01:29,000
master.

34
00:01:29,000 --> 00:01:32,000
Or and then it is having spam.

35
00:01:32,000 --> 00:01:36,000
So let's see I'm just going to write like this spam classification master.

36
00:01:36,000 --> 00:01:41,000
And then spam classifier dot p y.

37
00:01:41,000 --> 00:01:45,000
So I have something like spam SMS uh, folder.

38
00:01:45,000 --> 00:01:50,000
And then I can basically go ahead and write my data set name.

39
00:01:50,000 --> 00:01:55,000
So inside this you'll be able to see that I have this SMS spam collection.

40
00:01:55,000 --> 00:01:56,000
Right.

41
00:01:56,000 --> 00:02:01,000
So I'm going to write SMS spam a SMS spam collection.

42
00:02:01,000 --> 00:02:02,000
So this is my file.

43
00:02:02,000 --> 00:02:08,000
You'll be able to see it okay SMS spam

44
00:02:10,000 --> 00:02:10,000
collection.

45
00:02:11,000 --> 00:02:14,000
Okay, so this is my data set over here.

46
00:02:14,000 --> 00:02:19,000
But remember, as I opened this particular data set, there are two things that you really need to focus

47
00:02:20,000 --> 00:02:20,000
over here.

48
00:02:20,000 --> 00:02:25,000
Uh, the space between this in, uh, dependent and independent feature.

49
00:02:25,000 --> 00:02:26,000
That is my text.

50
00:02:26,000 --> 00:02:29,000
It is basically a tab okay.

51
00:02:29,000 --> 00:02:33,000
so I will basically be using a separator which will be a tab.

52
00:02:33,000 --> 00:02:36,000
So in the next step I will write separator.

53
00:02:36,000 --> 00:02:40,000
And this will basically be a slash slash t okay.

54
00:02:40,000 --> 00:02:48,000
And then uh along with this, what I'm actually going to do is that I'm going to also assign uh a names

55
00:02:48,000 --> 00:02:55,000
as my like what all features it should basically get assigned to one is label and the one is message,

56
00:02:55,000 --> 00:02:56,000
right?

57
00:02:56,000 --> 00:03:00,000
So I'm basically providing my own custom, uh, column name in short.

58
00:03:00,000 --> 00:03:01,000
Right.

59
00:03:01,000 --> 00:03:03,000
So once I do this, let's go ahead and execute it.

60
00:03:03,000 --> 00:03:08,000
So you will be able to understand I'm just trying to read the data set over here.

61
00:03:08,000 --> 00:03:10,000
So if I probably go and see the messages.

62
00:03:10,000 --> 00:03:12,000
So here you'll be able to see label and message.

63
00:03:12,000 --> 00:03:13,000
Right.

64
00:03:13,000 --> 00:03:18,000
So I've just read this entire data set using a slash T that is a tab separator.

65
00:03:18,000 --> 00:03:22,000
And then I've also provided two names Label and message, which is my column name.

66
00:03:22,000 --> 00:03:23,000
Perfect.

67
00:03:23,000 --> 00:03:24,000
Till here it is quite good.

68
00:03:24,000 --> 00:03:29,000
We are able to see the data set and clearly we are able to see the data set itself right.

69
00:03:29,000 --> 00:03:34,000
Now the first step obviously before applying bag of words we really need to clean the data.

70
00:03:34,000 --> 00:03:38,000
So I'll write data cleaning and preprocessing.

71
00:03:39,000 --> 00:03:39,000
Okay.

72
00:03:40,000 --> 00:03:42,000
Let me make my mic a little bit straight.

73
00:03:42,000 --> 00:03:43,000
Okay.

74
00:03:43,000 --> 00:03:45,000
Data cleaning and pre-processing.

75
00:03:45,000 --> 00:03:46,000
Now what?

76
00:03:46,000 --> 00:03:49,000
All things we can actually clean the data about.

77
00:03:49,000 --> 00:03:49,000
Right.

78
00:03:49,000 --> 00:03:55,000
What what all processing we can basically do obviously lowering the sentences, applying stopwords if

79
00:03:55,000 --> 00:03:58,000
I want applying um, uh lemmatization if I want.

80
00:03:58,000 --> 00:04:01,000
So there are various ways what all things we can basically do.

81
00:04:01,000 --> 00:04:01,000
Right.

82
00:04:01,000 --> 00:04:05,000
So, uh, what I'm actually going to do is that I'm going to import some of the libraries for this.

83
00:04:05,000 --> 00:04:10,000
And again, I also may need to make sure that any special character that are present in this particular

84
00:04:10,000 --> 00:04:14,000
sentences, I should remove it, you know, because see this kind of words.

85
00:04:14,000 --> 00:04:17,000
Uh, it has two dots.

86
00:04:17,000 --> 00:04:21,000
You know, these are like special words, uh, hardly being used for this kind of use cases, which

87
00:04:21,000 --> 00:04:24,000
we can directly do it with the help of Stopwords itself.

88
00:04:24,000 --> 00:04:27,000
So I'm going to import regular expression, I'm going to import NLTK.

89
00:04:28,000 --> 00:04:32,000
And along with that I will just write nltk dot download stop words.

90
00:04:33,000 --> 00:04:33,000
Okay.

91
00:04:34,000 --> 00:04:39,000
And already we have done the downloading of this particular stop words if you are following my previous

92
00:04:39,000 --> 00:04:42,000
videos, but let's say that you have you have directly jumped over here.

93
00:04:42,000 --> 00:04:45,000
I'm just going to download it once again for you.

94
00:04:45,000 --> 00:04:50,000
Okay, so here you will be able to see I've written s stop words so it is already up to date.

95
00:04:50,000 --> 00:04:51,000
Uh, all these things are there.

96
00:04:52,000 --> 00:04:57,000
Now, uh, the next thing what I'm actually going to do is that I'm going to write from NLTK dot corpus

97
00:04:58,000 --> 00:05:00,000
import Stopwords.

98
00:05:03,000 --> 00:05:04,000
Import Stopwords.

99
00:05:04,000 --> 00:05:12,000
And um, I'm also going to make sure that I download from NLTK dot stem I import the.

100
00:05:14,000 --> 00:05:18,000
NLTK dot stem dot porter.

101
00:05:19,000 --> 00:05:20,000
Import.

102
00:05:20,000 --> 00:05:21,000
Porter.

103
00:05:21,000 --> 00:05:21,000
Stemmer.

104
00:05:21,000 --> 00:05:22,000
Right.

105
00:05:22,000 --> 00:05:26,000
And I hope everybody knows why Porter stemmer is basically required.

106
00:05:26,000 --> 00:05:27,000
So I'm just going to write it.

107
00:05:27,000 --> 00:05:28,000
This one.

108
00:05:28,000 --> 00:05:32,000
Porter stemmer this is specifically for the stemming purpose.

109
00:05:32,000 --> 00:05:35,000
And along with that I will also make sure that I'll initialize this.

110
00:05:35,000 --> 00:05:35,000
Okay.

111
00:05:35,000 --> 00:05:41,000
So I'll write Porter Stemmer and I will just initialize it so that we will be able to do this.

112
00:05:42,000 --> 00:05:43,000
Um, okay.

113
00:05:43,000 --> 00:05:45,000
I really need to write from.

114
00:05:45,000 --> 00:05:45,000
Okay.

115
00:05:45,000 --> 00:05:48,000
So I'm not going to edit any errors as such.

116
00:05:48,000 --> 00:05:52,000
So everything whatever I am facing the error I think you should also be facing.

117
00:05:52,000 --> 00:05:55,000
Or you may be also writing with respect to some mistakes, right?

118
00:05:55,000 --> 00:05:57,000
Whenever we are trying to solve this perfect.

119
00:05:57,000 --> 00:06:01,000
Uh, we have imported Porter Stemmer because we really want to do the stemming.

120
00:06:01,000 --> 00:06:05,000
Now, what I'm actually going to do is that I'm just going to create a list which is like corpus.

121
00:06:05,000 --> 00:06:09,000
Okay, since I need to clean all the data right at the end of the day, after I clean the data, it

122
00:06:09,000 --> 00:06:20,000
will be stored in this variable and let me iterate it for I in range of zero comma length of messages.

123
00:06:20,000 --> 00:06:24,000
Now what I'm actually doing is that I will make sure that I go through each and every sentence.

124
00:06:25,000 --> 00:06:31,000
I remove all the special character, I lower down all the sentences, I apply the stopwords and then

125
00:06:31,000 --> 00:06:33,000
I probably apply stemming.

126
00:06:33,000 --> 00:06:38,000
So there are many things that we are going to do, and this task is common and generic with respect

127
00:06:38,000 --> 00:06:39,000
to all the use cases.

128
00:06:39,000 --> 00:06:42,000
Let's say that later on you want to solve any other use cases, right?

129
00:06:42,000 --> 00:06:46,000
You can follow the same approach and that will actually help you to solve this problem statement.

130
00:06:46,000 --> 00:06:47,000
Okay.

131
00:06:47,000 --> 00:06:51,000
So what I'm actually going to do over here is that I'm just going to write colon okay.

132
00:06:51,000 --> 00:06:54,000
And then first of all let me go ahead and take the review.

133
00:06:54,000 --> 00:06:54,000
Right.

134
00:06:54,000 --> 00:06:59,000
So here I will say, uh, the first thing that I'm going to do is that after I go with respect to each

135
00:06:59,000 --> 00:07:03,000
and every review, I first have to remove the special character.

136
00:07:03,000 --> 00:07:09,000
That basically means I don't want anything else other than the, uh, words or letters over there.

137
00:07:09,000 --> 00:07:14,000
Okay, so over here I'll be using re and there is a function which is called as dot sub.

138
00:07:15,000 --> 00:07:15,000
Okay.

139
00:07:15,000 --> 00:07:19,000
Now in this particular function I will just apply my regular expression.

140
00:07:19,000 --> 00:07:21,000
Now how do I apply my regular expression?

141
00:07:21,000 --> 00:07:26,000
You can definitely try out different different things, but I'll give a simple regular expression.

142
00:07:26,000 --> 00:07:32,000
And there is also a website where you can easily, uh, see that what all regular expression you need

143
00:07:32,000 --> 00:07:34,000
to apply to select those similar kind of words.

144
00:07:34,000 --> 00:07:35,000
Okay.

145
00:07:35,000 --> 00:07:40,000
So I'll say, uh, I'll just give one simple, uh, thing with respect to this.

146
00:07:40,000 --> 00:07:42,000
You know, I will say that, okay.

147
00:07:42,000 --> 00:07:48,000
With respect to only if you if you find A to Z, okay.

148
00:07:49,000 --> 00:07:56,000
Or I can also say that apart from this word like a to Z, that small A to Z or capital A to Z.

149
00:07:56,000 --> 00:08:01,000
Okay, apart from this word, any words, any special character that you find, just replace it with

150
00:08:01,000 --> 00:08:02,000
blank.

151
00:08:02,000 --> 00:08:07,000
Okay, I'm trying to use this because there are a lot of special characters in the text which I've basically

152
00:08:07,000 --> 00:08:08,000
analyzed.

153
00:08:08,000 --> 00:08:13,000
And here what I am saying is that apart from the text that is small A to Z and capital E to Z, replace

154
00:08:13,000 --> 00:08:14,000
everything by blank.

155
00:08:14,000 --> 00:08:18,000
So here I will be having message and this same thing.

156
00:08:18,000 --> 00:08:20,000
I will try to apply it.

157
00:08:20,000 --> 00:08:24,000
Now messages obviously has all the messages over here right.

158
00:08:24,000 --> 00:08:26,000
But inside this I have a label called as message.

159
00:08:26,000 --> 00:08:29,000
So I'm just going to apply that message.

160
00:08:29,000 --> 00:08:34,000
And with respect to any index that is I, I'm just going to apply this.

161
00:08:34,000 --> 00:08:34,000
Okay.

162
00:08:34,000 --> 00:08:37,000
So review what I'm doing is that I'm just substituting right.

163
00:08:37,000 --> 00:08:42,000
This sub is basically nothing but substituting A to z capital A to Z along like apart from this A to

164
00:08:42,000 --> 00:08:49,000
Z, any special character that you get, just replace it with blank and try to replace it in our message

165
00:08:49,000 --> 00:08:50,000
and store it in a review variable.

166
00:08:50,000 --> 00:08:52,000
Okay, this is perfect.

167
00:08:52,000 --> 00:08:57,000
Now the next step what I'm actually going to do I'm also going to make sure that I lower all these words

168
00:08:57,000 --> 00:08:59,000
because I really need to lower this okay.

169
00:08:59,000 --> 00:09:05,000
If I don't lower this then again it will be a problem after lowering the word what I will do since I

170
00:09:05,000 --> 00:09:08,000
will also be having a space things are.

171
00:09:08,000 --> 00:09:11,000
I will convert this entire review into a list of reviews.

172
00:09:11,000 --> 00:09:13,000
So I will write review dot split.

173
00:09:14,000 --> 00:09:16,000
Okay, just to create a list of reviews okay.

174
00:09:17,000 --> 00:09:24,000
And then finally for each and every for each and every words, like, let's say if I probably just try

175
00:09:24,000 --> 00:09:26,000
to, uh, execute this review okay.

176
00:09:26,000 --> 00:09:27,000
How will I get.

177
00:09:28,000 --> 00:09:30,000
So print review okay.

178
00:09:31,000 --> 00:09:31,000
Okay.

179
00:09:32,000 --> 00:09:32,000
It's okay.

180
00:09:32,000 --> 00:09:33,000
We'll be able to see it.

181
00:09:33,000 --> 00:09:34,000
Okay.

182
00:09:34,000 --> 00:09:37,000
So what I have done is that I have substituted it, I have made it lower.

183
00:09:37,000 --> 00:09:41,000
I have done a split with respect to each and every reviews.

184
00:09:41,000 --> 00:09:42,000
Inside the sentence.

185
00:09:42,000 --> 00:09:43,000
I'll get a list of words over here.

186
00:09:44,000 --> 00:09:49,000
Now, after I get a list of words, what I'll do, I will apply the same concept that we did before,

187
00:09:49,000 --> 00:09:49,000
right?

188
00:09:50,000 --> 00:09:53,000
Uh, in, uh, lemmatization and all same process.

189
00:09:53,000 --> 00:09:56,000
So over here, you'll be able to see that, uh, what kind of process?

190
00:09:56,000 --> 00:09:58,000
Uh, with respect to this things here.

191
00:09:58,000 --> 00:09:59,000
Same process.

192
00:09:59,000 --> 00:10:01,000
I will try to apply the same combination.

193
00:10:01,000 --> 00:10:02,000
Whatever I did.

194
00:10:02,000 --> 00:10:06,000
If it is not present in Stopwords, then only you try to stem it, right?

195
00:10:06,000 --> 00:10:08,000
Like how I did it in text pre-processing.

196
00:10:08,000 --> 00:10:08,000
Same thing.

197
00:10:08,000 --> 00:10:11,000
I will be doing it over here with respect to bag of words.

198
00:10:11,000 --> 00:10:20,000
So I will say for word in review that basically means review has a list of words with respect to a specific

199
00:10:20,000 --> 00:10:21,000
sentence.

200
00:10:21,000 --> 00:10:27,000
I'm iterating through every word, if not word in.

201
00:10:27,000 --> 00:10:34,000
If that word is not present in stop words dot words of English if it is not present in English.

202
00:10:34,000 --> 00:10:37,000
Because this is what I showed you with respect to theoretical intuition.

203
00:10:37,000 --> 00:10:37,000
Right?

204
00:10:37,000 --> 00:10:42,000
I'm going to take that specific word and I'm going to basically apply PS dot stem.

205
00:10:42,000 --> 00:10:44,000
Very simple okay.

206
00:10:44,000 --> 00:10:45,000
Nothing so complicated.

207
00:10:45,000 --> 00:10:46,000
Whatever.

208
00:10:46,000 --> 00:10:47,000
I have done the same thing.

209
00:10:47,000 --> 00:10:49,000
I've explained you with respect to this okay.

210
00:10:49,000 --> 00:10:55,000
And this I will be taking it entirely into my review variable again because I here I'm basically applying

211
00:10:55,000 --> 00:10:57,000
stopwords and stemming.

212
00:10:57,000 --> 00:10:59,000
I will get all my review variable over here.

213
00:10:59,000 --> 00:11:04,000
And finally you'll be able to see that I'll be writing review.

214
00:11:06,000 --> 00:11:12,000
And then I'll be joining that all these words into a sentence again.

215
00:11:12,000 --> 00:11:12,000
Okay.

216
00:11:12,000 --> 00:11:18,000
And later on what I will do instead of printing it, know the list that I have actually made.

217
00:11:18,000 --> 00:11:19,000
I will just add it over there.

218
00:11:19,000 --> 00:11:24,000
So I will say corpus dot append and I will append all this review.

219
00:11:24,000 --> 00:11:30,000
Okay, so once I probably execute it you'll be able to see it will take some time.

220
00:11:30,000 --> 00:11:32,000
But at the end of the day we are going to clean it.

221
00:11:32,000 --> 00:11:36,000
We are going to make sure that all the lowering of the words will happen.

222
00:11:37,000 --> 00:11:42,000
And finally, we'll be able to get the entire corpus, because after this only we will be able to apply

223
00:11:42,000 --> 00:11:43,000
the bag of words.

224
00:11:43,000 --> 00:11:44,000
Now this is perfect.

225
00:11:44,000 --> 00:11:45,000
It is completely done.

226
00:11:45,000 --> 00:11:50,000
Now if I probably go and see the corpus now here you can see everything is basically done right.

227
00:11:50,000 --> 00:11:53,000
And obviously you know the disadvantages of stemming.

228
00:11:53,000 --> 00:11:54,000
Some of the words will not come right.

229
00:11:54,000 --> 00:11:57,000
But uh, maximum number of words will come right.

230
00:11:57,000 --> 00:12:00,000
If you want to make it more accurate, you can definitely apply Lemmatization.

231
00:12:00,000 --> 00:12:05,000
And if you also want to make sure that you apply snowball stemmer, you can also do that.

232
00:12:05,000 --> 00:12:07,000
That will be a task that I really want to give it to you as an assignment.

233
00:12:07,000 --> 00:12:09,000
Okay, you can try both of them.

234
00:12:09,000 --> 00:12:11,000
But again, understand with Lemmatization it will take more time.

235
00:12:11,000 --> 00:12:15,000
Okay, so this is my entire text, uh, my input data.

236
00:12:15,000 --> 00:12:16,000
Okay.

237
00:12:16,000 --> 00:12:18,000
Input data with respect to all the things.

238
00:12:18,000 --> 00:12:23,000
And uh, let's see what will happen in the next step that we are going to basically apply bag of words.

239
00:12:23,000 --> 00:12:28,000
Now let's go ahead and create the bag of words.

240
00:12:28,000 --> 00:12:31,000
As you know I already know what is my output variable?

241
00:12:31,000 --> 00:12:35,000
Okay, my output variable is nothing but the spam and ham the label column, right?

242
00:12:35,000 --> 00:12:37,000
And this is specifically my input features.

243
00:12:37,000 --> 00:12:39,000
I need to convert this into vectors.

244
00:12:39,000 --> 00:12:42,000
So now we are going to create the bag of words model.

245
00:12:42,000 --> 00:12:45,000
Now this is super important guys how the bag of words model is created.

246
00:12:45,000 --> 00:12:51,000
So just go and search for sklearn bag of words okay.

247
00:12:51,000 --> 00:12:59,000
And uh here uh with respect to sklearn, you have a library which is called as countvectorizer.

248
00:12:59,000 --> 00:13:04,000
Now, this countvectorizer I'm going to specifically use for the bag of words.

249
00:13:04,000 --> 00:13:09,000
And here we are going to discuss about some of the important keywords.

250
00:13:09,000 --> 00:13:09,000
Okay.

251
00:13:09,000 --> 00:13:11,000
Then only we'll be able to understand okay.

252
00:13:11,000 --> 00:13:15,000
There is, there are so many keywords that we will try to solve it.

253
00:13:15,000 --> 00:13:15,000
Okay.

254
00:13:15,000 --> 00:13:20,000
And uh, with respect to that only we'll try to, uh, solve more of the problem statement as we go

255
00:13:20,000 --> 00:13:20,000
ahead.

256
00:13:20,000 --> 00:13:21,000
Okay.

257
00:13:21,000 --> 00:13:23,000
So, uh, let's go ahead and create it.

258
00:13:23,000 --> 00:13:26,000
So first of all, I will just import countvectorizer.

259
00:13:26,000 --> 00:13:32,000
So I'll say from sklearn dot feature extraction dot text we are going to import Countvectorizer.

260
00:13:32,000 --> 00:13:32,000
Okay.

261
00:13:32,000 --> 00:13:40,000
So here what I'm actually going to do is that I'm just going to write something like this from sklearn

262
00:13:40,000 --> 00:13:43,000
dot feature extraction.

263
00:13:44,000 --> 00:13:50,000
Dot text import count vectorizer.

264
00:13:50,000 --> 00:13:50,000
Okay.

265
00:13:50,000 --> 00:13:54,000
So and then I will initialize to count Vectorizer.

266
00:13:54,000 --> 00:13:58,000
Now you need to understand some of the things with respect to the parameters okay.

267
00:13:59,000 --> 00:14:04,000
Now here you can see that if I probably zoom in there is also something called as lowercase is equal

268
00:14:04,000 --> 00:14:04,000
to true.

269
00:14:04,000 --> 00:14:07,000
See we did not have to do the lowercase manually, right?

270
00:14:07,000 --> 00:14:12,000
You can also use this particular library to and make sure that you just put lowercase is equal to true.

271
00:14:12,000 --> 00:14:14,000
It will also do for you okay.

272
00:14:14,000 --> 00:14:16,000
And they are also stopwords.

273
00:14:16,000 --> 00:14:17,000
You can also apply stopwords.

274
00:14:17,000 --> 00:14:19,000
But we have already done that, so I'm not going to focus on that.

275
00:14:20,000 --> 00:14:24,000
This n gram range I will talk about it uh, in the upcoming sessions.

276
00:14:24,000 --> 00:14:25,000
Right now.

277
00:14:25,000 --> 00:14:30,000
Let's go ahead and focus more on this particular feature, which is called as Max underscore feature.

278
00:14:30,000 --> 00:14:35,000
Now in this I have already told you see based on frequency, right.

279
00:14:35,000 --> 00:14:39,000
We can pick up the top ten features, top 20 features, top 30 features.

280
00:14:39,000 --> 00:14:40,000
Right.

281
00:14:40,000 --> 00:14:47,000
And you know that whenever we try to solve any problem with respect to sentiment analysis and all,

282
00:14:47,000 --> 00:14:50,000
we obviously get some kind of vocabulary size.

283
00:14:50,000 --> 00:14:54,000
But when I see right how frequent those vocabulary is basically present.

284
00:14:54,000 --> 00:14:55,000
Right.

285
00:14:55,000 --> 00:15:00,000
And it is not necessary that every, every, every one will be available for 100 times 2 or 200 times,

286
00:15:00,000 --> 00:15:03,000
there will be some of the words that will be just present for once and twice.

287
00:15:03,000 --> 00:15:06,000
And sometimes those words will not play a very important role.

288
00:15:06,000 --> 00:15:06,000
Okay.

289
00:15:06,000 --> 00:15:10,000
So what we basically do is that we can set some max features.

290
00:15:10,000 --> 00:15:11,000
Okay.

291
00:15:11,000 --> 00:15:15,000
I'll just give you one idea that how we can set a max features, uh, with respect to this.

292
00:15:15,000 --> 00:15:20,000
But other than that ng gram n gram underscore range will discuss it later on.

293
00:15:20,000 --> 00:15:25,000
Not right now, but I really want to show you an example with respect to max underscore features by

294
00:15:25,000 --> 00:15:26,000
default.

295
00:15:26,000 --> 00:15:30,000
Other things we have already done it so we don't have to really worry about it okay.

296
00:15:30,000 --> 00:15:31,000
And one by one will try to complete it.

297
00:15:31,000 --> 00:15:36,000
And if you are not able to understand it, uh, don't worry, everything will be covered as we go step

298
00:15:36,000 --> 00:15:37,000
by step.

299
00:15:37,000 --> 00:15:39,000
So here you'll be seeing that Max features.

300
00:15:39,000 --> 00:15:45,000
Let's say I'm saying that out of all these words, out of all these words, right from the vocabulary,

301
00:15:45,000 --> 00:15:51,000
take the top 2500 words, which has having the maximum frequency.

302
00:15:51,000 --> 00:15:54,000
Okay, so this is what I'm actually creating.

303
00:15:54,000 --> 00:16:01,000
I'm creating a bag of words which is saying that take the top 500, top 2500 words from this particular

304
00:16:01,000 --> 00:16:04,000
data set, which is having the maximum number of frequency.

305
00:16:04,000 --> 00:16:05,000
So once I execute it.

306
00:16:05,000 --> 00:16:07,000
So this is my CV okay.

307
00:16:07,000 --> 00:16:14,000
Now after this all I will do is that I will just say CV dot fit underscore transform and I will apply

308
00:16:14,000 --> 00:16:17,000
it into my entire data set like corpus.

309
00:16:17,000 --> 00:16:17,000
Right.

310
00:16:17,000 --> 00:16:20,000
The corpus is basically having my entire data set.

311
00:16:20,000 --> 00:16:25,000
And I will try to convert this into an array and let me save this in my x variable.

312
00:16:25,000 --> 00:16:28,000
Now if I probably go and see my x values.

313
00:16:28,000 --> 00:16:33,000
So here you'll be able to see that I'm getting so many different different x values, and some or the

314
00:16:33,000 --> 00:16:36,000
other way you'll be able to find ones and zeros.

315
00:16:36,000 --> 00:16:38,000
Now see, I've taken the Max 2500 words.

316
00:16:38,000 --> 00:16:39,000
Right.

317
00:16:39,000 --> 00:16:40,000
And because of that you will be able to see this.

318
00:16:40,000 --> 00:16:45,000
So in short, if I probably see x dot shape I've taken that okay.

319
00:16:45,000 --> 00:16:49,000
Take 2500 max number of features.

320
00:16:49,000 --> 00:16:53,000
So obviously the number of vocabulary size, the number of columns that I'm going to have is 2500.

321
00:16:53,000 --> 00:16:59,000
Let's say if I say take the max 100 features, and I just try to do the transformation with respect

322
00:16:59,000 --> 00:17:01,000
to the on the X itself.

323
00:17:01,000 --> 00:17:05,000
So if you remember this, I'm just going to execute this on the corpus.

324
00:17:05,000 --> 00:17:10,000
And now if I go and see this will be my shape with the next 100 vocabulary size.

325
00:17:10,000 --> 00:17:11,000
Right.

326
00:17:11,000 --> 00:17:13,000
And if I again go and see my x.

327
00:17:13,000 --> 00:17:18,000
So here also you'll be able to see some where ones, ones will come and remaining all will be zero,

328
00:17:18,000 --> 00:17:20,000
ones will come and remaining all will zero.

329
00:17:20,000 --> 00:17:22,000
Once will come remaining all will be zero.

330
00:17:22,000 --> 00:17:25,000
There will be multiple ones wherever the specific word is present.

331
00:17:25,000 --> 00:17:28,000
Now you need to understand one more thing.

332
00:17:28,000 --> 00:17:28,000
Over here.

333
00:17:28,000 --> 00:17:29,000
I'm getting two.

334
00:17:29,000 --> 00:17:33,000
So this basically becomes that this is not binary bag of words.

335
00:17:33,000 --> 00:17:35,000
This is normal bag of words.

336
00:17:35,000 --> 00:17:39,000
Now how do I make sure that I also enable this binary bag of words.

337
00:17:39,000 --> 00:17:44,000
So for that you just go back to sklearn and there will be a feature which is called as binary.

338
00:17:45,000 --> 00:17:46,000
Let's see where it is.

339
00:17:46,000 --> 00:17:49,000
Uh, over here we are not able to find out okay.

340
00:17:49,000 --> 00:17:50,000
Here it is binary.

341
00:17:50,000 --> 00:17:51,000
Okay.

342
00:17:51,000 --> 00:17:55,000
So what I'm actually going to do, I'm going to just go over here along with this Max feature.

343
00:17:55,000 --> 00:18:01,000
I'm just going to write a comma and I'm going to write binary is equal to true right now.

344
00:18:01,000 --> 00:18:07,000
Once I do this right now, you'll be able to see that you're having twos right after I apply.

345
00:18:07,000 --> 00:18:08,000
This binary is equal to true.

346
00:18:08,000 --> 00:18:12,000
No, you will not be able to find out any values that will be greater than one because even though it

347
00:18:12,000 --> 00:18:15,000
is greater than one, it is always going to set it as one.

348
00:18:15,000 --> 00:18:15,000
Okay?

349
00:18:15,000 --> 00:18:20,000
And once I execute it now here you'll be able to see either I'll be getting ones and zeros.

350
00:18:20,000 --> 00:18:23,000
Okay, so all this specific thing has got updated.

351
00:18:23,000 --> 00:18:24,000
You'll have only ones and zeros.

352
00:18:24,000 --> 00:18:26,000
Now it is up to you which you want to basically use.

353
00:18:26,000 --> 00:18:31,000
And again it depends to problem to problem statement whether you want to go with binary bag of words

354
00:18:31,000 --> 00:18:31,000
or not okay.

355
00:18:32,000 --> 00:18:34,000
So I'm just going to write it over here.

356
00:18:35,000 --> 00:18:43,000
For binary bag of words enable binary is equal to true.

357
00:18:44,000 --> 00:18:44,000
Okay.

358
00:18:44,000 --> 00:18:45,000
So this is done.

359
00:18:45,000 --> 00:18:46,000
Very good.

360
00:18:46,000 --> 00:18:50,000
Uh, we are in the right state and we are able to convert this into this.

361
00:18:50,000 --> 00:18:52,000
And here I'm just going to execute it.

362
00:18:52,000 --> 00:18:54,000
Let me keep binary is equal to true.

363
00:18:54,000 --> 00:18:57,000
Otherwise I don't want you all to miss this.

364
00:18:57,000 --> 00:18:58,000
So done.

365
00:18:58,000 --> 00:19:00,000
And this is what we are able to get.

366
00:19:00,000 --> 00:19:04,000
Now understand the array shape.

367
00:19:04,000 --> 00:19:12,000
Since I've taken the top max 100 features which are having the maximum frequency, the column size will

368
00:19:12,000 --> 00:19:13,000
be 100, right?

369
00:19:13,000 --> 00:19:17,000
And this 5572 is basically the length of the entire list.

370
00:19:17,000 --> 00:19:23,000
Okay, so here we have got a complete idea about how to basically convert this into bag of words.

371
00:19:23,000 --> 00:19:27,000
You can play with max features, you can play with binary, you can play with other things.

372
00:19:27,000 --> 00:19:32,000
But the most important thing is that you really need to understand what is n g underscore gram range,

373
00:19:32,000 --> 00:19:34,000
which I will show you in the next video.

374
00:19:34,000 --> 00:19:40,000
First, we'll understand the theoretical intuition behind it and then we will try to also implement

375
00:19:40,000 --> 00:19:40,000
it.

376
00:19:40,000 --> 00:19:45,000
But right now I feel that yes we have done a bag of words easily.

377
00:19:45,000 --> 00:19:50,000
We have converted this word into entire vectors, which is a very, very good thing.

378
00:19:50,000 --> 00:19:54,000
Very amazing thing because now we know at least how to convert a text into the vectors.

379
00:19:54,000 --> 00:19:59,000
Now what I can take, I can take all these arrays as my independent features and all this particular

380
00:19:59,000 --> 00:20:02,000
output like label ham and spam and will try to solve the problems.

381
00:20:02,000 --> 00:20:05,000
Okay, so yes, uh, this was it from my side.

382
00:20:05,000 --> 00:20:07,000
I hope you like this particular video.

383
00:20:07,000 --> 00:20:10,000
Please make sure that you keep on practicing with different, different data sets.

384
00:20:10,000 --> 00:20:15,000
But one assignment that I really want to give you over here is that please make sure that you try to

385
00:20:15,000 --> 00:20:19,000
do it with the help of, uh, lemmatization.

386
00:20:19,000 --> 00:20:25,000
Also, uh, instead of doing stemming, just try to do lemmatization because you may get a better accuracy

387
00:20:25,000 --> 00:20:26,000
with respect to that also.

388
00:20:26,000 --> 00:20:26,000
Okay.

389
00:20:26,000 --> 00:20:28,000
So that will be an assignment.

390
00:20:28,000 --> 00:20:30,000
So yes, uh, this was it from my side.

391
00:20:30,000 --> 00:20:33,000
And yes, I will see you all in the next video.

392
00:20:33,000 --> 00:20:33,000
Thank.