1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue our discussion with respect to natural language processing.

3
00:00:04,000 --> 00:00:10,000
And now we are going to move towards some more techniques with respect to text pre-processing.

4
00:00:10,000 --> 00:00:14,000
Already in our previous video we have seen tokenization.

5
00:00:14,000 --> 00:00:20,000
We have seen that how we can convert a paragraph into sentences and then probably a paragraph into words,

6
00:00:20,000 --> 00:00:23,000
or converting a sentences into words.

7
00:00:23,000 --> 00:00:23,000
Right?

8
00:00:23,000 --> 00:00:28,000
So in short, we have seen that how we can actually do tokenization with the help of NLTK.

9
00:00:28,000 --> 00:00:34,000
So in this video we are going to focus on something called as stemming, which is a very important process

10
00:00:34,000 --> 00:00:35,000
altogether.

11
00:00:35,000 --> 00:00:37,000
And uh, what exactly is stemming?

12
00:00:37,000 --> 00:00:39,000
I'll also provide you the definition.

13
00:00:39,000 --> 00:00:42,000
We'll also see a lot of examples with respect to that.

14
00:00:42,000 --> 00:00:44,000
And we'll also try to see the different types of stemming.

15
00:00:44,000 --> 00:00:46,000
So I have opened a file over here.

16
00:00:46,000 --> 00:00:48,000
So here you can see regarding the stemming.

17
00:00:48,000 --> 00:00:50,000
And I have also given the definition.

18
00:00:50,000 --> 00:00:54,000
Now let's understand what exactly this is with some good examples.

19
00:00:54,000 --> 00:00:57,000
Now first of all we'll see the definition over here.

20
00:00:57,000 --> 00:01:01,000
It shows that stemming and this definition is taken from the Wikipedia.

21
00:01:01,000 --> 00:01:07,000
So stemming is the process of reducing a word to its word stem okay.

22
00:01:07,000 --> 00:01:15,000
This is super important guys to its word stem that affixes or suffixes or prefixes or to the roots of

23
00:01:15,000 --> 00:01:17,000
the word known as lemma okay.

24
00:01:17,000 --> 00:01:22,000
Stemming is important in natural language understanding and natural language processing.

25
00:01:22,000 --> 00:01:23,000
Now what exactly this is?

26
00:01:23,000 --> 00:01:24,000
I'll tell you some examples.

27
00:01:24,000 --> 00:01:28,000
Let's say I want to just create some cell below okay.

28
00:01:28,000 --> 00:01:32,000
Over here if I have a lot of text.

29
00:01:32,000 --> 00:01:32,000
Right.

30
00:01:32,000 --> 00:01:39,000
Let's say that I'm trying to solve a sentiment or I'm just trying to solve a classification problem.

31
00:01:39,000 --> 00:01:41,000
Classification problem.

32
00:01:41,000 --> 00:01:43,000
And the classification problem is very simple.

33
00:01:43,000 --> 00:01:56,000
We basically need to find out whether the comments on the product is a positive review or negative review.

34
00:01:56,000 --> 00:01:57,000
Right.

35
00:01:57,000 --> 00:02:00,000
So when we are solving this kind of problem statement.

36
00:02:00,000 --> 00:02:07,000
So in this what we'll be having in our data set, we'll be having the comments or reviews I can say

37
00:02:07,000 --> 00:02:12,000
I'll be having reviews and based on this particular reviews and this reviews will obviously be some

38
00:02:12,000 --> 00:02:18,000
kind of text data and we need to basically create a model where then we can basically classify whether

39
00:02:18,000 --> 00:02:20,000
it is a positive review or negative review.

40
00:02:20,000 --> 00:02:21,000
That is very simple.

41
00:02:21,000 --> 00:02:27,000
Now usually in this reviews let's say that I have some of the words like eating okay.

42
00:02:27,000 --> 00:02:33,000
Or it can be eat right or it can be like eaten right.

43
00:02:33,000 --> 00:02:38,000
It can be different kind of words, but at the end of the day, it actually represents the same thing

44
00:02:38,000 --> 00:02:39,000
regarding eating.

45
00:02:39,000 --> 00:02:39,000
Right.

46
00:02:39,000 --> 00:02:41,000
So this is basically eat.

47
00:02:41,000 --> 00:02:42,000
Eat is the root word.

48
00:02:42,000 --> 00:02:46,000
Or I can also say it is the stem word, word stem of this all the words.

49
00:02:46,000 --> 00:02:46,000
Right.

50
00:02:46,000 --> 00:02:48,000
Because eat is very much common.

51
00:02:48,000 --> 00:02:56,000
And having this variety of words for a problem statement will not impact much with respect to finding

52
00:02:56,000 --> 00:02:58,000
the output like positive or negative review.

53
00:02:58,000 --> 00:03:01,000
Try to understand what I'm actually trying to say over here.

54
00:03:01,000 --> 00:03:07,000
I may have different kind of words like eating, eat, eaten, or there may be also other words like

55
00:03:07,000 --> 00:03:14,000
I can also make a combination of like going, gone, right, gone goes right.

56
00:03:15,000 --> 00:03:18,000
At the end of the day, it is basically talking about go right.

57
00:03:18,000 --> 00:03:22,000
So go is a word stem of all these words that are present, right?

58
00:03:23,000 --> 00:03:29,000
So it is not necessary that we need to have similar kind of words again and again, because this increases

59
00:03:29,000 --> 00:03:30,000
the number of input features.

60
00:03:30,000 --> 00:03:36,000
In short, because each and every word represents a vector, as we'll see as we'll go ahead, you know,

61
00:03:36,000 --> 00:03:39,000
after text pre-processing will try to convert this text into vectors.

62
00:03:40,000 --> 00:03:45,000
So having this similar kind of words instead of having this similar kind of words, I can just have

63
00:03:45,000 --> 00:03:48,000
one word that is just like go right?

64
00:03:48,000 --> 00:03:51,000
And it will try to and will try to solve the problem with respect to that.

65
00:03:51,000 --> 00:03:55,000
So stemming is actually helping us to do this same thing.

66
00:03:56,000 --> 00:04:01,000
So finding this word stem can be actually done with the help of stemming.

67
00:04:01,000 --> 00:04:05,000
And there is also a concept which is called as Lemmatization will try to understand the difference as

68
00:04:05,000 --> 00:04:05,000
we go ahead.

69
00:04:05,000 --> 00:04:10,000
But first of all, let's go ahead and see that how with the help of NLTK, we can perform stemming.

70
00:04:10,000 --> 00:04:14,000
Okay, so what I'm actually going to do is that I'm just going to take some examples.

71
00:04:14,000 --> 00:04:16,000
Let's say that I have all these words.

72
00:04:16,000 --> 00:04:17,000
Okay.

73
00:04:17,000 --> 00:04:18,000
So I'm just going to remove this.

74
00:04:18,000 --> 00:04:21,000
And let's say that I have all these words now.

75
00:04:21,000 --> 00:04:28,000
Right now you have words like eating, eats, eaten, writing writes, programming programs, history

76
00:04:28,000 --> 00:04:30,000
finally and finalized okay.

77
00:04:30,000 --> 00:04:33,000
So I'm just going to execute it and let me make some more cells.

78
00:04:35,000 --> 00:04:35,000
Okay.

79
00:04:35,000 --> 00:04:39,000
So I'm just going to delete this cell because I don't require it.

80
00:04:39,000 --> 00:04:41,000
Okay I'll just create a cell below.

81
00:04:42,000 --> 00:04:42,000
Okay.

82
00:04:43,000 --> 00:04:46,000
And let me express exp exp and let us go ahead.

83
00:04:46,000 --> 00:04:47,000
Right.

84
00:04:47,000 --> 00:04:49,000
So here you can see all these particular words are there.

85
00:04:49,000 --> 00:04:55,000
Now let's see how we can find out the word stem of all this particular words with the help of stemming

86
00:04:55,000 --> 00:05:00,000
the first stemming technique that we are probably going to use is something called as Porter stemmer.

87
00:05:01,000 --> 00:05:01,000
Okay.

88
00:05:01,000 --> 00:05:02,000
Porter stemmer.

89
00:05:02,000 --> 00:05:07,000
And they are again different, different types of stemming techniques, which I will probably be showing

90
00:05:07,000 --> 00:05:07,000
you.

91
00:05:07,000 --> 00:05:11,000
And then we'll be able to understand that what all things it will be able to give us.

92
00:05:11,000 --> 00:05:12,000
Okay.

93
00:05:12,000 --> 00:05:16,000
Now in order to apply this Porter stemmer, it is very much simple in analytic.

94
00:05:16,000 --> 00:05:18,000
Already those functionalities basically present.

95
00:05:18,000 --> 00:05:24,000
So I will just write from NLTK dot stem.

96
00:05:27,000 --> 00:05:34,000
Import porter stemmer okay, so once I initialize this here you can see this I'm going to use this Porter

97
00:05:34,000 --> 00:05:37,000
stemmer which is just a kind of class.

98
00:05:37,000 --> 00:05:39,000
And for this we have to initialize it.

99
00:05:39,000 --> 00:05:41,000
We have to initialize an object for that.

100
00:05:41,000 --> 00:05:46,000
So let me just create an object and this will basically be my stemming.

101
00:05:47,000 --> 00:05:53,000
And once I do this now in the next step, what I'm actually going to do for each and every word, I'm

102
00:05:53,000 --> 00:05:54,000
just going to apply the stemming process.

103
00:05:54,000 --> 00:05:55,000
Okay.

104
00:05:55,000 --> 00:05:56,000
So it's very simple.

105
00:05:56,000 --> 00:05:57,000
How do I do it?

106
00:05:57,000 --> 00:05:59,000
I will just iterate it.

107
00:05:59,000 --> 00:06:05,000
So I'll say for word for word in words right.

108
00:06:05,000 --> 00:06:11,000
And here I'm basically going to write print and let me write it down as word plus.

109
00:06:13,000 --> 00:06:15,000
I'll just give some type of marking.

110
00:06:15,000 --> 00:06:24,000
This is the word and the stemmed part will be nothing, but I'll be using the same object stemming dot

111
00:06:24,000 --> 00:06:25,000
stem.

112
00:06:25,000 --> 00:06:26,000
There is a functionality.

113
00:06:26,000 --> 00:06:32,000
There is a function called as stem which will actually whenever we push any words inside this, it will

114
00:06:32,000 --> 00:06:35,000
do the stemming thing that basically means for eating.

115
00:06:35,000 --> 00:06:38,000
Probably it may give you eat for eats, it may give you eat.

116
00:06:38,000 --> 00:06:38,000
Right.

117
00:06:38,000 --> 00:06:39,000
So something like that.

118
00:06:39,000 --> 00:06:41,000
So I'm just going to give my word over here.

119
00:06:42,000 --> 00:06:48,000
So once I execute now here you can see that perfect eating is coming as eat eats comes as eat eaten

120
00:06:48,000 --> 00:06:49,000
is coming as eaten.

121
00:06:49,000 --> 00:06:52,000
Only writing is coming as right, which is good.

122
00:06:52,000 --> 00:06:53,000
Writes comes as right.

123
00:06:53,000 --> 00:06:58,000
Programming comes with program programs is nothing but program three.

124
00:06:58,000 --> 00:07:00,000
Here you can see history is becoming history.

125
00:07:00,000 --> 00:07:02,000
Uh, h I s t o r I.

126
00:07:02,000 --> 00:07:04,000
So here it is a major issue.

127
00:07:04,000 --> 00:07:04,000
Right.

128
00:07:04,000 --> 00:07:06,000
And I'll also talk about the disadvantages.

129
00:07:06,000 --> 00:07:08,000
Finally becomes final.

130
00:07:08,000 --> 00:07:09,000
Finalized becomes final.

131
00:07:09,000 --> 00:07:15,000
This is this is there right now let's say that, uh, over here it looks good, right?

132
00:07:15,000 --> 00:07:18,000
Probably for Eton, you can see that nothing has happened.

133
00:07:18,000 --> 00:07:19,000
It is given the same word.

134
00:07:19,000 --> 00:07:24,000
But if you see some words like history now here you are actually getting history, right?

135
00:07:24,000 --> 00:07:27,000
So the entire meaning of this particular word has actually changed.

136
00:07:27,000 --> 00:07:32,000
And this is the major disadvantage of stemming when stemming is basically applied, you know, for some

137
00:07:32,000 --> 00:07:37,000
of the words, you know, you may not get a correct exact meaning, the form of that specific word may

138
00:07:37,000 --> 00:07:38,000
change.

139
00:07:38,000 --> 00:07:41,000
So this is the major, major disadvantage with respect to stemming.

140
00:07:41,000 --> 00:07:43,000
Let me show you some more examples.

141
00:07:43,000 --> 00:07:50,000
Now suppose if I say stemming dot stem, I'm just going to apply this particular stem word on a word

142
00:07:50,000 --> 00:07:52,000
called as congratulations.

143
00:07:52,000 --> 00:07:53,000
Okay.

144
00:07:53,000 --> 00:07:56,000
And if I execute it here you can see the word.

145
00:07:56,000 --> 00:07:58,000
The meaning of the word is basically changing.

146
00:07:58,000 --> 00:08:01,000
It should have told like something like congratulate right.

147
00:08:01,000 --> 00:08:03,000
But here you can see congratulate.

148
00:08:03,000 --> 00:08:06,000
It is being it is basically changing the form of the word.

149
00:08:06,000 --> 00:08:09,000
Now the word does not have any kind of meaning.

150
00:08:09,000 --> 00:08:09,000
Right.

151
00:08:09,000 --> 00:08:12,000
So this is again one major, major disadvantage of stemming.

152
00:08:12,000 --> 00:08:12,000
Okay.

153
00:08:12,000 --> 00:08:18,000
Similarly, if I probably try to show you something like stemming not Stem, and if I probably write

154
00:08:18,000 --> 00:08:19,000
something like sitting.

155
00:08:19,000 --> 00:08:21,000
So here you will be able to see.

156
00:08:22,000 --> 00:08:23,000
Let me just write it down.

157
00:08:23,000 --> 00:08:24,000
Sit.

158
00:08:24,000 --> 00:08:27,000
Now see first sitting it is giving a very good word.

159
00:08:27,000 --> 00:08:32,000
So that basically means stemming works for uh very good number of words.

160
00:08:32,000 --> 00:08:35,000
But for some words it does not give us a good answer.

161
00:08:35,000 --> 00:08:35,000
Right.

162
00:08:35,000 --> 00:08:38,000
So this is the major disadvantage of stemming.

163
00:08:38,000 --> 00:08:40,000
And this all will get fixed with the help of lemmatization.

164
00:08:40,000 --> 00:08:46,000
But whenever you have any kind of problem, statements like, uh, classification, problem review,

165
00:08:46,000 --> 00:08:52,000
classification or we really want to see whether it is whether an email is a spam or a ham, right.

166
00:08:52,000 --> 00:08:58,000
Whether it is spam or not, a spam, we should definitely go ahead with using stemming, you know,

167
00:08:58,000 --> 00:09:01,000
and uh, And again, some of the words will not come in the right form.

168
00:09:01,000 --> 00:09:07,000
But yes, instead of using Porter Stemmer, we have other different kind of stemming stemmer techniques

169
00:09:07,000 --> 00:09:09,000
which we can definitely use to improve it.

170
00:09:09,000 --> 00:09:10,000
Okay, now let me go.

171
00:09:10,000 --> 00:09:14,000
With respect to the second one, now with respect to the second one, I'm just going to delete this

172
00:09:14,000 --> 00:09:15,000
cell because we don't require it.

173
00:09:16,000 --> 00:09:21,000
The second cell is that which is called as Régis Stemmer class.

174
00:09:21,000 --> 00:09:26,000
Now this regular expression stemmer class it is nothing, but this is a class with the help of which

175
00:09:26,000 --> 00:09:29,000
we can easily implement regular expression Stemmer algorithm.

176
00:09:29,000 --> 00:09:34,000
So we can just provide a regular expression and it will be able to apply the stemming purpose in that.

177
00:09:34,000 --> 00:09:39,000
Okay, so it basically takes a single regular expression and remove any prefix or suffix that matches

178
00:09:39,000 --> 00:09:40,000
the expression.

179
00:09:40,000 --> 00:09:41,000
Perfect.

180
00:09:41,000 --> 00:09:43,000
Now what I'll do, I'll just create a cell below.

181
00:09:43,000 --> 00:09:47,000
And again I will make some more cells and I'll try to show you some example.

182
00:09:47,000 --> 00:09:50,000
Now first of all we need we need to initialize this.

183
00:09:50,000 --> 00:09:56,000
So I will write from NLTK dot stem import.

184
00:09:56,000 --> 00:09:58,000
What is the name regex.

185
00:09:58,000 --> 00:09:58,000
Right.

186
00:09:58,000 --> 00:10:01,000
So I'm just going to write reg stem perfect.

187
00:10:01,000 --> 00:10:04,000
Then we basically need to initialize it.

188
00:10:04,000 --> 00:10:09,000
So I'll say reg underscore stem is equal to stem right.

189
00:10:09,000 --> 00:10:10,000
So I've initialize it.

190
00:10:10,000 --> 00:10:11,000
This is perfect.

191
00:10:12,000 --> 00:10:16,000
Uh, it is giving us an, uh, error saying that regular expression is required.

192
00:10:16,000 --> 00:10:17,000
Now, this is super important.

193
00:10:17,000 --> 00:10:20,000
Now let's press go ahead and press shift and Tab.

194
00:10:20,000 --> 00:10:24,000
Now here you can see the first parameter that goes is something called as regular expression.

195
00:10:24,000 --> 00:10:25,000
Okay.

196
00:10:25,000 --> 00:10:27,000
And this is some minimum value.

197
00:10:27,000 --> 00:10:29,000
We'll try to understand what exactly it is.

198
00:10:29,000 --> 00:10:34,000
Now here you can see a stemmer that uses regular expression to identify morphology affixes.

199
00:10:34,000 --> 00:10:34,000
Right.

200
00:10:35,000 --> 00:10:36,000
Understand this.

201
00:10:36,000 --> 00:10:37,000
morphological affixes.

202
00:10:37,000 --> 00:10:38,000
I'll try to show you an example.

203
00:10:38,000 --> 00:10:43,000
If you don't know the meaning of this, any substring that match the regular expression will match will

204
00:10:43,000 --> 00:10:43,000
be removed.

205
00:10:44,000 --> 00:10:50,000
So this morphological affixes is basically, in short, a regular expression which will match whatever

206
00:10:50,000 --> 00:10:53,000
words we are basically giving, and if it matches, it will get removed.

207
00:10:53,000 --> 00:10:56,000
Okay, now let's take some example.

208
00:10:57,000 --> 00:10:57,000
Okay.

209
00:10:57,000 --> 00:11:04,000
Over here, uh, and uh here you can see that this parliament, uh, is the minimum length of the string

210
00:11:04,000 --> 00:11:04,000
to stem.

211
00:11:04,000 --> 00:11:04,000
Okay.

212
00:11:04,000 --> 00:11:09,000
So if the minimum length is somewhere around four, then only you will be able to apply this.

213
00:11:09,000 --> 00:11:14,000
So I'm just going to apply this same thing and I'm going to paste it over here okay.

214
00:11:14,000 --> 00:11:17,000
So here is my index.html.

215
00:11:18,000 --> 00:11:22,000
Now what I'm actually going to do I'm just going to write reg dot stemmer.

216
00:11:22,000 --> 00:11:25,000
Now you should understand over here what I have actually given.

217
00:11:25,000 --> 00:11:27,000
I have given ING dollar.

218
00:11:28,000 --> 00:11:33,000
Okay, here I am given s dollar, e dollar able dollar.

219
00:11:33,000 --> 00:11:37,000
Okay, so I have given all this regular expression and this will make completely sense when I will be

220
00:11:37,000 --> 00:11:38,000
implementing it.

221
00:11:38,000 --> 00:11:45,000
Now you see, just pause the video and just let me know what is the output that you feel will be getting

222
00:11:45,000 --> 00:11:46,000
for this.

223
00:11:46,000 --> 00:11:48,000
Eating now.

224
00:11:48,000 --> 00:11:48,000
Eating.

225
00:11:48,000 --> 00:11:50,000
You should see over here ING.

226
00:11:50,000 --> 00:11:52,000
Is there a regular expression?

227
00:11:52,000 --> 00:11:54,000
Is there so.

228
00:11:55,000 --> 00:11:56,000
But dollar is also there.

229
00:11:56,000 --> 00:11:57,000
Okay, this is super important.

230
00:11:57,000 --> 00:11:58,000
Dollar is also there.

231
00:11:58,000 --> 00:12:03,000
Now if I try to execute this I am able to get something called as eat.

232
00:12:03,000 --> 00:12:04,000
Okay.

233
00:12:04,000 --> 00:12:09,000
Now in short what this is happening is that this is basically saying that wherever on the last word

234
00:12:09,000 --> 00:12:16,000
it is ING, or it is s, or it is E, or it is able, Just try to remove that now.

235
00:12:16,000 --> 00:12:21,000
The next example that I really want to give is that let's say I want to I have this particular word

236
00:12:21,000 --> 00:12:22,000
I n g.

237
00:12:23,000 --> 00:12:23,000
Okay.

238
00:12:23,000 --> 00:12:26,000
So here you can see I n g eating right.

239
00:12:26,000 --> 00:12:27,000
What do you think the output will be.

240
00:12:27,000 --> 00:12:28,000
Will it.

241
00:12:28,000 --> 00:12:30,000
Whether it will be eat or whether it will be something else.

242
00:12:30,000 --> 00:12:31,000
Okay.

243
00:12:31,000 --> 00:12:34,000
So here, if I try to execute it here you'll be seeing I n g eat.

244
00:12:34,000 --> 00:12:35,000
Why.

245
00:12:35,000 --> 00:12:39,000
Because the regular expression says that only in the last last dollar.

246
00:12:39,000 --> 00:12:41,000
If I probably remove dollar.

247
00:12:41,000 --> 00:12:43,000
If I just execute it now, see?

248
00:12:43,000 --> 00:12:44,000
Everything will get removed.

249
00:12:44,000 --> 00:12:45,000
Okay?

250
00:12:45,000 --> 00:12:47,000
Or probably I just want in the starting.

251
00:12:47,000 --> 00:12:48,000
So I'll just use this.

252
00:12:49,000 --> 00:12:51,000
So here you can see I enjoy eating.

253
00:12:51,000 --> 00:12:53,000
So this this will not work okay.

254
00:12:53,000 --> 00:12:59,000
So we can basically have something like this uh, and we can basically check how we can actually remove

255
00:12:59,000 --> 00:12:59,000
it.

256
00:12:59,000 --> 00:13:00,000
So this is perfect.

257
00:13:00,000 --> 00:13:03,000
So here you can see wherever I ng is there it is getting removed.

258
00:13:03,000 --> 00:13:03,000
Perfect.

259
00:13:03,000 --> 00:13:05,000
So this is good.

260
00:13:05,000 --> 00:13:08,000
We have done this okay I ng it.

261
00:13:08,000 --> 00:13:12,000
Now let's say that uh I also want to try something else.

262
00:13:12,000 --> 00:13:14,000
Uh, you can definitely try with different different word.

263
00:13:14,000 --> 00:13:18,000
You can use disable e s whatever regular expression you can basically.

264
00:13:18,000 --> 00:13:19,000
Right.

265
00:13:19,000 --> 00:13:21,000
You can go ahead and write it and you can check it okay.

266
00:13:21,000 --> 00:13:25,000
So this is with respect to regular expression Stemmer class.

267
00:13:25,000 --> 00:13:32,000
Now uh, the next one that we are basically going to discuss, uh, and uh, let's see where this is.

268
00:13:33,000 --> 00:13:38,000
Porter stemmer this, this this this is uh, again it has written the same thing.

269
00:13:38,000 --> 00:13:39,000
I don't want to write the same thing.

270
00:13:41,000 --> 00:13:44,000
I'll go ahead with something called as Snowball Stemmer.

271
00:13:45,000 --> 00:13:48,000
And this is also an amazing technique.

272
00:13:48,000 --> 00:13:56,000
And this is a better technique when compared to okay, so snowball stemmer, I think I've given the

273
00:13:56,000 --> 00:13:58,000
definition somewhere here.

274
00:13:58,000 --> 00:13:59,000
No no no no okay.

275
00:13:59,000 --> 00:14:02,000
Snowball stemmer is again a stemming technique.

276
00:14:02,000 --> 00:14:04,000
But in this snowball stemmer it is.

277
00:14:04,000 --> 00:14:07,000
It performs better than this Porter stemmer.

278
00:14:07,000 --> 00:14:10,000
Okay, that is the reason why Snowball stemmer had actually come.

279
00:14:10,000 --> 00:14:12,000
Initially we came up with Porter Stemmer.

280
00:14:12,000 --> 00:14:13,000
We saw that a lot of things.

281
00:14:13,000 --> 00:14:16,000
A lot of words were getting messed up, you know.

282
00:14:16,000 --> 00:14:20,000
So that is the reason why we use Snowball Stemmer, because it gives a better accuracy when compared

283
00:14:20,000 --> 00:14:22,000
to the, uh, porter stemmer.

284
00:14:22,000 --> 00:14:24,000
When I say accuracy, better form of a word.

285
00:14:24,000 --> 00:14:27,000
Okay, now for using snowball stemmer.

286
00:14:27,000 --> 00:14:33,000
What I'll do, I'll write from sklearn dot stem import snowball stemmer.

287
00:14:34,000 --> 00:14:37,000
So here you can see I'll just write snowball.

288
00:14:39,000 --> 00:14:40,000
For my sklearn.

289
00:14:40,000 --> 00:14:41,000
Oh, not for my sklearn.

290
00:14:41,000 --> 00:14:45,000
Sorry for my NLTK because it is basically present in the NLTK, right?

291
00:14:45,000 --> 00:14:51,000
So from NLTK dot stem import snowball stemmer and I'm just going to import it.

292
00:14:51,000 --> 00:14:54,000
Then we are going to initialize with respect to snowball stemmer.

293
00:14:54,000 --> 00:14:55,000
And let's see what are the parameters.

294
00:14:55,000 --> 00:15:01,000
So here first of all snowball stemmer is also provided with different different languages like uh Arabic

295
00:15:02,000 --> 00:15:08,000
okay Danish, uh, English, French, Finnish, French, German, Hungarian, Italian.

296
00:15:08,000 --> 00:15:10,000
So you can use basically all these words.

297
00:15:10,000 --> 00:15:10,000
Okay.

298
00:15:10,000 --> 00:15:13,000
So for right now I'm just going to use English.

299
00:15:13,000 --> 00:15:18,000
So what I'm going to do is that I'm in quotes I will just say English okay.

300
00:15:18,000 --> 00:15:20,000
And uh, I'll just use this.

301
00:15:20,000 --> 00:15:23,000
And finally I will basically use a snowball stemmer over here.

302
00:15:23,000 --> 00:15:26,000
I will just create a variable okay.

303
00:15:26,000 --> 00:15:27,000
And just execute it.

304
00:15:27,000 --> 00:15:27,000
Perfect.

305
00:15:27,000 --> 00:15:33,000
Now the next thing is that I will just use another condition saying that for word in words okay.

306
00:15:34,000 --> 00:15:37,000
And I'll just write for words in words.

307
00:15:37,000 --> 00:15:44,000
So I'll write snowball stemmer dot stem on word okay.

308
00:15:44,000 --> 00:15:47,000
But I'll print this in a better way so that you'll be able to understand it.

309
00:15:47,000 --> 00:15:53,000
So print word plus.

310
00:15:55,000 --> 00:15:57,000
And this will be like this.

311
00:15:57,000 --> 00:15:58,000
Something like this.

312
00:15:58,000 --> 00:16:00,000
Just for formatting purpose.

313
00:16:00,000 --> 00:16:03,000
I'm writing this so that it'll look good, better for you all.

314
00:16:03,000 --> 00:16:06,000
And here I'm just going to write the word right.

315
00:16:06,000 --> 00:16:08,000
So same thing what I did in the things.

316
00:16:08,000 --> 00:16:14,000
So here you can see that I'm getting eating as eat eats as eat eat and as eat.

317
00:16:14,000 --> 00:16:15,000
Right right.

318
00:16:15,000 --> 00:16:16,000
This is all fine right?

319
00:16:16,000 --> 00:16:17,000
For history.

320
00:16:17,000 --> 00:16:19,000
Also, it is not being able to give the correct form.

321
00:16:19,000 --> 00:16:19,000
Right?

322
00:16:19,000 --> 00:16:21,000
Then you may be thinking, Krish.

323
00:16:21,000 --> 00:16:22,000
Then what is the difference?

324
00:16:22,000 --> 00:16:22,000
Right?

325
00:16:22,000 --> 00:16:28,000
But let's see with respect to some porter stemmer where this porter stemmer will give some bad results

326
00:16:28,000 --> 00:16:31,000
or bad form of the word, and where the snowball will give us a better form of a word.

327
00:16:31,000 --> 00:16:35,000
So for this, I'm just going to, uh, execute this line of code.

328
00:16:35,000 --> 00:16:38,000
So here you can see when I'm applying this stemming.

329
00:16:38,000 --> 00:16:41,000
This stemming was for the Porter stemmer right.

330
00:16:41,000 --> 00:16:44,000
When I apply stemming dot stem on fairly and sportingly.

331
00:16:44,000 --> 00:16:46,000
So what is the output that I'm getting?

332
00:16:46,000 --> 00:16:49,000
I'm getting fairly and sportingly right.

333
00:16:49,000 --> 00:16:49,000
L I l I.

334
00:16:50,000 --> 00:16:57,000
But if I try to use the same word with respect to snowball right here, you will be able to see.

335
00:16:57,000 --> 00:17:01,000
Just let me write the snowball stemmer here.

336
00:17:01,000 --> 00:17:02,000
We'll be able to see.

337
00:17:02,000 --> 00:17:06,000
I am getting some amazing answers.

338
00:17:06,000 --> 00:17:10,000
Snowballs stammer okay.

339
00:17:14,000 --> 00:17:16,000
Okay, let me just remove this.

340
00:17:16,000 --> 00:17:18,000
I think there is a problem with respect to this.

341
00:17:19,000 --> 00:17:20,000
Now it will work.

342
00:17:20,000 --> 00:17:21,000
Okay.

343
00:17:22,000 --> 00:17:29,000
Snowball stemmer comma I will just copy this and I'll apply to the same word something called as sportingly.

344
00:17:30,000 --> 00:17:30,000
Okay.

345
00:17:32,000 --> 00:17:37,000
So now if you probably go and see I'm getting a good output which is like fair and sport.

346
00:17:37,000 --> 00:17:44,000
So all together you will be seeing that snowball stemmer when applied to various other words performs

347
00:17:44,000 --> 00:17:46,000
better than stemming.

348
00:17:46,000 --> 00:17:47,000
Right.

349
00:17:47,000 --> 00:17:52,000
So here see Porter Stemmer is obviously a technique where you'll be able to find out the word stem.

350
00:17:52,000 --> 00:17:56,000
But for many of the words it is not being able to give a good answer.

351
00:17:56,000 --> 00:17:59,000
Some of the examples are like fairly, sportingly and all.

352
00:17:59,000 --> 00:18:02,000
Now, in order to overcome this disadvantage, Snowball stemmer is basically used.

353
00:18:02,000 --> 00:18:07,000
And again guys understand these are some techniques which will actually help you to find out the word

354
00:18:07,000 --> 00:18:12,000
Stem and where it is specifically getting used in text preprocessing.

355
00:18:12,000 --> 00:18:13,000
You really need to clean the data.

356
00:18:13,000 --> 00:18:19,000
You need to make sure that the data is ready so that we will be able to convert it into vectors in an

357
00:18:19,000 --> 00:18:19,000
efficient way.

358
00:18:19,000 --> 00:18:20,000
Right.

359
00:18:20,000 --> 00:18:22,000
So in this part we have seen about stemming.

360
00:18:22,000 --> 00:18:30,000
And again one major disadvantage of stemming is that obviously see even though Snowball Stemmer is performing

361
00:18:30,000 --> 00:18:36,000
exceptionally well, but some of the words like history, there will be also something like going right.

362
00:18:36,000 --> 00:18:41,000
So suppose if I probably write this right snowball dot stem right.

363
00:18:41,000 --> 00:18:45,000
And if I use going I think going whether it will give us good or not.

364
00:18:45,000 --> 00:18:49,000
So going is performing well if I right goes.

365
00:18:49,000 --> 00:18:50,000
So here you can see for goes.

366
00:18:50,000 --> 00:18:52,000
Also it gives us a bad word right.

367
00:18:52,000 --> 00:18:55,000
Even stemming dot stem if I probably try to see this right.

368
00:18:55,000 --> 00:18:58,000
So in short how much we try.

369
00:18:58,000 --> 00:18:58,000
Right.

370
00:18:58,000 --> 00:19:02,000
For some of the words obviously the form of the word is changing.

371
00:19:02,000 --> 00:19:05,000
So this is one of the disadvantage with respect to stemming.

372
00:19:05,000 --> 00:19:12,000
And understand that for use cases like chat bots and all these techniques cannot be used.

373
00:19:12,000 --> 00:19:17,000
So for that we have to go ahead with something called as lemmatization, because Lemmatization solves

374
00:19:17,000 --> 00:19:24,000
all this particular problem because it has the dictionary of all the words, all the root words that

375
00:19:24,000 --> 00:19:25,000
is basically there.

376
00:19:25,000 --> 00:19:30,000
So whatever word you basically give, it will be giving you a good grammatical form of the word.

377
00:19:30,000 --> 00:19:32,000
Like If I'm giving goes, it will be go.

378
00:19:32,000 --> 00:19:35,000
If I'm giving fairly, it will be fair.

379
00:19:35,000 --> 00:19:37,000
You know, if I'm giving eating, it will be eat.

380
00:19:37,000 --> 00:19:43,000
So that kind of disadvantage is getting removed completely with the help of Lemmatization.

381
00:19:43,000 --> 00:19:45,000
And that particular part we'll be seeing in the next video.

382
00:19:45,000 --> 00:19:47,000
So I hope you have understood till here.

383
00:19:47,000 --> 00:19:51,000
Please make sure that you try practice with different different examples.

384
00:19:51,000 --> 00:19:53,000
And yes, this was it from my side.

385
00:19:53,000 --> 00:19:55,000
I will see you all in the next video.

386
00:19:55,000 --> 00:19:55,000
Thank you.