1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:02,000
So we'll be continuing the discussion.

3
00:00:02,000 --> 00:00:04,000
With respect to natural language processing.

4
00:00:04,000 --> 00:00:07,000
We are still in text pre-processing techniques.

5
00:00:07,000 --> 00:00:08,000
We have seen tokenization.

6
00:00:08,000 --> 00:00:09,000
We have seen stemming.

7
00:00:09,000 --> 00:00:11,000
We have also seen its different types.

8
00:00:11,000 --> 00:00:13,000
Along with that we have seen Lemmatization.

9
00:00:13,000 --> 00:00:17,000
Now we are going to consider a topic which is called as Stopwords.

10
00:00:17,000 --> 00:00:21,000
So in this video I'm going to discuss about Stopwords the importance of Stopwords.

11
00:00:21,000 --> 00:00:23,000
And again I'll show you with the help of NLTK.

12
00:00:24,000 --> 00:00:29,000
Now, text processing is a very important step in natural language processing because you really need

13
00:00:29,000 --> 00:00:30,000
to clean the data.

14
00:00:30,000 --> 00:00:32,000
You need to make the data in the right format.

15
00:00:32,000 --> 00:00:37,000
Later on, we'll try to convert all this text data into vectors, and then only we'll be able to train

16
00:00:37,000 --> 00:00:37,000
the model.

17
00:00:37,000 --> 00:00:43,000
Because model you know, whenever we say any machine learning model internally we really need to train

18
00:00:43,000 --> 00:00:44,000
with some mathematical equations.

19
00:00:44,000 --> 00:00:49,000
So whenever we train something with mathematical equations there, we really need to give the input

20
00:00:49,000 --> 00:00:52,000
data in the form of numerical or floating values.

21
00:00:52,000 --> 00:00:56,000
So let us go ahead and let's us understand what exactly Stopwords is.

22
00:00:56,000 --> 00:00:58,000
So I have opened a new notebook file over here.

23
00:00:58,000 --> 00:01:03,000
So here I have one amazing speech from doctor APJ Abdul Kalam.

24
00:01:03,000 --> 00:01:08,000
He was the former President of India and it was an amazing speech.

25
00:01:08,000 --> 00:01:12,000
You can probably read out completely over here and obviously it is given in the materials.

26
00:01:12,000 --> 00:01:16,000
Now, what I'm actually going to do is that I'm going to probably talk about Stopwords and why it is

27
00:01:16,000 --> 00:01:19,000
important that we should try to remove the stop words.

28
00:01:19,000 --> 00:01:20,000
Okay.

29
00:01:20,000 --> 00:01:23,000
And, uh, just for a definition, what exactly stop words.

30
00:01:23,000 --> 00:01:28,000
Now, here in this particular speech, you can see that, uh, there are a lot of sentences like, I

31
00:01:28,000 --> 00:01:34,000
have three visions for India in 3000 years of our history, people from all over the world have come

32
00:01:34,000 --> 00:01:34,000
and invaded us.

33
00:01:34,000 --> 00:01:36,000
So this is the entire speech.

34
00:01:36,000 --> 00:01:38,000
It is an amazing speech.

35
00:01:38,000 --> 00:01:41,000
If you are probably learning it, I would like to just, uh, tell you that.

36
00:01:41,000 --> 00:01:42,000
Please read this.

37
00:01:42,000 --> 00:01:44,000
You'll be getting a lot of information out of it.

38
00:01:44,000 --> 00:01:47,000
Very motivational speech altogether.

39
00:01:47,000 --> 00:01:49,000
Now, from this particular speech, you can see that.

40
00:01:49,000 --> 00:01:56,000
And I can definitely say this as paragraph or corpus right now here there are some words like I the

41
00:01:57,000 --> 00:02:05,000
have and, you know, uh, let's say off uh, the you know, two there y right.

42
00:02:05,000 --> 00:02:07,000
All this kind of words, right?

43
00:02:07,000 --> 00:02:13,000
It will not play a very big role when we are doing a task like spam classification.

44
00:02:13,000 --> 00:02:19,000
Or let's say if you are trying to do some kind of task with respect to, uh, you know, uh, like spam

45
00:02:19,000 --> 00:02:22,000
or ham classification, I have already told about that.

46
00:02:22,000 --> 00:02:26,000
And uh, along with that, to just see that whether this is a positive review or negative review, but

47
00:02:26,000 --> 00:02:29,000
some of the words like not can actually play a very important role.

48
00:02:29,000 --> 00:02:30,000
Not and all.

49
00:02:30,000 --> 00:02:34,000
So what we do is that with the help of stop words, you know, we try to remove this particular words

50
00:02:34,000 --> 00:02:40,000
because, uh, with this kind of use cases where you are specifically focusing on some of the important

51
00:02:40,000 --> 00:02:47,000
words to determine the output, this is all words like I the he she off there is not at all required.

52
00:02:47,000 --> 00:02:52,000
So what we can do is that we can basically pass this entire paragraph to that particular stop words

53
00:02:52,000 --> 00:02:55,000
and see that what all words can be basically removed, okay.

54
00:02:55,000 --> 00:02:57,000
And that is the importance of stopwords in short.

55
00:02:57,000 --> 00:02:59,000
So let us go ahead right now.

56
00:02:59,000 --> 00:03:00,000
I'll just go ahead and execute it.

57
00:03:01,000 --> 00:03:07,000
Let me make some cells so that it will be very much easy for you all to understand and how we can apply

58
00:03:07,000 --> 00:03:07,000
Stopwords.

59
00:03:07,000 --> 00:03:09,000
Along with Stopwords, you can also apply stemming.

60
00:03:09,000 --> 00:03:13,000
So I'll show you both the combination, uh, which will be super important for everyone.

61
00:03:13,000 --> 00:03:13,000
Okay.

62
00:03:13,000 --> 00:03:16,000
So let's go ahead and let's uh, try to do that okay.

63
00:03:16,000 --> 00:03:19,000
Now first of all, I really need to import, uh, for stemming.

64
00:03:19,000 --> 00:03:21,000
You obviously know what we need to import.

65
00:03:21,000 --> 00:03:27,000
So from for stemming I'll write for NLTK dot stem import porter stemmer.

66
00:03:28,000 --> 00:03:33,000
So I can basically write Porter stemmer over here and I'll execute it along with this.

67
00:03:33,000 --> 00:03:39,000
Uh, I obviously need to also import Stopwords because stopwords for English, it will be different

68
00:03:39,000 --> 00:03:43,000
because you'll be having that entire list of words like the he, she and.

69
00:03:43,000 --> 00:03:44,000
All right.

70
00:03:44,000 --> 00:03:52,000
so what I'm going to do is that I'm going to say that from an NLP k dot corpus import stopwords.

71
00:03:52,000 --> 00:03:56,000
So from this I will be able to use the stopwords itself.

72
00:03:56,000 --> 00:04:00,000
So now I have imported uh from an antique corpus import Stopwords.

73
00:04:00,000 --> 00:04:04,000
Now this Stopwords, you know, I have to also download it.

74
00:04:04,000 --> 00:04:06,000
So for that, uh, let me do one thing.

75
00:04:06,000 --> 00:04:08,000
I'll also import a NLTK and I will execute it.

76
00:04:08,000 --> 00:04:11,000
And along with that, I'll write an LCC dot download.

77
00:04:11,000 --> 00:04:15,000
And here in the parameter I am going to give about the stopwords.

78
00:04:15,000 --> 00:04:18,000
And there are different different language stopwords also.

79
00:04:18,000 --> 00:04:19,000
And we'll also try to see that.

80
00:04:19,000 --> 00:04:24,000
So if I probably write this here, you'll be able to see that downloading package Stopwords to this

81
00:04:24,000 --> 00:04:25,000
particular location.

82
00:04:25,000 --> 00:04:28,000
So the package Stopwords is already up to date and I'm getting true.

83
00:04:28,000 --> 00:04:33,000
So in short I've downloaded all the different different languages Stopwords which is already present

84
00:04:33,000 --> 00:04:35,000
in the NLTK library.

85
00:04:35,000 --> 00:04:35,000
Perfect.

86
00:04:35,000 --> 00:04:37,000
Now we have done that.

87
00:04:37,000 --> 00:04:40,000
Now let's see that what are the Stopwords that are available in English?

88
00:04:40,000 --> 00:04:42,000
So in order to do that I am.

89
00:04:42,000 --> 00:04:45,000
I have already imported from NLTK corpus import stopwords.

90
00:04:45,000 --> 00:04:46,000
I'll take the Stopwords.

91
00:04:46,000 --> 00:04:53,000
I'll copy and paste it over here and I will say dot download okay or instead of download I will just

92
00:04:53,000 --> 00:04:55,000
write dot words.

93
00:04:55,000 --> 00:04:59,000
And here I just need to give my language.

94
00:04:59,000 --> 00:05:00,000
Like what language?

95
00:05:00,000 --> 00:05:05,000
I really want to, uh, give it for like English or something else, like German and all.

96
00:05:05,000 --> 00:05:07,000
So I'm just going to write this.

97
00:05:07,000 --> 00:05:08,000
Let me just write English.

98
00:05:08,000 --> 00:05:15,000
And if I execute this here, now you can see the list of all the stop words that you obviously have,

99
00:05:15,000 --> 00:05:18,000
and all the stop words can actually be removed.

100
00:05:18,000 --> 00:05:18,000
Right now.

101
00:05:18,000 --> 00:05:23,000
You may be thinking, Krish, this may depend on data to data.

102
00:05:23,000 --> 00:05:25,000
Right now here you can see that guys, this is a list.

103
00:05:25,000 --> 00:05:28,000
You can also create your own stopwords in English like let's say over here.

104
00:05:28,000 --> 00:05:31,000
Some of the important words are like are and couldn't.

105
00:05:31,000 --> 00:05:32,000
Right.

106
00:05:32,000 --> 00:05:35,000
These are words can actually play a very important role.

107
00:05:35,000 --> 00:05:39,000
Uh, to find out whether a statement is positive or negative, like not is also there.

108
00:05:39,000 --> 00:05:42,000
If you probably search for it not, you will also be able to find it not.

109
00:05:42,000 --> 00:05:43,000
Okay.

110
00:05:43,000 --> 00:05:47,000
So it is always a good way that you create your own stopwords and try to remove all those kind of words

111
00:05:47,000 --> 00:05:48,000
from the paragraph.

112
00:05:48,000 --> 00:05:52,000
So I hope everybody is able to understand now with respect to English, you have this.

113
00:05:52,000 --> 00:05:55,000
Now let's see whether we have with respect to different different language.

114
00:05:55,000 --> 00:05:59,000
And obviously you can go ahead and check the documentation, but I will just try to show you with respect

115
00:05:59,000 --> 00:06:00,000
to German.

116
00:06:00,000 --> 00:06:03,000
So in German also you have this specific stop words.

117
00:06:03,000 --> 00:06:05,000
Along with this you can also use French.

118
00:06:05,000 --> 00:06:07,000
You have this particular stop words.

119
00:06:07,000 --> 00:06:12,000
So with respect to different different texts or different different language of text, you can definitely

120
00:06:12,000 --> 00:06:14,000
apply different different stopwords with respect to that.

121
00:06:14,000 --> 00:06:17,000
Now you may be thinking, is there Hindi or Arabic or some other?

122
00:06:17,000 --> 00:06:19,000
I think for Arabic also, I think it is there.

123
00:06:19,000 --> 00:06:21,000
Let's see whether it is there or not.

124
00:06:21,000 --> 00:06:25,000
Yes, for Arabic also it is there, but I do not find it for Hindi I guess.

125
00:06:25,000 --> 00:06:28,000
So again from the documentation you can check it out.

126
00:06:28,000 --> 00:06:30,000
But till Arabic I was able to see it again.

127
00:06:30,000 --> 00:06:33,000
All the information will be given in the documentation.

128
00:06:33,000 --> 00:06:37,000
Now what I'm actually going to do is that my sentence is already English.

129
00:06:37,000 --> 00:06:39,000
Right now I'm going to perform two important tasks.

130
00:06:39,000 --> 00:06:41,000
One is I will apply stemming.

131
00:06:41,000 --> 00:06:46,000
And before applying stemming, you know what I'm actually going to do wherever I find out the stop words,

132
00:06:46,000 --> 00:06:50,000
I'm just going to remove the stop words from this particular paragraph so that this entire paragraph

133
00:06:50,000 --> 00:06:51,000
will be shortened up.

134
00:06:51,000 --> 00:06:52,000
Right.

135
00:06:52,000 --> 00:06:53,000
So for that, what I'm actually going to do.

136
00:06:53,000 --> 00:06:57,000
Now, see, whatever things you have learned from starting everything, I'm actually going to cover

137
00:06:57,000 --> 00:06:57,000
up.

138
00:06:57,000 --> 00:06:58,000
Okay.

139
00:06:58,000 --> 00:07:05,000
So first thing first I'm just going to say from NLTK dot stem I'm going to import porter stemmer.

140
00:07:05,000 --> 00:07:06,000
Porter stemmer.

141
00:07:06,000 --> 00:07:10,000
And I'll go to just execute I'm just going to execute this okay.

142
00:07:10,000 --> 00:07:14,000
And then what I'm actually going to do is that I'm just going to write Stemmer is equal to Porter stemmer

143
00:07:14,000 --> 00:07:14,000
this.

144
00:07:14,000 --> 00:07:16,000
We really need to initialize it.

145
00:07:16,000 --> 00:07:20,000
Now when we do this task right now, the next step, what I'm actually going to do is that I'm going

146
00:07:20,000 --> 00:07:23,000
to perform the tokenization on the entire paragraph.

147
00:07:23,000 --> 00:07:27,000
So for that I can use NLTK, dot, send, tokenize.

148
00:07:27,000 --> 00:07:29,000
And here I'm just going to give my paragraph.

149
00:07:29,000 --> 00:07:35,000
Now see this guys here I'm going to get my entire paragraph, entire sentences like see I have three

150
00:07:35,000 --> 00:07:36,000
visions for India.

151
00:07:36,000 --> 00:07:39,000
Then in 3000 years this all things I am able to get.

152
00:07:39,000 --> 00:07:40,000
And this is my second sentence.

153
00:07:40,000 --> 00:07:41,000
Third sentence.

154
00:07:41,000 --> 00:07:42,000
Fourth sentence.

155
00:07:42,000 --> 00:07:42,000
Like this.

156
00:07:42,000 --> 00:07:47,000
All the sentences in the form of list I'm able to get just by using cent underscore tokenize.

157
00:07:47,000 --> 00:07:47,000
Right.

158
00:07:47,000 --> 00:07:53,000
This is a tokenization process wherein we take a paragraph, divide that into sentences.

159
00:07:53,000 --> 00:07:58,000
Okay, now let me do one thing, is that I'm just going to save this in a variable which is called as

160
00:07:58,000 --> 00:08:00,000
sentences, which will let later become a list.

161
00:08:00,000 --> 00:08:00,000
Right.

162
00:08:00,000 --> 00:08:02,000
So this is my sentences.

163
00:08:02,000 --> 00:08:06,000
And if you probably see the type of sentences I'm just going to basically see this.

164
00:08:06,000 --> 00:08:08,000
It is a list now.

165
00:08:08,000 --> 00:08:09,000
Perfect till here.

166
00:08:09,000 --> 00:08:11,000
We have done it amazingly well right.

167
00:08:11,000 --> 00:08:13,000
We have done Porter Stemmer on that.

168
00:08:13,000 --> 00:08:13,000
Sorry.

169
00:08:13,000 --> 00:08:16,000
We have initialized stemmer over here and we have tokenized it.

170
00:08:16,000 --> 00:08:21,000
Now understand, what we are going to do is that I'm going to traverse through all the sentences.

171
00:08:21,000 --> 00:08:26,000
First of all, apply a stopwords which all words are not present in the stopwords we are only going

172
00:08:26,000 --> 00:08:28,000
to take that and apply stemming.

173
00:08:28,000 --> 00:08:30,000
This is what we really want to do.

174
00:08:30,000 --> 00:08:42,000
So here I'm saying that first of all apply stopwords and filter and then apply tokenization.

175
00:08:42,000 --> 00:08:42,000
Right?

176
00:08:42,000 --> 00:08:43,000
Sorry.

177
00:08:43,000 --> 00:08:44,000
Then apply stemming.

178
00:08:44,000 --> 00:08:46,000
So this is the step that I'm actually going to do.

179
00:08:46,000 --> 00:08:49,000
Now see this very simple very important.

180
00:08:49,000 --> 00:08:54,000
So I'll write a for loop saying that for I in range for I in range.

181
00:08:54,000 --> 00:08:57,000
And here I'm going to basically give the length of the sentences.

182
00:08:57,000 --> 00:08:59,000
I can also go with respect to sentences.

183
00:08:59,000 --> 00:09:02,000
But there I'll not be getting the indexes here.

184
00:09:02,000 --> 00:09:03,000
I'll be getting the indices.

185
00:09:03,000 --> 00:09:07,000
Okay, so range basically says that whatever length I'm actually giving that becomes an index, right?

186
00:09:07,000 --> 00:09:09,000
Zero to that specific length.

187
00:09:09,000 --> 00:09:15,000
Now what I'm actually going to take, I'm going to take this specific n um, I and I'm going to write

188
00:09:15,000 --> 00:09:21,000
n nltk dot word tokenize because I'm getting in the form of sentences, I need to get each and every

189
00:09:21,000 --> 00:09:21,000
word right.

190
00:09:21,000 --> 00:09:27,000
So I'm getting the word over here and inside word underscore tokenize I'll give I sorry sentence of

191
00:09:27,000 --> 00:09:32,000
I because this will be an index sentences of I.

192
00:09:32,000 --> 00:09:33,000
So this will be an index.

193
00:09:33,000 --> 00:09:35,000
And here I'll be able to get the word.

194
00:09:35,000 --> 00:09:38,000
So here what I'll do I'll make a list of words.

195
00:09:38,000 --> 00:09:41,000
So in short I'll be getting the list of words inside the sentences.

196
00:09:41,000 --> 00:09:41,000
Perfect.

197
00:09:41,000 --> 00:09:43,000
Now till here we have done it.

198
00:09:43,000 --> 00:09:47,000
Now after this we are going to apply one very important thing that is.

199
00:09:47,000 --> 00:09:52,000
First of all, I need to apply stop words for each and every word and see whether it falls in the stop

200
00:09:52,000 --> 00:09:52,000
word or not.

201
00:09:52,000 --> 00:09:55,000
If it does not fall in the stop word, then only I have to do the stemming.

202
00:09:55,000 --> 00:09:58,000
So understand the task step by step.

203
00:09:58,000 --> 00:10:02,000
This is super important with respect to all the steps that I'm actually taking up.

204
00:10:02,000 --> 00:10:05,000
Okay, so here I will write a list comprehension.

205
00:10:05,000 --> 00:10:09,000
I will say stemmer dot stem okay.

206
00:10:09,000 --> 00:10:15,000
And here I'm going to write dot word okay of the word because uh from this particular words, this word

207
00:10:15,000 --> 00:10:16,000
is a list of words.

208
00:10:16,000 --> 00:10:18,000
And I have to take each and every word.

209
00:10:18,000 --> 00:10:20,000
So here I will write a for loop okay.

210
00:10:20,000 --> 00:10:22,000
And this is called as list comprehension.

211
00:10:22,000 --> 00:10:34,000
So I'll write for word in words if word not in see if the word is not present in the stop words then

212
00:10:34,000 --> 00:10:35,000
only you apply stemming.

213
00:10:35,000 --> 00:10:36,000
That is what I'm actually trying to do.

214
00:10:36,000 --> 00:10:40,000
Okay, so here you can basically see if the word is not in.

215
00:10:40,000 --> 00:10:41,000
I'll use a set.

216
00:10:41,000 --> 00:10:46,000
Along with that I will download all the all the stop words with respect to English.

217
00:10:46,000 --> 00:10:47,000
Right.

218
00:10:47,000 --> 00:10:51,000
So why I'm using step set because some of the words may get repeated.

219
00:10:51,000 --> 00:10:52,000
So I don't want that.

220
00:10:52,000 --> 00:10:56,000
So I'm going to basically write it over here right now through this.

221
00:10:56,000 --> 00:11:00,000
What I'm actually going to do I'll get that specific word that is not present in the stop words.

222
00:11:00,000 --> 00:11:04,000
And only stemming will be getting applied to that specific word.

223
00:11:04,000 --> 00:11:05,000
Perfect.

224
00:11:05,000 --> 00:11:10,000
Now here, what I'm actually going to do is that I'm going to save it in a variable called as words

225
00:11:10,000 --> 00:11:10,000
perfect.

226
00:11:11,000 --> 00:11:13,000
I hope it is very much clear.

227
00:11:13,000 --> 00:11:17,000
I'm getting back everything after doing the stemming back in the words itself.

228
00:11:17,000 --> 00:11:23,000
And then finally, what I'm actually going to do is that I'm going to take this sentences, and I'm

229
00:11:23,000 --> 00:11:27,000
going to replace it on the same index with respect to this words.

230
00:11:27,000 --> 00:11:31,000
But once we get this words right, I need to join all these words together.

231
00:11:31,000 --> 00:11:32,000
So how do I join?

232
00:11:32,000 --> 00:11:35,000
There will obviously be a space dot.

233
00:11:35,000 --> 00:11:40,000
I'll just use dot join so that it will join together and it will convert that into a sentences.

234
00:11:40,000 --> 00:11:49,000
So this exactly is converting all the words into sentences, right.

235
00:11:49,000 --> 00:11:50,000
This is very much simple.

236
00:11:50,000 --> 00:11:51,000
And we have actually done this.

237
00:11:51,000 --> 00:11:52,000
Perfect.

238
00:11:52,000 --> 00:11:57,000
So here what all things we have done again let me repeat I'm I'm iterating through each and every sentence.

239
00:11:57,000 --> 00:12:01,000
And then I'm doing a word tokenize that basically means for every sentences I'm getting the list of

240
00:12:01,000 --> 00:12:02,000
words.

241
00:12:02,000 --> 00:12:07,000
And from that list of words I'm iterating, I'm seeing that whether it is present in the stop words,

242
00:12:07,000 --> 00:12:09,000
if it is not present, I'm doing the stemming.

243
00:12:09,000 --> 00:12:14,000
After stemming, I'm again storing back in that same list, and then I'm converting all these words

244
00:12:14,000 --> 00:12:18,000
into sentences I can say, converting all the list of words into sentences.

245
00:12:18,000 --> 00:12:19,000
Perfect.

246
00:12:19,000 --> 00:12:20,000
Now, once I execute it.

247
00:12:20,000 --> 00:12:28,000
And now if I go and see my sentences here, you'll be able to see now I three vision India right in

248
00:12:28,000 --> 00:12:30,000
3000 year history.

249
00:12:30,000 --> 00:12:30,000
Right.

250
00:12:30,000 --> 00:12:37,000
So here I h I s t o r y became RI okay people, people it became people.

251
00:12:38,000 --> 00:12:41,000
World came, invaded us, invade it, became captured.

252
00:12:41,000 --> 00:12:43,000
Land conquer mined from.

253
00:12:43,000 --> 00:12:47,000
You can see all the special words like like I, I have.

254
00:12:47,000 --> 00:12:48,000
Everything is gone.

255
00:12:48,000 --> 00:12:51,000
See, see this I is there have is gone.

256
00:12:51,000 --> 00:12:52,000
Right.

257
00:12:52,000 --> 00:12:55,000
Though you will not be able to find out though anyway, right?

258
00:12:55,000 --> 00:12:56,000
So whatever.

259
00:12:56,000 --> 00:12:56,000
Stop.

260
00:12:56,000 --> 00:12:58,000
Words were present over here that all got removed.

261
00:12:58,000 --> 00:13:00,000
And then only we performed the stemming.

262
00:13:00,000 --> 00:13:02,000
Right now you may be saying crush.

263
00:13:02,000 --> 00:13:04,000
Uh, the stemming does not look very good, right?

264
00:13:04,000 --> 00:13:10,000
So for that, what you need to do, I've already taught you with respect to snowball stemmer, so I

265
00:13:10,000 --> 00:13:11,000
will just import this.

266
00:13:11,000 --> 00:13:12,000
Okay.

267
00:13:13,000 --> 00:13:13,000
And it is very simple.

268
00:13:13,000 --> 00:13:14,000
Simple.

269
00:13:14,000 --> 00:13:15,000
I think you can do the same task again.

270
00:13:15,000 --> 00:13:20,000
So snowball stemmer and then I will try to import this with respect to English.

271
00:13:21,000 --> 00:13:24,000
And obviously after this you'll be able to get good sentence.

272
00:13:24,000 --> 00:13:24,000
Right.

273
00:13:24,000 --> 00:13:27,000
So let me just remove this one.

274
00:13:27,000 --> 00:13:31,000
So snowball stemmer I've already done it and I'm going to copy the same thing.

275
00:13:31,000 --> 00:13:32,000
Okay.

276
00:13:32,000 --> 00:13:38,000
And here I'm just going to say apply snowball stemmer stemming.

277
00:13:38,000 --> 00:13:38,000
Right.

278
00:13:38,000 --> 00:13:43,000
And then instead of stemmer I will just write this word that is snowball stemmer.

279
00:13:43,000 --> 00:13:44,000
That's it.

280
00:13:44,000 --> 00:13:45,000
Yeah.

281
00:13:45,000 --> 00:13:51,000
And now once I execute it, uh, let me go back again, back to the sentences, because that sentences

282
00:13:51,000 --> 00:13:52,000
have got changed now.

283
00:13:52,000 --> 00:13:54,000
So where is the sentences?

284
00:13:54,000 --> 00:13:55,000
Let's see.

285
00:13:55,000 --> 00:13:56,000
Okay.

286
00:13:56,000 --> 00:13:59,000
Now this is the sentences I've got executed now.

287
00:13:59,000 --> 00:14:01,000
Now I'm just going to execute this.

288
00:14:01,000 --> 00:14:07,000
Now if I probably go and see my sentences here, you can see that now it is good right now.

289
00:14:07,000 --> 00:14:11,000
So uh, one more important thing that snowball has done.

290
00:14:11,000 --> 00:14:14,000
See over here, I still have capital letters, right?

291
00:14:14,000 --> 00:14:18,000
Like there may be some of the sentences which may be in small letter it also.

292
00:14:18,000 --> 00:14:20,000
So that becomes a repeated word.

293
00:14:20,000 --> 00:14:25,000
But since this is a capital letter, it will be considered as a separate word right for the model to

294
00:14:25,000 --> 00:14:25,000
understand.

295
00:14:25,000 --> 00:14:26,000
Right?

296
00:14:26,000 --> 00:14:27,000
So what it does is that snowball.

297
00:14:27,000 --> 00:14:31,000
One more advantage is that it is making sure that all the letter is becoming small.

298
00:14:31,000 --> 00:14:32,000
Right.

299
00:14:32,000 --> 00:14:34,000
So all the letter is becoming small.

300
00:14:34,000 --> 00:14:39,000
And for some of the words it is not even giving a good result like poverty has become poverty.

301
00:14:39,000 --> 00:14:44,000
But if you try to do this with the help of Lemmatization, you can also get a good word.

302
00:14:44,000 --> 00:14:47,000
Now let's try it with the help of Lemmatization.

303
00:14:47,000 --> 00:14:49,000
So I'm just going to, uh, do the same thing.

304
00:14:49,000 --> 00:14:49,000
Okay?

305
00:14:49,000 --> 00:14:52,000
I hope everybody is understood with respect to snowball stemmer.

306
00:14:52,000 --> 00:14:56,000
Now, what I'm going to do is that again, going to go back to my Lemmatization code.

307
00:14:56,000 --> 00:15:01,000
So I'm just going to import this NLTK dot stem.

308
00:15:01,000 --> 00:15:02,000
And it is very simple guys.

309
00:15:02,000 --> 00:15:07,000
I think we are just repeating things so that you also practice in a better way okay.

310
00:15:08,000 --> 00:15:11,000
so here I've got the word net lemmatizer I'm going to initialize it.

311
00:15:11,000 --> 00:15:12,000
Perfect.

312
00:15:12,000 --> 00:15:13,000
This is done.

313
00:15:13,000 --> 00:15:16,000
Now I'm going to go to go ahead and copy the same code right.

314
00:15:17,000 --> 00:15:19,000
And I will just write it over here.

315
00:15:19,000 --> 00:15:24,000
And instead of writing snowball I'm just going to copy this I'm going to paste it over here.

316
00:15:24,000 --> 00:15:25,000
Perfect.

317
00:15:25,000 --> 00:15:26,000
So I've done it.

318
00:15:26,000 --> 00:15:32,000
But let me just go ahead and execute the sentence part again because I need to get the updated sentence.

319
00:15:32,000 --> 00:15:35,000
So I think it is somewhere here.

320
00:15:35,000 --> 00:15:35,000
Paragraph.

321
00:15:35,000 --> 00:15:36,000
Okay.

322
00:15:36,000 --> 00:15:38,000
And uh, here is the sentence.

323
00:15:38,000 --> 00:15:39,000
Okay.

324
00:15:40,000 --> 00:15:40,000
Perfect.

325
00:15:40,000 --> 00:15:43,000
Now let me just go ahead and execute the same thing.

326
00:15:43,000 --> 00:15:47,000
And now if you okay I'm getting an A has no Stem okay.

327
00:15:47,000 --> 00:15:47,000
Sorry.

328
00:15:47,000 --> 00:15:48,000
It should be lemmatize.

329
00:15:50,000 --> 00:15:51,000
Lemmatize.

330
00:15:51,000 --> 00:15:51,000
Okay.

331
00:15:51,000 --> 00:15:53,000
So Lemmatize or Lemmatize.

332
00:15:53,000 --> 00:15:55,000
Now, as you see, some time it took.

333
00:15:55,000 --> 00:15:55,000
Right?

334
00:15:55,000 --> 00:15:57,000
Because it is basically checking from the entire corpus.

335
00:15:57,000 --> 00:16:04,000
Now if I probably go and see the sentences now, you have all amazing things with respect to words that

336
00:16:04,000 --> 00:16:07,000
are coming up correctly and all the things are here, right.

337
00:16:07,000 --> 00:16:10,000
So with respect to this, you are able to get some good thing.

338
00:16:10,000 --> 00:16:14,000
Now one more thing you can do is that after this word you can also put the post tag.

339
00:16:14,000 --> 00:16:19,000
Now if you put the post tag to V right now, you see what will be the output.

340
00:16:19,000 --> 00:16:26,000
You'll get a better output I guess because most of the words it will be basically considering as, um,

341
00:16:26,000 --> 00:16:29,000
you know, as an adverb or sorry as a verb.

342
00:16:29,000 --> 00:16:32,000
So but anyhow, we will try to understand about post tag again.

343
00:16:32,000 --> 00:16:33,000
More more things.

344
00:16:33,000 --> 00:16:34,000
Right.

345
00:16:34,000 --> 00:16:40,000
So I three visions India in 3000 years history people come would world come invade us capture land.

346
00:16:40,000 --> 00:16:42,000
So all the stop words has got deleted.

347
00:16:42,000 --> 00:16:44,000
Now we are getting a very good one.

348
00:16:44,000 --> 00:16:47,000
You know, uh, at least better than stemming.

349
00:16:47,000 --> 00:16:47,000
Okay.

350
00:16:47,000 --> 00:16:49,000
And the other one that is snowball stemming.

351
00:16:49,000 --> 00:16:50,000
Right.

352
00:16:50,000 --> 00:16:53,000
So this was the entire process with respect to text pre-processing.

353
00:16:53,000 --> 00:16:59,000
And here we have discussed with stop words and how you should also go ahead and do the text pre-processing.

354
00:16:59,000 --> 00:17:01,000
I hope everybody got that idea.

355
00:17:01,000 --> 00:17:05,000
Now in Lemmatization also you can see that it is not lowering it.

356
00:17:05,000 --> 00:17:09,000
So what you can actually do is that you can basically lower all the sentences.

357
00:17:09,000 --> 00:17:09,000
right.

358
00:17:09,000 --> 00:17:22,000
So let's say if you write sentences of I is equal to sentences dot of I dot two lower right.

359
00:17:22,000 --> 00:17:24,000
So you can basically give two lower.

360
00:17:24,000 --> 00:17:25,000
Let's see whether it will work or not.

361
00:17:26,000 --> 00:17:29,000
Uh, and uh let's see whether it will completely work or not.

362
00:17:29,000 --> 00:17:34,000
I'm not sure whether dot two lower will work with respect to list, but I think it should work.

363
00:17:34,000 --> 00:17:36,000
I'm just going to execute this again.

364
00:17:36,000 --> 00:17:40,000
Go down okay and apply this Lemmatizer.

365
00:17:40,000 --> 00:17:41,000
Okay.

366
00:17:41,000 --> 00:17:45,000
Uh str object has no attribute to lower okay.

367
00:17:45,000 --> 00:17:45,000
So it's okay.

368
00:17:45,000 --> 00:17:46,000
Not a problem.

369
00:17:46,000 --> 00:17:47,000
Not a problem.

370
00:17:47,000 --> 00:17:49,000
What I'm actually going to do I'm just going to comment this.

371
00:17:49,000 --> 00:17:54,000
And before writing here I'll just say word to lower okay.

372
00:17:54,000 --> 00:17:56,000
You have to definitely try different different things okay.

373
00:17:56,000 --> 00:17:58,000
And it's all about Google.

374
00:17:58,000 --> 00:17:59,000
You know, just Google it.

375
00:17:59,000 --> 00:18:00,000
You'll be able to get it.

376
00:18:00,000 --> 00:18:02,000
So again I'm going to execute this.

377
00:18:02,000 --> 00:18:04,000
And now I think it will execute I know it is.

378
00:18:04,000 --> 00:18:05,000
I'm doing a lot of up and downs.

379
00:18:05,000 --> 00:18:07,000
But just try to follow the lecture.

380
00:18:08,000 --> 00:18:13,000
Uh string object has no attribute to lower y to underscore lower is there.

381
00:18:16,000 --> 00:18:17,000
To lower.

382
00:18:17,000 --> 00:18:19,000
Um, okay.

383
00:18:19,000 --> 00:18:20,000
Let me just see.

384
00:18:20,000 --> 00:18:25,000
Uh, str two lower case.

385
00:18:25,000 --> 00:18:25,000
Right.

386
00:18:25,000 --> 00:18:26,000
Python.

387
00:18:26,000 --> 00:18:28,000
Let me just see this.

388
00:18:28,000 --> 00:18:34,000
It's okay if I don't have any other way to see it, but I think dot lower will definitely work.

389
00:18:34,000 --> 00:18:34,000
Let's see.

390
00:18:35,000 --> 00:18:36,000
Dot lower.

391
00:18:37,000 --> 00:18:37,000
Okay.

392
00:18:37,000 --> 00:18:37,000
Perfect.

393
00:18:37,000 --> 00:18:38,000
It has worked.

394
00:18:38,000 --> 00:18:41,000
Now, now if I go and see the sentences, it has done okay.

395
00:18:41,000 --> 00:18:45,000
So I don't have any any regrets to search in the Google.

396
00:18:45,000 --> 00:18:47,000
You should also do the search in the Google.

397
00:18:47,000 --> 00:18:51,000
So now here you can see all the small letters are there along with the lemmatization.

398
00:18:51,000 --> 00:18:53,000
So this is the entire text pre-processing.

399
00:18:53,000 --> 00:18:58,000
We can also apply some regular expression before this so that we can do the cleaning of the sentence.

400
00:18:58,000 --> 00:18:59,000
So I'll remove this word.

401
00:18:59,000 --> 00:19:05,000
And I think uh, this will also work if I probably just comment this out with to lower or just lower.

402
00:19:05,000 --> 00:19:05,000
Okay.

403
00:19:05,000 --> 00:19:06,000
Just check it out.

404
00:19:06,000 --> 00:19:07,000
It is up to you.

405
00:19:07,000 --> 00:19:10,000
So I'll just comment this out so that you can try it out.

406
00:19:10,000 --> 00:19:10,000
Okay.

407
00:19:11,000 --> 00:19:16,000
But in short, uh, we have understood what Stopwords can actually do and how we can basically apply

408
00:19:16,000 --> 00:19:16,000
things.

409
00:19:16,000 --> 00:19:19,000
Okay, so yes, uh, this was it from my side.

410
00:19:19,000 --> 00:19:22,000
I think you are liking this entire series.

411
00:19:22,000 --> 00:19:24,000
Uh, we will be learning more things as we go ahead.

412
00:19:25,000 --> 00:19:27,000
Um, uh, we have post tags and all.

413
00:19:27,000 --> 00:19:29,000
We have entity name recognitions.

414
00:19:29,000 --> 00:19:31,000
Also, many things are there which is going to come.

415
00:19:31,000 --> 00:19:34,000
So yes, uh, keep on learning, keep on practicing.

416
00:19:34,000 --> 00:19:36,000
Uh, I will see you all in the next video.

417
00:19:36,000 --> 00:19:36,000
Thank you.

