1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue a discussion with respect to natural language processing.

3
00:00:04,000 --> 00:00:07,000
In this video we are going to discuss about Lemmatization.

4
00:00:07,000 --> 00:00:10,000
In our previous video we have already seen something called as stemming.

5
00:00:10,000 --> 00:00:15,000
We understood that stemming is the process of reducing a word to its word stem.

6
00:00:15,000 --> 00:00:16,000
Right.

7
00:00:16,000 --> 00:00:20,000
And we understood what is the disadvantage of stemming because for some of the words right, when we

8
00:00:20,000 --> 00:00:25,000
perform stemming, we do not get the correct form of the word then and the entire meaning of the word

9
00:00:25,000 --> 00:00:26,000
actually gets changed.

10
00:00:26,000 --> 00:00:30,000
And please remember this word which is called as word stem, right?

11
00:00:30,000 --> 00:00:35,000
So in short, this is just a kind of algorithm which tries to find out the word stem.

12
00:00:35,000 --> 00:00:38,000
But for some of the words it is working absolutely fine.

13
00:00:38,000 --> 00:00:42,000
We have seen different types of stemming techniques like porter, stemmer, regev stemmer, snowball,

14
00:00:42,000 --> 00:00:42,000
stemmer.

15
00:00:42,000 --> 00:00:46,000
And we also found out that snowball stemmer was better than Porter stemmer.

16
00:00:46,000 --> 00:00:49,000
So all these things were covered today in this video.

17
00:00:49,000 --> 00:00:53,000
Right now in this video, we are going to discuss about Lemmatization and the Lemmatization technique

18
00:00:53,000 --> 00:00:56,000
that we are going to use is something called as word net Lemmatizer.

19
00:00:56,000 --> 00:01:01,000
I've already told you what is the disadvantage of stemming, because the words that we are getting is

20
00:01:01,000 --> 00:01:02,000
not in the correct form.

21
00:01:02,000 --> 00:01:04,000
The meaning of the word is basically changing.

22
00:01:04,000 --> 00:01:08,000
So in order to prevent that, we use something called as Lemmatizer.

23
00:01:08,000 --> 00:01:10,000
And in this we have something called as word net Lemmatizer.

24
00:01:11,000 --> 00:01:13,000
Now what is Lemmatizer okay.

25
00:01:13,000 --> 00:01:19,000
Lemmatization technique is like stemming the output will we will get after.

26
00:01:19,000 --> 00:01:23,000
Lemmatization is called lemma, which is a root word.

27
00:01:23,000 --> 00:01:26,000
Understand this, which is a root word like.

28
00:01:26,000 --> 00:01:29,000
Suppose if I have eating, then it will become.

29
00:01:29,000 --> 00:01:31,000
Eat is the root word right?

30
00:01:31,000 --> 00:01:36,000
If I probably talk about history, history is the root word, right?

31
00:01:36,000 --> 00:01:40,000
If I talk about go goes, go is the root word.

32
00:01:40,000 --> 00:01:40,000
Right.

33
00:01:40,000 --> 00:01:44,000
Over here we are getting the stem word right in stemming right.

34
00:01:44,000 --> 00:01:46,000
It will probably apply an algorithm.

35
00:01:46,000 --> 00:01:50,000
And with respect to that it will try to find out the stem word right word stem.

36
00:01:50,000 --> 00:01:51,000
We basically say it this as.

37
00:01:51,000 --> 00:01:53,000
But here we get the root exact word.

38
00:01:53,000 --> 00:01:59,000
So again the main aim of Lemmatizer is to give you the exact form of the word which is meaningful,

39
00:01:59,000 --> 00:02:04,000
and it does not change the meaning as it was happening in stemming itself.

40
00:02:04,000 --> 00:02:07,000
So here you have rather than the root stem, the output of the stem.

41
00:02:07,000 --> 00:02:08,000
Now again I'll repeat the definition.

42
00:02:08,000 --> 00:02:10,000
Lemmatization technique is like a stemming.

43
00:02:10,000 --> 00:02:16,000
The output we will get after lemmatization is called lemma, which is a root word rather than the root

44
00:02:16,000 --> 00:02:16,000
stem.

45
00:02:16,000 --> 00:02:19,000
The output of stemming after lemmatization we will get.

46
00:02:19,000 --> 00:02:24,000
We will be getting a valid word that means the same thing.

47
00:02:24,000 --> 00:02:29,000
So suppose if there are words like eating, eats, eaten, it will become eat only right?

48
00:02:29,000 --> 00:02:34,000
So it will be giving you a meaningful word which will represent many words over there.

49
00:02:34,000 --> 00:02:40,000
Okay, so NLTK provides word net Lemmatizer class and I'll try to show you with respect to this and

50
00:02:40,000 --> 00:02:46,000
understand guys, uh, you know, this lemmatization occurs or they it, it gets performed with the

51
00:02:46,000 --> 00:02:49,000
help of this word net corpus reader.

52
00:02:49,000 --> 00:02:53,000
So there will be a dictionary kind of words which will be comparing from there.

53
00:02:53,000 --> 00:02:56,000
And it will be able to do the word net very much properly.

54
00:02:56,000 --> 00:03:00,000
So first of all, let's go ahead and see that how we can basically implement it.

55
00:03:00,000 --> 00:03:06,000
So first of all I will try to import from an LCC stem uh import word.

56
00:03:06,000 --> 00:03:07,000
Net Lemmatizer.

57
00:03:07,000 --> 00:03:08,000
Okay.

58
00:03:08,000 --> 00:03:10,000
So I'm just going to import this and this word.

59
00:03:10,000 --> 00:03:12,000
Net Lemmatizer.

60
00:03:12,000 --> 00:03:14,000
Uh, again helps us to perform Lemmatization.

61
00:03:14,000 --> 00:03:16,000
Uh, so first of all, I'll create an object.

62
00:03:16,000 --> 00:03:18,000
So I will write Lemmatizer.

63
00:03:19,000 --> 00:03:22,000
And here I'm just going to write WordNet Lemmatizer.

64
00:03:22,000 --> 00:03:25,000
So here if I see this, it is nothing but WordNet.

65
00:03:26,000 --> 00:03:28,000
So let me just go ahead and execute it.

66
00:03:28,000 --> 00:03:30,000
So this has got executed perfectly.

67
00:03:30,000 --> 00:03:33,000
Now one thing that you really need to see that okay.

68
00:03:33,000 --> 00:03:39,000
So I will just be writing something called as let's say let's try some easy word okay.

69
00:03:39,000 --> 00:03:46,000
So Lemmatizer is equal to oh I'm just going to say dot lemmatize.

70
00:03:46,000 --> 00:03:48,000
So there is a function which is called as lemmatize.

71
00:03:48,000 --> 00:03:50,000
And here I have to give my words.

72
00:03:51,000 --> 00:03:56,000
So let's say if I'm giving going and if I try to see the answer I'm actually getting going.

73
00:03:56,000 --> 00:04:01,000
So that basically means that it is a uh, it is trying to find out the root word with respect to this,

74
00:04:01,000 --> 00:04:04,000
but let's go and see the functionality of Lemmatize.

75
00:04:06,000 --> 00:04:09,000
In Lemmatize we give two important parameters.

76
00:04:09,000 --> 00:04:12,000
One is word and this is something called as post tag.

77
00:04:12,000 --> 00:04:14,000
Okay, I will talk about this post tag.

78
00:04:14,000 --> 00:04:15,000
Right now it is written as n.

79
00:04:15,000 --> 00:04:19,000
N basically means this word that I'm actually passing.

80
00:04:19,000 --> 00:04:20,000
It is being treated as a noun.

81
00:04:20,000 --> 00:04:24,000
So you may be thinking, Krish, how many uh post tax will be there?

82
00:04:24,000 --> 00:04:26,000
So let me just write down a comment over here.

83
00:04:26,000 --> 00:04:27,000
Okay.

84
00:04:27,000 --> 00:04:29,000
So here you have different different post tag for noun.

85
00:04:29,000 --> 00:04:31,000
We give it as small n for verb.

86
00:04:31,000 --> 00:04:36,000
We give it as v for adjective, we give it as a for adverb, we give it as r.

87
00:04:36,000 --> 00:04:41,000
Now what I'm actually going to do by default right now the post tag is n, right?

88
00:04:41,000 --> 00:04:44,000
So for with respect to n I'm getting this output.

89
00:04:44,000 --> 00:04:44,000
Okay.

90
00:04:44,000 --> 00:04:49,000
Let's say I change it to V because I just want to show you whether this will be changed or not.

91
00:04:49,000 --> 00:04:55,000
Now when I give post tag as n we I'm just saying that consider going as a noun keyword, but going is

92
00:04:55,000 --> 00:04:56,000
not a noun keyword, right?

93
00:04:56,000 --> 00:05:02,000
But if I consider with respect to verb, now you see with respect to verb, I am able to get a good

94
00:05:02,000 --> 00:05:04,000
lemmatization that is like go, okay?

95
00:05:04,000 --> 00:05:09,000
Similarly, if I probably try to see with respect to a adjective.

96
00:05:09,000 --> 00:05:11,000
So here you can see I'm getting going, right?

97
00:05:11,000 --> 00:05:14,000
And obviously this going is a kind of verb, right?

98
00:05:14,000 --> 00:05:15,000
It is not an adjective.

99
00:05:15,000 --> 00:05:20,000
And let's say if I try to go with respect to adverb here also I'm getting going right now with respect

100
00:05:20,000 --> 00:05:21,000
to this going word.

101
00:05:21,000 --> 00:05:21,000
Right?

102
00:05:21,000 --> 00:05:25,000
I feel verb is the correct one, which we really need to selected.

103
00:05:25,000 --> 00:05:27,000
Now let me do one thing though.

104
00:05:27,000 --> 00:05:29,000
All the words that I had copied over here.

105
00:05:29,000 --> 00:05:29,000
Right.

106
00:05:29,000 --> 00:05:34,000
Let's apply this entire lemmatization on this entire words.

107
00:05:34,000 --> 00:05:34,000
Okay.

108
00:05:34,000 --> 00:05:38,000
So I'm just going to execute this and execute it.

109
00:05:38,000 --> 00:05:41,000
And again I'm just going to write this for loop okay.

110
00:05:41,000 --> 00:05:47,000
And I'm just going to copy this instead of writing stemming dot stem I'm just going to write Lemmatizer

111
00:05:47,000 --> 00:05:49,000
dot Lemmatize.

112
00:05:50,000 --> 00:05:50,000
Okay.

113
00:05:50,000 --> 00:05:52,000
And I'm just going to apply for this words.

114
00:05:52,000 --> 00:05:59,000
Now see this eating is becoming eating eats becomes eat eaten eaten writing writing writes write programming

115
00:05:59,000 --> 00:06:05,000
programs programs has become program because again by default in this lemmatize remember that I have

116
00:06:05,000 --> 00:06:08,000
given my post tag as n write.

117
00:06:08,000 --> 00:06:13,000
We are considering that all these words, all these words are basically noun.

118
00:06:13,000 --> 00:06:15,000
Okay, so programming programs history finally.

119
00:06:15,000 --> 00:06:19,000
So here you'll be seeing that Lemmatization has not occurred that much.

120
00:06:19,000 --> 00:06:24,000
We have not done any kind of we have not able to find out because it is being considered as a noun and

121
00:06:24,000 --> 00:06:24,000
in noun.

122
00:06:24,000 --> 00:06:25,000
Whenever we give noun.

123
00:06:25,000 --> 00:06:29,000
Suppose let's say that all my words has names and in names.

124
00:06:29,000 --> 00:06:31,000
Obviously those lemmatization does not occur, right?

125
00:06:31,000 --> 00:06:36,000
So if I have like Krishna, if I have Sudhanshu Kumar, if I have this kind of words, which is name,

126
00:06:36,000 --> 00:06:37,000
right.

127
00:06:37,000 --> 00:06:38,000
In short, this is nouns, right?

128
00:06:38,000 --> 00:06:43,000
We have some famous place name like Taj Mahal, India, uh, places name and.

129
00:06:43,000 --> 00:06:43,000
All right.

130
00:06:43,000 --> 00:06:45,000
So this all will be considered as nouns, right?

131
00:06:45,000 --> 00:06:49,000
So uh, that basic difference you really need to know in order to understand this.

132
00:06:49,000 --> 00:06:52,000
Let's say that I want to again convert this into adjective.

133
00:06:52,000 --> 00:06:55,000
So here you can see eating has become it's it's writing.

134
00:06:55,000 --> 00:06:55,000
Everything is same.

135
00:06:55,000 --> 00:06:56,000
Right.

136
00:06:56,000 --> 00:07:00,000
But now let me try with respect to verb because most of these words are basically verbs itself.

137
00:07:00,000 --> 00:07:01,000
Right.

138
00:07:01,000 --> 00:07:07,000
So here you can see eating has become eat eats has become eat writing write write programming program

139
00:07:07,000 --> 00:07:08,000
this history.

140
00:07:08,000 --> 00:07:09,000
Now you can see this, right?

141
00:07:09,000 --> 00:07:11,000
Uh, history is becoming as history only in stemming.

142
00:07:11,000 --> 00:07:13,000
We used to get something like history.

143
00:07:13,000 --> 00:07:13,000
Right.

144
00:07:13,000 --> 00:07:15,000
So this is quite amazing, right?

145
00:07:15,000 --> 00:07:16,000
Finally, finally and this all.

146
00:07:16,000 --> 00:07:18,000
And we could also see that in stemming.

147
00:07:18,000 --> 00:07:18,000
Right.

148
00:07:18,000 --> 00:07:21,000
We also have some of the words which are not actually being able to perform well.

149
00:07:21,000 --> 00:07:22,000
Right.

150
00:07:22,000 --> 00:07:25,000
Like fairly and sportingly and all right.

151
00:07:25,000 --> 00:07:26,000
And even goes right.

152
00:07:26,000 --> 00:07:33,000
So suppose if I probably go ahead and write something like Lemmatizer dot Lemmatize and let's say I'm

153
00:07:33,000 --> 00:07:35,000
going to write something like goes, right.

154
00:07:35,000 --> 00:07:37,000
So I'm going to get go which is good.

155
00:07:37,000 --> 00:07:37,000
Right.

156
00:07:37,000 --> 00:07:39,000
And here I can also play with post tag.

157
00:07:39,000 --> 00:07:40,000
Right.

158
00:07:40,000 --> 00:07:43,000
So let's say if I'm using a V post tag I'm going to get go.

159
00:07:43,000 --> 00:07:45,000
Let me go and try with this.

160
00:07:45,000 --> 00:07:48,000
So here I'm just going to write Lemmatizer okay.

161
00:07:48,000 --> 00:07:51,000
And I'm just going to say try to do with this fairly.

162
00:07:51,000 --> 00:07:53,000
Let's see what kind of output we are getting.

163
00:07:53,000 --> 00:07:54,000
Now.

164
00:07:54,000 --> 00:07:56,000
This Lemmatizer is super amazing, right?

165
00:07:56,000 --> 00:07:59,000
Because it is giving you the good word form.

166
00:07:59,000 --> 00:08:02,000
And the meaning of the word is also maintained.

167
00:08:02,000 --> 00:08:02,000
Right?

168
00:08:02,000 --> 00:08:04,000
So here you can see fairly sportingly.

169
00:08:04,000 --> 00:08:08,000
Now let me write a post tag obviously because it was noun.

170
00:08:08,000 --> 00:08:10,000
So it may not give you something.

171
00:08:10,000 --> 00:08:11,000
So here also with respect to verb.

172
00:08:11,000 --> 00:08:14,000
Also I'm getting the same answer which is good right?

173
00:08:14,000 --> 00:08:18,000
So this is just to tell you that how good lemmatization is.

174
00:08:18,000 --> 00:08:24,000
But one question that I really wanted to ask you, which will take more time word learning advertiser

175
00:08:24,000 --> 00:08:24,000
or stemming?

176
00:08:24,000 --> 00:08:27,000
The answer is simple word net Lemmatizer.

177
00:08:27,000 --> 00:08:28,000
Why?

178
00:08:28,000 --> 00:08:33,000
Because you can see NLTK provides word net lemmatizer class within a thin wrapper around the word net

179
00:08:33,000 --> 00:08:34,000
corpus.

180
00:08:34,000 --> 00:08:36,000
So it is going to compare from there, right?

181
00:08:36,000 --> 00:08:40,000
It uses Morphe function to the WordNet corpus reader class to find a lemma.

182
00:08:40,000 --> 00:08:41,000
So this will basically take time.

183
00:08:41,000 --> 00:08:43,000
Right now I just have some number of words.

184
00:08:43,000 --> 00:08:45,000
So it is happening very fast.

185
00:08:45,000 --> 00:08:49,000
But if you have a paragraph, if you have a bigger sentence, Lemmatization is going to take time,

186
00:08:49,000 --> 00:08:50,000
right.

187
00:08:50,000 --> 00:08:56,000
So this is the basic difference between stemming and repetition for which use cases you can use this.

188
00:08:56,000 --> 00:09:01,000
Uh, one simple use case is, uh, if I really want to write, uh, Q and a chat bots.

189
00:09:01,000 --> 00:09:03,000
Write chat bots.

190
00:09:03,000 --> 00:09:07,000
All this is an amazing examples for all this, right?

191
00:09:07,000 --> 00:09:09,000
You can basically use this, right.

192
00:09:10,000 --> 00:09:13,000
So let me write it down Q and A chat bots.

193
00:09:13,000 --> 00:09:14,000
Right.

194
00:09:14,000 --> 00:09:16,000
Uh, text summarization.

195
00:09:17,000 --> 00:09:18,000
Right.

196
00:09:19,000 --> 00:09:23,000
Uh, Q&A these are some of the examples which you can basically consider.

197
00:09:24,000 --> 00:09:26,000
Uh, text summarization is also one example.

198
00:09:26,000 --> 00:09:28,000
And in many, many companies it is being used.

199
00:09:28,000 --> 00:09:28,000
So right.

200
00:09:28,000 --> 00:09:33,000
So you can use basically all these things and you can basically uh implement the word advertiser over

201
00:09:33,000 --> 00:09:37,000
there because you get the exact good form of the word that is the root form of the word, which is meaningful.

202
00:09:37,000 --> 00:09:38,000
Right.

203
00:09:38,000 --> 00:09:42,000
So I hope, uh, you were able to understand about Lemmatization, you are able to find out the differences

204
00:09:42,000 --> 00:09:44,000
between stemming and lemmatization.

205
00:09:44,000 --> 00:09:45,000
Uh, this was it.

206
00:09:45,000 --> 00:09:47,000
Uh, and I will see you all in the next video.

207
00:09:47,000 --> 00:09:47,000
Thank you.

