1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to NLP.

3
00:00:03,000 --> 00:00:06,000
In our previous video, we have already seen about tokenization.

4
00:00:06,000 --> 00:00:10,000
We have understood about basic terminologies like corpus paragraph.

5
00:00:10,000 --> 00:00:13,000
We have understood vocabulary, we have understood about words.

6
00:00:13,000 --> 00:00:16,000
Now let's go ahead and do some practical things.

7
00:00:16,000 --> 00:00:19,000
You know how much we'll be learning with respect to theory.

8
00:00:19,000 --> 00:00:22,000
So first of all, I'll just go ahead and open my Python three notebook file.

9
00:00:22,000 --> 00:00:28,000
So here, uh, first of all, uh, and and for this right we are going to use libraries like NLTK.

10
00:00:28,000 --> 00:00:30,000
Now NLTK is one amazing library.

11
00:00:30,000 --> 00:00:32,000
Let me just go through this and let me just show you.

12
00:00:32,000 --> 00:00:37,000
NLTK is a leading platform for building Python programs to work with human language data.

13
00:00:37,000 --> 00:00:45,000
So if you really want to work with NLP, things like tokenization, creating converting a sentence into

14
00:00:45,000 --> 00:00:46,000
vectors can be easily done.

15
00:00:46,000 --> 00:00:48,000
With the help of this NLTK libraries.

16
00:00:48,000 --> 00:00:51,000
There is also one more library which we basically say it as spacy.

17
00:00:51,000 --> 00:00:54,000
So if I probably search for spacy spacey NLP.

18
00:00:54,000 --> 00:00:56,000
So here you'll be able to see that again.

19
00:00:56,000 --> 00:00:58,000
These all are completely open source libraries.

20
00:00:58,000 --> 00:00:59,000
You can also use Spacey.

21
00:00:59,000 --> 00:01:03,000
And I will also be showing you with the help of Spacey how you can perform different, different things.

22
00:01:03,000 --> 00:01:04,000
Right.

23
00:01:04,000 --> 00:01:05,000
So Spacey is also there.

24
00:01:05,000 --> 00:01:07,000
NLTK is also there.

25
00:01:07,000 --> 00:01:12,000
One important assignment that I really want to give it to you is that try to find out the differences

26
00:01:12,000 --> 00:01:18,000
between NLTK and Spacey and uh, just let me know probably in the comment section of this particular

27
00:01:18,000 --> 00:01:20,000
video or in the upcoming videos.

28
00:01:20,000 --> 00:01:24,000
Okay, so this one task, I want to really give it to you so that you try and try to understand what

29
00:01:24,000 --> 00:01:27,000
is the difference between these two open source libraries.

30
00:01:27,000 --> 00:01:31,000
Now, to begin with, uh, since uh, we are going to initially start with NLTK.

31
00:01:31,000 --> 00:01:34,000
Now what I'm actually going to do, I'm first of all going to install NLTK.

32
00:01:34,000 --> 00:01:37,000
Now here you can directly install it from here.

33
00:01:37,000 --> 00:01:41,000
Or you can also open the command prompt and directly install NLTK.

34
00:01:41,000 --> 00:01:41,000
Right.

35
00:01:41,000 --> 00:01:45,000
So what you have to do is that just right pip install nltk.

36
00:01:45,000 --> 00:01:48,000
So once you do this automatically the installation will be done.

37
00:01:48,000 --> 00:01:51,000
So what I'm actually going to do I'm going to basically install from here.

38
00:01:51,000 --> 00:01:53,000
And I'm going to show you a tokenization example.

39
00:01:53,000 --> 00:01:55,000
So let me write it down.

40
00:01:55,000 --> 00:01:58,000
Tokenization uh example okay.

41
00:01:59,000 --> 00:02:00,000
Perfect.

42
00:02:00,000 --> 00:02:02,000
Uh so this is the tokenization example.

43
00:02:02,000 --> 00:02:06,000
So here you can see that uh collecting NLTK requirement already.

44
00:02:06,000 --> 00:02:07,000
This is done.

45
00:02:07,000 --> 00:02:08,000
The installation has been done.

46
00:02:08,000 --> 00:02:11,000
A new release or release of Pip is available.

47
00:02:11,000 --> 00:02:15,000
I don't want to update it right now, because this Pip will almost cover each and everything right now.

48
00:02:15,000 --> 00:02:16,000
Right now.

49
00:02:16,000 --> 00:02:17,000
Uh, this is perfect.

50
00:02:17,000 --> 00:02:21,000
Now, what I'm actually going to do, I'm going to show you how we can perform tokenization.

51
00:02:21,000 --> 00:02:26,000
Like, let's say if I have a paragraph, how I can convert it into sentences, and then how I can convert

52
00:02:26,000 --> 00:02:30,000
into words all those things I'll be discussing and multiple ways of tokenization.

53
00:02:30,000 --> 00:02:37,000
Also, I'll be showing you Okay, so let me go ahead and let me make few cells so that I can directly

54
00:02:37,000 --> 00:02:38,000
go ahead and execute it.

55
00:02:38,000 --> 00:02:40,000
Now one thing over here.

56
00:02:40,000 --> 00:02:41,000
Yeah.

57
00:02:41,000 --> 00:02:43,000
Let's go ahead and let's start now over here.

58
00:02:43,000 --> 00:02:46,000
First of all I will try to define my own corpus okay.

59
00:02:46,000 --> 00:02:48,000
Corpus basically means paragraph.

60
00:02:48,000 --> 00:02:53,000
So let's say uh if I really want to create a multi line comments I have to basically use this triple

61
00:02:53,000 --> 00:02:54,000
quotes I will write.

62
00:02:54,000 --> 00:02:55,000
Hello.

63
00:02:56,000 --> 00:02:59,000
Welcome comma.

64
00:02:59,000 --> 00:03:03,000
Okay, I'm just giving some sentences over here so that you'll be able to understand okay.

65
00:03:03,000 --> 00:03:08,000
So hello welcome to fresh nginx.

66
00:03:11,000 --> 00:03:12,000
Tutorials.

67
00:03:12,000 --> 00:03:20,000
Or let's say I'll just write I neurons right or I will write something like this.

68
00:03:20,000 --> 00:03:31,000
Krish Nayak NLP Krishna x NLP tutorials okay and this is my sentence okay and I can continue with my

69
00:03:31,000 --> 00:03:31,000
second line.

70
00:03:31,000 --> 00:03:32,000
So I'll write.

71
00:03:32,000 --> 00:03:42,000
Please do watch the entire course to I'm just writing exclamation.

72
00:03:42,000 --> 00:03:50,000
I'm using different different characters over here to become expert in NLP.

73
00:03:50,000 --> 00:03:53,000
Okay, so this is what I have actually done.

74
00:03:53,000 --> 00:03:58,000
I've I've basically defined a simple corpus, uh, which is just like a paragraph which has around two

75
00:03:58,000 --> 00:03:59,000
sentences.

76
00:03:59,000 --> 00:04:01,000
And I'm just going to use this particular corpus.

77
00:04:01,000 --> 00:04:03,000
So let's go ahead and see this corpus.

78
00:04:04,000 --> 00:04:09,000
Now once I see this corpus here, you can see that even though I just I'm just printing this right.

79
00:04:09,000 --> 00:04:10,000
I can also print this.

80
00:04:10,000 --> 00:04:12,000
Since then, the slash n will go.

81
00:04:12,000 --> 00:04:13,000
Slash n indicates the new line.

82
00:04:13,000 --> 00:04:18,000
Okay, so if I'm probably printing corpus here, you'll be able to see the text how it is basically

83
00:04:18,000 --> 00:04:19,000
visible.

84
00:04:19,000 --> 00:04:26,000
Now with respect to tokenization, the first step that I'm actually going to do is that I'm going to

85
00:04:26,000 --> 00:04:32,000
convert a sentence into paragraphs.

86
00:04:33,000 --> 00:04:33,000
Okay.

87
00:04:33,000 --> 00:04:36,000
So I'm going to basically convert a sentence into paragraphs.

88
00:04:36,000 --> 00:04:38,000
So for this how do I convert it.

89
00:04:38,000 --> 00:04:41,000
And with the help of NLTK it is definitely possible.

90
00:04:41,000 --> 00:04:47,000
So I will write from NLTK dot tokenize okay.

91
00:04:47,000 --> 00:04:49,000
And I'm going to just import.

92
00:04:50,000 --> 00:04:50,000
Okay.

93
00:04:50,000 --> 00:04:54,000
So in NLTK there is uh there is a library which is called as tokenize.

94
00:04:54,000 --> 00:04:58,000
And if I import this cent underscore tokenize.

95
00:04:58,000 --> 00:05:04,000
So what cent underscore tokenize does is that it tries to convert a paragraph into sentences.

96
00:05:04,000 --> 00:05:09,000
So this is the function that we are basically going to use or functionality that is going, we are going

97
00:05:09,000 --> 00:05:12,000
to use which is present inside NLTK dot tokenize.

98
00:05:12,000 --> 00:05:12,000
Okay.

99
00:05:12,000 --> 00:05:15,000
So you can just say that this is a kind of package inside this.

100
00:05:15,000 --> 00:05:23,000
And I will initialize this package in order to convert our sentence into a, uh, sorry, a paragraph

101
00:05:23,000 --> 00:05:24,000
into a sentences.

102
00:05:24,000 --> 00:05:29,000
So here I'm just going to write, send, tokenize and let me just go ahead and give my corpus.

103
00:05:29,000 --> 00:05:35,000
So once I give my corpus here, you can see that it is giving us a list a list of sentence.

104
00:05:35,000 --> 00:05:38,000
So here you can see over here this is my sentence over here.

105
00:05:38,000 --> 00:05:40,000
Hello welcome to Krishna.

106
00:05:40,000 --> 00:05:41,000
NLP tutorial.

107
00:05:41,000 --> 00:05:43,000
So here you can see this is my first sentence.

108
00:05:43,000 --> 00:05:45,000
And this is my second sentence.

109
00:05:45,000 --> 00:05:49,000
And from exclamation also it has divided it because I'm getting three right.

110
00:05:49,000 --> 00:05:50,000
So here you can see.

111
00:05:50,000 --> 00:05:52,000
Hello, welcome to Krishna NLP tutorials.

112
00:05:52,000 --> 00:05:53,000
Full stop.

113
00:05:53,000 --> 00:05:57,000
As soon as it finds full stop it is just going to make a next sentence.

114
00:05:57,000 --> 00:06:02,000
So here, wherever slash n was present, it is using that and wherever exclamation is also present.

115
00:06:02,000 --> 00:06:06,000
It is basically making sure that a new sentence is getting created with respect to that.

116
00:06:06,000 --> 00:06:08,000
And that is what Sent tokenize actually does.

117
00:06:08,000 --> 00:06:13,000
If you really want to find out the definition of this here, you can see that, uh, you can also provide

118
00:06:13,000 --> 00:06:15,000
different different languages, what other languages it supports.

119
00:06:15,000 --> 00:06:17,000
You can just go ahead and have a look onto that.

120
00:06:17,000 --> 00:06:20,000
And uh, if you don't find much documentation out of it.

121
00:06:20,000 --> 00:06:28,000
So what you can do over here, you can just go and search for NLTK, send, tokenize right, send,

122
00:06:28,000 --> 00:06:29,000
tokenize.

123
00:06:29,000 --> 00:06:33,000
So you will be able to find out the documentation page directly.

124
00:06:33,000 --> 00:06:34,000
And you can refer it from here.

125
00:06:34,000 --> 00:06:35,000
Right.

126
00:06:35,000 --> 00:06:36,000
There are different different tokenization.

127
00:06:36,000 --> 00:06:37,000
See send tokenize.

128
00:06:37,000 --> 00:06:41,000
Is there word tokenize is there which we are basically going to discuss right now.

129
00:06:41,000 --> 00:06:42,000
We are going to focus on this okay.

130
00:06:42,000 --> 00:06:46,000
So in short it returns a tokenized copy of text using NLTK.

131
00:06:47,000 --> 00:06:51,000
Recommended sentence tokenizer and current class.

132
00:06:51,000 --> 00:06:54,000
It uses this called as punk send tokenizer.

133
00:06:54,000 --> 00:06:59,000
So along with the full stop, it is making sure that wherever exclamation is basically coming, it is

134
00:06:59,000 --> 00:07:02,000
going to make us a, uh, as a another sentence.

135
00:07:02,000 --> 00:07:03,000
So this is perfect.

136
00:07:03,000 --> 00:07:05,000
Uh, we are able to get this right now.

137
00:07:05,000 --> 00:07:07,000
I will go with the next tokenization.

138
00:07:07,000 --> 00:07:14,000
And before that, what I also want is that if I probably save this in a list of sentences.

139
00:07:14,000 --> 00:07:18,000
So let's say this sentence is also called as documents, which I have already discussed in my previous

140
00:07:18,000 --> 00:07:20,000
class, uh, previous session.

141
00:07:20,000 --> 00:07:20,000
Right.

142
00:07:20,000 --> 00:07:24,000
So if I probably go and see this documents, this is basically my list.

143
00:07:24,000 --> 00:07:26,000
You can also check out with the help of type.

144
00:07:26,000 --> 00:07:29,000
So type documents right.

145
00:07:29,000 --> 00:07:38,000
And if I probably want to iterate through this let's say from send in documents from from for sentences

146
00:07:38,000 --> 00:07:41,000
for sentence and documents, I can also print this sentence Parallely.

147
00:07:42,000 --> 00:07:42,000
Right.

148
00:07:42,000 --> 00:07:45,000
So here I can just define it as sentence.

149
00:07:45,000 --> 00:07:48,000
So this is my first sentence, second sentence and third sentence.

150
00:07:48,000 --> 00:07:49,000
This is perfect.

151
00:07:49,000 --> 00:07:51,000
We were able to do with respect to the sentences.

152
00:07:51,000 --> 00:07:54,000
Now let's go ahead and do it with respect to the word tokenize.

153
00:07:54,000 --> 00:07:56,000
So next tokenization technique.

154
00:07:56,000 --> 00:08:04,000
What I am actually going to do is that my next tokenization is that I can convert a paragraph.

155
00:08:05,000 --> 00:08:09,000
I can convert a paragraph into words.

156
00:08:09,000 --> 00:08:12,000
I can also convert a sentence into words.

157
00:08:13,000 --> 00:08:14,000
Sentence into words.

158
00:08:14,000 --> 00:08:16,000
Okay, perfect.

159
00:08:16,000 --> 00:08:18,000
Now for this what I'm actually going to do.

160
00:08:18,000 --> 00:08:23,000
First of all, let's go ahead and see with respect to paragraph, uh, so already, uh, you know that,

161
00:08:23,000 --> 00:08:29,000
uh, uh, with respect to converting a paragraph into words, I'll be using again another library so

162
00:08:29,000 --> 00:08:32,000
I can write from NLTK dot okay.

163
00:08:32,000 --> 00:08:37,000
Tokenize from NLTK dot tokenize.

164
00:08:37,000 --> 00:08:37,000
Okay.

165
00:08:37,000 --> 00:08:39,000
spelling mistake is there.

166
00:08:39,000 --> 00:08:43,000
So tokenize I'm going to import word.

167
00:08:44,000 --> 00:08:45,000
Let me just write it down.

168
00:08:45,000 --> 00:08:46,000
Word tokenize.

169
00:08:46,000 --> 00:08:47,000
Okay.

170
00:08:47,000 --> 00:08:50,000
So here we can basically use this word tokenize over here.

171
00:08:50,000 --> 00:08:52,000
And let me just execute it over here.

172
00:08:52,000 --> 00:08:55,000
And let me go ahead and write word tokenize.

173
00:08:55,000 --> 00:09:01,000
And here if I directly give us my corpus here, you can see that each and every word has been divided.

174
00:09:01,000 --> 00:09:05,000
Hello world and hello, welcome.

175
00:09:05,000 --> 00:09:13,000
Where you can see all the characters like comma full stop has been treated as a separate character altogether

176
00:09:13,000 --> 00:09:14,000
or separate words altogether.

177
00:09:14,000 --> 00:09:15,000
Right?

178
00:09:15,000 --> 00:09:20,000
So here you can definitely clearly see that each and every word has been splitted with respect to this

179
00:09:20,000 --> 00:09:23,000
only one word that is not having this right.

180
00:09:23,000 --> 00:09:25,000
So if I probably go and see over here.

181
00:09:25,000 --> 00:09:27,000
Hello, welcome to Krishna.

182
00:09:27,000 --> 00:09:29,000
So this kind of word has not been splitted.

183
00:09:29,000 --> 00:09:32,000
This all is being considered as a single word.

184
00:09:32,000 --> 00:09:37,000
But if I consider with respect to full stop, with respect to exclamation, with respect to comma,

185
00:09:37,000 --> 00:09:39,000
it has been considered as a separate word.

186
00:09:39,000 --> 00:09:41,000
So here also you can find it out with respect to this.

187
00:09:41,000 --> 00:09:42,000
Right.

188
00:09:42,000 --> 00:09:47,000
So this is very, very simple with respect to uh converting a paragraph into words.

189
00:09:47,000 --> 00:09:48,000
And why do we do this?

190
00:09:48,000 --> 00:09:51,000
Because each and every word will have a different importance.

191
00:09:51,000 --> 00:09:54,000
And we really need to perform some pre-processing on top of it.

192
00:09:54,000 --> 00:09:59,000
So right now initially when you get it, you know which are the important words, you have to take it,

193
00:09:59,000 --> 00:10:03,000
you have to pre-process, you have to clean it, you know, and uh, that is the reason why we specifically

194
00:10:03,000 --> 00:10:05,000
focus on each and every word.

195
00:10:05,000 --> 00:10:05,000
Okay.

196
00:10:05,000 --> 00:10:07,000
So this was about the word tokenize.

197
00:10:07,000 --> 00:10:12,000
And uh, now with respect to sentences, also, you know that how you can basically do just go over

198
00:10:12,000 --> 00:10:16,000
here and just write paste it over here for sentences in this.

199
00:10:16,000 --> 00:10:19,000
Now here you are printing the sentences right now.

200
00:10:19,000 --> 00:10:24,000
What you can basically do in that is that after you probably get this sentence, after you get the sentence

201
00:10:24,000 --> 00:10:27,000
all together, you can just directly apply word tokenize, right.

202
00:10:27,000 --> 00:10:32,000
So here you can basically write word tokenize on sentences.

203
00:10:32,000 --> 00:10:36,000
So you will be able to print everything over here.

204
00:10:36,000 --> 00:10:36,000
Right.

205
00:10:36,000 --> 00:10:40,000
So hello welcome to this this this this uh please do watch the entire course.

206
00:10:40,000 --> 00:10:42,000
This this is there.

207
00:10:42,000 --> 00:10:42,000
Right.

208
00:10:42,000 --> 00:10:43,000
Perfect.

209
00:10:43,000 --> 00:10:48,000
So here we have seen that how we can basically convert a sentence into words right now, you can also

210
00:10:48,000 --> 00:10:51,000
do one thing, uh, over here is that you can use another one more library.

211
00:10:51,000 --> 00:10:53,000
Let me just talk about one more library.

212
00:10:53,000 --> 00:10:54,000
Probably you have seen that.

213
00:10:55,000 --> 00:10:58,000
So I'm going to just write word punked tokenize.

214
00:10:58,000 --> 00:11:00,000
And in this word punked tokenize.

215
00:11:00,000 --> 00:11:06,000
What we are basically going to do is that if I try to apply this, if I try to initialize this word

216
00:11:06,000 --> 00:11:12,000
punk tokenize, and if I try to provide my corpus, let's say so here you just try to find out the difference.

217
00:11:12,000 --> 00:11:18,000
And one difference that is clearly seen is that this apostrophe s has also got splitted before it was

218
00:11:18,000 --> 00:11:19,000
not getting splitted, right?

219
00:11:19,000 --> 00:11:24,000
C apostrophe s was a single word, but now you can see that it has been splitted.

220
00:11:24,000 --> 00:11:26,000
So that is the reason why we are using this punctuations.

221
00:11:27,000 --> 00:11:32,000
Um, this punctuation is it is basically, uh, making sure that the punctuation will also be treated

222
00:11:32,000 --> 00:11:33,000
as a separate word.

223
00:11:33,000 --> 00:11:34,000
Perfect.

224
00:11:34,000 --> 00:11:35,000
So this is good.

225
00:11:35,000 --> 00:11:39,000
There is also one more technique which is basically called as uh treebank word tokenizer.

226
00:11:39,000 --> 00:11:45,000
And also again, I'll try to tell you the difference what exactly it is with respect to Treebank word

227
00:11:45,000 --> 00:11:46,000
tokenizer.

228
00:11:46,000 --> 00:11:50,000
So, uh, I'll just try to execute it and you try to find out the differences with respect to that.

229
00:11:50,000 --> 00:11:52,000
Let's see how much you'll be able to do.

230
00:11:52,000 --> 00:12:01,000
So I will write from nltk dot tokenize import Treebank tokenizer.

231
00:12:01,000 --> 00:12:04,000
So I will initialize this Treebank tokenizer.

232
00:12:05,000 --> 00:12:10,000
Okay, let's say that I'm initializing this into tokenizer something like this something to one variable.

233
00:12:10,000 --> 00:12:16,000
And then I can basically use a function which is called as tokenizer dot tokenize.

234
00:12:16,000 --> 00:12:21,000
And once I give my corpus here, you'll be able to see that I'm actually able to get it.

235
00:12:21,000 --> 00:12:25,000
Okay, now just see over this.

236
00:12:25,000 --> 00:12:28,000
What is the difference with respect to this?

237
00:12:28,000 --> 00:12:35,000
You will definitely be able to find out some difference when comparing with this specific thing.

238
00:12:35,000 --> 00:12:38,000
Okay, so just let me know what is the difference that you are able to see.

239
00:12:38,000 --> 00:12:41,000
I know there is a very minute difference.

240
00:12:41,000 --> 00:12:46,000
So with the a let me tell you the answer, you can just pause for some time and but you can check it

241
00:12:46,000 --> 00:12:46,000
out.

242
00:12:46,000 --> 00:12:50,000
But let me tell you an answer with the help of tree bank word tokenizer.

243
00:12:50,000 --> 00:12:53,000
Here you can see that full stop will not be treated as a separate word.

244
00:12:53,000 --> 00:12:56,000
It will be included in the previous word itself.

245
00:12:56,000 --> 00:12:58,000
Now here you can see that full stop was a separate word.

246
00:12:58,000 --> 00:12:59,000
Here.

247
00:12:59,000 --> 00:13:02,000
Also, you can see that full stop is a separate word, right?

248
00:13:02,000 --> 00:13:07,000
But with respect to the last word right, full stop will be separate, you know because here you can

249
00:13:07,000 --> 00:13:08,000
see that, right?

250
00:13:08,000 --> 00:13:11,000
If I probably see the sentence right after this, we have a new line.

251
00:13:11,000 --> 00:13:12,000
Right?

252
00:13:12,000 --> 00:13:16,000
And after this our sentence is getting closed for the last full stop.

253
00:13:16,000 --> 00:13:18,000
Only it will be considering as a separate word.

254
00:13:18,000 --> 00:13:23,000
But with respect to this particular full stop, it will be considering as a part of this it will not

255
00:13:23,000 --> 00:13:25,000
be considered as a separate word only.

256
00:13:25,000 --> 00:13:27,000
This is the difference with respect to this.

257
00:13:27,000 --> 00:13:32,000
And again, it can be handy in some of the use cases, but not in all, but at in a generic way, we

258
00:13:32,000 --> 00:13:35,000
basically most of the time use word tokenize or send tokenize.

259
00:13:35,000 --> 00:13:35,000
Right.

260
00:13:35,000 --> 00:13:37,000
So yes, uh, this was it.

261
00:13:37,000 --> 00:13:40,000
With respect to the tokenization example, I hope you liked this particular video.

262
00:13:40,000 --> 00:13:43,000
Please make sure that you practice.

263
00:13:43,000 --> 00:13:46,000
You have to really practice a lot with respect to this.

264
00:13:46,000 --> 00:13:49,000
You can take your own example, try out with different different sentences.

265
00:13:49,000 --> 00:13:53,000
With that you will be becoming better with this specific algorithm.

266
00:13:53,000 --> 00:13:56,000
So yes, I will see you all in the next video.

267
00:13:56,000 --> 00:13:56,000
Thank you.