1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to natural language processing.

3
00:00:04,000 --> 00:00:10,000
In this video, we are going to cover some of the basic terminologies that is required in NLP.

4
00:00:10,000 --> 00:00:15,000
You really need to understand these terminologies because I am going to repeat this terminologies again

5
00:00:15,000 --> 00:00:18,000
and again when we are discussing the other topics.

6
00:00:18,000 --> 00:00:25,000
So the topics that is going to get covered in this video is about corpus documents, vocabulary words.

7
00:00:25,000 --> 00:00:31,000
You really need to know all these topics, what exactly it is with some basic examples.

8
00:00:31,000 --> 00:00:39,000
Now usually whenever we get a paragraph, a paragraph is usually called as a corpus.

9
00:00:39,000 --> 00:00:39,000
Okay.

10
00:00:40,000 --> 00:00:47,000
With respect to documents, whenever you have any kind of sentences, you really need to understand

11
00:00:47,000 --> 00:00:51,000
that this sentences are also usually called as documents.

12
00:00:51,000 --> 00:00:53,000
What about vocabulary?

13
00:00:53,000 --> 00:00:59,000
Vocabulary is nothing but all the unique words that are present in this paragraph that is basically

14
00:00:59,000 --> 00:01:01,000
called as vocabulary.

15
00:01:01,000 --> 00:01:03,000
Usually we have a dictionary, right?

16
00:01:03,000 --> 00:01:04,000
We usually say that.

17
00:01:04,000 --> 00:01:07,000
What is the vocabulary in that particular dictionary?

18
00:01:07,000 --> 00:01:12,000
All the unique words are the count of all the unique words, or all the unique words that is present

19
00:01:12,000 --> 00:01:15,000
in the dictionary that is called as vocabulary.

20
00:01:15,000 --> 00:01:21,000
And with respect to the words, all the words that are present in a corpus that we will basically define

21
00:01:21,000 --> 00:01:24,000
all those separately as a specific words itself.

22
00:01:24,000 --> 00:01:29,000
So these are the basic terminologies that you really need to understand.

23
00:01:29,000 --> 00:01:34,000
As said in this video we are going to discuss about something called as tokenization.

24
00:01:35,000 --> 00:01:42,000
And tokenization is a very important step whenever we try to solve any kind of use cases with respect

25
00:01:42,000 --> 00:01:43,000
to NLP.

26
00:01:43,000 --> 00:01:46,000
Now what exactly is tokenization?

27
00:01:46,000 --> 00:01:47,000
Right.

28
00:01:47,000 --> 00:01:56,000
So let's say that I have a paragraph I write over here that my name is Krish, my name is Krish.

29
00:01:57,000 --> 00:01:57,000
Okay.

30
00:01:58,000 --> 00:02:01,000
And I have a.

31
00:02:03,000 --> 00:02:07,000
I have a an interest in teaching.

32
00:02:08,000 --> 00:02:11,000
I have an interest in teaching.

33
00:02:14,000 --> 00:02:19,000
Machine learning and NLP and deep learning and DL.

34
00:02:20,000 --> 00:02:27,000
Now, let's say that if I have this specific text, this text I can consider basically as paragraphs.

35
00:02:27,000 --> 00:02:29,000
So this will be entirely corpus.

36
00:02:29,000 --> 00:02:30,000
Okay.

37
00:02:30,000 --> 00:02:36,000
So this is my entire corpus that is available, which is nothing but a paragraph of, uh, words.

38
00:02:36,000 --> 00:02:36,000
Right.

39
00:02:36,000 --> 00:02:39,000
So if I probably combine all these words, it becomes a paragraph.

40
00:02:39,000 --> 00:02:49,000
Now tokenization is a process wherein we take either a paragraph or a sentences, and we convert this

41
00:02:49,000 --> 00:02:51,000
into tokens.

42
00:02:51,000 --> 00:02:52,000
Right.

43
00:02:52,000 --> 00:02:57,000
Now suppose let's say I want to perform a tokenization on this particular paragraph.

44
00:02:58,000 --> 00:03:04,000
And over here from this paragraph the tokens that are usually generated, it will basically be called

45
00:03:04,000 --> 00:03:07,000
as sentences or documents.

46
00:03:07,000 --> 00:03:11,000
So let's say that I will be applying a tokenization on this.

47
00:03:12,000 --> 00:03:19,000
And with respect to this, let's say that they I will try to convert this entire paragraph into a sentence.

48
00:03:19,000 --> 00:03:22,000
So I may also add one more line over here.

49
00:03:22,000 --> 00:03:23,000
Let's say full stop.

50
00:03:23,000 --> 00:03:27,000
I'm just writing one more full stop over here okay.

51
00:03:27,000 --> 00:03:31,000
And I will also write that I am also a YouTuber.

52
00:03:34,000 --> 00:03:34,000
Okay.

53
00:03:34,000 --> 00:03:38,000
So these are the two sentences that is present in this paragraph.

54
00:03:38,000 --> 00:03:44,000
So with respect to this particular tokenization, if I perform a tokenization on this paragraph, it

55
00:03:44,000 --> 00:03:46,000
will basically create sentences.

56
00:03:46,000 --> 00:03:56,000
My first sentence in this particular case will be my name is Krish okay.

57
00:03:56,000 --> 00:03:59,000
And I have interest in.

58
00:04:01,000 --> 00:04:02,000
Interest in teaching.

59
00:04:07,000 --> 00:04:12,000
ML, NLP and NLP and DL.

60
00:04:12,000 --> 00:04:18,000
Okay, so this what this is basically my document one or sentence one.

61
00:04:19,000 --> 00:04:24,000
My next sentence that I'm going to probably write over here because the full stop is over here.

62
00:04:24,000 --> 00:04:24,000
Right.

63
00:04:24,000 --> 00:04:32,000
So when we convert from a paragraph, when we do talk tokenization from a paragraph into sentence,

64
00:04:32,000 --> 00:04:37,000
it will be looking for this kind of characters like full stop or exclamation.

65
00:04:37,000 --> 00:04:41,000
I'll show you practically how this can be actually done with the help of Python programming language.

66
00:04:41,000 --> 00:04:48,000
So the second sentence that I will probably be had having is like, I am also a YouTuber, right?

67
00:04:49,000 --> 00:04:51,000
I am also a YouTuber.

68
00:04:51,000 --> 00:04:55,000
So again, if you really want to understand what exactly is tokenization?

69
00:04:55,000 --> 00:05:01,000
Tokenization is a simple process wherein we are converting a sentences into sorry, where we are converting

70
00:05:01,000 --> 00:05:03,000
a paragraph into sentences.

71
00:05:03,000 --> 00:05:08,000
Now there may also be a scenario that let's say that I have some sentences.

72
00:05:08,000 --> 00:05:08,000
Okay.

73
00:05:08,000 --> 00:05:13,000
And on top of this I can also perform tokenization again.

74
00:05:13,000 --> 00:05:16,000
So let's say on top of this I'm performing a tokenization.

75
00:05:17,000 --> 00:05:25,000
Now this tokenization technique that I am probably applying will convert this sentences into words,

76
00:05:26,000 --> 00:05:26,000
right?

77
00:05:26,000 --> 00:05:30,000
So let's say I say over here it is basically getting converted into words.

78
00:05:30,000 --> 00:05:33,000
So each and every word will be a separate word.

79
00:05:33,000 --> 00:05:35,000
So my will be a separate word.

80
00:05:35,000 --> 00:05:38,000
Name will be a separate word is will be a separate word.

81
00:05:38,000 --> 00:05:42,000
Crush will be a separate word and will be a separate word I will be separate word.

82
00:05:42,000 --> 00:05:44,000
Have interest in teaching.

83
00:05:45,000 --> 00:05:47,000
All this will be a separate words itself.

84
00:05:48,000 --> 00:05:48,000
Right.

85
00:05:48,000 --> 00:05:51,000
So this process is also called as tokenization.

86
00:05:52,000 --> 00:05:54,000
So in short words can also be a token.

87
00:05:55,000 --> 00:05:57,000
Sentences can also be a token.

88
00:05:57,000 --> 00:05:58,000
Right.

89
00:05:58,000 --> 00:06:00,000
This is very important to understand.

90
00:06:00,000 --> 00:06:01,000
And why it is required.

91
00:06:01,000 --> 00:06:05,000
Because this is a part of text preprocessing.

92
00:06:05,000 --> 00:06:10,000
Because each and every word in NLP needs to be converted into a vector.

93
00:06:11,000 --> 00:06:15,000
So we really need to take up each word and try to do this kind of pre-processing.

94
00:06:15,000 --> 00:06:19,000
And there are a lot of steps like cleaning and all, which I will also be showing you.

95
00:06:19,000 --> 00:06:23,000
But in this video we are going trying to understand about tokenization.

96
00:06:23,000 --> 00:06:25,000
So I hope you have got an idea about corpus.

97
00:06:25,000 --> 00:06:26,000
You have got an idea about sentences.

98
00:06:26,000 --> 00:06:31,000
Now let's go ahead and understand about vocabulary, which is also called as unique words.

99
00:06:31,000 --> 00:06:32,000
Okay.

100
00:06:32,000 --> 00:06:42,000
Now let's say I have two sentences I like to eat apple juice.

101
00:06:43,000 --> 00:06:44,000
Sorry.

102
00:06:44,000 --> 00:06:46,000
How can we eat apple juice?

103
00:06:46,000 --> 00:06:51,000
I like to drink apple juice.

104
00:06:53,000 --> 00:06:54,000
Okay.

105
00:06:55,000 --> 00:06:57,000
here I will again continue and I'll write.

106
00:06:57,000 --> 00:07:00,000
My friend likes.

107
00:07:03,000 --> 00:07:05,000
Mango juice.

108
00:07:06,000 --> 00:07:07,000
Okay.

109
00:07:07,000 --> 00:07:09,000
Now let's say that this is my entire paragraph.

110
00:07:10,000 --> 00:07:11,000
Okay.

111
00:07:11,000 --> 00:07:13,000
Now, in this paragraph, you know how many sentences are there?

112
00:07:13,000 --> 00:07:16,000
There are two sentences because there is a full stop over here.

113
00:07:16,000 --> 00:07:16,000
Right.

114
00:07:16,000 --> 00:07:20,000
So I will just divide this into tokens.

115
00:07:20,000 --> 00:07:25,000
So let's say I'm going to perform something called as tokenization over here okay.

116
00:07:25,000 --> 00:07:27,000
And this will get converted into tokens.

117
00:07:27,000 --> 00:07:32,000
And right now the tokens that is present over here will be sentences.

118
00:07:32,000 --> 00:07:33,000
Right.

119
00:07:33,000 --> 00:07:42,000
So my first sentence will be I like to drink apple juice.

120
00:07:42,000 --> 00:07:44,000
So this is my first sentence.

121
00:07:44,000 --> 00:07:46,000
And second sentence is nothing.

122
00:07:46,000 --> 00:07:47,000
But because there is a full stop.

123
00:07:48,000 --> 00:07:54,000
My friend likes mango juice.

124
00:07:56,000 --> 00:07:57,000
Mango juice.

125
00:07:57,000 --> 00:08:02,000
Now see, when we have the sentences, obviously you can you can go and count each and every words,

126
00:08:02,000 --> 00:08:03,000
right?

127
00:08:03,000 --> 00:08:05,000
Let's say how many total number of words are over here.

128
00:08:05,000 --> 00:08:12,000
So if I probably go and count 1234, five, five, six, seven, eight 910.

129
00:08:12,000 --> 00:08:13,000
Right.

130
00:08:13,000 --> 00:08:18,000
So if I again count it 123456789 1011.

131
00:08:18,000 --> 00:08:21,000
So total I have 11 words.

132
00:08:21,000 --> 00:08:28,000
But if I try to count the unique words, how many unique words are there?

133
00:08:28,000 --> 00:08:34,000
If I make the count again so I will be one unique word, like another unique word

134
00:08:34,000 --> 00:08:41,000
123456789.

135
00:08:41,000 --> 00:08:43,000
See, like and likes are two different word.

136
00:08:43,000 --> 00:08:45,000
So I'll say 910.

137
00:08:45,000 --> 00:08:51,000
But this juice is getting repeated, so the total number of unique words will basically be ten words.

138
00:08:52,000 --> 00:08:52,000
Right.

139
00:08:52,000 --> 00:08:59,000
Let's say instead of this likes there was something called as like at that point of time, the number

140
00:08:59,000 --> 00:09:08,000
of unique word, the number of unique word will be 123456789.

141
00:09:08,000 --> 00:09:09,000
Right.

142
00:09:09,000 --> 00:09:12,000
I will not count like and juice right.

143
00:09:12,000 --> 00:09:14,000
But already likes is there.

144
00:09:14,000 --> 00:09:16,000
So it will be counted as a separate word.

145
00:09:16,000 --> 00:09:22,000
So whenever I get this unique word as ten words, that basically means in my dictionary, in my this

146
00:09:22,000 --> 00:09:25,000
complete paragraph, this is my vocabulary.

147
00:09:25,000 --> 00:09:29,000
So this is all the possible words that I have, right?

148
00:09:29,000 --> 00:09:34,000
That is the ten words right now since I have converted this into like so I'm just going to make this

149
00:09:34,000 --> 00:09:36,000
as nine words.

150
00:09:36,000 --> 00:09:43,000
I hope you are able to understand the basic differences between corpus documents, vocabulary, and

151
00:09:43,000 --> 00:09:44,000
words.

152
00:09:44,000 --> 00:09:44,000
Right?

153
00:09:44,000 --> 00:09:49,000
So this entire thing is super important when we are learning about tokenization.

154
00:09:49,000 --> 00:09:55,000
Again, if somebody asks you what is the definition of tokenization, you can just say that tokenization

155
00:09:55,000 --> 00:10:01,000
is a process to convert either a paragraph or a sentences into tokens.

156
00:10:01,000 --> 00:10:06,000
If I convert a paragraph into tokens, that basically means I'm converting a paragraph into sentences.

157
00:10:06,000 --> 00:10:09,000
I can also convert a paragraph into words.

158
00:10:09,000 --> 00:10:10,000
Right.

159
00:10:10,000 --> 00:10:14,000
So if I'm converting into a words, single single words becomes a token.

160
00:10:14,000 --> 00:10:19,000
And if I'm trying to convert a paragraph into a sentence, every sentence will be a token, right?

161
00:10:19,000 --> 00:10:22,000
So I hope you are able to understand this.

162
00:10:22,000 --> 00:10:27,000
Now, in my next video, I'll try to show you, with the help of NLTK library, how you can perform

163
00:10:27,000 --> 00:10:29,000
tokenization with the help of Python.

164
00:10:29,000 --> 00:10:30,000
So yes, this was it.

165
00:10:30,000 --> 00:10:32,000
I will see you all in the next video.

166
00:10:32,000 --> 00:10:32,000
Thank you.