1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we're going to continue our discussion with respect to NLP.

3
00:00:03,000 --> 00:00:07,000
In our previous videos we have discussed many topics till now.

4
00:00:07,000 --> 00:00:10,000
You know like stemming Lemmatization we have seen Stopwords.

5
00:00:10,000 --> 00:00:12,000
We have also done with the help of NLTK and Python.

6
00:00:13,000 --> 00:00:16,000
Now let us just go ahead and revise once more.

7
00:00:16,000 --> 00:00:22,000
Okay, I'm not revising each and every topic, but if you really want to solve a specific problem statement.

8
00:00:22,000 --> 00:00:26,000
So let's consider that over here we have a problem statement called as sentiment analysis.

9
00:00:26,000 --> 00:00:32,000
Now in order to solve the sentiment analysis you know what all things basically we do and how this topics

10
00:00:32,000 --> 00:00:37,000
we have learned till now, where does it fit in the life cycle of an NLP project?

11
00:00:37,000 --> 00:00:41,000
Now let's say that this is my text and this is my output okay.

12
00:00:41,000 --> 00:00:46,000
And when I write d1, d2, d3, d4 these are basically documents one, document two.

13
00:00:46,000 --> 00:00:48,000
Or it can also be called a sentence one sentence to sentence three.

14
00:00:48,000 --> 00:00:49,000
Sentence four.

15
00:00:49,000 --> 00:00:53,000
If I combine all these particular sentences, it becomes something called as corpus.

16
00:00:53,000 --> 00:00:54,000
Right?

17
00:00:54,000 --> 00:00:59,000
And uh, if I probably say corpus, another meaning can be paragraph also.

18
00:00:59,000 --> 00:00:59,000
Right.

19
00:00:59,000 --> 00:01:03,000
And over here we will be also able to see different different words.

20
00:01:03,000 --> 00:01:07,000
We'll also be able to see unique number of words which is called as vocabulary perfect.

21
00:01:07,000 --> 00:01:12,000
Now initially whenever you are given a problem statement that is with respect to NLP, you definitely

22
00:01:12,000 --> 00:01:13,000
will be having a data set.

23
00:01:13,000 --> 00:01:17,000
So let me just create a small, uh, block diagram over here.

24
00:01:17,000 --> 00:01:23,000
So let's say this is the problem statement I really want to solve that is called as sentiment analysis

25
00:01:23,000 --> 00:01:24,000
just for an example.

26
00:01:24,000 --> 00:01:29,000
So initially what I will be having I'll be having a specific data set.

27
00:01:29,000 --> 00:01:31,000
Now with respect to this particular data set.

28
00:01:31,000 --> 00:01:35,000
The first step that I will definitely do is something called as text pre-processing.

29
00:01:35,000 --> 00:01:35,000
Right.

30
00:01:35,000 --> 00:01:40,000
So over here you will be able to see something called as text pre-processing.

31
00:01:40,000 --> 00:01:44,000
So let's say this is my text pre-processing okay.

32
00:01:45,000 --> 00:01:51,000
And I would like to give this as part one because here specifically if I talk with respect to this particular

33
00:01:51,000 --> 00:01:55,000
text, pre-processing techniques usually write what all things we have learnt till now.

34
00:01:55,000 --> 00:01:59,000
So first topic uh is nothing but something called as tokenization.

35
00:01:59,000 --> 00:02:00,000
Right.

36
00:02:00,000 --> 00:02:06,000
And tokenization is a process where we convert a paragraph into sentences or a sentences into words,

37
00:02:06,000 --> 00:02:07,000
and many more things.

38
00:02:07,000 --> 00:02:12,000
The second simple thing that we basically do is something called as lowering the use case, right?

39
00:02:12,000 --> 00:02:19,000
Lowering the lowering the case of the words right now when I say case of the words, that basically

40
00:02:19,000 --> 00:02:21,000
means or I can just say that, right?

41
00:02:21,000 --> 00:02:23,000
I'm just lowering all the words itself.

42
00:02:23,000 --> 00:02:23,000
Right.

43
00:02:24,000 --> 00:02:26,000
So here let me just write this one.

44
00:02:27,000 --> 00:02:30,000
I can just write something called as lowercase of the words.

45
00:02:30,000 --> 00:02:31,000
Okay.

46
00:02:32,000 --> 00:02:33,000
Why we require this?

47
00:02:33,000 --> 00:02:36,000
Because understand lowercase the words.

48
00:02:36,000 --> 00:02:40,000
Let's say I have a capital though, and small though, right?

49
00:02:40,000 --> 00:02:43,000
Both the words are actually same, right?

50
00:02:43,000 --> 00:02:45,000
Though can be present in any other sentences.

51
00:02:45,000 --> 00:02:49,000
And we really need we really need to treat this as a single word itself.

52
00:02:49,000 --> 00:02:50,000
Right.

53
00:02:50,000 --> 00:02:54,000
So it is important after we perform tokenization, we really need to lower all the specific words.

54
00:02:55,000 --> 00:02:55,000
Okay.

55
00:02:55,000 --> 00:03:00,000
So that is the reason what we do now with respect to text pre-processing part one, uh, we usually

56
00:03:00,000 --> 00:03:01,000
do this two simple step.

57
00:03:01,000 --> 00:03:04,000
And along with that we can also apply our third thing.

58
00:03:04,000 --> 00:03:06,000
It is with respect to regular expression.

59
00:03:06,000 --> 00:03:12,000
And definitely I will try to show you, uh, why regular expression can be super important in this particular

60
00:03:12,000 --> 00:03:13,000
case.

61
00:03:13,000 --> 00:03:18,000
Now over here you'll be able to see regular expression can be, you know, removing the special characters.

62
00:03:18,000 --> 00:03:24,000
It can be removing, uh, you know, any, any characters from that particular word or the entire sentence

63
00:03:24,000 --> 00:03:26,000
based on some regular expression.

64
00:03:26,000 --> 00:03:26,000
Right.

65
00:03:26,000 --> 00:03:28,000
So that is what we basically do.

66
00:03:28,000 --> 00:03:30,000
And again, this is a cleaning process.

67
00:03:30,000 --> 00:03:33,000
When we say text pre-processing we basically say cleaning process.

68
00:03:33,000 --> 00:03:37,000
The next step again, uh, when we further go and probably we have discussed about all these things,

69
00:03:37,000 --> 00:03:42,000
only regular expression is left, which I will probably discuss, uh, as we go ahead.

70
00:03:42,000 --> 00:03:46,000
Okay, now in the second step, what I'm actually going to do, I'm basically going to write something

71
00:03:46,000 --> 00:03:49,000
called as Text pre-processing part two.

72
00:03:51,000 --> 00:03:52,000
Right.

73
00:03:52,000 --> 00:03:53,000
Text pre-processing.

74
00:03:53,000 --> 00:03:54,000
And this is part two.

75
00:03:54,000 --> 00:03:57,000
And what all things we have learnt here till now.

76
00:03:57,000 --> 00:04:01,000
So some of the topics like stemming right.

77
00:04:02,000 --> 00:04:06,000
Lemmatization Lemmatization and I hope.

78
00:04:06,000 --> 00:04:10,000
What is the advantage of stemming Lemmatization we have already discussed.

79
00:04:10,000 --> 00:04:10,000
Right.

80
00:04:10,000 --> 00:04:16,000
And third one, uh, we basically have something called as Stopwords, which is perfectly going on right

81
00:04:16,000 --> 00:04:16,000
now.

82
00:04:16,000 --> 00:04:19,000
All these things are absolutely fine, right?

83
00:04:19,000 --> 00:04:22,000
We have we are doing all this things stemming, lemmatization and all here.

84
00:04:22,000 --> 00:04:27,000
Also, we are focusing on cleaning the text and our raw text once it becomes very much clean, we are

85
00:04:27,000 --> 00:04:33,000
definitely, uh, the next step will basically be to take this particular test text and try to convert

86
00:04:33,000 --> 00:04:34,000
this into a vectors.

87
00:04:34,000 --> 00:04:38,000
You know, vectors is a numerical representation of a specific text.

88
00:04:38,000 --> 00:04:40,000
It can be a sentence, it can be words.

89
00:04:40,000 --> 00:04:41,000
Right?

90
00:04:41,000 --> 00:04:45,000
We try to represent each and every word with some kind of vectors, you know, which will give some

91
00:04:45,000 --> 00:04:50,000
meaningful representation of that specific word, so that we will be able to apply any kind of machine

92
00:04:50,000 --> 00:04:55,000
learning algorithm like classification, you know, to solve any kind of use cases like sentiment analysis.

93
00:04:55,000 --> 00:05:00,000
So we have actually covered till here the next step what will happen is that.

94
00:05:00,000 --> 00:05:03,000
So here I'm just going to write out all the steps.

95
00:05:03,000 --> 00:05:04,000
So this is my step one.

96
00:05:05,000 --> 00:05:06,000
Step two right.

97
00:05:06,000 --> 00:05:07,000
Step three.

98
00:05:07,000 --> 00:05:10,000
Now when I talk about step four now this is super important right.

99
00:05:10,000 --> 00:05:13,000
And uh I'll just give it as another color.

100
00:05:13,000 --> 00:05:15,000
So this will basically be my step four.

101
00:05:15,000 --> 00:05:21,000
After completing this step you know I will take this entire text, you know, and probably we will try

102
00:05:21,000 --> 00:05:24,000
to convert this text into vectors.

103
00:05:25,000 --> 00:05:30,000
So what we are going to do, we're going to convert this text into something called as vectors.

104
00:05:30,000 --> 00:05:34,000
Vectors is nothing but a numerical representation of the specific text.

105
00:05:34,000 --> 00:05:37,000
What are the techniques that is used that is super important.

106
00:05:37,000 --> 00:05:44,000
And you really need to understand this techniques because as we go ahead right later, later on when

107
00:05:44,000 --> 00:05:49,000
you start learning deep learning, also the concepts like word embedding, word two vec and all right,

108
00:05:49,000 --> 00:05:51,000
and all these techniques that is basically used.

109
00:05:51,000 --> 00:05:56,000
And even the advanced techniques like Transformer and Bert, they also use this technique of converting

110
00:05:56,000 --> 00:05:57,000
the text into vectors.

111
00:05:57,000 --> 00:05:59,000
And they have some amazing way.

112
00:05:59,000 --> 00:06:01,000
One way is word embeddings, right?

113
00:06:01,000 --> 00:06:08,000
That is, they have an amazing way to convert this text into vectors which will provide meaningful semantic

114
00:06:08,000 --> 00:06:10,000
information to the text, right?

115
00:06:10,000 --> 00:06:16,000
So just understand for right now, if you're not understanding what is text to vectors, it is just

116
00:06:16,000 --> 00:06:18,000
like I have some kind of text.

117
00:06:18,000 --> 00:06:20,000
Let's say I have this food is good.

118
00:06:20,000 --> 00:06:26,000
This entire text will be represented by a simple numerical format okay.

119
00:06:26,000 --> 00:06:28,000
Numerical data in short.

120
00:06:28,000 --> 00:06:32,000
Now over here we are going to learn various different techniques.

121
00:06:32,000 --> 00:06:38,000
Now first thing is that we are going to see what is called as one hot encoded okay one hot encode.

122
00:06:38,000 --> 00:06:41,000
And then probably if you know machine learning you will definitely be able to know it.

123
00:06:41,000 --> 00:06:46,000
And again one hot encoded is not a very efficient technique with respect to this kind of data.

124
00:06:47,000 --> 00:06:53,000
And uh, right now nobody is using one hot encoding techniques specifically for text data.

125
00:06:53,000 --> 00:06:53,000
Yes.

126
00:06:53,000 --> 00:06:59,000
For machine learning problem statements there to convert a categorical features uh into numerical format.

127
00:06:59,000 --> 00:07:01,000
We basically use one hot encoded but not in text.

128
00:07:01,000 --> 00:07:04,000
But we'll try to understand the theoretical part of this.

129
00:07:04,000 --> 00:07:08,000
And after this we will be going with something called as bag of words.

130
00:07:08,000 --> 00:07:09,000
Okay.

131
00:07:09,000 --> 00:07:13,000
Super important technique which we also say it as B or W right.

132
00:07:13,000 --> 00:07:16,000
Very very super important technique with respect to this.

133
00:07:16,000 --> 00:07:21,000
Uh, third technique uh, that we are specifically going to see, uh, over here is something called

134
00:07:21,000 --> 00:07:23,000
as tf IDF.

135
00:07:24,000 --> 00:07:30,000
TF IDF okay, so tf IDF is also a very good technique, uh, to convert the text into vectors.

136
00:07:30,000 --> 00:07:35,000
Uh, and as we go ahead and learn different, different things, right, there will be some disadvantage

137
00:07:35,000 --> 00:07:36,000
in that specific technique.

138
00:07:36,000 --> 00:07:38,000
So we are using the next technique.

139
00:07:38,000 --> 00:07:42,000
The fourth one that we are probably going to see is something called as word two vec.

140
00:07:42,000 --> 00:07:45,000
Now this word two vec is also an amazing technique.

141
00:07:45,000 --> 00:07:48,000
Uh, there's a concept in deep learning which is called as word embedding.

142
00:07:48,000 --> 00:07:49,000
Word two vec is basically used.

143
00:07:49,000 --> 00:07:51,000
You can also train your own word two vec.

144
00:07:51,000 --> 00:07:56,000
You can also train a custom word two vec, or you can also use the same weights to train the newer data.

145
00:07:56,000 --> 00:08:01,000
I know if it is not making sense, don't worry, I will explain each and everything as we go ahead.

146
00:08:01,000 --> 00:08:05,000
Now the fifth part, uh, that we are going to basically see something called as average word two vec.

147
00:08:05,000 --> 00:08:07,000
Again, a super important topic altogether.

148
00:08:07,000 --> 00:08:10,000
And I will also try to see all these things.

149
00:08:10,000 --> 00:08:14,000
Now why I'm writing all these things, because these are all the techniques that we are going to see,

150
00:08:14,000 --> 00:08:15,000
uh, from the next video itself.

151
00:08:15,000 --> 00:08:19,000
First we'll go with one hot encoded will understand the advantage and disadvantage.

152
00:08:19,000 --> 00:08:21,000
Then we'll go and see bag of words.

153
00:08:22,000 --> 00:08:23,000
We'll see that how bag of words work.

154
00:08:23,000 --> 00:08:28,000
And then we'll see the practical implementation like how we can do it with the help of NLTK.

155
00:08:28,000 --> 00:08:34,000
Right now after you do this, after we get the vector representation of a text and numerical information

156
00:08:34,000 --> 00:08:39,000
about that specific text or numerical representation of the text, we take that numerical representation.

157
00:08:39,000 --> 00:08:41,000
Let's say that this this entire sentence.

158
00:08:41,000 --> 00:08:42,000
Right.

159
00:08:42,000 --> 00:08:44,000
I'll give you an example how it will get converted.

160
00:08:44,000 --> 00:08:49,000
Let's say, um, I'm just writing it as 1011.

161
00:08:49,000 --> 00:08:49,000
Right.

162
00:08:49,000 --> 00:08:50,000
Something.

163
00:08:50,000 --> 00:08:50,000
Okay.

164
00:08:50,000 --> 00:08:52,000
Some some numerical representation okay.

165
00:08:52,000 --> 00:08:55,000
So this will be my numerical representation for this particular document.

166
00:08:55,000 --> 00:08:56,000
That is D one.

167
00:08:56,000 --> 00:08:56,000
D1.

168
00:08:56,000 --> 00:08:57,000
Uh, how I'm getting this.

169
00:08:57,000 --> 00:09:00,000
Don't worry, I've just put some as an example.

170
00:09:00,000 --> 00:09:04,000
I will explain you as we go ahead with the video with respect to this numerical representation.

171
00:09:04,000 --> 00:09:05,000
This will be my output.

172
00:09:05,000 --> 00:09:11,000
So what we can do is that we can take this vectors and we can send it to the next stage wherein I will

173
00:09:11,000 --> 00:09:13,000
just write it out as fifth.

174
00:09:13,000 --> 00:09:19,000
And this will basically be my model trained a model getting trained.

175
00:09:20,000 --> 00:09:20,000
Okay.

176
00:09:20,000 --> 00:09:24,000
And over here you'll be able to see that once I give convert my text into vectors.

177
00:09:24,000 --> 00:09:29,000
Here we will train with machine learning or DL algorithms.

178
00:09:30,000 --> 00:09:30,000
Okay.

179
00:09:31,000 --> 00:09:35,000
But right now since we are discussing NLP with machine learning I'm just going to write machine learning

180
00:09:35,000 --> 00:09:36,000
algorithms.

181
00:09:36,000 --> 00:09:36,000
Okay.

182
00:09:36,000 --> 00:09:41,000
We are going to train this uh with ML algorithms okay.

183
00:09:42,000 --> 00:09:46,000
And finally we will be able to do the prediction and find out the accuracy.

184
00:09:46,000 --> 00:09:49,000
So this is what step by step we are basically going to do it.

185
00:09:49,000 --> 00:09:54,000
But you really need to understand all these techniques uh all these techniques right.

186
00:09:54,000 --> 00:09:57,000
In machine learning at least one hot encoded and all.

187
00:09:57,000 --> 00:10:01,000
And uh, later on we are also going to see a library which is called as Gensim library, which will

188
00:10:01,000 --> 00:10:05,000
actually help you to do, uh, word two vec because it, it is a huge model.

189
00:10:05,000 --> 00:10:05,000
Okay.

190
00:10:05,000 --> 00:10:06,000
Both to vec.

191
00:10:06,000 --> 00:10:10,000
Uh, it already has many representation of the words.

192
00:10:10,000 --> 00:10:13,000
Like it can provide a numerical representation of various words.

193
00:10:13,000 --> 00:10:18,000
Uh, and again internally, word two vec and average word two vec also uses deep learning techniques.

194
00:10:18,000 --> 00:10:20,000
Uh, again we will discuss about it.

195
00:10:20,000 --> 00:10:21,000
The overall brief idea.

196
00:10:21,000 --> 00:10:22,000
I'll try to give it to you.

197
00:10:22,000 --> 00:10:27,000
And then what we'll do is that we'll solve some, uh, use cases wherein we'll take this particular

198
00:10:27,000 --> 00:10:29,000
data and again, we'll follow this entire step.

199
00:10:29,000 --> 00:10:31,000
And finally we'll train our machine learning algorithm.

200
00:10:31,000 --> 00:10:31,000
Right.

201
00:10:31,000 --> 00:10:34,000
So this is what we are going to do as we go ahead.

202
00:10:34,000 --> 00:10:38,000
Uh, so I hope, uh, you're able to understand what is the flow and how we are going to solve this

203
00:10:38,000 --> 00:10:39,000
particular problem.

204
00:10:39,000 --> 00:10:41,000
The example of the data set is over here.

205
00:10:41,000 --> 00:10:45,000
We'll try to convert this into a format and then train with the machine learning algorithms.

206
00:10:45,000 --> 00:10:46,000
So yes.

207
00:10:46,000 --> 00:10:49,000
Uh, I will see you all in the next video.

208
00:10:49,000 --> 00:10:51,000
I hope you are able to understand things.

209
00:10:51,000 --> 00:10:53,000
I hope you are able to understand each and every thing.

210
00:10:53,000 --> 00:10:54,000
What I'm actually writing.

211
00:10:54,000 --> 00:10:55,000
Uh, yes.

212
00:10:55,000 --> 00:10:56,000
This was it.

213
00:10:56,000 --> 00:10:57,000
Uh, I'll see you all in the next video.

214
00:10:57,000 --> 00:10:58,000
Thank you.