1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
We are going to continue our discussion with respect to natural language processing.

3
00:00:03,000 --> 00:00:06,000
In our previous video, we have already seen Ngrams intuition.

4
00:00:06,000 --> 00:00:09,000
Now we will try to implement it with the help of NLTK.

5
00:00:09,000 --> 00:00:13,000
We are going to take the same problem statement what we have taken for the spam classification.

6
00:00:13,000 --> 00:00:19,000
So this is how we had actually created a bag of words, right where I took maximum features as 100.

7
00:00:19,000 --> 00:00:24,000
Like most occurring, frequent occurring words, uh, I have taken that as top 100 words.

8
00:00:24,000 --> 00:00:27,000
Now, if you really want to find out what are the top 100 words.

9
00:00:27,000 --> 00:00:31,000
So you basically need to after you do the fit transform, right?

10
00:00:31,000 --> 00:00:36,000
You just need to take the CV and you can just write dot vocabulary.

11
00:00:36,000 --> 00:00:38,000
And this part I had actually missed while explaining bag of words.

12
00:00:38,000 --> 00:00:40,000
So these are my top 100 words.

13
00:00:40,000 --> 00:00:41,000
Go great.

14
00:00:41,000 --> 00:00:50,000
Go out what okay free win text txt say already I think so here you can see these all are indexes of

15
00:00:50,000 --> 00:00:51,000
the columns right.

16
00:00:51,000 --> 00:00:55,000
So if I probably say which is the first first column that you'll be able to find, it is nothing but

17
00:00:55,000 --> 00:00:56,000
ask, right.

18
00:00:56,000 --> 00:00:57,000
Which is the second column.

19
00:00:57,000 --> 00:00:59,000
It is something like babe, okay.

20
00:00:59,000 --> 00:01:00,000
Uh, third column like that.

21
00:01:00,000 --> 00:01:05,000
So total you will be able to see it will be having the indexes from, uh, 1 to 100.

22
00:01:05,000 --> 00:01:05,000
Right.

23
00:01:05,000 --> 00:01:06,000
Something like that.

24
00:01:06,000 --> 00:01:08,000
So these are my top 100 words okay.

25
00:01:08,000 --> 00:01:10,000
So this is with respect to CV dot vocabulary.

26
00:01:10,000 --> 00:01:14,000
If you are able to find it out you'll be able to if you're just executing this you'll be able to do

27
00:01:14,000 --> 00:01:14,000
it.

28
00:01:14,000 --> 00:01:17,000
Now let's go back to this sklearn feature extraction.

29
00:01:17,000 --> 00:01:18,000
Now this is super important.

30
00:01:18,000 --> 00:01:22,000
Now I'm talking about this n gram underscore range okay.

31
00:01:22,000 --> 00:01:23,000
Now see this.

32
00:01:23,000 --> 00:01:26,000
What will happen once I just copy and paste the same thing.

33
00:01:26,000 --> 00:01:31,000
So let's say I'm going to copy this and I'm going to paste it over here okay.

34
00:01:32,000 --> 00:01:33,000
I'm just going to paste it over here.

35
00:01:34,000 --> 00:01:36,000
Now this is super important okay.

36
00:01:38,000 --> 00:01:41,000
So I'm just going to execute this and I'm going to paste it over here.

37
00:01:41,000 --> 00:01:44,000
Here I'm just going to say n gram okay.

38
00:01:44,000 --> 00:01:48,000
Now you know that there is a parameter which is called as n gram range.

39
00:01:48,000 --> 00:01:48,000
Okay.

40
00:01:48,000 --> 00:01:52,000
Now once I do one comma one nothing is going to change.

41
00:01:52,000 --> 00:01:53,000
Now see this okay.

42
00:01:53,000 --> 00:01:54,000
Nothing is going to change.

43
00:01:54,000 --> 00:01:59,000
And I'm just going to do the same fit transform on the corpus okay.

44
00:01:59,000 --> 00:02:01,000
And I'm going to save it in a x okay.

45
00:02:01,000 --> 00:02:04,000
We need to do fit underscore transform.

46
00:02:04,000 --> 00:02:12,000
So once I execute this here if I go and write CV dot vocabulary, and now if I show you, it will be

47
00:02:12,000 --> 00:02:12,000
the same word.

48
00:02:12,000 --> 00:02:13,000
See.

49
00:02:13,000 --> 00:02:16,000
So these are the hundred most frequently frequent occurring words.

50
00:02:16,000 --> 00:02:20,000
And here I have given n ng gram underscore range as one comma one.

51
00:02:20,000 --> 00:02:23,000
That basically means all my features will be one.

52
00:02:23,000 --> 00:02:27,000
There will not be any combination of two words like bi gram or tri gram.

53
00:02:27,000 --> 00:02:28,000
Now see this?

54
00:02:28,000 --> 00:02:31,000
Okay, what I will do is that I will change this to one comma two.

55
00:02:31,000 --> 00:02:37,000
Now once I write one comma two, now it is going to be the combination of unigram and bigram.

56
00:02:37,000 --> 00:02:37,000
Okay.

57
00:02:37,000 --> 00:02:39,000
Unigram and bigram.

58
00:02:39,000 --> 00:02:40,000
Now see this okay.

59
00:02:40,000 --> 00:02:42,000
And I'm also going to increase the number of max features.

60
00:02:42,000 --> 00:02:44,000
Let's say I'm going to make it as 200 okay.

61
00:02:44,000 --> 00:02:47,000
Now let me just go ahead and execute.

62
00:02:47,000 --> 00:02:48,000
And now let me see the vocabulary.

63
00:02:48,000 --> 00:02:53,000
Now vocabulary over here you can see uh all these words are there single single words are there.

64
00:02:53,000 --> 00:02:57,000
Now as you go ahead, right there will be two words.

65
00:02:57,000 --> 00:02:57,000
See?

66
00:02:57,000 --> 00:02:58,000
Over here.

67
00:02:58,000 --> 00:02:59,000
Please call.

68
00:02:59,000 --> 00:03:00,000
Okay.

69
00:03:00,000 --> 00:03:02,000
So here you'll be able to see.

70
00:03:02,000 --> 00:03:07,000
Please call and you won't be able to see the other words because that is not the maximum occurring words.

71
00:03:07,000 --> 00:03:10,000
Let's say I want to increase this number.

72
00:03:10,000 --> 00:03:13,000
Let's say I'm going to basically write it as uh 500.

73
00:03:13,000 --> 00:03:15,000
Now the number of words will keep on increasing.

74
00:03:16,000 --> 00:03:22,000
So yeah here you can see that let know is uh in the is in the 2228 index of the column.

75
00:03:22,000 --> 00:03:22,000
Right.

76
00:03:22,000 --> 00:03:26,000
Still you have how many please call is also there, which is present in the 325 index.

77
00:03:26,000 --> 00:03:31,000
And that index number is basically given based on the maximum number of occurring frequency.

78
00:03:31,000 --> 00:03:32,000
Right.

79
00:03:32,000 --> 00:03:33,000
So uh, more words.

80
00:03:33,000 --> 00:03:35,000
Are there something called as customer service.

81
00:03:35,000 --> 00:03:36,000
Right.

82
00:03:36,000 --> 00:03:42,000
And if I probably go and explore more price guarantee guarantee call try contact.

83
00:03:42,000 --> 00:03:44,000
See all these words are definitely there.

84
00:03:44,000 --> 00:03:46,000
There is obviously some spelling mistakes but it's fine.

85
00:03:46,000 --> 00:03:51,000
So over here empty decimal decimal g g t something something is there now you see.

86
00:03:51,000 --> 00:03:52,000
Good morn.

87
00:03:52,000 --> 00:03:53,000
Right.

88
00:03:53,000 --> 00:03:54,000
So morning.

89
00:03:54,000 --> 00:03:54,000
Something like that.

90
00:03:54,000 --> 00:03:55,000
Good night.

91
00:03:55,000 --> 00:03:56,000
Right.

92
00:03:56,000 --> 00:03:58,000
So this is the basically the combination of this.

93
00:03:58,000 --> 00:04:05,000
Now what I can also do is that if I really want to see the maximum occurring, uh, just a bigram kind

94
00:04:05,000 --> 00:04:06,000
of words, I can just write two comma two.

95
00:04:06,000 --> 00:04:09,000
That basically means I'm ignoring I mean, ignoring unigram.

96
00:04:09,000 --> 00:04:16,000
So now here you will be seeing that free entry claim call call claw claim free call chance when every

97
00:04:16,000 --> 00:04:21,000
combination of the possible words are basically there along with the column index.

98
00:04:21,000 --> 00:04:27,000
Now, I hope you are able to get an idea what is the basic difference between unigram bigram trigram.

99
00:04:27,000 --> 00:04:27,000
Right?

100
00:04:27,000 --> 00:04:33,000
So suppose if you also want trigram with respect to this one I can basically write one comma three.

101
00:04:33,000 --> 00:04:34,000
Now let me do one thing.

102
00:04:34,000 --> 00:04:36,000
I'll also take a combination of first three comma three.

103
00:04:36,000 --> 00:04:39,000
Let's see Uh, and let's execute this.

104
00:04:39,000 --> 00:04:41,000
Now here you can see call claim code.

105
00:04:41,000 --> 00:04:42,000
So this is the sixth index.

106
00:04:42,000 --> 00:04:48,000
That basically means this kind of message are occurring more in the and this is basically in the column

107
00:04:48,000 --> 00:04:48,000
sixth index.

108
00:04:48,000 --> 00:04:49,000
Right.

109
00:04:49,000 --> 00:04:51,000
Like l t g t sorry.

110
00:04:51,000 --> 00:04:54,000
Call letter please call custom call custom service.

111
00:04:54,000 --> 00:04:55,000
Right.

112
00:04:55,000 --> 00:04:55,000
All these things are there.

113
00:04:56,000 --> 00:04:58,000
Now you can see this is the combination of three words.

114
00:04:58,000 --> 00:05:03,000
Right now if I want to use the combination of two and three I can basically use two and three bigram

115
00:05:03,000 --> 00:05:03,000
and trigram.

116
00:05:03,000 --> 00:05:06,000
So here you can see call customer service.

117
00:05:06,000 --> 00:05:07,000
Then you can see let know.

118
00:05:07,000 --> 00:05:09,000
Then you can basically see game call.

119
00:05:09,000 --> 00:05:14,000
So you can basically make the combination of various unigram bigram and trigram.

120
00:05:14,000 --> 00:05:18,000
Usually if you are not getting a better accuracy first of all start with one comma one.

121
00:05:18,000 --> 00:05:20,000
Okay, I'll tell you the simple phenomena.

122
00:05:20,000 --> 00:05:23,000
Just try with one comma one with respect to one comma one.

123
00:05:23,000 --> 00:05:26,000
You will be getting a vocabulary word which is the most frequent words, right?

124
00:05:26,000 --> 00:05:28,000
Let's say with respect to this you are not getting a good accuracy.

125
00:05:28,000 --> 00:05:32,000
Then what you can do, you can increase one comma two and go ahead and execute it.

126
00:05:32,000 --> 00:05:32,000
Right.

127
00:05:33,000 --> 00:05:34,000
Suppose let's say it is not increasing.

128
00:05:34,000 --> 00:05:36,000
Then what you can do try one comma three.

129
00:05:36,000 --> 00:05:41,000
Now if even one comma three is not working you can play with this max feature also.

130
00:05:41,000 --> 00:05:42,000
And these are called a hyper parameter.

131
00:05:42,000 --> 00:05:46,000
You can play with different different max features if that combination is also not working.

132
00:05:46,000 --> 00:05:51,000
Use the combination of tri, bigram and trigram and this will definitely give you a better answer,

133
00:05:51,000 --> 00:05:51,000
right?

134
00:05:52,000 --> 00:05:57,000
So definitely my suggestion would be that try to check each and every parameters of this and how we

135
00:05:57,000 --> 00:05:58,000
can basically use it.

136
00:05:58,000 --> 00:06:02,000
N-gram underscore range plays a very important role.

137
00:06:02,000 --> 00:06:04,000
And this was with respect to the practical.

138
00:06:04,000 --> 00:06:08,000
Have already shown you that how you can play with this and how you can actually use it.

139
00:06:08,000 --> 00:06:08,000
Right.

140
00:06:08,000 --> 00:06:13,000
So, uh, at the end of the day, after you do this now, you will be able to see that once I make this

141
00:06:13,000 --> 00:06:14,000
combination.

142
00:06:14,000 --> 00:06:15,000
Right.

143
00:06:15,000 --> 00:06:18,000
And this will basically be my vector of x.

144
00:06:18,000 --> 00:06:20,000
So here, if I probably go ahead and execute.

145
00:06:20,000 --> 00:06:22,000
So this is how my vector of x is there.

146
00:06:22,000 --> 00:06:27,000
You will also be having ones and zeros since binary in true and all the remaining things are actually

147
00:06:27,000 --> 00:06:27,000
there.

148
00:06:27,000 --> 00:06:31,000
So I hope you got an idea about how to implement n gram.

149
00:06:31,000 --> 00:06:33,000
Uh, yes, this was it.

150
00:06:33,000 --> 00:06:38,000
Uh, in the next video we'll try to see with respect to TF-IDF intuition and uh, we'll also try to

151
00:06:38,000 --> 00:06:40,000
solve it with the help of NLTK.

152
00:06:40,000 --> 00:06:41,000
So yes, this was it.

153
00:06:41,000 --> 00:06:42,000
I will see you all in the next video.

154
00:06:42,000 --> 00:06:43,000
Thank you.