1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
We are going to continue our discussion with respect to natural language processing.

3
00:00:03,000 --> 00:00:08,000
In this video we are going to discuss about a very important topic which is called as N-grams.

4
00:00:08,000 --> 00:00:13,000
Now why N-grams is used and it can also be used with Bag of words and TF-IDF.

5
00:00:13,000 --> 00:00:15,000
But till now we have just discussed about Bag of Words.

6
00:00:15,000 --> 00:00:18,000
Let me take a very, very good example.

7
00:00:18,000 --> 00:00:21,000
Suppose I have two sentences and this particular sentence is something like this.

8
00:00:21,000 --> 00:00:24,000
The food is good, okay.

9
00:00:24,000 --> 00:00:28,000
Food is good and the food is not good.

10
00:00:29,000 --> 00:00:29,000
Okay.

11
00:00:30,000 --> 00:00:35,000
Now here you will be able to see that both these words are completely different.

12
00:00:35,000 --> 00:00:38,000
That is completely exactly opposite of each other, right?

13
00:00:38,000 --> 00:00:39,000
Both the sentences.

14
00:00:39,000 --> 00:00:43,000
This is my sentence one and this is my sentence two.

15
00:00:43,000 --> 00:00:46,000
So I cannot say that this two sentences are same.

16
00:00:46,000 --> 00:00:46,000
Okay.

17
00:00:46,000 --> 00:00:51,000
Now what I will do is that I and obviously over here.

18
00:00:51,000 --> 00:00:51,000
Right.

19
00:00:51,000 --> 00:00:53,000
Let's consider that this are my vocabulary.

20
00:00:53,000 --> 00:00:59,000
The food is good okay.

21
00:00:59,000 --> 00:01:00,000
Food is good.

22
00:01:00,000 --> 00:01:06,000
Or let me just add this vocabulary food is not good.

23
00:01:06,000 --> 00:01:09,000
Okay, these are my vocabulary with respect to feature one.

24
00:01:09,000 --> 00:01:11,000
Feature two, feature three, feature four.

25
00:01:11,000 --> 00:01:13,000
Because these are the possible vocabulary that I have.

26
00:01:13,000 --> 00:01:16,000
I'm not considering the because there can be a stop word.

27
00:01:16,000 --> 00:01:19,000
I can also remove is let's let's remove these.

28
00:01:19,000 --> 00:01:22,000
Okay, so what are the possible vocabulary that is basically present?

29
00:01:22,000 --> 00:01:24,000
I'll basically having food.

30
00:01:24,000 --> 00:01:30,000
I'll be having this not keyword because not keyword will distinguish right whether the sentence is positive

31
00:01:30,000 --> 00:01:30,000
or negative.

32
00:01:30,000 --> 00:01:32,000
So this is super important.

33
00:01:32,000 --> 00:01:36,000
So I may have this all possible options with respect to vocabulary.

34
00:01:36,000 --> 00:01:45,000
Now for the sentence one I may get a a vector of something like this 101 right.

35
00:01:45,000 --> 00:01:46,000
Because not is not present over here.

36
00:01:46,000 --> 00:01:49,000
So obviously it is going to be zero for the sentence.

37
00:01:49,000 --> 00:01:53,000
Two I will be getting something like one, one one right.

38
00:01:53,000 --> 00:01:55,000
Understand that why the n is is not present?

39
00:01:55,000 --> 00:01:59,000
Because we have removed this with the help of stop words Okay.

40
00:01:59,000 --> 00:02:01,000
We have removed this with the help of Stopwords.

41
00:02:01,000 --> 00:02:02,000
Not a problem.

42
00:02:02,000 --> 00:02:04,000
You may be thinking Krish not can also be there in the stopword.

43
00:02:04,000 --> 00:02:05,000
It can be.

44
00:02:05,000 --> 00:02:06,000
But we are not going to add it.

45
00:02:06,000 --> 00:02:09,000
Because this word is basically distinguishing the entire sentence.

46
00:02:09,000 --> 00:02:14,000
Now you need to understand that with respect to this two vectors, you will be seeing that these two

47
00:02:14,000 --> 00:02:15,000
vectors are almost similar.

48
00:02:15,000 --> 00:02:18,000
Only one change is there like zero and one.

49
00:02:18,000 --> 00:02:23,000
In real world scenario, I will say that these two sentences are completely opposite of each other.

50
00:02:23,000 --> 00:02:28,000
But when I'm trying to convert this into vectors, it is just almost similar with only one change,

51
00:02:28,000 --> 00:02:28,000
right?

52
00:02:28,000 --> 00:02:31,000
And this is a bigger issue, right?

53
00:02:31,000 --> 00:02:35,000
Now what we'll do is that we'll try to solve this problem with the help of n grams.

54
00:02:35,000 --> 00:02:36,000
Now see this is super important.

55
00:02:36,000 --> 00:02:37,000
And you try to understand this.

56
00:02:37,000 --> 00:02:40,000
Now you know that these are my possible options.

57
00:02:40,000 --> 00:02:42,000
These are my possible options with respect to words.

58
00:02:42,000 --> 00:02:42,000
Right?

59
00:02:42,000 --> 00:02:48,000
So I have my first word like food not good right.

60
00:02:48,000 --> 00:02:50,000
So this is my initial vocabulary.

61
00:02:50,000 --> 00:02:53,000
Now what I'll do is that I will make a combination of this.

62
00:02:54,000 --> 00:02:58,000
Let's say from the sentence one, I'm going to make a combination of something like this food good.

63
00:02:58,000 --> 00:02:58,000
Right.

64
00:02:58,000 --> 00:03:02,000
So my next word that you will be having is something called as foot.

65
00:03:02,000 --> 00:03:02,000
Good.

66
00:03:03,000 --> 00:03:07,000
And let's say that I am going to just perform by gram not by gram.

67
00:03:07,000 --> 00:03:12,000
Also by gram basically means that I will have the combination of two words, right?

68
00:03:12,000 --> 00:03:13,000
So I'll be having foot.

69
00:03:13,000 --> 00:03:13,000
Good.

70
00:03:13,000 --> 00:03:20,000
Then similarly over here you'll be able to see I will also be having something like this foot not right.

71
00:03:20,000 --> 00:03:22,000
So this is one combination.

72
00:03:22,000 --> 00:03:24,000
One more combination I have over here.

73
00:03:24,000 --> 00:03:26,000
Foot not okay.

74
00:03:26,000 --> 00:03:30,000
And then the next combination will be not good right.

75
00:03:30,000 --> 00:03:34,000
So here I will be having my next word which looks something like this.

76
00:03:34,000 --> 00:03:37,000
Not good right now.

77
00:03:37,000 --> 00:03:38,000
See this.

78
00:03:38,000 --> 00:03:45,000
Now if this becomes a possible vectors then we will be able to clearly distinguish much more better

79
00:03:45,000 --> 00:03:46,000
when compared to this.

80
00:03:46,000 --> 00:03:49,000
And here I have implemented by gram.

81
00:03:49,000 --> 00:03:51,000
By gram basically means combination of two words.

82
00:03:51,000 --> 00:03:53,000
I can also take bigram and trigram.

83
00:03:53,000 --> 00:03:57,000
I will take the combination of three words like food, not good right?

84
00:03:57,000 --> 00:03:59,000
So that would be one combination right.

85
00:03:59,000 --> 00:04:00,000
Something like that.

86
00:04:00,000 --> 00:04:03,000
So let's understand with respect to bigram.

87
00:04:03,000 --> 00:04:05,000
Now how will the vectors get updated for sentence one.

88
00:04:05,000 --> 00:04:08,000
Wherever there is a food I will put it as one.

89
00:04:08,000 --> 00:04:09,000
Not is not present.

90
00:04:09,000 --> 00:04:10,000
I will put it as zero good is present.

91
00:04:10,000 --> 00:04:11,000
I will put it as one.

92
00:04:11,000 --> 00:04:14,000
Now wherever food good is present, I will put it as one.

93
00:04:14,000 --> 00:04:17,000
Wherever food not is present, I will put it as zero.

94
00:04:17,000 --> 00:04:19,000
So food not is not present over here, right?

95
00:04:20,000 --> 00:04:22,000
If food not is present over here, I will put it as one.

96
00:04:22,000 --> 00:04:24,000
But right now it is not present.

97
00:04:24,000 --> 00:04:25,000
Not good.

98
00:04:25,000 --> 00:04:27,000
Over here with respect to sentence one is also not present.

99
00:04:27,000 --> 00:04:29,000
So I will make it as zero.

100
00:04:29,000 --> 00:04:31,000
Now if I go with respect to sentence two.

101
00:04:31,000 --> 00:04:32,000
Okay.

102
00:04:32,000 --> 00:04:35,000
Now see the difference between the vectors will be very much there.

103
00:04:35,000 --> 00:04:36,000
Okay so food is there.

104
00:04:36,000 --> 00:04:39,000
I will put it as one not is there.

105
00:04:39,000 --> 00:04:40,000
I will put it as one.

106
00:04:40,000 --> 00:04:44,000
Good is there, I will put it as one food, good combination.

107
00:04:44,000 --> 00:04:46,000
You can see that continuous combination is not there.

108
00:04:46,000 --> 00:04:48,000
Let's say I'm going to put it as zero food.

109
00:04:48,000 --> 00:04:50,000
Not is there food not is there?

110
00:04:50,000 --> 00:04:51,000
Obviously I'm going to put it as one.

111
00:04:51,000 --> 00:04:53,000
And food not good is also there.

112
00:04:53,000 --> 00:04:55,000
I'm going to put it as one.

113
00:04:55,000 --> 00:05:00,000
Now here you can see that if I try to compare these two vectors now, they are not exactly the same.

114
00:05:00,000 --> 00:05:05,000
And there is a huge difference because I have one difference over here, two difference over here,

115
00:05:05,000 --> 00:05:06,000
three difference over here, four difference over here.

116
00:05:06,000 --> 00:05:13,000
Now my model, my model will be able to clearly distinguish these two sentences because I have implemented

117
00:05:13,000 --> 00:05:15,000
something called as Ngrams.

118
00:05:15,000 --> 00:05:15,000
Right.

119
00:05:15,000 --> 00:05:19,000
Now if I probably consider trigrams, that basically means I'm going to make a combination of three

120
00:05:19,000 --> 00:05:20,000
words, right?

121
00:05:20,000 --> 00:05:28,000
So in sklearn here in sklearn you have a very important parameter which is called as n-grams okay.

122
00:05:29,000 --> 00:05:31,000
And you can make a combination of something like this.

123
00:05:31,000 --> 00:05:36,000
If I make a combination of one comma one, that basically means I'm just going to implement unigram.

124
00:05:36,000 --> 00:05:39,000
Unigram basically means only one word vocabulary okay.

125
00:05:39,000 --> 00:05:44,000
If I write the combination like one comma two, then what it is going to do it is going to consider

126
00:05:44,000 --> 00:05:49,000
something like unigram and bigram, like how I have explained you in this example.

127
00:05:49,000 --> 00:05:52,000
This is a combination of unigram and bigram.

128
00:05:52,000 --> 00:05:52,000
Okay.

129
00:05:52,000 --> 00:05:56,000
And if I probably consider like this one comma three.

130
00:05:56,000 --> 00:06:03,000
Then what I'm actually going to do I'm going to get a combination of unigram bigram and trigram.

131
00:06:03,000 --> 00:06:05,000
So trigram basically means three three words.

132
00:06:05,000 --> 00:06:08,000
Write all the possible combination with respect to this.

133
00:06:08,000 --> 00:06:13,000
And then suppose let's say if I'm writing two comma three then it is not going to take unigram.

134
00:06:13,000 --> 00:06:17,000
It is going to take the combination of bigram and trigram.

135
00:06:17,000 --> 00:06:20,000
And this is what I'm going to show you probably in the next video.

136
00:06:20,000 --> 00:06:25,000
But understand why we are specifically using N-grams over here.

137
00:06:25,000 --> 00:06:25,000
Right?

138
00:06:25,000 --> 00:06:27,000
Super important topic.

139
00:06:27,000 --> 00:06:32,000
Since we are going to use the combination of all these words, it will give us a very good contextual

140
00:06:32,000 --> 00:06:35,000
information about the sentence and wherever.

141
00:06:35,000 --> 00:06:38,000
Do you feel that the sentence is completely opposite?

142
00:06:38,000 --> 00:06:40,000
But still we are getting the similar kind of vectors.

143
00:06:40,000 --> 00:06:41,000
This can be very, very handy.

144
00:06:41,000 --> 00:06:45,000
So I hope you are able to understand about N-grams.

145
00:06:45,000 --> 00:06:50,000
And in the next video we will try to do some practical implementation with respect to this, which will

146
00:06:50,000 --> 00:06:52,000
be superbly important.

147
00:06:52,000 --> 00:06:55,000
Okay, so let's go ahead and I'll see you all in the next video.

