1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to natural language processing.

3
00:00:04,000 --> 00:00:08,000
In our previous video we have already seen what is n grams.

4
00:00:08,000 --> 00:00:13,000
Now what we are going to do is that we are going to see one more efficient way of converting a words

5
00:00:13,000 --> 00:00:17,000
into vectors, and we specifically say it as tf IDF.

6
00:00:17,000 --> 00:00:19,000
Now what exactly tf IDF is?

7
00:00:19,000 --> 00:00:23,000
It is nothing but term frequency and inverse document frequency.

8
00:00:23,000 --> 00:00:29,000
So we will try to understand how, with the help of tf IDF, we will be converting all these particular

9
00:00:29,000 --> 00:00:31,000
sentences into vectors.

10
00:00:31,000 --> 00:00:36,000
Okay, and I've taken the same example what we specifically did with bag of words like good boy.

11
00:00:36,000 --> 00:00:43,000
And this is after lowering all the, uh, cases or character cases along with that, after performing

12
00:00:43,000 --> 00:00:44,000
or after removing the stopwords.

13
00:00:44,000 --> 00:00:50,000
So I have sentence one as good boy, sentence two as good girl, sentence three as boy girl good.

14
00:00:50,000 --> 00:00:50,000
Okay.

15
00:00:50,000 --> 00:00:54,000
And this is the same thing like, uh, from the same materials, if you will be able to see I've done

16
00:00:54,000 --> 00:00:55,000
the same thing.

17
00:00:55,000 --> 00:00:56,000
I've taken the same thing over here.

18
00:00:56,000 --> 00:00:57,000
Okay.

19
00:00:57,000 --> 00:01:00,000
Now there are two components in TF-IDF.

20
00:01:00,000 --> 00:01:05,000
One is term frequency and the other one is something called as inverse document frequency.

21
00:01:05,000 --> 00:01:12,000
So whenever I talk about term frequency, term frequency definition or the formula how we calculate

22
00:01:12,000 --> 00:01:21,000
it, it is given by number of repetition of words in sentence divided by number of words in sentence.

23
00:01:21,000 --> 00:01:24,000
Okay, I'll try to show you completely.

24
00:01:24,000 --> 00:01:27,000
Taking this as an example, how we can calculate term frequency.

25
00:01:27,000 --> 00:01:31,000
And over here inverse document frequency formula is very simple.

26
00:01:31,000 --> 00:01:38,000
We basically calculate it by inverse document frequency is nothing but log to the base e number of sentences

27
00:01:38,000 --> 00:01:42,000
divided by number of sentences containing the word.

28
00:01:42,000 --> 00:01:47,000
Now this is super important and probably don't get confused with the formula right now.

29
00:01:47,000 --> 00:01:49,000
I will try to explain you each and every thing.

30
00:01:49,000 --> 00:01:53,000
Okay, so let's go step by step and let's see how we can calculate the term frequency.

31
00:01:53,000 --> 00:01:57,000
Now first thing that I am going to make sure that what I calculate is nothing.

32
00:01:57,000 --> 00:01:59,000
But I will be using this term frequency.

33
00:02:00,000 --> 00:02:05,000
Now with respect to this term frequency, you know that how many vocabulary of words are there?

34
00:02:05,000 --> 00:02:05,000
Okay.

35
00:02:05,000 --> 00:02:10,000
So first of all, I will just try to show you in a different way by creating a table.

36
00:02:10,000 --> 00:02:16,000
So I have s one sentence, I have s two sentence and I have s three sentence.

37
00:02:16,000 --> 00:02:17,000
Okay.

38
00:02:17,000 --> 00:02:23,000
And then with respect to my vocabulary of words I have something like good okay.

39
00:02:23,000 --> 00:02:27,000
Then I have boy and then I have girl.

40
00:02:27,000 --> 00:02:33,000
And already we know that only three words are basically present in the vocabulary I'm trying to show

41
00:02:33,000 --> 00:02:39,000
you, show you with a simple example so that you will be able to understand how TF-IDF will work.

42
00:02:39,000 --> 00:02:39,000
Okay.

43
00:02:39,000 --> 00:02:42,000
Now the first thing let's go back to the definition.

44
00:02:42,000 --> 00:02:48,000
Term frequency is nothing but number of repetition of words in a sentence divided by number of words

45
00:02:48,000 --> 00:02:48,000
in a sentence.

46
00:02:48,000 --> 00:02:52,000
So suppose if I take s one and with respect to s one.

47
00:02:52,000 --> 00:02:57,000
If I really want to find out the term frequency of this particular word, that is good.

48
00:02:57,000 --> 00:02:58,000
How do I calculate?

49
00:02:58,000 --> 00:03:03,000
I just need to see how many number of times this particular word is repeated in the sentence.

50
00:03:03,000 --> 00:03:09,000
So here you know that it is repeated just one time, and then I will be dividing by number of words

51
00:03:09,000 --> 00:03:10,000
in that specific sentence.

52
00:03:10,000 --> 00:03:14,000
So it will become one by two, because I have two words.

53
00:03:14,000 --> 00:03:16,000
Now let's go to the next word that is boy.

54
00:03:16,000 --> 00:03:18,000
So boy is also present.

55
00:03:18,000 --> 00:03:20,000
How many number of times one.

56
00:03:20,000 --> 00:03:21,000
And this will be divided by two.

57
00:03:21,000 --> 00:03:23,000
Okay, I'll tell you why we are doing this.

58
00:03:23,000 --> 00:03:29,000
Because when we understand the advantages and disadvantages, you will get a clear idea about why TF-IDF

59
00:03:29,000 --> 00:03:34,000
will better play, perform, or will perform better than compared to bag of words.

60
00:03:34,000 --> 00:03:36,000
Okay then I have the word girl.

61
00:03:36,000 --> 00:03:41,000
So girl, I know that over here it is not present right in the sentence one.

62
00:03:41,000 --> 00:03:43,000
So it will be zero by zero by two.

63
00:03:43,000 --> 00:03:45,000
So which is nothing but zero okay.

64
00:03:45,000 --> 00:03:50,000
Similarly with respect to s two here you will be able to see how many time good is present only one

65
00:03:50,000 --> 00:03:52,000
time and the total number of words is two.

66
00:03:52,000 --> 00:03:55,000
So one by two boy is present zero time.

67
00:03:55,000 --> 00:03:58,000
So it will be zero by two, which will be nothing but zero.

68
00:03:58,000 --> 00:04:01,000
And girl is basically present over here one time.

69
00:04:01,000 --> 00:04:03,000
So again I'm going to write one by two.

70
00:04:03,000 --> 00:04:05,000
Now let's go with respect to the sentence three.

71
00:04:05,000 --> 00:04:11,000
So sentence three how many times good is present one time and the total number of words now is three

72
00:04:11,000 --> 00:04:11,000
right?

73
00:04:12,000 --> 00:04:15,000
And then if I go next, uh, to see the boy.

74
00:04:15,000 --> 00:04:15,000
Word.

75
00:04:15,000 --> 00:04:16,000
How many times boy is present?

76
00:04:16,000 --> 00:04:17,000
Only one time.

77
00:04:17,000 --> 00:04:21,000
So this will also be one by three, and girl will also be present one by three, because the total number

78
00:04:21,000 --> 00:04:22,000
of words are three.

79
00:04:22,000 --> 00:04:28,000
Okay, so this is how simple we are able to calculate the term frequency okay.

80
00:04:28,000 --> 00:04:32,000
Now let's go ahead and let's try to find out the inverse document frequency.

81
00:04:32,000 --> 00:04:35,000
So I'm just going to write over here as IDF.

82
00:04:35,000 --> 00:04:40,000
Now with respect to IDF also we will be creating two fields, very simple fields.

83
00:04:40,000 --> 00:04:43,000
And remember uh, this will basically be my IDF.

84
00:04:43,000 --> 00:04:47,000
And this is with respect to my words over here.

85
00:04:47,000 --> 00:04:56,000
So my words are nothing but good boy or let me write it down in a better way so that in the it should

86
00:04:56,000 --> 00:04:57,000
look in the same order.

87
00:04:57,000 --> 00:04:59,000
So here will be my good.

88
00:05:00,000 --> 00:05:02,000
Here will be my boy.

89
00:05:02,000 --> 00:05:05,000
And here the next word is something like girl.

90
00:05:05,000 --> 00:05:10,000
Now, in order to calculate the inverse document frequency, it is very much simple.

91
00:05:10,000 --> 00:05:13,000
Now all I have to apply this log base E.

92
00:05:13,000 --> 00:05:14,000
Right.

93
00:05:14,000 --> 00:05:18,000
How many number of sentences are there with respect to good right.

94
00:05:18,000 --> 00:05:20,000
With respect to good.

95
00:05:20,000 --> 00:05:24,000
Uh, suppose if I really want to calculate the inverse document frequency of good.

96
00:05:24,000 --> 00:05:24,000
Okay.

97
00:05:24,000 --> 00:05:30,000
So here what I'm writing I'll write log base e multiplied by number of sentences.

98
00:05:30,000 --> 00:05:31,000
How many sentences are there.

99
00:05:31,000 --> 00:05:32,000
There are three sentences.

100
00:05:32,000 --> 00:05:37,000
So three divided by number of sentences containing the word.

101
00:05:37,000 --> 00:05:41,000
So good is present in all these three sentences 123.

102
00:05:41,000 --> 00:05:42,000
Right.

103
00:05:42,000 --> 00:05:46,000
So I will basically be writing log to the base three divided by three okay.

104
00:05:46,000 --> 00:05:51,000
And if I basically calculate this, if you try to calculate it it is nothing.

105
00:05:51,000 --> 00:05:53,000
But I will be getting a zero okay.

106
00:05:53,000 --> 00:05:57,000
And you can basically do with the calculator by over here.

107
00:05:57,000 --> 00:06:00,000
Again log to the base e number of sentences are three.

108
00:06:00,000 --> 00:06:02,000
And how many times boy is present.

109
00:06:02,000 --> 00:06:05,000
How many times boy is present in uh how many sentences?

110
00:06:05,000 --> 00:06:06,000
Boy is basically present.

111
00:06:06,000 --> 00:06:08,000
It is present one two right.

112
00:06:08,000 --> 00:06:10,000
Sentence one and sentence three.

113
00:06:10,000 --> 00:06:16,000
So I will be writing log base uh, base e multiplied by three by two.

114
00:06:16,000 --> 00:06:22,000
Similarly, girl will also be present same number of time if you probably see how many sentences girl

115
00:06:22,000 --> 00:06:23,000
is basically present.

116
00:06:23,000 --> 00:06:31,000
Okay, so I have independently calculated term frequency and I have independently calculated inverse

117
00:06:31,000 --> 00:06:31,000
document frequency.

118
00:06:31,000 --> 00:06:36,000
This is perfect right now whenever we say tf IDF.

119
00:06:36,000 --> 00:06:42,000
In short what I'm actually doing I'm multiplying this two okay term frequency and inverse document.

120
00:06:42,000 --> 00:06:44,000
I'm taking the combination of this two.

121
00:06:44,000 --> 00:06:47,000
Now let me go ahead and write it down in a better way.

122
00:06:47,000 --> 00:06:49,000
Still, uh, in the way that we specifically want.

123
00:06:49,000 --> 00:06:51,000
And finally how our vectors will look like.

124
00:06:51,000 --> 00:06:53,000
So this is my vocabulary.

125
00:06:53,000 --> 00:06:54,000
Good boy.

126
00:06:54,000 --> 00:06:55,000
Girl.

127
00:06:55,000 --> 00:06:57,000
And this is the final tip.

128
00:06:57,000 --> 00:07:03,000
Uh tf IDF okay, so final tf IDF based on this calculation.

129
00:07:03,000 --> 00:07:06,000
And this will differ based on dataset to dataset.

130
00:07:06,000 --> 00:07:12,000
Okay, so first of all, with respect to sentence one, whenever I see the combination of TF-IDF with

131
00:07:12,000 --> 00:07:17,000
respect to good, all I have to do is that multiply one by two with this zero okay.

132
00:07:17,000 --> 00:07:19,000
So I'm going to multiply this with this.

133
00:07:19,000 --> 00:07:20,000
So in sentence one.

134
00:07:20,000 --> 00:07:23,000
So see this is the sentence one right.

135
00:07:23,000 --> 00:07:25,000
This is this entire thing is the sentence one.

136
00:07:25,000 --> 00:07:25,000
Right.

137
00:07:25,000 --> 00:07:28,000
So this is my sentence one okay.

138
00:07:28,000 --> 00:07:31,000
So I will be taking this combination and multiply with this.

139
00:07:31,000 --> 00:07:32,000
Right.

140
00:07:32,000 --> 00:07:35,000
So sentence one good one by two multiplied by zero.

141
00:07:35,000 --> 00:07:42,000
It is nothing but zero with respect to boy one by two multiplied by log base e one by two multiplied

142
00:07:42,000 --> 00:07:47,000
by log base E three by two will be the value that I'll be getting in sentence one.

143
00:07:47,000 --> 00:07:51,000
And with respect to girl zero multiplied by this it will be zero.

144
00:07:51,000 --> 00:07:53,000
Now let's go to the sentence two.

145
00:07:53,000 --> 00:07:56,000
In sentence two, I will go ahead and look for this.

146
00:07:56,000 --> 00:07:59,000
Now again, one by two, multiplied by zero.

147
00:07:59,000 --> 00:08:00,000
Again good will be zero.

148
00:08:00,000 --> 00:08:03,000
And this boy is nothing but zero by zero.

149
00:08:03,000 --> 00:08:07,000
So this zero multiplied by zero is nothing but zero.

150
00:08:07,000 --> 00:08:12,000
And here I have one by two multiplied by log base e three by two.

151
00:08:13,000 --> 00:08:15,000
I will tell you the exact thing.

152
00:08:15,000 --> 00:08:18,000
What we really need to know why we are doing this specific thing.

153
00:08:18,000 --> 00:08:20,000
Everything will make sense.

154
00:08:20,000 --> 00:08:22,000
Uh, and it will make sense.

155
00:08:22,000 --> 00:08:24,000
And I'll make you definitely understand all these things.

156
00:08:24,000 --> 00:08:30,000
Okay, so coming to the next one with respect to sentence three, in sentence three, I will do this

157
00:08:30,000 --> 00:08:32,000
multiplication with this.

158
00:08:32,000 --> 00:08:32,000
Right.

159
00:08:32,000 --> 00:08:34,000
So one by three multiplied by zero.

160
00:08:34,000 --> 00:08:42,000
Again it will be zero one by three multiplied by log base e three by two and one by three multiplied

161
00:08:42,000 --> 00:08:45,000
by log base E three by two.

162
00:08:45,000 --> 00:08:46,000
Perfect.

163
00:08:46,000 --> 00:08:48,000
So we have got all this calculation.

164
00:08:48,000 --> 00:08:51,000
And this is how my vectors will look like.

165
00:08:51,000 --> 00:08:53,000
So here you will be seeing that for sentence one.

166
00:08:53,000 --> 00:08:58,000
We converted this entire sentence into vectors which looks like this.

167
00:08:59,000 --> 00:08:59,000
Right.

168
00:08:59,000 --> 00:09:01,000
So this is my sentence one vector.

169
00:09:01,000 --> 00:09:03,000
This is my sentence two vector.

170
00:09:03,000 --> 00:09:04,000
And this is my sentence three vector.

171
00:09:04,000 --> 00:09:09,000
And obviously I'll be having a output with respect to any kind of classification that I want to do.

172
00:09:09,000 --> 00:09:13,000
And then I will train my model by passing my sentence one.

173
00:09:13,000 --> 00:09:19,000
So in short, if you see Good Boy is basically converted into a vector which will look like this.

174
00:09:19,000 --> 00:09:25,000
Okay, this entire sentence one is getting a converted into a vector zero this and this zero.

175
00:09:25,000 --> 00:09:25,000
Right.

176
00:09:25,000 --> 00:09:28,000
So again you can calculate this with the help of calculator.

177
00:09:28,000 --> 00:09:32,000
But this is what is the way that we have actually done.

178
00:09:32,000 --> 00:09:35,000
We have converted all our sentences into vectors.

179
00:09:35,000 --> 00:09:42,000
And in TF-IDF, this is the phenomenon that is used in converting the words into vectors.

180
00:09:42,000 --> 00:09:45,000
Now you may be thinking, Krish, what is so special about this?

181
00:09:45,000 --> 00:09:46,000
We have got some values.

182
00:09:46,000 --> 00:09:47,000
Okay, that is fine.

183
00:09:47,000 --> 00:09:52,000
And that is what I am going to talk about in my next video about the advantages and disadvantages of

184
00:09:52,000 --> 00:09:53,000
TF-IDF.

185
00:09:53,000 --> 00:09:57,000
At the end of the day, the best one is nothing but word two vec.

186
00:09:57,000 --> 00:10:01,000
But I really need to show you this also because majority of the use cases.

187
00:10:01,000 --> 00:10:05,000
Also you can basically solve with this technique that is TF-IDF technique.

188
00:10:05,000 --> 00:10:05,000
Okay.

189
00:10:05,000 --> 00:10:08,000
So yes, uh, this was it uh, for this particular video.

190
00:10:08,000 --> 00:10:09,000
I hope you like it.

191
00:10:09,000 --> 00:10:14,000
Uh, again, you can take different, different sentences and just try to apply this and try to see

192
00:10:14,000 --> 00:10:17,000
whether you are able to find out TF-IDF or not.

193
00:10:17,000 --> 00:10:21,000
So yes, I will see you all in the next video where I'll be discussing about the advantages and disadvantages.

194
00:10:21,000 --> 00:10:22,000
Thank you.

