1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to TF-IDF.

3
00:00:03,000 --> 00:00:07,000
And now we are going to basically implement it with the help of Python.

4
00:00:07,000 --> 00:00:12,000
And we'll try to see that how easily we can implement TF-IDF with respect to all the concepts that we

5
00:00:12,000 --> 00:00:13,000
have learned in our previous video.

6
00:00:13,000 --> 00:00:20,000
I'm going to take the same example with respect to um, PD dot read underscore CSV with respect to the

7
00:00:20,000 --> 00:00:23,000
data set that we specifically have over here.

8
00:00:23,000 --> 00:00:23,000
Spam collection.

9
00:00:23,000 --> 00:00:24,000
Right.

10
00:00:24,000 --> 00:00:26,000
So this is the data set that we specifically have.

11
00:00:26,000 --> 00:00:29,000
So I am just going to rename this over here just a second.

12
00:00:30,000 --> 00:00:31,000
Uh, spam classifier.

13
00:00:31,000 --> 00:00:31,000
Yeah.

14
00:00:31,000 --> 00:00:33,000
So this is my data set.

15
00:00:33,000 --> 00:00:35,000
Uh, like how we did it for the bag of words.

16
00:00:35,000 --> 00:00:36,000
Uh.

17
00:00:36,000 --> 00:00:37,000
And all the things.

18
00:00:37,000 --> 00:00:41,000
And again, this is the normal cleaning process that we specifically did.

19
00:00:41,000 --> 00:00:43,000
I think, uh, all these things are repeated, right?

20
00:00:43,000 --> 00:00:45,000
Porter Stemmer and all and all.

21
00:00:45,000 --> 00:00:50,000
And here also, you'll be able to see that, uh, we are basically cleaning the data, lowering it,

22
00:00:50,000 --> 00:00:53,000
splitting it, and applying stopwords and then stemming it.

23
00:00:53,000 --> 00:00:53,000
Okay.

24
00:00:54,000 --> 00:00:59,000
Uh, so just to make some little bit changes, what I'll do instead of doing stemming, right, I'll

25
00:00:59,000 --> 00:01:00,000
do Lemmatization.

26
00:01:00,000 --> 00:01:03,000
So already I've shown you how to perform Lemmatization.

27
00:01:03,000 --> 00:01:06,000
And for that we import word let Lemmatizer.

28
00:01:06,000 --> 00:01:06,000
Right.

29
00:01:06,000 --> 00:01:09,000
So I'm just going to go over here, copy and paste it over here.

30
00:01:09,000 --> 00:01:15,000
And instead of using Porter Stemmer I'm just going to use word net Lemmatizer just to give you an added

31
00:01:15,000 --> 00:01:15,000
flavor.

32
00:01:15,000 --> 00:01:20,000
So I will write word lemmatize and I'll execute this okay.

33
00:01:20,000 --> 00:01:26,000
And then instead of writing fw dot stem I will write word lemmatize dot lemmatize.

34
00:01:26,000 --> 00:01:29,000
Okay, so I think this is almost repeated.

35
00:01:29,000 --> 00:01:30,000
I've done it again and again.

36
00:01:30,000 --> 00:01:32,000
Now I think you'll be able to understand.

37
00:01:32,000 --> 00:01:36,000
Now this will probably take some time based on the size of the data set we have.

38
00:01:36,000 --> 00:01:39,000
Um, and again, yeah, obviously it is going to take time.

39
00:01:39,000 --> 00:01:40,000
So we'll just wait.

40
00:01:41,000 --> 00:01:44,000
Uh, but uh, let me repeat what I did instead of doing stemming.

41
00:01:44,000 --> 00:01:46,000
I'm actually doing word net lemmatizer.

42
00:01:46,000 --> 00:01:47,000
Right.

43
00:01:47,000 --> 00:01:50,000
And this is also one of the things that you should always try.

44
00:01:50,000 --> 00:01:51,000
Different, different variety of things.

45
00:01:51,000 --> 00:01:52,000
Right?

46
00:01:52,000 --> 00:01:53,000
So finally it is done.

47
00:01:53,000 --> 00:01:54,000
And this is my corpus.

48
00:01:54,000 --> 00:01:59,000
Now with respect to the corpus you will already be able to get correct words over here.

49
00:01:59,000 --> 00:01:59,000
Right.

50
00:01:59,000 --> 00:02:06,000
And these are all my sentences input sentences and which I will probably be using to further create

51
00:02:06,000 --> 00:02:06,000
TF-IDF.

52
00:02:06,000 --> 00:02:08,000
Now to create TF-IDF.

53
00:02:08,000 --> 00:02:11,000
I have already opened this tf IDF Vectorizer.

54
00:02:11,000 --> 00:02:13,000
It is again present in sklearn.

55
00:02:14,000 --> 00:02:16,000
Uh, it is again a technique.

56
00:02:16,000 --> 00:02:21,000
Whatever things we explained you, I explained you with respect to tf IDF similar all the things it

57
00:02:21,000 --> 00:02:24,000
will be done, but it is almost like a bag of words.

58
00:02:24,000 --> 00:02:26,000
All the features are like bag of words.

59
00:02:26,000 --> 00:02:26,000
Here.

60
00:02:26,000 --> 00:02:31,000
Also you have lowercase, you have tokenizer, you have uh, analyzer and all these things.

61
00:02:31,000 --> 00:02:32,000
And you also have n gram.

62
00:02:32,000 --> 00:02:33,000
Right.

63
00:02:33,000 --> 00:02:35,000
So you can also perform n grams over here.

64
00:02:35,000 --> 00:02:40,000
Now let me go ahead and let me try to implement and show you how you can implement Tfidfvectorizer.

65
00:02:40,000 --> 00:02:55,000
So I will say from sklearn dot so feature extraction or dot text import or tf IDF Vectorizer okay.

66
00:02:55,000 --> 00:02:59,000
So I'm going to basically use this and let me make my mic a little bit straight.

67
00:02:59,000 --> 00:03:00,000
Yeah.

68
00:03:00,000 --> 00:03:02,000
So over here, uh, this is done.

69
00:03:02,000 --> 00:03:08,000
Now all I have to do is that initialize Tfidfvectorizer provide, uh, after initializing it.

70
00:03:08,000 --> 00:03:08,000
Right.

71
00:03:08,000 --> 00:03:11,000
Like how we did it for bag of words in the bag of words.

72
00:03:11,000 --> 00:03:16,000
If you see in our previous statement we imported this and then we initialize with max features this.

73
00:03:16,000 --> 00:03:21,000
So what I'm going to do I'm just going to initialize this with the same thing okay.

74
00:03:21,000 --> 00:03:26,000
And over here binary will not be required because binary is not at all required because binary.

75
00:03:26,000 --> 00:03:28,000
It will not be working with respect to that.

76
00:03:28,000 --> 00:03:30,000
Like how we used to work in bag of words.

77
00:03:30,000 --> 00:03:30,000
Right?

78
00:03:30,000 --> 00:03:31,000
So that's it.

79
00:03:31,000 --> 00:03:31,000
See?

80
00:03:31,000 --> 00:03:32,000
Same thing.

81
00:03:32,000 --> 00:03:33,000
I've done it over here.

82
00:03:33,000 --> 00:03:37,000
Now what I'll do, I will just say TF-IDF as my variable.

83
00:03:37,000 --> 00:03:41,000
And then let's say I'm going to take the top 100 most occurring words.

84
00:03:41,000 --> 00:03:42,000
Right.

85
00:03:42,000 --> 00:03:51,000
So after this I will write TF-IDF dot fit underscore transform fit underscore transform on my corpus.

86
00:03:52,000 --> 00:03:52,000
Right.

87
00:03:52,000 --> 00:03:57,000
Like how we did it in the bag of words right fit underscore transform on corpus.

88
00:03:57,000 --> 00:04:00,000
And then I'll finally I'll convert it this into an array.

89
00:04:00,000 --> 00:04:00,000
Right.

90
00:04:00,000 --> 00:04:01,000
So two array.

91
00:04:01,000 --> 00:04:05,000
And I will store this in my variable x.

92
00:04:07,000 --> 00:04:10,000
So once I do it now you'll be able to see x over here.

93
00:04:10,000 --> 00:04:15,000
And for that I'm also importing this numpy so that I can see all my values.

94
00:04:15,000 --> 00:04:21,000
Again, not a big issue because just I'm saying I'm trying to display all the arrays uh, clearly over

95
00:04:21,000 --> 00:04:21,000
here.

96
00:04:21,000 --> 00:04:22,000
Right.

97
00:04:22,000 --> 00:04:25,000
So if I probably go and click on X.

98
00:04:25,000 --> 00:04:31,000
So now you will be able to see I'm getting decimal values like 0.43, 4.46, 1.554.

99
00:04:31,000 --> 00:04:36,000
Similarly you'll be able to see these are nothing but word importance importance of the word that is

100
00:04:36,000 --> 00:04:38,000
given in a specific sentence.

101
00:04:38,000 --> 00:04:42,000
So you'll be seeing different different values like 0.3650 .384.

102
00:04:42,000 --> 00:04:43,000
Now here you're not getting ones and zeros.

103
00:04:43,000 --> 00:04:47,000
So definitely some importance to the word is definitely given over here.

104
00:04:47,000 --> 00:04:47,000
Okay.

105
00:04:47,000 --> 00:04:53,000
So this is what simply so easily you can actually do with the help of tf IDF Vectorizer.

106
00:04:53,000 --> 00:05:00,000
Now what I will do along with this, I will try to again go ahead and show you in n in n grams.

107
00:05:00,000 --> 00:05:08,000
Also, you can also create words and let's say here I'm just writing n gram range.

108
00:05:08,000 --> 00:05:12,000
Let's say I want to go ahead and try with two comma, 222 words okay.

109
00:05:12,000 --> 00:05:15,000
And once I execute it this is entirely executed.

110
00:05:15,000 --> 00:05:18,000
But before that let me show you the vocabulary.

111
00:05:18,000 --> 00:05:22,000
Okay so sorry tf IDF.

112
00:05:22,000 --> 00:05:25,000
TF IDF dot vocab.

113
00:05:25,000 --> 00:05:27,000
So if I probably go and see the vocab.

114
00:05:27,000 --> 00:05:31,000
So these are all the words that has been selected as the features I gave two comma two.

115
00:05:31,000 --> 00:05:38,000
You can also give one comma two uh and based on that you will be getting the top 100 most frequent occurring

116
00:05:38,000 --> 00:05:40,000
combination of words with respect to two comma two.

117
00:05:40,000 --> 00:05:41,000
Right.

118
00:05:41,000 --> 00:05:41,000
Same thing.

119
00:05:41,000 --> 00:05:43,000
No changes, nothing as such.

120
00:05:43,000 --> 00:05:45,000
So that is why I like this.

121
00:05:45,000 --> 00:05:46,000
Because you don't have to do anything.

122
00:05:46,000 --> 00:05:52,000
You're just trying to understand which is the better one, what advantage it basically has right now.

123
00:05:52,000 --> 00:05:56,000
Finally, if I probably go and see the x value with respect to this here also, you'll be able to see

124
00:05:56,000 --> 00:06:00,000
some of the other updates, which is not probably if you explore it some or the other way, you'll be

125
00:06:00,000 --> 00:06:01,000
able to see it.

126
00:06:01,000 --> 00:06:03,000
See over here 0.688 .726.

127
00:06:03,000 --> 00:06:10,000
So all these combinations are already available 0.588 .56 so this is one sentence vector.

128
00:06:10,000 --> 00:06:12,000
You know, this is the another sentence vector.

129
00:06:12,000 --> 00:06:14,000
And this size will be 100.

130
00:06:14,000 --> 00:06:14,000
Why.

131
00:06:14,000 --> 00:06:17,000
Because your vocab size is basically 100.

132
00:06:17,000 --> 00:06:21,000
You have taken the top 100 occurring words that are there with respect to frequency.

133
00:06:21,000 --> 00:06:25,000
So yes, this was about the tf IDF practicals.

134
00:06:25,000 --> 00:06:29,000
Uh, in the next step, all we have to do is that take these vectors, apply with any machine learning

135
00:06:29,000 --> 00:06:31,000
algorithm, and you will be able to get the output.

136
00:06:31,000 --> 00:06:34,000
But before that you really need to do train, test split and all and all.

137
00:06:34,000 --> 00:06:38,000
Anyhow, I'll be showing you some uh, project implementations also, and we'll also be doing some end

138
00:06:38,000 --> 00:06:39,000
to end projects.

139
00:06:39,000 --> 00:06:39,000
Right.

140
00:06:39,000 --> 00:06:40,000
So yes, this was it.

141
00:06:40,000 --> 00:06:42,000
Uh, I hope you have understood this.

142
00:06:42,000 --> 00:06:44,000
I will see you all in the next video.

143
00:06:44,000 --> 00:06:45,000
Thank you.