1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to tf IDF.

3
00:00:03,000 --> 00:00:07,000
And already I've shown you how we can what is the formula of tf IDF?

4
00:00:07,000 --> 00:00:11,000
That is term frequency and IDF that is inverse document frequency.

5
00:00:11,000 --> 00:00:13,000
And I've also shown you an example over here.

6
00:00:13,000 --> 00:00:13,000
Right.

7
00:00:13,000 --> 00:00:15,000
So till here that is everything is fine.

8
00:00:15,000 --> 00:00:21,000
Now let's talk about the most important thing about advantages and disadvantages and why this is probably

9
00:00:21,000 --> 00:00:23,000
better than bag of words okay.

10
00:00:23,000 --> 00:00:29,000
So first of all, uh, the basic advantage that we have again, this is quite intuitive.

11
00:00:29,000 --> 00:00:32,000
Uh, the implementation is also quite intuitive.

12
00:00:32,000 --> 00:00:34,000
Uh, coming to the second advantage.

13
00:00:34,000 --> 00:00:35,000
Okay.

14
00:00:35,000 --> 00:00:36,000
Like bag of words.

15
00:00:36,000 --> 00:00:41,000
Uh, here also our inputs are basically fixed size.

16
00:00:41,000 --> 00:00:44,000
And this is based on the vocab size.

17
00:00:44,000 --> 00:00:44,000
Right.

18
00:00:44,000 --> 00:00:50,000
And this advantage is also present with respect to bag of words that is also there.

19
00:00:50,000 --> 00:00:54,000
But the third advantage that I'm actually going to talk about see in bag of words.

20
00:00:54,000 --> 00:00:55,000
Also we had fixed size right.

21
00:00:55,000 --> 00:00:57,000
But this third advantage is a major advantage.

22
00:00:57,000 --> 00:01:00,000
Now let's talk about the third advantage okay.

23
00:01:00,000 --> 00:01:05,000
So the third advantage is that the word importance is getting captured.

24
00:01:07,000 --> 00:01:09,000
I'll explain you what exactly this is.

25
00:01:09,000 --> 00:01:12,000
Word importance is getting captured.

26
00:01:12,000 --> 00:01:14,000
Super important point.

27
00:01:14,000 --> 00:01:18,000
And probably they may also ask you this specific thing in interviews.

28
00:01:18,000 --> 00:01:23,000
Now if I probably go and see my entire paragraph, let's say this is my paragraph.

29
00:01:23,000 --> 00:01:25,000
Good boy, good girl, boy girl.

30
00:01:25,000 --> 00:01:26,000
Good, right?

31
00:01:26,000 --> 00:01:29,000
I'm getting a TF-IDF of this number.

32
00:01:29,000 --> 00:01:30,000
Right?

33
00:01:30,000 --> 00:01:31,000
And over here I.

34
00:01:31,000 --> 00:01:35,000
I have also written this with the help of bag of words and bag of words.

35
00:01:35,000 --> 00:01:36,000
I used to get either 1 or 0.

36
00:01:37,000 --> 00:01:41,000
Wherever that word is present, that is coming as one, otherwise it is zero if it is not present in

37
00:01:41,000 --> 00:01:42,000
the sentence.

38
00:01:42,000 --> 00:01:48,000
But now the word importance is getting captured over here, equal importance is given to both the word

39
00:01:48,000 --> 00:01:51,000
like good and boy, right, because it is present in the sentences.

40
00:01:51,000 --> 00:01:53,000
But here it does not work like that.

41
00:01:53,000 --> 00:01:59,000
Here, considering the entire paragraph, what it is happening is that we are focusing on two things

42
00:01:59,000 --> 00:02:00,000
term frequency.

43
00:02:00,000 --> 00:02:01,000
Inverse document frequency.

44
00:02:02,000 --> 00:02:08,000
If if a word is present in all the sentences, it should be given less importance.

45
00:02:08,000 --> 00:02:09,000
Understand this okay.

46
00:02:09,000 --> 00:02:15,000
If a word is present in all the sentences in that paragraph, it should be given less importance.

47
00:02:15,000 --> 00:02:16,000
Why?

48
00:02:16,000 --> 00:02:19,000
Because all the all the sentences having that specific word.

49
00:02:19,000 --> 00:02:23,000
So it is not playing that amazing or important role.

50
00:02:23,000 --> 00:02:26,000
Word importance needs to be captured from every sentence.

51
00:02:26,000 --> 00:02:28,000
That is what we specifically want.

52
00:02:28,000 --> 00:02:31,000
Now over here you can see that boy is there right over here.

53
00:02:31,000 --> 00:02:32,000
Girl is there.

54
00:02:32,000 --> 00:02:36,000
Now boy and girl are getting repeated in 1 or 2 sentences.

55
00:02:36,000 --> 00:02:37,000
not in every sentences.

56
00:02:37,000 --> 00:02:43,000
So if it is not repeated in every sentences, we need to value this particular word in every sentences

57
00:02:43,000 --> 00:02:43,000
as such.

58
00:02:43,000 --> 00:02:47,000
So if I probably take an example of good, good is present in all these three sentences.

59
00:02:47,000 --> 00:02:50,000
So we calculate TF-IDF.

60
00:02:50,000 --> 00:02:53,000
Here you will be seeing that all zeros we are getting over here.

61
00:02:53,000 --> 00:02:55,000
Major major issue right.

62
00:02:55,000 --> 00:03:00,000
So not a issue but it is a good thing we are ignoring the good word because it is present in all the

63
00:03:00,000 --> 00:03:00,000
sentence.

64
00:03:00,000 --> 00:03:06,000
Now if I consider with respect to boy so sentence one boy will play a very important role now, right?

65
00:03:06,000 --> 00:03:11,000
So with respect to boy here, you will be seeing that I am getting some values right.

66
00:03:11,000 --> 00:03:12,000
I'm getting some values.

67
00:03:12,000 --> 00:03:15,000
Now in the second sentence, obviously boy was not there, so it became zero.

68
00:03:15,000 --> 00:03:18,000
But if I consider girl in the second sentence.

69
00:03:18,000 --> 00:03:20,000
So here you will be seeing that I'm getting some value.

70
00:03:21,000 --> 00:03:26,000
That basically means in this particular sentence, the girl word is super important and the context

71
00:03:26,000 --> 00:03:30,000
is based on that specific word that we are having a value of tf IDF.

72
00:03:30,000 --> 00:03:30,000
Right?

73
00:03:30,000 --> 00:03:34,000
So in short, what is happening is that word importance is getting captured.

74
00:03:34,000 --> 00:03:37,000
And in the third sentence it is talking about both boy and girl.

75
00:03:37,000 --> 00:03:40,000
So you'll be seeing that both this boy and girl has some values.

76
00:03:40,000 --> 00:03:45,000
So in short, we are capturing some word importance over here based on the context.

77
00:03:45,000 --> 00:03:45,000
Right.

78
00:03:45,000 --> 00:03:47,000
Super important point.

79
00:03:47,000 --> 00:03:53,000
And by this our machine learning model will be able to understand that, okay, something specific we

80
00:03:53,000 --> 00:03:55,000
are basically talking about.

81
00:03:55,000 --> 00:04:01,000
And that way the mathematical models will be able to find out what kind of predictions it actually wants.

82
00:04:01,000 --> 00:04:03,000
And through this the accuracy increases.

83
00:04:03,000 --> 00:04:05,000
Now let's talk about the disadvantages.

84
00:04:05,000 --> 00:04:09,000
Obviously in this particular case also you have lot number of zeros.

85
00:04:09,000 --> 00:04:12,000
So sparsity still exists okay.

86
00:04:12,000 --> 00:04:14,000
Sparsity still exists over here.

87
00:04:14,000 --> 00:04:18,000
Uh and again we will try to see how we can solve sparsity using word two vec.

88
00:04:18,000 --> 00:04:22,000
The second thing is that what we specifically discuss about is something called as oov.

89
00:04:22,000 --> 00:04:23,000
Out of vocabulary.

90
00:04:23,000 --> 00:04:28,000
Now here also if I probably add any more words over here with respect to the test data, that is going

91
00:04:28,000 --> 00:04:34,000
to get ignored because uh, over here also all my features is basically made based on our training vocabulary

92
00:04:34,000 --> 00:04:34,000
size.

93
00:04:35,000 --> 00:04:35,000
Right.

94
00:04:35,000 --> 00:04:43,000
So this is basically the advantages and disadvantages with respect to uh, TF-IDF, but definitely just

95
00:04:43,000 --> 00:04:48,000
by seeing the advantages and disadvantages, we can definitely know that tf IDF performs better than

96
00:04:48,000 --> 00:04:50,000
bag of words right now.

97
00:04:50,000 --> 00:04:55,000
Uh, in the next video, we'll try to see some practical, uh, implementation with the help of NLTK

98
00:04:55,000 --> 00:04:56,000
and Python.

99
00:04:56,000 --> 00:05:00,000
And again, guys, you really need to practice this considering different, different data sets.

100
00:05:00,000 --> 00:05:04,000
We will try to provide you more assignments as possible so that you can practice these things also.

101
00:05:04,000 --> 00:05:06,000
So yes, this was it from my side.

102
00:05:06,000 --> 00:05:08,000
I will see you all in the next video.

103
00:05:08,000 --> 00:05:08,000
Thank you.