1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:06,000
So in this video we are going to basically see the word two vec practical implementation I really want

3
00:00:06,000 --> 00:00:12,000
to show you some Google pre-trained models and basically give you an idea like how word two vec creates

4
00:00:12,000 --> 00:00:13,000
a vector, right?

5
00:00:13,000 --> 00:00:18,000
So for this tutorial I'm going to use a library which is called as Gensim.

6
00:00:18,000 --> 00:00:22,000
So let's go ahead and let's install this particular library.

7
00:00:22,000 --> 00:00:24,000
So you just need to write pip install Gensim.

8
00:00:24,000 --> 00:00:27,000
So here you can see that requirement is already satisfied.

9
00:00:27,000 --> 00:00:32,000
And uh, what I am actually going to do is that I'm going to import this import Gensim.

10
00:00:32,000 --> 00:00:37,000
And from Gensim dot model I'm going to import word two vec and keyed vectors okay.

11
00:00:37,000 --> 00:00:39,000
And I'll talk about this.

12
00:00:39,000 --> 00:00:41,000
Why this two libraries are specifically required.

13
00:00:41,000 --> 00:00:48,000
Now one very important thing for this as I said that to show you the practical implementation of word

14
00:00:48,000 --> 00:00:51,000
two vec here, I'm going to take a Google pre-trained model.

15
00:00:51,000 --> 00:00:57,000
In the upcoming video I will try to show you a different model which can be trained from scratch.

16
00:00:57,000 --> 00:01:03,000
But here I'm going to show you a Google Word two vec pre-trained model and what this particular model

17
00:01:03,000 --> 00:01:04,000
is all about.

18
00:01:04,000 --> 00:01:08,000
So I'm basically taking a word two vec Google news 300 okay.

19
00:01:08,000 --> 00:01:17,000
And this is basically a pre-trained vector trained on Google News data set about 100 billion words.

20
00:01:17,000 --> 00:01:23,000
The model contains 300 dimensional vectors for 3 million words and phrases.

21
00:01:23,000 --> 00:01:23,000
Write.

22
00:01:23,000 --> 00:01:29,000
These phrases were obtained using a sample data driven approach in distributed representation of this

23
00:01:29,000 --> 00:01:30,000
and all.

24
00:01:30,000 --> 00:01:33,000
All the research paper and everything is basically given over here.

25
00:01:33,000 --> 00:01:38,000
Okay, so this same model we are going to use and we are going to see that how it can easily create

26
00:01:38,000 --> 00:01:41,000
a vectors whenever we give any kind of words.

27
00:01:41,000 --> 00:01:45,000
So in Gensim you know you have something called as API, right?

28
00:01:45,000 --> 00:01:47,000
So there is a library called as API.

29
00:01:47,000 --> 00:01:51,000
So you just need to write import Gensim dot downloader as API.

30
00:01:51,000 --> 00:01:56,000
And you just need to write API dot load and basically the model name.

31
00:01:56,000 --> 00:01:56,000
Right.

32
00:01:56,000 --> 00:01:59,000
So the model name is nothing but word two vec Google news.

33
00:01:59,000 --> 00:02:01,000
Uh dash 300 okay.

34
00:02:01,000 --> 00:02:08,000
Now once you do this, all you have to do is that when you provide any word inside this w v variable,

35
00:02:08,000 --> 00:02:11,000
which is nothing, but this is an instance of this specific model.

36
00:02:11,000 --> 00:02:13,000
It will try to give you the vector.

37
00:02:13,000 --> 00:02:18,000
I'm not going to execute this line of code because I've already done it, because the model size is

38
00:02:18,000 --> 00:02:21,000
1662.8 MB, right?

39
00:02:21,000 --> 00:02:24,000
So it will probably take some time for you all to download.

40
00:02:24,000 --> 00:02:28,000
So I have already downloaded this so that I can record the video directly.

41
00:02:28,000 --> 00:02:29,000
You can just go ahead and download it.

42
00:02:29,000 --> 00:02:33,000
Now let me go and see that how the King vector look like.

43
00:02:33,000 --> 00:02:34,000
Okay.

44
00:02:34,000 --> 00:02:40,000
So this is basically the king uh, vectors, how the word is basically converted into vectors and the

45
00:02:40,000 --> 00:02:41,000
dimensions that you'll be seeing.

46
00:02:41,000 --> 00:02:41,000
Right.

47
00:02:41,000 --> 00:02:42,000
How many vectors are there.

48
00:02:42,000 --> 00:02:44,000
As said from here.

49
00:02:44,000 --> 00:02:44,000
Right.

50
00:02:44,000 --> 00:02:46,000
We have 300 dimensions.

51
00:02:46,000 --> 00:02:47,000
So this is what we are getting.

52
00:02:47,000 --> 00:02:51,000
This entire vectors you'll be able to see will be having 300 dimensions.

53
00:02:51,000 --> 00:02:57,000
So if I probably use this vector underscore king dot shape.

54
00:02:57,000 --> 00:02:58,000
Right.

55
00:02:58,000 --> 00:03:01,000
So if I execute this here you will be able to see 300 dimensions.

56
00:03:01,000 --> 00:03:02,000
Right.

57
00:03:02,000 --> 00:03:06,000
So with respect to this you use any kind of word you will be able to get some kind of vectors.

58
00:03:06,000 --> 00:03:09,000
Now let me give you some of the example okay.

59
00:03:09,000 --> 00:03:11,000
Suppose I make over here.

60
00:03:11,000 --> 00:03:17,000
And uh for this particular vector what all things I use, I use this w v variable which is a word two

61
00:03:17,000 --> 00:03:18,000
vec object.

62
00:03:18,000 --> 00:03:19,000
Right.

63
00:03:19,000 --> 00:03:22,000
All I have to do, I have to give this and give any word of your choice.

64
00:03:22,000 --> 00:03:24,000
Let's say I want to give cricket.

65
00:03:24,000 --> 00:03:25,000
Right?

66
00:03:25,000 --> 00:03:30,000
So if I give probably cricket, you'll be able to see that the vector is automatically generated.

67
00:03:30,000 --> 00:03:35,000
So here is again the vector which you are basically getting with respect to this particular shape.

68
00:03:35,000 --> 00:03:36,000
That is 300 dimension.

69
00:03:36,000 --> 00:03:39,000
Now this w v this w v variable right.

70
00:03:39,000 --> 00:03:41,000
Which is the word vector.

71
00:03:41,000 --> 00:03:46,000
It also has some of the functions you can actually use something called as most underscore similar.

72
00:03:46,000 --> 00:03:49,000
Now let's say that I have given this most underscore similar.

73
00:03:49,000 --> 00:03:55,000
I'm saying that from uh, if I'm giving this cricket word, right, which is the most similar word present

74
00:03:55,000 --> 00:03:56,000
in this entire corpus.

75
00:03:56,000 --> 00:03:56,000
Right.

76
00:03:56,000 --> 00:04:01,000
And I'm actually going to use the same word, two vec object, and I'm going to find out all the words.

77
00:04:01,000 --> 00:04:06,000
So if I execute this here you'll be able to see the most similar with respect to cricket.

78
00:04:06,000 --> 00:04:09,000
You know, it will first of all try to convert into a vector.

79
00:04:09,000 --> 00:04:14,000
And then it will probably check whether that vector is being able to see any similar kind of words or

80
00:04:14,000 --> 00:04:14,000
not.

81
00:04:14,000 --> 00:04:19,000
So these are all the words that is similar to cricket in that specific corpus.

82
00:04:19,000 --> 00:04:20,000
So you have cricketing.

83
00:04:20,000 --> 00:04:25,000
It is having like 0.83 similarity point 83% similarity.

84
00:04:25,000 --> 00:04:27,000
Then cricket is 81% similarity.

85
00:04:27,000 --> 00:04:28,000
Then you have test cricket.

86
00:04:28,000 --> 00:04:32,000
Here you can see 80% similarity 20 cricket right?

87
00:04:32,000 --> 00:04:34,000
You have 80% similarity.

88
00:04:34,000 --> 00:04:36,000
Then you have cricket, then you have cricketer.

89
00:04:36,000 --> 00:04:41,000
All these things are, you know, they are showing some kind of similarities with respect to this Google

90
00:04:41,000 --> 00:04:42,000
News feed right now.

91
00:04:42,000 --> 00:04:46,000
Similarly, if I really want to find out which is the most similar for word with respect to happy,

92
00:04:46,000 --> 00:04:53,000
you'll also be able to see I'll be having words like glad, pleased, ecstatic, overjoyed, thrilled,

93
00:04:53,000 --> 00:04:58,000
satisfied, proud, delighted, and understand that these all are similar words when I compare to happy,

94
00:04:58,000 --> 00:04:59,000
right?

95
00:04:59,000 --> 00:05:04,000
So it is also being able to say that, okay, these are the similar words when compared to the happy

96
00:05:04,000 --> 00:05:04,000
word.

97
00:05:04,000 --> 00:05:07,000
And they are also being able to show with respect to how much distance.

98
00:05:07,000 --> 00:05:10,000
And I've already shown you how this distance is basically calculated.

99
00:05:10,000 --> 00:05:11,000
Right.

100
00:05:11,000 --> 00:05:14,000
So we have a concept of cosine similarity and all.

101
00:05:14,000 --> 00:05:16,000
Similarly I can also provide two two words.

102
00:05:16,000 --> 00:05:21,000
And I can basically say that okay how much this two word are basically similar.

103
00:05:21,000 --> 00:05:26,000
So if I probably execute this here you'll be able to see w v dot similarity I'm getting.

104
00:05:26,000 --> 00:05:30,000
Hockey and sports are somewhere under 53% similar right now.

105
00:05:30,000 --> 00:05:34,000
This is very interesting okay I'm going to take the vector of King.

106
00:05:34,000 --> 00:05:37,000
And I'm going to subtract with the vector of man.

107
00:05:37,000 --> 00:05:39,000
And I'm going to add the vector of omen.

108
00:05:39,000 --> 00:05:44,000
Now let's see what will be the kind of output vectors that I'll be getting.

109
00:05:44,000 --> 00:05:44,000
Okay.

110
00:05:44,000 --> 00:05:49,000
So once I do this in short what I'm doing I'm subtracting king minus man plus woman.

111
00:05:49,000 --> 00:05:53,000
Obviously the answer should be queen, but I really want to prove you through the vectors itself.

112
00:05:53,000 --> 00:05:55,000
Whether we are able to get the queen or not.

113
00:05:55,000 --> 00:05:57,000
So here I'm going to execute this.

114
00:05:57,000 --> 00:05:58,000
I'll got I got my vector.

115
00:05:58,000 --> 00:06:00,000
So this is my entire vector.

116
00:06:00,000 --> 00:06:01,000
So this is my entire vector.

117
00:06:01,000 --> 00:06:04,000
And again this vector is of 300 dimensions okay.

118
00:06:04,000 --> 00:06:06,000
So this is my entire vector.

119
00:06:06,000 --> 00:06:11,000
Now what I'm going to do I'm going to use w v dot most underscore similar.

120
00:06:11,000 --> 00:06:13,000
And I'm just going to pass this entire vector over here.

121
00:06:13,000 --> 00:06:18,000
So once I execute this here you'll be able to see that Sea King is the first.

122
00:06:18,000 --> 00:06:20,000
Obviously, uh, King should be the most similar vector.

123
00:06:20,000 --> 00:06:23,000
But here I'm getting queen monarch.

124
00:06:23,000 --> 00:06:23,000
Princess.

125
00:06:23,000 --> 00:06:24,000
Crown prince.

126
00:06:24,000 --> 00:06:25,000
Prince Prince.

127
00:06:25,000 --> 00:06:26,000
Sorry.

128
00:06:26,000 --> 00:06:26,000
King.

129
00:06:26,000 --> 00:06:28,000
Sultan Queen resort.

130
00:06:28,000 --> 00:06:32,000
But here you can see that the most similar word after king is queen.

131
00:06:32,000 --> 00:06:38,000
So the kind of vector we are getting after doing all this particular subtraction matches matches more

132
00:06:38,000 --> 00:06:40,000
towards the vector with respect to queen.

133
00:06:40,000 --> 00:06:40,000
Right?

134
00:06:40,000 --> 00:06:44,000
And this is what we are able to get with the help of word two vec.

135
00:06:44,000 --> 00:06:45,000
Try to use this guys.

136
00:06:45,000 --> 00:06:46,000
This is an amazing model.

137
00:06:46,000 --> 00:06:51,000
Altogether you'll be able to see uh, that you can actually use this model itself.

138
00:06:51,000 --> 00:06:57,000
You know you can actually use word two vec, Google News 300, which will be able to solve many problem

139
00:06:57,000 --> 00:06:57,000
of yours.

140
00:06:57,000 --> 00:07:01,000
And this is just a brief idea about a pre-trained model.

141
00:07:01,000 --> 00:07:04,000
You know how a word two vec pre-trained model looks like.

142
00:07:04,000 --> 00:07:07,000
You can also take your own text and train it from scratch.

143
00:07:07,000 --> 00:07:10,000
But again, there is a different process for all together.

144
00:07:10,000 --> 00:07:14,000
And this entire thing I've executed in Google Colab because the model size is quite huge.

145
00:07:14,000 --> 00:07:15,000
Okay.

146
00:07:16,000 --> 00:07:20,000
Uh, so in the upcoming videos, what I'm actually going to do, I'm also going to show you what two

147
00:07:20,000 --> 00:07:24,000
vec technique by using Gensim where you can train the model from scratch.

148
00:07:24,000 --> 00:07:24,000
Okay.

149
00:07:24,000 --> 00:07:27,000
So yes, uh, this was it for, for this particular video.

150
00:07:27,000 --> 00:07:29,000
I will see you all in the next video.

151
00:07:29,000 --> 00:07:29,000
Thank you.