1
00:00:00,000 --> 00:00:05,000
So guys, let us go ahead and start the data collection mechanism over here.

2
00:00:05,000 --> 00:00:05,000
Okay.

3
00:00:05,000 --> 00:00:12,000
Now in this data collection mechanism, first of all, uh, we definitely require some kind of input

4
00:00:12,000 --> 00:00:12,000
data.

5
00:00:12,000 --> 00:00:13,000
Right.

6
00:00:13,000 --> 00:00:15,000
And without input data, it will be very difficult to work.

7
00:00:15,000 --> 00:00:15,000
Right.

8
00:00:15,000 --> 00:00:22,000
So first of all, what I will do, I will just go ahead and write import NLTK and in NLTK, uh, I have

9
00:00:22,000 --> 00:00:26,000
a very important, uh, package.

10
00:00:26,000 --> 00:00:26,000
Right.

11
00:00:26,000 --> 00:00:28,000
So I'll say NLTK dot download.

12
00:00:28,000 --> 00:00:29,000
Okay.

13
00:00:29,000 --> 00:00:33,000
We will be specifically using this called as Gutenberg okay.

14
00:00:33,000 --> 00:00:38,000
And then this Gutenberg we will be seeing there will be one text file, uh, you know, and we'll try

15
00:00:38,000 --> 00:00:40,000
to load that particular text file so quickly.

16
00:00:40,000 --> 00:00:48,000
Uh, I will just go ahead and write from NLTK dot corpus import Gutenberg.

17
00:00:48,000 --> 00:00:49,000
Okay.

18
00:00:49,000 --> 00:00:52,000
So I'm just going to specifically use this Gutenberg library.

19
00:00:52,000 --> 00:00:55,000
Along with this I will just go ahead and import pandas as PD.

20
00:00:56,000 --> 00:01:02,000
So I'm also going to use pandas as PD along with this.

21
00:01:02,000 --> 00:01:06,000
The next thing is that we will go ahead and load the data set.

22
00:01:06,000 --> 00:01:09,000
Now in order to load the data set quickly, I will go ahead and write.

23
00:01:09,000 --> 00:01:14,000
Data is equal to Gutenberg dot raw.

24
00:01:14,000 --> 00:01:22,000
And here we inside this Gutenberg package there is a raw data set which is called as Shakespeare hamlet

25
00:01:22,000 --> 00:01:23,000
dot txt.

26
00:01:23,000 --> 00:01:27,000
Okay, so we will try to pick up this specific data set okay.

27
00:01:27,000 --> 00:01:30,000
Now this particular data set is already present inside this package.

28
00:01:30,000 --> 00:01:37,000
Now we will take this data set and we'll save to a file okay I will go ahead and write with open.

29
00:01:38,000 --> 00:01:43,000
And let me quickly go ahead and write my hamlet dot txt.

30
00:01:43,000 --> 00:01:45,000
I will create a new file.

31
00:01:45,000 --> 00:01:50,000
I'll say hey, open this in my write mode and use a context called as S file.

32
00:01:50,000 --> 00:01:55,000
And I will just go ahead and write file, dot write and whatever data I'm actually getting from that

33
00:01:55,000 --> 00:01:56,000
particular txt file.

34
00:01:56,000 --> 00:01:58,000
I'm just going to write in this okay.

35
00:01:58,000 --> 00:02:02,000
So this is a very simple way of doing the data collection.

36
00:02:02,000 --> 00:02:08,000
Uh, over here in the Gutenberg package there is a txt file which is called as six dash hamlet dot txt.

37
00:02:08,000 --> 00:02:14,000
Okay, so once I execute this you will be able to see there will be one txt file that will get created.

38
00:02:14,000 --> 00:02:15,000
Okay, I'm getting an error.

39
00:02:15,000 --> 00:02:16,000
Let's see what is the error.

40
00:02:17,000 --> 00:02:23,000
Partially initialized module has no attribute internals, mostly likely due to circular import.

41
00:02:23,000 --> 00:02:23,000
Okay.

42
00:02:23,000 --> 00:02:25,000
Uh, let me do one thing.

43
00:02:25,000 --> 00:02:27,000
Let me restart this kernel.

44
00:02:27,000 --> 00:02:28,000
Okay.

45
00:02:29,000 --> 00:02:31,000
Sometime this kind of issues will definitely come.

46
00:02:31,000 --> 00:02:33,000
So I will just go ahead and execute it.

47
00:02:34,000 --> 00:02:39,000
So here you can see NLTK Gutenberg is already up to date okay.

48
00:02:39,000 --> 00:02:42,000
And now here if you see hamlet dot txt okay.

49
00:02:42,000 --> 00:02:44,000
So why I got that particular error.

50
00:02:44,000 --> 00:02:47,000
Because I had before starting the recording of the video.

51
00:02:47,000 --> 00:02:50,000
I had imported it once and I was just trying to do it.

52
00:02:50,000 --> 00:02:56,000
So now, uh, I just restarted the kernel and now I can you can see over here I'm able to get this particular

53
00:02:56,000 --> 00:02:57,000
hamlet dot txt.

54
00:02:57,000 --> 00:02:59,000
Now this is what is the hamlet dot txt.

55
00:02:59,000 --> 00:03:03,000
The Tragedy of Hamlet by William Shakespeare 1599.

56
00:03:03,000 --> 00:03:05,000
And it's a huge data set.

57
00:03:05,000 --> 00:03:08,000
And we'll take this entire data set and we'll train with LSTM RNN.

58
00:03:08,000 --> 00:03:11,000
Okay, one more thing that I really want to go ahead and tell you.

59
00:03:11,000 --> 00:03:15,000
If you do not have a powerful system, please try to execute.

60
00:03:15,000 --> 00:03:18,000
And all this code in Google Colab.

61
00:03:18,000 --> 00:03:24,000
In Google Colab you have free GPUs right now in my system, I have a powerful system workstation, so

62
00:03:24,000 --> 00:03:28,000
I think I will be able to install all these things and work with this in my local machine.

63
00:03:28,000 --> 00:03:28,000
Okay.

64
00:03:29,000 --> 00:03:31,000
Now this is the first step.

65
00:03:31,000 --> 00:03:35,000
Now we will go ahead with data preprocessing okay.

66
00:03:35,000 --> 00:03:38,000
So quickly let's go ahead and do this.

67
00:03:38,000 --> 00:03:43,000
So for data preprocessing I'm going to import numpy as NP okay one of the library.

68
00:03:43,000 --> 00:03:50,000
Along with this what I will say I will go ahead and write from TensorFlow dot Keras dot preprocessing

69
00:03:51,000 --> 00:03:56,000
okay dot text I'm going to import tokenizer.

70
00:03:57,000 --> 00:04:05,000
Now with respect to this uh, I will also go ahead and import from TensorFlow dot keras dot preprocessing

71
00:04:07,000 --> 00:04:09,000
uh dot sequence.

72
00:04:09,000 --> 00:04:12,000
I'm going to go ahead and import pad underscore sequence.

73
00:04:12,000 --> 00:04:13,000
Okay.

74
00:04:13,000 --> 00:04:17,000
See uh, I will talk about why we are using this particular library.

75
00:04:17,000 --> 00:04:17,000
Okay.

76
00:04:17,000 --> 00:04:22,000
Uh, everything will make sense once I'm first of all, let us go ahead and import it.

77
00:04:22,000 --> 00:04:22,000
Okay.

78
00:04:23,000 --> 00:04:29,000
Then I will also go ahead and write from sklearn dot modelselection import traintestsplit.

79
00:04:29,000 --> 00:04:29,000
Okay.

80
00:04:29,000 --> 00:04:31,000
So this is specifically for traintestsplit okay.

81
00:04:31,000 --> 00:04:33,000
Now let me talk about the first one.

82
00:04:33,000 --> 00:04:39,000
Tokenizer is just uh, a library that is present inside Tensorflow.keras, uh, which will actually

83
00:04:39,000 --> 00:04:41,000
help you to convert your text into vectors.

84
00:04:41,000 --> 00:04:46,000
Um, adding sequence specifically makes sure that all the sentence length will be same.

85
00:04:46,000 --> 00:04:53,000
Uh, in the in the while we are training the entire LSTM RNN, and this is specifically for doing your

86
00:04:53,000 --> 00:04:53,000
train test split.

87
00:04:54,000 --> 00:04:56,000
Now I have to apply.

88
00:04:56,000 --> 00:05:02,000
I will go ahead and load my data set and I have to apply this entire tokenizer and padding sequences

89
00:05:02,000 --> 00:05:04,000
specifically in our data set.

90
00:05:04,000 --> 00:05:04,000
Right.

91
00:05:04,000 --> 00:05:06,000
So I will go ahead and write with open.

92
00:05:06,000 --> 00:05:11,000
And here let me just go ahead and write hamlet dot txt.

93
00:05:12,000 --> 00:05:17,000
Here I'm going to basically use my read mode and I'll say hey as file.

94
00:05:17,000 --> 00:05:19,000
And I will just go ahead and write.

95
00:05:19,000 --> 00:05:22,000
Text is equal to file dot read.

96
00:05:23,000 --> 00:05:26,000
And here we are specifically going to use this dot lower.

97
00:05:26,000 --> 00:05:26,000
Okay.

98
00:05:26,000 --> 00:05:33,000
So first of all I'm just lowering all the text uh and probably taking in the specific variable okay.

99
00:05:33,000 --> 00:05:37,000
Uh after this I will go ahead and tokenize the text.

100
00:05:37,000 --> 00:05:41,000
So in order to take a tokenize it I will say tokenizer.

101
00:05:42,000 --> 00:05:44,000
Tokenizer will be my variable.

102
00:05:44,000 --> 00:05:49,000
I will go ahead and initialize this particular to my tokenizer object or tokenizer class.

103
00:05:49,000 --> 00:05:57,000
After this, you'll be able to see that I will quickly go ahead and write tokenizer dot fit underscore

104
00:05:57,000 --> 00:06:00,000
on underscore text.

105
00:06:00,000 --> 00:06:04,000
Okay, so this is an function which we are specifically going to use this.

106
00:06:04,000 --> 00:06:07,000
And this text needs to be given in the form of list okay.

107
00:06:07,000 --> 00:06:14,000
So this is a function that will be specifically used in um you know we just need to use this fit underscore

108
00:06:14,000 --> 00:06:19,000
on tests so that it will apply this entire tokenization on the text that I'm actually giving.

109
00:06:19,000 --> 00:06:19,000
Okay.

110
00:06:19,000 --> 00:06:22,000
Now we will just go ahead and calculate our total words.

111
00:06:22,000 --> 00:06:29,000
So let me just go ahead and write total words is equal to length of tokenizer dot word index.

112
00:06:29,000 --> 00:06:31,000
We will get the word index okay.

113
00:06:31,000 --> 00:06:34,000
So word underscore index.

114
00:06:34,000 --> 00:06:37,000
And since we know that it is starting from zero index.

115
00:06:37,000 --> 00:06:39,000
So I will just go ahead and write plus one okay.

116
00:06:39,000 --> 00:06:43,000
Now if I go ahead and see my total words um, okay.

117
00:06:43,000 --> 00:06:49,000
There is uh, it is saying that hey, uh, no module name TensorFlow, Keras, dot pre-processing.

118
00:06:49,000 --> 00:06:54,000
So here guys, uh, I have just made a minor mistake where I have done the spelling change instead of

119
00:06:54,000 --> 00:06:55,000
writing processing.

120
00:06:55,000 --> 00:06:56,000
I have written this.

121
00:06:56,000 --> 00:07:00,000
Okay, so double s, it should be like how we have written for the top one right now.

122
00:07:00,000 --> 00:07:02,000
Let's go ahead and execute this.

123
00:07:02,000 --> 00:07:05,000
Now, the total number of words that I'm actually getting is 4818.

124
00:07:05,000 --> 00:07:06,000
Okay.

125
00:07:06,000 --> 00:07:11,000
And, uh, you'll be able to see that if I just go ahead and write tokenizer dot word index.

126
00:07:11,000 --> 00:07:11,000
Right.

127
00:07:11,000 --> 00:07:16,000
You'll also be able to see all the word index like for the the number is one.

128
00:07:16,000 --> 00:07:19,000
The index is one then and you have the index is two.

129
00:07:19,000 --> 00:07:22,000
For two you have the index is three of four.

130
00:07:22,000 --> 00:07:22,000
Like this.

131
00:07:22,000 --> 00:07:25,000
For every word you will be able to see the indexes okay.

132
00:07:26,000 --> 00:07:29,000
Now this is what my tokenizer work is specifically doing.

133
00:07:29,000 --> 00:07:36,000
Now let me quickly go ahead and, uh, make sure that we have to try to create some input and output

134
00:07:36,000 --> 00:07:37,000
sequence right now.

135
00:07:37,000 --> 00:07:42,000
How do I actually go ahead and create input and output sequence for this particular data set.

136
00:07:43,000 --> 00:07:47,000
But here you can see with respect to every word I am able to get an index by using this tokenizer.

137
00:07:47,000 --> 00:07:47,000
Okay.

138
00:07:48,000 --> 00:07:50,000
And that is why this tokenizer is specifically used.

139
00:07:50,000 --> 00:07:51,000
Right.

140
00:07:51,000 --> 00:08:03,000
So when we are fitting it uh, over here we are basically creating indexes, creating indexes for words.

141
00:08:03,000 --> 00:08:03,000
Okay.

142
00:08:04,000 --> 00:08:04,000
Perfect.

143
00:08:04,000 --> 00:08:05,000
Till here.

144
00:08:05,000 --> 00:08:07,000
Uh I think everybody is clear right now.

145
00:08:07,000 --> 00:08:10,000
Let's go ahead and let's go ahead and create my input sequence.

146
00:08:10,000 --> 00:08:16,000
So I will go ahead and create my input sequence sequences Okay.

147
00:08:16,000 --> 00:08:18,000
So input sequence is basically means I will have a input text.

148
00:08:18,000 --> 00:08:22,000
And then I should have for that input what will be the next word.

149
00:08:22,000 --> 00:08:22,000
Right.

150
00:08:22,000 --> 00:08:23,000
Something like that.

151
00:08:23,000 --> 00:08:31,000
So I will go ahead and create a list which will say input sequences is equal to this list.

152
00:08:31,000 --> 00:08:39,000
I'll say for line in text dot split, I will split this entire text with respect to a new line.

153
00:08:39,000 --> 00:08:41,000
Okay, for every line.

154
00:08:41,000 --> 00:08:41,000
Okay.

155
00:08:41,000 --> 00:08:43,000
Then I will go ahead and write token.

156
00:08:43,000 --> 00:08:48,000
Underscore list is equal to and I will use the same tokenizer.

157
00:08:48,000 --> 00:08:51,000
And there will be something called a text to sequence.

158
00:08:51,000 --> 00:08:58,000
There is a function which is called as text underscore two sequences okay.

159
00:08:58,000 --> 00:09:03,000
And I will take every line for this okay.

160
00:09:05,000 --> 00:09:06,000
Every line.

161
00:09:06,000 --> 00:09:11,000
And from that particular line, let's take the zeroth index okay.

162
00:09:11,000 --> 00:09:13,000
Inside this particular token list okay.

163
00:09:13,000 --> 00:09:16,000
So let's see you'll be able to understand the output okay.

164
00:09:16,000 --> 00:09:17,000
What will happen.

165
00:09:17,000 --> 00:09:18,000
Just wait for some time.

166
00:09:18,000 --> 00:09:27,000
So I will go ahead and write for I in range okay I will say one comma the length of token underscore

167
00:09:27,000 --> 00:09:28,000
list okay.

168
00:09:28,000 --> 00:09:36,000
Since I'm going to add every tokens inside this okay, I'm just going to write for every line with respect

169
00:09:36,000 --> 00:09:36,000
to all the elements.

170
00:09:36,000 --> 00:09:45,000
I will say n underscore gram underscore sequence is equal to token underscore list with respect to colon

171
00:09:45,000 --> 00:09:47,000
I plus one okay.

172
00:09:47,000 --> 00:09:48,000
You'll understand it.

173
00:09:48,000 --> 00:09:50,000
What exactly we are doing okay.

174
00:09:50,000 --> 00:09:56,000
So here input sequence dot append will be nothing but n gram sequence okay.

175
00:09:56,000 --> 00:10:00,000
Now see what we are exactly doing over here.

176
00:10:00,000 --> 00:10:02,000
We are splitting each and every line okay.

177
00:10:02,000 --> 00:10:07,000
We take uh, that particular line and we apply to this particular tokenizer.

178
00:10:07,000 --> 00:10:09,000
And if you don't know about text to sequence okay.

179
00:10:09,000 --> 00:10:15,000
So this is what it is going to do is that it will just going to convert that text into some sequences,

180
00:10:15,000 --> 00:10:16,000
uh, wherein we'll have some kind of vectors.

181
00:10:16,000 --> 00:10:18,000
Also, I will show you.

182
00:10:18,000 --> 00:10:18,000
Okay.

183
00:10:18,000 --> 00:10:20,000
just just, uh, wait.

184
00:10:20,000 --> 00:10:22,000
Like, finally you'll be able to see the what will be the output.

185
00:10:22,000 --> 00:10:24,000
So let me execute this.

186
00:10:24,000 --> 00:10:27,000
Now, you know, this will basically be my input sequence.

187
00:10:27,000 --> 00:10:29,000
Now if you see the input sequence.

188
00:10:29,000 --> 00:10:30,000
Right, right.

189
00:10:30,000 --> 00:10:35,000
So for the first sentence right, let's say this is my first sentence.

190
00:10:35,000 --> 00:10:36,000
Actus primus.

191
00:10:36,000 --> 00:10:37,000
You know supreme right.

192
00:10:37,000 --> 00:10:40,000
So first of all see I'm able to get this 1687.

193
00:10:40,000 --> 00:10:41,000
Right.

194
00:10:41,000 --> 00:10:46,000
So in short what this is basically doing is that it is converting every words into sentence, line by

195
00:10:46,000 --> 00:10:48,000
line, right line by line.

196
00:10:48,000 --> 00:10:50,000
And we are getting that particular value.

197
00:10:50,000 --> 00:10:52,000
Now see first sentence is basically having two words.

198
00:10:52,000 --> 00:10:54,000
Second sentence is having three words.

199
00:10:54,000 --> 00:10:56,000
Well then you are having this.

200
00:10:56,000 --> 00:10:57,000
Then you're having this.

201
00:10:57,000 --> 00:10:59,000
And this is how it is basically getting divided.

202
00:10:59,000 --> 00:11:03,000
Like all the sentence is basically converted into an input sequence.

203
00:11:03,000 --> 00:11:03,000
Okay.

204
00:11:03,000 --> 00:11:04,000
So let me do one thing.

205
00:11:04,000 --> 00:11:09,000
Let me just create this variable like this input underscore sequence input underscore sequence.

206
00:11:09,000 --> 00:11:13,000
So this is the kind of data pre-processing that we really need to do okay.

207
00:11:13,000 --> 00:11:14,000
Perfect.

208
00:11:14,000 --> 00:11:15,000
Till here.

209
00:11:15,000 --> 00:11:16,000
Uh I hope everybody is fine.

210
00:11:16,000 --> 00:11:19,000
We have just converted our every sentence into sequence of indexes.

211
00:11:19,000 --> 00:11:23,000
Now the one thing that we really need to do is about our pad sequence.

212
00:11:23,000 --> 00:11:27,000
We have to make sure that all the sentences are of equal length.

213
00:11:27,000 --> 00:11:27,000
Right.

214
00:11:27,000 --> 00:11:32,000
So for this I will just go ahead and apply my pad sequences Okay.

215
00:11:32,000 --> 00:11:37,000
Here I'm going to basically use max underscore sequence.

216
00:11:37,000 --> 00:11:41,000
Underscore length is equal to max of.

217
00:11:43,000 --> 00:11:48,000
I will just go ahead and apply this max of max of.

218
00:11:49,000 --> 00:12:00,000
I will use one list and I'll ask length of x and I'll add like for x in input underscore sequence.

219
00:12:00,000 --> 00:12:07,000
Now what I'm actually going to do is that for every uh c when we are saying we are iterating through

220
00:12:07,000 --> 00:12:11,000
each and every sequences for every sentence, and then we are calculating the length, right?

221
00:12:11,000 --> 00:12:16,000
And whatever max length we specifically get right over here.

222
00:12:16,000 --> 00:12:20,000
So if I just go ahead and execute and see this max underscore sequence underscore length, you'll be

223
00:12:20,000 --> 00:12:22,000
able to see that 14.

224
00:12:22,000 --> 00:12:22,000
Right?

225
00:12:22,000 --> 00:12:27,000
14 is the maximum length with respect to all the sentences that we have over here right now.

226
00:12:27,000 --> 00:12:33,000
What I'm actually going to do I'll take uh input sequence, something like this I will apply this input

227
00:12:33,000 --> 00:12:36,000
sequences and I'll say hey go ahead and use this NP dot array.

228
00:12:37,000 --> 00:12:39,000
Uh NP dot array okay.

229
00:12:39,000 --> 00:12:45,000
And I will apply this pad sequences in all the input sequence okay.

230
00:12:45,000 --> 00:12:52,000
all the input sequence with max length is equal to nothing but this max sequence length.

231
00:12:52,000 --> 00:12:53,000
Okay.

232
00:12:53,000 --> 00:12:58,000
The reason why max sequence length I'm basically saying I'm saying hey, make all the sentences to maximum

233
00:12:58,000 --> 00:13:01,000
14 uh, indexes or 14 characters.

234
00:13:01,000 --> 00:13:04,000
And here what kind of padding I will be using.

235
00:13:04,000 --> 00:13:05,000
So padding will be nothing.

236
00:13:05,000 --> 00:13:08,000
But it will be uh something like pre padding or post padding.

237
00:13:08,000 --> 00:13:08,000
It is up to you.

238
00:13:08,000 --> 00:13:14,000
Now if you go ahead and see the input sequence now, every sentences will be of the same length and

239
00:13:14,000 --> 00:13:16,000
the maximum length will be of 14 sentences.

240
00:13:16,000 --> 00:13:19,000
And we have added zero for in the front right.

241
00:13:19,000 --> 00:13:22,000
So that is the reason you'll be able to find this kind of input sequence.

242
00:13:22,000 --> 00:13:22,000
Right.

243
00:13:22,000 --> 00:13:25,000
So here we have specifically done the prediction okay.

244
00:13:26,000 --> 00:13:32,000
Now uh, it's time, uh, that we divide this entire data set into training and test data set.

245
00:13:32,000 --> 00:13:33,000
Now see training and test data set.

246
00:13:33,000 --> 00:13:37,000
How should we specifically divide in my x axis?

247
00:13:37,000 --> 00:13:43,000
Uh, let's I consider for the first word right here, I will probably take just one example.

248
00:13:43,000 --> 00:13:47,000
What I will do is that I will just take, uh, or I'll let me write it over here.

249
00:13:47,000 --> 00:13:47,000
Okay.

250
00:13:47,000 --> 00:13:56,000
So here I will say creation create prediction predictors predictors and label okay.

251
00:13:57,000 --> 00:14:04,000
So here I'm going to write import TensorFlow as tf okay I will just go ahead and write my.

252
00:14:04,000 --> 00:14:05,000
Just a second.

253
00:14:05,000 --> 00:14:08,000
Import TensorFlow tf okay.

254
00:14:11,000 --> 00:14:15,000
Now once I import TensorFlow tf I will go ahead and use my x and y coordinate.

255
00:14:15,000 --> 00:14:19,000
So this x will be my independent features, y will be my dependent feature.

256
00:14:19,000 --> 00:14:21,000
I will go ahead and write input sequences.

257
00:14:21,000 --> 00:14:26,000
Input underscore sequences I will take all the words and just remove the final word okay.

258
00:14:27,000 --> 00:14:34,000
And in the output sequence, which is my y, I will take this input sequences, I will take all the

259
00:14:34,000 --> 00:14:38,000
words and I'll just take the last word okay.

260
00:14:38,000 --> 00:14:40,000
So the last word will specifically go to y.

261
00:14:40,000 --> 00:14:46,000
And the first word will specify first all the words till the last will go to my independent feature

262
00:14:46,000 --> 00:14:46,000
okay.

263
00:14:46,000 --> 00:14:52,000
Now similarly if I go ahead and write y Y, I'll just go ahead and say, hey, tf dot Keras.

264
00:14:52,000 --> 00:14:59,000
Uh, so I will just go ahead and write KF dot keras, dot utils dot two underscore.

265
00:14:59,000 --> 00:15:01,000
Or first of all, let me just go ahead and execute this okay.

266
00:15:02,000 --> 00:15:05,000
So here you'll be able to see this is my x okay.

267
00:15:05,000 --> 00:15:06,000
All the input features.

268
00:15:06,000 --> 00:15:08,000
And when I go ahead and see my y.

269
00:15:08,000 --> 00:15:11,000
So here you'll be able to see this indexes okay.

270
00:15:11,000 --> 00:15:12,000
With respect to this.

271
00:15:12,000 --> 00:15:14,000
So many number of indexes will be there okay.

272
00:15:14,000 --> 00:15:19,000
Now for this why we cannot just keep this into indexes right.

273
00:15:19,000 --> 00:15:22,000
Based on the number of classes some words may be repeated.

274
00:15:22,000 --> 00:15:23,000
Some words may not be repeated.

275
00:15:23,000 --> 00:15:25,000
We'll try to convert this into categories.

276
00:15:25,000 --> 00:15:33,000
So in order to convert this into categories I will say tf dot keras dot utils dot two underscore categorical.

277
00:15:34,000 --> 00:15:41,000
Here I'm going to basically take y comma number underscore Underscore classes is equal to total underscore

278
00:15:41,000 --> 00:15:42,000
words okay.

279
00:15:42,000 --> 00:15:48,000
Now if I go ahead and see my y it will be like completely wherever this particular index will be present

280
00:15:48,000 --> 00:15:49,000
that will be one remaining.

281
00:15:49,000 --> 00:15:50,000
All will be zero.

282
00:15:50,000 --> 00:15:53,000
Wherever this index will be present, that will be one and remaining all will be zero.

283
00:15:53,000 --> 00:15:56,000
Wherever this index is present, that will be one.

284
00:15:56,000 --> 00:15:57,000
Otherwise remaining all will be zero.

285
00:15:57,000 --> 00:16:04,000
So we have actually converted our output feature with respect to all the categories right Now it's time

286
00:16:04,000 --> 00:16:07,000
that we go ahead and split our data set into train and test.

287
00:16:07,000 --> 00:16:14,000
So here you can go and do this I have taken train test split x y test size is 0.2 okay.

288
00:16:14,000 --> 00:16:20,000
So once I execute this now we are going to uh we are ready to probably go ahead and train our LSTM,

289
00:16:20,000 --> 00:16:21,000
RNN right.

290
00:16:22,000 --> 00:16:25,000
Train our LSTM RNN.

291
00:16:25,000 --> 00:16:25,000
rnn.

292
00:16:26,000 --> 00:16:32,000
So, um, uh, this process of LSTM RNN training is going to definitely take time.

293
00:16:32,000 --> 00:16:37,000
Uh, so uh, till here I think we have done the data ingestion and data pre-processing.

294
00:16:37,000 --> 00:16:40,000
We have our input features with respect to the index.

295
00:16:40,000 --> 00:16:42,000
And we have our output features right now.

296
00:16:42,000 --> 00:16:45,000
It's time that we go ahead and train our entire model.

297
00:16:45,000 --> 00:16:48,000
And that will be shown in our next video.

298
00:16:48,000 --> 00:16:50,000
So I hope you like this particular video.

299
00:16:50,000 --> 00:16:51,000
This was it from my side.

300
00:16:51,000 --> 00:16:52,000
I will see you in the next video.

301
00:16:52,000 --> 00:16:52,000
Thank you.