1
00:00:00,000 --> 00:00:04,000
So, guys, uh, we are going to continue the discussion with respect to this particular project.

2
00:00:04,000 --> 00:00:07,000
Uh, you can see over here all the libraries has been installed.

3
00:00:07,000 --> 00:00:08,000
Okay.

4
00:00:08,000 --> 00:00:10,000
And I even did not get any kind of error.

5
00:00:10,000 --> 00:00:11,000
Okay.

6
00:00:11,000 --> 00:00:14,000
Now, I have also activated this specific environment in my terminal.

7
00:00:14,000 --> 00:00:20,000
So let me just, uh, hide my terminal right now, because I will not be requiring what I will do.

8
00:00:20,000 --> 00:00:23,000
Uh, along with this one last thing, I also need to do the installation.

9
00:00:23,000 --> 00:00:25,000
That is nothing but ipykernel.

10
00:00:25,000 --> 00:00:31,000
So since the first experiment that I'm actually going to do, I'm going to do in my Jupyter notebook.

11
00:00:31,000 --> 00:00:37,000
And for that I require this ipykernel library so that I'll be able to attach a kernel and execute my

12
00:00:37,000 --> 00:00:38,000
Jupyter notebook file.

13
00:00:38,000 --> 00:00:43,000
So here I'm actually going to write pip install Ipykernel.

14
00:00:43,000 --> 00:00:43,000
Okay.

15
00:00:43,000 --> 00:00:50,000
So once I execute this you will be able to see that the installation will take place.

16
00:00:50,000 --> 00:00:50,000
Okay.

17
00:00:50,000 --> 00:00:57,000
So once this ipykernel has been executed, then, uh, I'm going to probably start my coding.

18
00:00:57,000 --> 00:01:04,000
And the first thing that I will do is that I will read this entire CSV file and let me do also one more

19
00:01:04,000 --> 00:01:04,000
thing.

20
00:01:04,000 --> 00:01:10,000
Let me just go ahead and create my experiments dot Ipynb file.

21
00:01:10,000 --> 00:01:10,000
Okay.

22
00:01:10,000 --> 00:01:13,000
So I'm going to use this IPO Nvl file.

23
00:01:13,000 --> 00:01:16,000
And in this my entire experiment will happen okay.

24
00:01:16,000 --> 00:01:20,000
So right now here you can see my Ipy kernel library is basically getting installed.

25
00:01:20,000 --> 00:01:22,000
That is going to take some amount of time.

26
00:01:22,000 --> 00:01:23,000
Now it is done.

27
00:01:23,000 --> 00:01:26,000
Now I will quickly, uh, hide this terminal.

28
00:01:26,000 --> 00:01:27,000
Okay.

29
00:01:28,000 --> 00:01:30,000
Um, let me just go ahead and select my kernel.

30
00:01:30,000 --> 00:01:34,000
First of all, like here I have selected my kernel with, uh, Python 3.11.

31
00:01:34,000 --> 00:01:38,000
Okay, now I will just go ahead and make some more code cells.

32
00:01:38,000 --> 00:01:41,000
Now let's start this entire coding okay.

33
00:01:41,000 --> 00:01:47,000
So first of all, uh, what I'm actually going to do is that, uh, I'm going to import pandas.

34
00:01:47,000 --> 00:01:50,000
So let me just go ahead and import pandas as PD.

35
00:01:51,000 --> 00:01:58,000
Uh, along with this I will go ahead and import from sklearn dot model selection because I also have

36
00:01:58,000 --> 00:01:59,000
to do the train test split.

37
00:01:59,000 --> 00:02:00,000
Right.

38
00:02:00,000 --> 00:02:05,000
So for that I will be using this model selection import train test split.

39
00:02:05,000 --> 00:02:09,000
I hope you know this from the machine learning which we have actually used.

40
00:02:10,000 --> 00:02:16,000
And then along with this I will also go ahead and import from sklearn dot pre-processing.

41
00:02:16,000 --> 00:02:21,000
So here I will just go ahead and write pre processing.

42
00:02:22,000 --> 00:02:24,000
And then I'm going to import two important libraries.

43
00:02:24,000 --> 00:02:30,000
One is standard scalar and the other one that I'm actually going to use is something called as label

44
00:02:30,000 --> 00:02:31,000
encoder.

45
00:02:32,000 --> 00:02:32,000
Okay.

46
00:02:32,000 --> 00:02:37,000
So this two libraries I'm actually going to use along with this I'm going to go ahead and import pickle

47
00:02:37,000 --> 00:02:38,000
file okay.

48
00:02:38,000 --> 00:02:42,000
Pickle is specifically for pickling because when we are using the standard scalar and label encoder

49
00:02:42,000 --> 00:02:47,000
it is important that we go ahead and pickle this file so that we can reuse it when we are doing the

50
00:02:47,000 --> 00:02:48,000
deployment.

51
00:02:48,000 --> 00:02:52,000
So here are some of the libraries that we have already done the installation.

52
00:02:52,000 --> 00:02:54,000
And we are just going to import it.

53
00:02:54,000 --> 00:02:54,000
Okay.

54
00:02:55,000 --> 00:02:56,000
So these are some of the basic libraries.

55
00:02:56,000 --> 00:03:00,000
Now let us go ahead and load the data set okay.

56
00:03:00,000 --> 00:03:06,000
So for this I will say data is equal to PD dot read underscore CSV.

57
00:03:06,000 --> 00:03:13,000
And here I'm going to basically call churn underscore modeling dot csv file.

58
00:03:13,000 --> 00:03:19,000
Now over here you'll be able to see I'll just go ahead and display data dot head.

59
00:03:19,000 --> 00:03:25,000
Okay, so these are all my values or all my features that are available over here.

60
00:03:25,000 --> 00:03:28,000
I've already told you in my previous problem statement what we are going to do.

61
00:03:28,000 --> 00:03:34,000
We are going to take this credit score, geography, gender, age, tenure, balance, number of product,

62
00:03:34,000 --> 00:03:39,000
has credit card, is active member and estimated salary as my independent feature.

63
00:03:39,000 --> 00:03:43,000
And I'm going to predict whether the person is going to exit the bank or not.

64
00:03:43,000 --> 00:03:43,000
Okay.

65
00:03:43,000 --> 00:03:47,000
And this is the past data from the bank which shows which person has exited or not.

66
00:03:47,000 --> 00:03:49,000
So this is just the top five records.

67
00:03:49,000 --> 00:03:55,000
Now the first important thing is that this first three features are not that important, right?

68
00:03:55,000 --> 00:03:58,000
You have row number, you have customer ID, you have surname.

69
00:03:58,000 --> 00:04:01,000
This definitely will not play a very important role.

70
00:04:01,000 --> 00:04:10,000
So first of all what we will do we will go ahead and pre-process the data.

71
00:04:10,000 --> 00:04:10,000
Okay.

72
00:04:10,000 --> 00:04:17,000
What we are going to do here, we are just going to drop irrelevant features okay.

73
00:04:17,000 --> 00:04:20,000
Which we can directly see it okay.

74
00:04:20,000 --> 00:04:22,000
So here uh I will use this data variable.

75
00:04:22,000 --> 00:04:27,000
And from this I will say hey go ahead and drop which all features we need to drop okay.

76
00:04:27,000 --> 00:04:29,000
So drop is a function over here.

77
00:04:29,000 --> 00:04:32,000
We we really need to drop this row number.

78
00:04:32,000 --> 00:04:36,000
So I'll copy and paste it over here.

79
00:04:36,000 --> 00:04:41,000
Then we have something called as customer ID.

80
00:04:43,000 --> 00:04:43,000
Okay.

81
00:04:44,000 --> 00:04:47,000
Then we have something called as surname.

82
00:04:48,000 --> 00:04:48,000
Okay.

83
00:04:48,000 --> 00:04:51,000
And we are going to drop all these things with axis equal to one.

84
00:04:51,000 --> 00:04:53,000
Axis is equal to one basically means column wise.

85
00:04:53,000 --> 00:04:54,000
Okay.

86
00:04:54,000 --> 00:04:59,000
So once we do this particular drop we will be able to get the data value.

87
00:04:59,000 --> 00:05:00,000
So here is my data value.

88
00:05:00,000 --> 00:05:06,000
Now here you can see my features are starting from credit score geography gender age balance and all

89
00:05:06,000 --> 00:05:08,000
these values are there okay perfect.

90
00:05:08,000 --> 00:05:09,000
So this is good enough.

91
00:05:09,000 --> 00:05:10,000
Uh we have done this.

92
00:05:10,000 --> 00:05:17,000
Now here you can actually see we have a column called as, uh, geography over here.

93
00:05:17,000 --> 00:05:17,000
Right.

94
00:05:17,000 --> 00:05:18,000
So geography is there.

95
00:05:18,000 --> 00:05:19,000
Okay.

96
00:05:19,000 --> 00:05:22,000
And then, um, we also have this gender.

97
00:05:22,000 --> 00:05:25,000
We have this, uh, this two features.

98
00:05:25,000 --> 00:05:27,000
We specifically have, uh, geography and gender.

99
00:05:28,000 --> 00:05:33,000
Now geography and gender, uh, is specifically in this particular case, you can see it is a categorical

100
00:05:33,000 --> 00:05:34,000
variable.

101
00:05:34,000 --> 00:05:34,000
Okay.

102
00:05:34,000 --> 00:05:41,000
Now in the case of categorical variable, what I can actually do is that I can apply some kind of encoding

103
00:05:41,000 --> 00:05:42,000
for this categorical variable.

104
00:05:42,000 --> 00:05:42,000
Okay.

105
00:05:42,000 --> 00:05:44,000
And that is what I am actually going to do.

106
00:05:45,000 --> 00:05:51,000
Now let's go ahead and encode one categorical variable.

107
00:05:53,000 --> 00:06:01,000
Now in this I will say hey um let's take first of all gender I'll say label underscore encoder underscore

108
00:06:01,000 --> 00:06:10,000
gender is equal to label encoder okay I'm going to just use this and I'm going to say hey DF of gender

109
00:06:10,000 --> 00:06:16,000
or I'll say data of gender because we have used this particular feature, data of gender.

110
00:06:16,000 --> 00:06:20,000
And I'll just use this label encoder.

111
00:06:22,000 --> 00:06:25,000
Encoder underscore gender dot fit.

112
00:06:26,000 --> 00:06:29,000
And along with fit we will also be using transform.

113
00:06:29,000 --> 00:06:32,000
And we have already discussed this why we use fit underscore transform.

114
00:06:32,000 --> 00:06:35,000
And here I'm going to take this particular data.

115
00:06:35,000 --> 00:06:39,000
I'm going to basically use apply this on my gender itself.

116
00:06:39,000 --> 00:06:39,000
Right.

117
00:06:39,000 --> 00:06:41,000
So now let's go ahead and see the data.

118
00:06:41,000 --> 00:06:44,000
And let's see with respect to this particular gender.

119
00:06:44,000 --> 00:06:48,000
So here you can see okay I'm getting an error the spelling mistake I'll go ahead and write this one

120
00:06:48,000 --> 00:06:49,000
okay.

121
00:06:49,000 --> 00:06:54,000
So once I do this so my gender will have two main features right.

122
00:06:54,000 --> 00:06:55,000
One is male and female.

123
00:06:55,000 --> 00:06:59,000
So that male and female with the help of label encoder has got converted to zeros and ones.

124
00:06:59,000 --> 00:07:02,000
Okay, we have also seen this in machine learning okay.

125
00:07:02,000 --> 00:07:10,000
So this is one of the encoding techniques with respect to the categorical variable okay I hope you are

126
00:07:10,000 --> 00:07:11,000
able to understand this.

127
00:07:12,000 --> 00:07:16,000
So guys uh over here in the gender we saw that our values are male and female.

128
00:07:16,000 --> 00:07:19,000
And so that is the reason we use label encoder.

129
00:07:19,000 --> 00:07:21,000
And we changed it to zeros and ones.

130
00:07:21,000 --> 00:07:23,000
But what about the geography column?

131
00:07:23,000 --> 00:07:26,000
In this geography column you basically have 2 to 3 features right.

132
00:07:26,000 --> 00:07:30,000
2 to 3 categories like France, Spain and Japan.

133
00:07:30,000 --> 00:07:31,000
Germany.

134
00:07:31,000 --> 00:07:31,000
Right.

135
00:07:31,000 --> 00:07:33,000
So all this country names are specifically there.

136
00:07:33,000 --> 00:07:42,000
So here if we apply, just imagine that if we go ahead and uh, apply, uh, you know the label encoder.

137
00:07:42,000 --> 00:07:47,000
So let me just go ahead and open my epi pen so that I will explain you over here.

138
00:07:47,000 --> 00:07:52,000
Why, uh, why I'm specifically focus on, uh, on this particular geographic column.

139
00:07:52,000 --> 00:07:52,000
Right.

140
00:07:52,000 --> 00:07:58,000
So let's say if I just go ahead and convert this into a categorical column.

141
00:07:58,000 --> 00:08:03,000
So here you'll be able to see I will be able to get France as zero, Spain as one, and Germany as two.

142
00:08:03,000 --> 00:08:09,000
And when these types of numbers are specifically coming up, then there is a problem because, uh,

143
00:08:09,000 --> 00:08:16,000
when we assign a value of Germany to two, it the at the end of the day, an is all about, you know,

144
00:08:16,000 --> 00:08:17,000
numerical calculations.

145
00:08:18,000 --> 00:08:22,000
Since the number or since the label of Germany is two, it will just consider that Germany is greater

146
00:08:22,000 --> 00:08:25,000
than Spain or Spain is greater than France.

147
00:08:25,000 --> 00:08:25,000
Right.

148
00:08:25,000 --> 00:08:27,000
So this should not happen.

149
00:08:27,000 --> 00:08:33,000
So for this particular case, what we will do is that we will not directly convert this with the help

150
00:08:33,000 --> 00:08:35,000
of label encoder, but instead what we will do.

151
00:08:35,000 --> 00:08:39,000
We'll go ahead and write some code and we will use one hot encoding okay.

152
00:08:39,000 --> 00:08:43,000
Now one hot encoding will just actually give us some values with respect to zeros and ones.

153
00:08:43,000 --> 00:08:44,000
Okay.

154
00:08:44,000 --> 00:08:51,000
So now uh, I will just go ahead and write one hot encode geographic column.

155
00:08:51,000 --> 00:08:51,000
Okay.

156
00:08:51,000 --> 00:08:54,000
So here I'm just going to write geographic column over here.

157
00:08:54,000 --> 00:08:59,000
Now quickly let's first of all go ahead and import the one hot encoder.

158
00:08:59,000 --> 00:09:03,000
So I will say from sklearn dot pre-processing.

159
00:09:04,000 --> 00:09:13,000
From sklearn dot pre-processing I'm going to just import one hot one hot encoder.

160
00:09:13,000 --> 00:09:18,000
Let me just see this particular I'm seeing the documentation inside wise.

161
00:09:18,000 --> 00:09:22,000
So it will be helpful for me to also check it out directly instead of wasting time.

162
00:09:22,000 --> 00:09:23,000
Okay.

163
00:09:23,000 --> 00:09:25,000
And, uh, writing each and every thing.

164
00:09:25,000 --> 00:09:25,000
Okay.

165
00:09:25,000 --> 00:09:29,000
So here what I'm actually going to do, I'm going to specifically use one hot encoder.

166
00:09:29,000 --> 00:09:32,000
Now quickly I will just go ahead and use this one hot encoder.

167
00:09:32,000 --> 00:09:33,000
I will initialize this.

168
00:09:34,000 --> 00:09:37,000
And this time there is a parameter which is called as sparse.

169
00:09:37,000 --> 00:09:37,000
Right.

170
00:09:37,000 --> 00:09:41,000
So I will just make this feature sparse is equal to false.

171
00:09:41,000 --> 00:09:44,000
I will let you know why we are basically using this.

172
00:09:44,000 --> 00:09:47,000
And once you see the encoded values then you'll be able to understand.

173
00:09:48,000 --> 00:09:54,000
I'll create a variable called as one hot underscore encoder underscore geo okay.

174
00:09:54,000 --> 00:09:58,000
So this will basically be my values over here okay.

175
00:09:58,000 --> 00:10:06,000
Now uh quickly uh, the next step that uh we are going to specifically use is nothing but geo underscore

176
00:10:06,000 --> 00:10:07,000
encoded value.

177
00:10:07,000 --> 00:10:19,000
And I'll say, hey, let's apply this one hot encoder one hot encoder underscore geo dot fit dot fit

178
00:10:19,000 --> 00:10:29,000
under okay one port underscore geo dot fit underscore transform on which column basically on my geographic

179
00:10:29,000 --> 00:10:30,000
column.

180
00:10:30,000 --> 00:10:30,000
Right.

181
00:10:30,000 --> 00:10:36,000
So here I'm going to give my geography column right now.

182
00:10:36,000 --> 00:10:38,000
Once this is done you'll be able to see that.

183
00:10:38,000 --> 00:10:47,000
Let's go ahead and see how I'm actually going to get this geo underscore sorry geo underscore encoder.

184
00:10:48,000 --> 00:10:50,000
Um and let's execute this.

185
00:10:50,000 --> 00:10:54,000
So here it says uh got an unexpected argument.

186
00:10:54,000 --> 00:10:57,000
Sparse is equal to false okay.

187
00:10:57,000 --> 00:10:59,000
Let me just remove it for now.

188
00:10:59,000 --> 00:11:00,000
Onwards.

189
00:11:00,000 --> 00:11:00,000
Okay.

190
00:11:00,000 --> 00:11:02,000
Let's execute this.

191
00:11:02,000 --> 00:11:06,000
So here I'm actually able to get a sparse row matrix okay.

192
00:11:06,000 --> 00:11:11,000
Uh I think in one hot encoder I have something called as sparse.

193
00:11:11,000 --> 00:11:13,000
Sparse is equal to false.

194
00:11:13,000 --> 00:11:14,000
Let's see.

195
00:11:15,000 --> 00:11:19,000
Um, got an unexpected argument called as sparse.

196
00:11:19,000 --> 00:11:22,000
I think in the recent version the sparse is not present.

197
00:11:22,000 --> 00:11:24,000
Let me just go ahead and use directly this one.

198
00:11:24,000 --> 00:11:25,000
Okay.

199
00:11:25,000 --> 00:11:30,000
Now once I get this Jio encoder, uh, over here.

200
00:11:30,000 --> 00:11:31,000
Right.

201
00:11:31,000 --> 00:11:36,000
What I will do is that, uh, if I use this Jio encoder, okay.

202
00:11:36,000 --> 00:11:41,000
Or if I also go ahead and see what all features specifically we have.

203
00:11:41,000 --> 00:11:45,000
If I try to convert this into a numpy array, I will be able to see all the values.

204
00:11:45,000 --> 00:11:54,000
So first of all what I will do I'll just go ahead and write geo or sorry one hot one hot underscore

205
00:11:54,000 --> 00:12:00,000
encoder underscore geo dot get underscore features.

206
00:12:00,000 --> 00:12:03,000
Underscore names underscore out.

207
00:12:03,000 --> 00:12:04,000
Right.

208
00:12:04,000 --> 00:12:07,000
So if I go ahead and see this parameter okay.

209
00:12:07,000 --> 00:12:08,000
Sorry out.

210
00:12:08,000 --> 00:12:11,000
And I here I'm just going to give my function.

211
00:12:11,000 --> 00:12:17,000
And inside this I'm just going to give my V column geography.

212
00:12:17,000 --> 00:12:22,000
So if I go ahead and execute it it says get underscore features okay.

213
00:12:22,000 --> 00:12:29,000
It should be feature sorry get underscore feature underscore names underscore out okay.

214
00:12:30,000 --> 00:12:35,000
So here you'll be able to see with respect to the geography I have this three features that is getting

215
00:12:35,000 --> 00:12:38,000
converted into a one hot encoder.

216
00:12:38,000 --> 00:12:39,000
Right.

217
00:12:39,000 --> 00:12:44,000
So, uh, with respect to this, if I try to convert, I'll use this particular features and I will

218
00:12:44,000 --> 00:12:49,000
be using this, uh, sparse matrix that I've actually got in the form of data frame.

219
00:12:49,000 --> 00:12:53,000
Now, in order to get that in the form of data frame, what I will do, I will just go ahead and write

220
00:12:53,000 --> 00:12:59,000
geo underscore encoder or geo underscore encoder.

221
00:12:59,000 --> 00:13:06,000
Uh, here I will use this particular value and I will convert this entire into PD dot data frame.

222
00:13:06,000 --> 00:13:08,000
So I'll just go ahead and write PD dot data frame.

223
00:13:08,000 --> 00:13:12,000
And here what I'll say this is my data G0 underscore encoder.

224
00:13:12,000 --> 00:13:18,000
Along with that uh if I really want to assign the columns to this, it will be nothing but this entire

225
00:13:18,000 --> 00:13:18,000
values.

226
00:13:18,000 --> 00:13:22,000
Because from this I'm actually able to get the columns okay.

227
00:13:23,000 --> 00:13:26,000
so this is my entire, uh, data frame.

228
00:13:26,000 --> 00:13:30,000
And this is what I'm actually getting as my encoded underscore DF.

229
00:13:30,000 --> 00:13:30,000
Okay.

230
00:13:30,000 --> 00:13:32,000
So let me just go ahead and execute this.

231
00:13:32,000 --> 00:13:39,000
So once I execute this and if I go ahead and see my g underscore encoded underscore df I'm getting an

232
00:13:39,000 --> 00:13:39,000
error.

233
00:13:39,000 --> 00:13:45,000
Uh, the reason is uh shape of value uh is 1000 comma one.

234
00:13:45,000 --> 00:13:49,000
So guys here you can see that uh I'm actually getting this particular error.

235
00:13:49,000 --> 00:13:50,000
You know, it.

236
00:13:50,000 --> 00:13:54,000
Uh, shape of value that is passed is 10,000 comma one.

237
00:13:54,000 --> 00:13:59,000
And it is specifically implying that it needs 10,000 comma three because here, uh, we are giving three

238
00:13:59,000 --> 00:13:59,000
columns.

239
00:13:59,000 --> 00:14:00,000
Right.

240
00:14:00,000 --> 00:14:03,000
Now why this specific problem is happening.

241
00:14:03,000 --> 00:14:05,000
See if I go ahead and print this geo encoder.

242
00:14:05,000 --> 00:14:09,000
It shows compressed sparse row matrix of dtype right.

243
00:14:09,000 --> 00:14:16,000
Now if I probably convert this geo encoder to two underscore array okay.

244
00:14:16,000 --> 00:14:19,000
And if I execute it uh two underscore array.

245
00:14:19,000 --> 00:14:19,000
Sorry.

246
00:14:19,000 --> 00:14:20,000
It should be two array.

247
00:14:20,000 --> 00:14:24,000
So here you will be able to see that I am able to convert this into a sparse value.

248
00:14:24,000 --> 00:14:25,000
Right.

249
00:14:25,000 --> 00:14:30,000
And uh that is the reason, you know before I was just giving this instead of giving this, I will just

250
00:14:30,000 --> 00:14:31,000
go ahead and give this.

251
00:14:31,000 --> 00:14:32,000
Okay.

252
00:14:32,000 --> 00:14:38,000
So here you will be able to see that if I go ahead and give it now And if I go ahead and see this encoded

253
00:14:38,000 --> 00:14:44,000
underscore f uh df so here you can see that I'm actually able to get this geography France, geography,

254
00:14:44,000 --> 00:14:48,000
Germany and Spain, wherever France is that value would be one and remaining.

255
00:14:48,000 --> 00:14:50,000
All will be zero wherever Germany is.

256
00:14:50,000 --> 00:14:54,000
Um, that will be one like over here and wherever Spain is, that will be one.

257
00:14:54,000 --> 00:14:54,000
Okay.

258
00:14:54,000 --> 00:14:57,000
So this is how you fix this particular problem.

259
00:14:57,000 --> 00:15:03,000
Again, I'm not going to probably, uh, delete any of this or, uh, remove any or part of this particular

260
00:15:03,000 --> 00:15:09,000
video because I want everyone to see the errors that we specifically face, you know, because once

261
00:15:09,000 --> 00:15:15,000
you see the error, you probably become much more confident in, uh, solving that particular error.

262
00:15:15,000 --> 00:15:15,000
Right.

263
00:15:15,000 --> 00:15:18,000
So let's go ahead and combine all the columns now.

264
00:15:18,000 --> 00:15:23,000
So here I will just go ahead and combine all the columns.

265
00:15:23,000 --> 00:15:30,000
Call the one hot encoded columns with the original data.

266
00:15:31,000 --> 00:15:34,000
Okay, with the original data.

267
00:15:34,000 --> 00:15:41,000
So here what I'm actually now going to do is that quickly, uh, and this is really important, uh,

268
00:15:41,000 --> 00:15:45,000
because without this trust me, because the new part of the code.

269
00:15:45,000 --> 00:15:45,000
Right.

270
00:15:45,000 --> 00:15:48,000
What we did with respect to transformation, we really need to update that.

271
00:15:48,000 --> 00:15:50,000
So I will just go ahead and update it.

272
00:15:50,000 --> 00:15:54,000
And I will write data is equal to PD dot concatenation.

273
00:15:54,000 --> 00:16:04,000
And here first of all I have to concatenate the new the new geo encoded de underscore DF with our this

274
00:16:04,000 --> 00:16:05,000
particular data.

275
00:16:05,000 --> 00:16:06,000
Right.

276
00:16:06,000 --> 00:16:09,000
So first of all I will just go ahead and do the data dot drop.

277
00:16:09,000 --> 00:16:11,000
So I will go ahead and write data dot drop.

278
00:16:11,000 --> 00:16:16,000
And here I'm going to use my column which is called as geography because I don't require it any time

279
00:16:16,000 --> 00:16:16,000
now.

280
00:16:16,000 --> 00:16:20,000
And here I'm going to just use it as access is equal to one okay.

281
00:16:21,000 --> 00:16:26,000
So once this is dropped the next thing is that I will just go ahead and concatenate with my geo underscore

282
00:16:26,000 --> 00:16:28,000
encoded underscore df.

283
00:16:28,000 --> 00:16:30,000
And this time I'm just going to give my access equal to one.

284
00:16:30,000 --> 00:16:30,000
Right.

285
00:16:30,000 --> 00:16:36,000
So now if I go ahead and see my data dot head you'll be seeing that okay gender has become zero one.

286
00:16:36,000 --> 00:16:42,000
And along with this there'll be three more features like geography, France, Germany and Spain that

287
00:16:42,000 --> 00:16:43,000
has got concatenated.

288
00:16:43,000 --> 00:16:51,000
Okay, so this is how we have actually converted our categorical features into numerical features.

289
00:16:51,000 --> 00:16:57,000
And uh, we have basically used two important type of uh encoding techniques.

290
00:16:57,000 --> 00:17:00,000
One is label encoder and the other one is uh one hot encoder.

291
00:17:02,000 --> 00:17:03,000
So or so I hope.

292
00:17:03,000 --> 00:17:11,000
Uh, you have, uh, you are able to understand this entire thing, uh, and, uh, see, everything

293
00:17:11,000 --> 00:17:16,000
I'm trying to do step by step will do it very much in a slower way so that you get all the points,

294
00:17:16,000 --> 00:17:23,000
you understand things, and, uh, you go in, you see, uh, all this, ah, like, all machine learning

295
00:17:23,000 --> 00:17:24,000
kind of feature engineering.

296
00:17:24,000 --> 00:17:25,000
Only we did nothing.

297
00:17:25,000 --> 00:17:27,000
Something new would not apply.

298
00:17:27,000 --> 00:17:27,000
Over here.

299
00:17:27,000 --> 00:17:31,000
Everything is same, like how we have actually performed right now.

300
00:17:31,000 --> 00:17:39,000
Uh, let's quickly go ahead and, uh, see, uh, one is what I have used, uh, label encoding one,

301
00:17:39,000 --> 00:17:40,000
I have used one hot encoder.

302
00:17:40,000 --> 00:17:41,000
Right.

303
00:17:41,000 --> 00:17:46,000
So I have to probably go ahead and save all this particular files right in the form of pickle file,

304
00:17:46,000 --> 00:17:52,000
because this same one hot encoding and same label encoding I'm actually going to use in the later stages.

305
00:17:52,000 --> 00:17:53,000
Right.

306
00:17:53,000 --> 00:17:56,000
So label encoder gender I will try to go ahead and save it.

307
00:17:56,000 --> 00:18:00,000
And then one hot encoder with respect to this G0 I'm going to save it okay.

308
00:18:00,000 --> 00:18:02,000
So let's go ahead and save this particular file in the pickle.

309
00:18:02,000 --> 00:18:07,000
So that in our end to end project we have to probably use this.

310
00:18:07,000 --> 00:18:15,000
So I'll just go ahead and write save the encoders and scalar Okay.

311
00:18:15,000 --> 00:18:17,000
So I'll say hey with open.

312
00:18:17,000 --> 00:18:18,000
I'll open a pickle file.

313
00:18:18,000 --> 00:18:22,000
I'll say label underscore encoder.

314
00:18:23,000 --> 00:18:24,000
Okay.

315
00:18:24,000 --> 00:18:25,000
Label underscore encoder.

316
00:18:25,000 --> 00:18:28,000
Underscore geo okay.

317
00:18:29,000 --> 00:18:31,000
Label underscore encoder underscore gender.

318
00:18:31,000 --> 00:18:34,000
First of all we'll go ahead and save this pickle file.

319
00:18:34,000 --> 00:18:40,000
And for this I have already imported my pickle file I'll go ahead and open this in my write byte mode,

320
00:18:40,000 --> 00:18:44,000
because this is a deserialized or serialized file itself.

321
00:18:44,000 --> 00:18:50,000
So we will be writing in the form of bytes and this as file I have created a temporary variable.

322
00:18:50,000 --> 00:18:57,000
I'll just go ahead and write pickle dot dump, and I will dump this entire label encoder gender inside

323
00:18:57,000 --> 00:18:59,000
this particular file okay.

324
00:18:59,000 --> 00:19:07,000
Similarly you'll also be seeing that I will go ahead and open with open, uh, one more pickle file.

325
00:19:07,000 --> 00:19:08,000
I'll just go ahead and create it.

326
00:19:08,000 --> 00:19:13,000
That is nothing but one hot encoder underscore geo dot pickle file.

327
00:19:14,000 --> 00:19:19,000
And this will also get opened in my write byte mode okay WB mode.

328
00:19:19,000 --> 00:19:22,000
And here I can see as file okay.

329
00:19:22,000 --> 00:19:26,000
And this also we will go ahead and dump it.

330
00:19:27,000 --> 00:19:30,000
So I'll say pickle dot dump.

331
00:19:30,000 --> 00:19:36,000
And here also I'm going to use my one hot.

332
00:19:38,000 --> 00:19:41,000
Encoder underscore Geo.

333
00:19:41,000 --> 00:19:45,000
And finally I'm just going to convert this into my file okay.

334
00:19:46,000 --> 00:19:48,000
So let's go ahead and open it.

335
00:19:48,000 --> 00:19:52,000
So I'll say with open or okay this two file right now I'll save it.

336
00:19:52,000 --> 00:19:57,000
So once I've save it uh you can see this is my pickle file that has got created.

337
00:19:58,000 --> 00:20:05,000
So till here uh the basic transformation that I have actually done right, that all has been completed.

338
00:20:06,000 --> 00:20:09,000
And, uh, we have almost done this kind of transformation, and we have saved it.

339
00:20:09,000 --> 00:20:12,000
Okay, now, uh, let's do one thing.

340
00:20:12,000 --> 00:20:14,000
Okay, so you have done the transformation.

341
00:20:14,000 --> 00:20:17,000
Now let's quickly divide our data.

342
00:20:17,000 --> 00:20:25,000
Divide the data set into independent and dependent features.

343
00:20:28,000 --> 00:20:33,000
Now you know that my my data dot head.

344
00:20:33,000 --> 00:20:35,000
If I just go ahead and execute it.

345
00:20:35,000 --> 00:20:35,000
Right.

346
00:20:35,000 --> 00:20:40,000
So this is data dot head here.

347
00:20:40,000 --> 00:20:47,000
Exit is my exit is my column which I will be keeping as my dependent feature and remaining all this

348
00:20:47,000 --> 00:20:48,000
particular column.

349
00:20:48,000 --> 00:20:49,000
I will keep it in my independent features.

350
00:20:49,000 --> 00:20:50,000
Right.

351
00:20:50,000 --> 00:20:53,000
So quickly let's go ahead and do that.

352
00:20:53,000 --> 00:20:58,000
So I will just try to go ahead and divide my data set into independent and dependent features.

353
00:20:58,000 --> 00:20:59,000
So I'll say data dot drop.

354
00:21:00,000 --> 00:21:08,000
And here I will go ahead and write exited comma access is equal to one okay y is equal to data.

355
00:21:09,000 --> 00:21:12,000
And here I will just go ahead and say exited right.

356
00:21:12,000 --> 00:21:17,000
So this will basically be my dependent Y will be my dependent X will be my independent feature.

357
00:21:17,000 --> 00:21:28,000
Then we will split the data in training when training and testing sets Okay.

358
00:21:29,000 --> 00:21:38,000
Now let me just go ahead and write x train comma x test comma y train comma y test quickly.

359
00:21:38,000 --> 00:21:43,000
And then this is very much common because we are using it from machine learning x comma y.

360
00:21:43,000 --> 00:21:46,000
Uh let's go ahead and use some test size.

361
00:21:46,000 --> 00:21:48,000
So test size will be 0.2.

362
00:21:49,000 --> 00:21:53,000
And some random underscore state will be 42.

363
00:21:55,000 --> 00:21:58,000
So here we have also what we need to do.

364
00:21:58,000 --> 00:21:59,000
One more thing guys.

365
00:21:59,000 --> 00:22:01,000
We'll scale down this features.

366
00:22:01,000 --> 00:22:01,000
Okay.

367
00:22:01,000 --> 00:22:02,000
Scale this features.

368
00:22:02,000 --> 00:22:06,000
And the scaling up of this all the features.

369
00:22:06,000 --> 00:22:08,000
Also we will be specifically doing okay.

370
00:22:08,000 --> 00:22:14,000
Now scaling we will use the scaler scaler standard scalar okay.

371
00:22:14,000 --> 00:22:17,000
And uh once we'll go ahead and initialize it.

372
00:22:17,000 --> 00:22:27,000
Uh, here I'm just going to write X train is equal to scalar dot fit transform fit transform on my uh

373
00:22:27,000 --> 00:22:28,000
the entire X train data.

374
00:22:28,000 --> 00:22:34,000
And for my X test underscore data X underscore test data I'm just going to use scalar dot transform.

375
00:22:35,000 --> 00:22:38,000
And here I'm just going to go ahead and write x underscore test.

376
00:22:39,000 --> 00:22:44,000
So this transformation is specifically required I will quickly go ahead and execute it.

377
00:22:46,000 --> 00:22:47,000
So.

378
00:22:50,000 --> 00:22:50,000
Okay.

379
00:22:50,000 --> 00:22:54,000
Um x comma y okay.

380
00:22:54,000 --> 00:22:55,000
Perfect.

381
00:22:55,000 --> 00:22:57,000
So let's go ahead and execute this.

382
00:22:57,000 --> 00:23:03,000
So here you'll be able to see this will be my X train data which will be the uh it will be basically

383
00:23:03,000 --> 00:23:06,000
scaling is getting uh, getting applied with the help of standard scalar.

384
00:23:06,000 --> 00:23:08,000
And similarly I have my X test.

385
00:23:09,000 --> 00:23:14,000
So once this is done, we'll also go ahead and save this scalar file with the help of uh in the form

386
00:23:14,000 --> 00:23:15,000
of pickle format.

387
00:23:15,000 --> 00:23:15,000
Okay.

388
00:23:15,000 --> 00:23:18,000
We will be requiring all this pickle format.

389
00:23:18,000 --> 00:23:23,000
So with open I will say hey go ahead and use my scalar dot pickle.

390
00:23:23,000 --> 00:23:29,000
And uh it will be in the right byte mode Okay.

391
00:23:29,000 --> 00:23:31,000
Uh, right byte mode.

392
00:23:33,000 --> 00:23:38,000
Now, once we are using this right byte mode, I will just go ahead and create my context over here

393
00:23:38,000 --> 00:23:39,000
as file.

394
00:23:39,000 --> 00:23:47,000
And we'll go ahead and dump this dump this scalar dot scalar comma file okay.

395
00:23:47,000 --> 00:23:49,000
Whatever scalar value is there.

396
00:23:49,000 --> 00:23:51,000
So let's go ahead and execute it.

397
00:23:51,000 --> 00:23:53,000
So here you have each and every thing.

398
00:23:53,000 --> 00:23:54,000
Here you have label encoder pickle.

399
00:23:54,000 --> 00:23:55,000
You have one hot encoder pickle.

400
00:23:55,000 --> 00:23:56,000
We have scalar pickle.

401
00:23:56,000 --> 00:23:57,000
This is perfect.

402
00:23:58,000 --> 00:24:02,000
So in short in this video what I have done is that I've just cleaned some data set.

403
00:24:02,000 --> 00:24:06,000
I've applied all type of uh feature engineering that I want to do.

404
00:24:06,000 --> 00:24:06,000
Okay.

405
00:24:06,000 --> 00:24:09,000
I have divided this into dependent and independent features.

406
00:24:09,000 --> 00:24:14,000
What all, uh, main feature engineering methods that we have applied is nothing, but we have applied

407
00:24:14,000 --> 00:24:17,000
both our standard scalar.

408
00:24:17,000 --> 00:24:22,000
We have applied, uh, we have converted categorical features into numerical features using label encoder,

409
00:24:22,000 --> 00:24:24,000
one hot encoder, everything we have specifically done.

410
00:24:25,000 --> 00:24:28,000
So, uh, now my data is specifically ready.

411
00:24:28,000 --> 00:24:28,000
Okay.

412
00:24:28,000 --> 00:24:29,000
This is my data.

413
00:24:30,000 --> 00:24:34,000
Now on this particular data, I'm going to just go ahead and train my artificial neural network.

414
00:24:34,000 --> 00:24:38,000
And that is what I'm actually going to show you in the next video.

415
00:24:38,000 --> 00:24:40,000
So I hope you like this particular video.

416
00:24:40,000 --> 00:24:41,000
Uh, I will see you in the next video.

417
00:24:41,000 --> 00:24:42,000
Thank you.