WEBVTT

00:01.320 --> 00:02.280
All right, everyone.

00:02.820 --> 00:10.950
So the next step in this spam message classification project is to separate out your data into two different

00:10.950 --> 00:15.030
buckets that will be training sets of data and texting sets of data.

00:15.570 --> 00:21.240
Now, why we need to separate out this data because this machine learning system is kind of training

00:21.270 --> 00:22.950
plus testing kind of system.

00:23.310 --> 00:28.960
So modern building machine learning happens when you apply some subset of data from your audience.

00:29.220 --> 00:31.380
I say that is called as a training dataset.

00:31.800 --> 00:37.590
And create a model out of it and evaluate your model that how good your model is.

00:38.250 --> 00:40.830
What is the accuracy of running behind your model?

00:41.130 --> 00:48.720
You can apply all those remaining testing datasets on your model to get to know about how good your

00:48.720 --> 00:49.200
model is.

00:49.470 --> 00:55.920
So the basic rule behind is that you just segregate your data into two different buckets, like a training

00:55.920 --> 00:58.860
data set and testing the CSA and know what it was.

00:58.870 --> 01:06.660
So basically to say who your model really never, ever use your testing data set for the model training,

01:06.990 --> 01:11.820
because once you learn the model, the model got created.

01:11.910 --> 01:17.520
We will apply those testing dataset to know about how good or accurate our model is.

01:17.850 --> 01:22.090
So the first task we will do from this data datastream object.

01:22.510 --> 01:24.350
Let me display it.

01:26.060 --> 01:32.820
We will divide it into two different buckets, like a training dataset and testing that a.

01:35.210 --> 01:35.640
All right.

01:36.250 --> 01:40.900
So we have a fourteen hundred ninety four records that I really want.

01:41.350 --> 01:48.970
So let us keep some 30 was in charge of the guy to pasting buckets and 70 percent of the time to cleaning

01:48.970 --> 01:53.020
buckets and thought that we are going to use this sacred land library.

01:53.050 --> 01:57.610
Now, this sacred land is also pre Mendon as a part of this CoLab environment.

01:57.640 --> 02:02.890
So no need to worry about any kind of installation from Skillern.

02:04.430 --> 02:08.780
Let's go with the model selection and let me put.

02:10.970 --> 02:12.890
Thirteen days split.

02:13.840 --> 02:14.320
All right.

02:15.640 --> 02:21.610
So this train is flic function will give us two different buckets.

02:21.730 --> 02:25.900
So the first argument we need to supply for which data you want to do it.

02:26.350 --> 02:29.110
So the first data we want to do it for.

02:30.710 --> 02:38.000
Data message, because that is what our training input data and we are not going to give this length

02:38.570 --> 02:44.090
and punctuation mark column for about model billing, because as we have seen, that is not very much

02:44.120 --> 02:48.140
useful, not creating any kind of differentiating factors.

02:48.590 --> 02:58.220
So what we can do the feature vector first, we can possibly hope it's not only if it's the data and

02:59.570 --> 03:00.420
message column.

03:00.860 --> 03:02.390
So data will be able to input data.

03:02.850 --> 03:07.850
And our objective is to predict the label that is nothing but our output because the data.

03:10.820 --> 03:12.500
In now, one more argument.

03:12.530 --> 03:20.540
You can pass it like a test size or it will be a brain size, so test size.

03:20.930 --> 03:23.810
So here you can view the numbers between zero to one.

03:24.200 --> 03:25.740
Let's see if I give zero point three.

03:25.760 --> 03:29.760
That means you are looking at a total percentage of daytime to testing.

03:29.960 --> 03:30.350
It's.

03:32.000 --> 03:32.990
Let's just make it.

03:33.580 --> 03:34.310
No state.

03:38.290 --> 03:38.660
Zero.

03:39.010 --> 03:44.920
Now, this zero indicates here, whenever you want to recreate exactly the same reason why you are so

03:44.930 --> 03:45.300
insecure.

03:46.090 --> 03:47.770
Same thing on your site.

03:48.400 --> 03:51.280
So zero here indicates it's the same reason.

03:51.310 --> 03:56.800
So suppose if I keep some other numbers, you also have to keep the same number to recreate exactly

03:56.920 --> 03:57.670
the same reason.

03:57.970 --> 04:00.370
And another one is you want to shuffle.

04:00.370 --> 04:03.150
It felt like just make it through.

04:05.450 --> 04:06.820
All right.

04:06.850 --> 04:10.820
Now this whole function will return for retirement.

04:11.290 --> 04:14.070
So that will be x rayed.

04:17.770 --> 04:25.190
Then we have our X, X, Y and escort rain in Reinders could test.

04:25.520 --> 04:32.470
So you have X that refers to a feature input, whereas Y refers to the label as an output.

04:32.860 --> 04:40.020
So corresponding to this particular message, it will be split into two different variables huckstering

04:40.180 --> 04:43.040
exodus and corresponding to this particular variable.

04:43.420 --> 04:45.850
It will be split it into Whiteley and Vytas.

04:46.090 --> 04:49.610
So this is the standard way of writing this for variable.

04:49.690 --> 04:52.620
Generally, everyone in a community use that.

04:53.320 --> 04:54.010
So let me.

04:54.260 --> 04:55.720
Executing an.

04:57.690 --> 05:02.880
So the best way to observe the result is to apply the shape matter on the top of.

05:03.330 --> 05:11.130
So what we can do out of one, I mean, 1494 records that are relevant and it's the total percentage

05:11.130 --> 05:14.170
of value will be 448.

05:14.550 --> 05:21.440
So let's us try to observe how many records containing, say, this train dog --.

05:21.840 --> 05:22.320
Whoops.

05:27.400 --> 05:27.680
It.

05:30.160 --> 05:35.520
And we have a 1045 a court has been look at raining buckets safely.

05:35.930 --> 05:37.330
And let's do this for.

05:38.360 --> 05:41.660
X this like ship.

05:44.900 --> 05:49.240
So it's a 449 cost because obviously you just cannot take slow to lose.

05:49.640 --> 05:54.290
So 449 Lacourt has been assigned to testing it, put it simply.

05:54.590 --> 06:00.110
If you just try to observe how many records are available in a training output limit, our testing output

06:00.110 --> 06:00.440
level.

06:00.470 --> 06:05.130
So that will be respectively 1045 and it will be 449.

06:05.210 --> 06:07.750
So you can verify your sense of light.

06:07.760 --> 06:12.890
Solar database has been created into two different buckets, training and testing.

06:13.190 --> 06:19.330
So in the next video, we'll create our first model and do the training on what training data set.

06:19.820 --> 06:21.030
So see you in the next video.

06:21.310 --> 06:28.280
We'll get started with creating our first random forest model, applying on our spam classification

06:28.280 --> 06:28.610
project.