WEBVTT

00:01.090 --> 00:06.820
Hello, everyone, and welcome to new section on Spam Message Classification Project.

00:07.540 --> 00:14.320
So in this section, what we are going to do, we are going to import one spam message deleted file

00:14.830 --> 00:18.220
and we'll be applying on a different machine learning algorithm.

00:18.250 --> 00:19.730
One is a random forest.

00:20.290 --> 00:21.340
Hannah Hestia.

00:21.820 --> 00:28.450
So let us have a look at post about the data and then we will see about what is the business problem

00:28.630 --> 00:29.430
running behind it.

00:30.190 --> 00:34.840
And for that, we help upload one local file so you can click on a file.

00:35.560 --> 00:40.480
Click on upload and navigate to where you kept this file.

00:41.170 --> 00:47.040
So inside my folder I have kept here spam dot DSE file.

00:49.270 --> 00:51.310
So they might get uploaded file veigar delegate.

00:51.640 --> 00:53.590
When that untimed will be recycled.

00:53.620 --> 00:54.140
That's OK.

00:55.820 --> 00:56.220
All right.

00:56.660 --> 00:57.840
So what we can do?

00:58.100 --> 01:01.090
Let's try to import this data first for that.

01:01.250 --> 01:01.950
First of all.

01:02.240 --> 01:10.970
We are going to import our all basic necessity required library numpty, find us and make part.

01:12.230 --> 01:13.700
And let me just.

01:13.730 --> 01:15.950
Heidi, let's just run it.

01:18.700 --> 01:19.110
All right.

01:20.650 --> 01:25.300
Now we are going to use this very beauty dot to see this function.

01:27.700 --> 01:29.380
To read this Hoffs.

01:33.210 --> 01:37.460
Spambot, BSEE, I know everything here, ACAP separately.

01:37.890 --> 01:41.210
So we are going to use this separator.

01:42.840 --> 01:44.340
As a Slessor, he.

01:45.930 --> 01:49.440
And let me assigning to some data from object B F.

01:51.470 --> 01:53.720
Perhaps there is some issue.

01:56.900 --> 01:57.300
All right.

01:57.500 --> 02:00.000
It's not set, it's a separate.

02:00.110 --> 02:01.790
That means SICP.

02:02.750 --> 02:04.190
So that's a small typo.

02:04.940 --> 02:05.390
All right.

02:06.320 --> 02:12.890
Now, let us have a look at the data for us and then we will figure out what we need to do.

02:13.730 --> 02:19.550
So I'm just displaying phosphide codes with the help of this HERIT function, and you will be able to

02:19.550 --> 02:23.510
see V how one column gets a very important.

02:23.540 --> 02:26.330
And that's where whole data recites.

02:26.660 --> 02:28.760
That is nothing but our message column.

02:29.240 --> 02:34.010
And each individual message has been associated with ham.

02:34.540 --> 02:39.560
It will be a spam and for each individual message they have given Leiker.

02:39.680 --> 02:41.870
What is the length of that particular message?

02:42.440 --> 02:48.470
Although this bill looks like an already computed field, that if you just apply a land function on

02:48.560 --> 02:55.100
top of this message column, you'll be able to get this one hundred again eleven similar like in each

02:55.190 --> 02:56.350
individual message.

02:56.420 --> 02:58.690
How many punctuation mark exists?

02:59.300 --> 03:05.960
So as a business problem, our object is to build a model which will do this binary classification between

03:06.410 --> 03:06.870
two class.

03:07.050 --> 03:09.100
Either it will be a ham or a spam.

03:09.470 --> 03:17.000
So based on message, need to classify weather messages, spam quantities, not spam.

03:17.750 --> 03:18.020
All right.

03:18.050 --> 03:22.430
So let's do some more basic analysis on top of this data.

03:23.000 --> 03:32.180
Something like whether we have any missing records or not so far that we can just simply use B.F. DNA.

03:33.920 --> 03:37.730
And if you execute it, it will be done like a true and files.

03:37.790 --> 03:42.560
So wherever there is something missing, it will return like a probe.

03:42.800 --> 03:44.210
Otherwise it will return false.

03:44.450 --> 03:48.410
So what we can do on top of that, we can apply some matter.

03:49.430 --> 03:51.860
And what each individual column.

03:51.990 --> 03:54.950
What I would say feature it will give us that.

03:54.980 --> 03:56.630
How many do exist.

03:57.050 --> 03:59.060
So in our case, that is not true.

03:59.150 --> 04:04.910
That means our data is complete in the sense that there is no missing value existing of it.

04:05.210 --> 04:05.690
At this.

04:06.930 --> 04:11.350
We can have a look at that data set like from ducktail on, so.

04:11.630 --> 04:15.720
So on the back side of the data so we can verify.

04:17.100 --> 04:24.220
What that data is, and you can see the data is starting from indexing zero to five five seven one Beckman's,

04:24.270 --> 04:31.380
there are total five five seven two records exist and each of the record has been assigned a Leeville

04:31.560 --> 04:32.820
Ham or spam.

04:33.120 --> 04:39.720
Now we have just a two column that is a numeric, although we can do all those statistical analysis

04:39.720 --> 04:40.590
on top of it.

04:41.130 --> 04:45.120
So let's have latently, if not describe.

04:47.490 --> 04:52.740
And it will do those statistical analysis on the top of just the numerical column.

04:53.520 --> 04:55.700
So you can see in case of land and punctuation.

04:56.160 --> 05:01.050
So in both the cases, we have a total five five seven two record of valuably.

05:01.680 --> 05:07.350
And in case I mean, the average Lindop of our message is fifty point forty eight characters.

05:07.830 --> 05:12.990
And a punctuation mark average in each of the masses is like a four point seventeen.

05:13.950 --> 05:17.430
And in case of standard deviation, it deviates too much from this.

05:17.910 --> 05:18.970
Like 59.

05:19.530 --> 05:23.640
Whereas in case of punctuation Mark, the deviation is four point sixty two.

05:24.390 --> 05:32.580
Although this information doesn't really bitingly helpful while classifying stuff, because if you just

05:32.730 --> 05:37.130
do all those thing grouped by and if you apply those matter.

05:38.610 --> 05:39.940
It will be very much has.

05:40.450 --> 05:48.310
So what we can do on the top off label, if we can apply and we can try to grab how many of the records

05:48.310 --> 05:51.000
sort of hammer and how many records are spent.

05:51.870 --> 05:56.720
So let me do this on our label column.

05:57.000 --> 06:03.310
So it will be a label and let's play Relu finish school, Collins.

06:04.050 --> 06:04.530
All right.

06:04.560 --> 06:11.530
So we have total four thousand eight hundred and twenty five records that are ham and just the selling

06:11.800 --> 06:16.160
40 because out of spam, that means it's a very much imbalanced dataset.

06:16.460 --> 06:21.570
I would say we can even divide by the length of the F.

06:22.050 --> 06:24.720
So in terms of percentage, we'll get those numbers.

06:25.620 --> 06:27.290
You can even multiply by hundred.

06:33.960 --> 06:34.380
An.

06:35.900 --> 06:36.990
I have to multiply here.

06:37.370 --> 06:38.120
It's not here.

06:38.150 --> 06:40.730
So what we can do instead of multiply by.

06:43.190 --> 06:46.070
Here, I could only play here.

06:46.130 --> 06:46.710
But that's OK.

06:47.520 --> 06:49.040
So here the indication is that.

06:49.070 --> 06:51.210
Eighty four percent.

06:51.410 --> 06:56.270
86 percent digital data are lying in a basket of ham.

06:56.600 --> 07:00.190
That is just the fourteen thirteen point forty percent digital data.

07:00.590 --> 07:02.750
Lying in a hospital spam.

07:02.810 --> 07:05.970
So it's kind of very much imbalanced data set.

07:06.290 --> 07:09.520
So we are not going to do all those training on a complete dataset.

07:09.590 --> 07:12.020
Because then it will be a kind of a little bit biased.

07:12.350 --> 07:19.570
So in the next video, we will see how you can take equivalently same amount of data, say, for the

07:19.770 --> 07:23.990
further analysis from both the Glassey side that it will be a spam or it will be a ham.

07:24.470 --> 07:24.790
All right.

07:24.830 --> 07:26.310
So see you in the next video.