WEBVTT

00:02.460 --> 00:03.390
All right, everyone.

00:03.450 --> 00:07.710
So let's continue our discussion on a spam message classification project.

00:08.430 --> 00:13.050
So, you know, last video we have seen, that is a very much imbalanced dataset.

00:13.470 --> 00:15.530
So how we can make it balanced.

00:16.080 --> 00:17.500
So if you observe the.

00:17.590 --> 00:20.430
We have let me execute this one again.

00:21.450 --> 00:24.320
So we'll get absolute numbers like four.

00:24.370 --> 00:29.490
Have we have four thousand eight hundred twenty five course and four spam?

00:29.640 --> 00:31.730
That is a very limited number of records that are relevant.

00:32.160 --> 00:36.370
So one way we can go, like, oh, we can try to find more data.

00:36.420 --> 00:41.360
We can try to collect more data related to spam category and make it legal.

00:41.550 --> 00:48.510
Not exactly for it to fight what equivalently kind of very same proposal of data so that it will be

00:48.510 --> 00:51.490
I mean, quite good in a stratification.

00:51.990 --> 00:54.570
But no, we are not going to go with that.

00:54.570 --> 00:56.200
We are collecting data.

00:56.280 --> 01:01.620
But that is the one we are really working on some industry level problem.

01:01.980 --> 01:04.190
No other way we can go ahead, Leiker.

01:04.560 --> 01:08.580
We can discard randomly some data from this category.

01:08.910 --> 01:15.360
And equivalently, what there were numbers of messages that are available in a spam category, the same

01:15.360 --> 01:16.860
amount of message.

01:16.930 --> 01:24.560
We can put it into ham category so that we we have a quite balanced dataset, like four percentage of

01:24.570 --> 01:28.380
data belongs to him category and 50 percent belongs to spam category.

01:29.040 --> 01:34.680
So first, let us find out just a ham message into some other variables.

01:34.800 --> 01:37.760
So what do you label?

01:38.610 --> 01:47.240
Can we just compare with the ham and let's just take all those records, which is having the heavens

01:47.720 --> 01:49.290
what a label is, ham.

01:50.900 --> 01:51.290
All right.

01:51.330 --> 01:53.460
So you can see only hand message appears.

01:53.910 --> 01:56.690
Let's just assign it to ham radio and.

02:00.450 --> 02:02.400
And seemed like a ham.

02:02.820 --> 02:09.160
Let just make it hopes another jobs market, like a spam message bucket.

02:09.660 --> 02:14.160
So it will be a spam and they will be spam.

02:14.460 --> 02:14.970
All right.

02:15.310 --> 02:16.230
Legislative fight.

02:16.330 --> 02:17.350
Disabled both.

02:17.740 --> 02:21.410
This video was so ham, not ship.

02:22.140 --> 02:23.970
Spam, dark shape.

02:26.080 --> 02:33.400
So in a word, the video will be Alvo for features like in case of Ham, as we so only earlier, that

02:33.400 --> 02:36.110
we have a quite good number of datasets set available.

02:37.030 --> 02:41.950
So what we can do out of this for eight to five records.

02:43.330 --> 02:53.560
We can play to grab randomly some 747 sample and put it into markets like a ham, so ham, let's say

02:53.560 --> 02:53.980
ham.

02:54.470 --> 02:59.200
We are going to use this sample function and how many we need to take.

02:59.470 --> 03:01.420
So that will be a 747.

03:01.420 --> 03:03.280
So I can give you the hard coded stuff.

03:03.370 --> 03:08.560
Well, or instead of that, I can put away to get Sape zero.

03:10.460 --> 03:11.600
And let me executed.

03:13.090 --> 03:20.730
And afterwards, if you'll just then this shape on this board of this video went home safe and expansive.

03:21.680 --> 03:24.140
So it will be a quiet balance.

03:26.880 --> 03:33.300
And we can just up in all those spam messaging to have a message or a ham message into a spam message

03:34.240 --> 03:34.760
and 48.

03:34.950 --> 03:37.230
We are going to use like a hand or a pen.

03:38.100 --> 03:40.170
Let me append spam to the ham.

03:40.890 --> 03:44.870
And now there is some issue like up ignore index.

03:45.030 --> 03:47.090
So let's just make it ignore.

03:47.480 --> 03:53.910
Index is equal to two because both of these are having the same index.

03:55.310 --> 04:00.680
So it should not create a problem while upending because there should not be a to court, which is having

04:00.830 --> 04:03.260
exactly the same index.

04:03.980 --> 04:05.300
And let me, as I need to know.

04:05.650 --> 04:07.820
New variable like a data.

04:09.930 --> 04:13.290
Now, if you observe the data like shape.

04:16.090 --> 04:21.250
You'll be able to see we have a very limited number of records that are available like fourteen hundred

04:21.400 --> 04:23.830
ninety four records only among them.

04:24.280 --> 04:27.400
That is a 747.

04:27.750 --> 04:28.030
Hmm.

04:28.560 --> 04:32.260
And same number of records resides in a spam category.

04:32.440 --> 04:37.120
So if you apply let's say this label value counts.

04:38.150 --> 04:42.420
Now, not under culpably if magic will be under Karpoff data later.

04:43.460 --> 04:50.240
So we have exactly the same number of records available in both this category while most of this label.

04:53.270 --> 04:53.670
All right.

04:53.760 --> 04:57.810
So that is how we made our database balance.

04:58.140 --> 05:06.560
Now let's do some quick resolution that based on whether latest spam on a database hem.

05:07.470 --> 05:11.160
How different feature affects to this to category.

05:12.620 --> 05:20.960
Let's say for him only, let's try to visualize VLA other columns on, so let me display first few records.

05:23.930 --> 05:26.740
So we have a land also vla punctuation mark.

05:27.380 --> 05:33.330
So let's see for just a ham category, let's try to visualize the data in our form of, let's say,

05:33.330 --> 05:33.920
a histogram.

05:33.920 --> 05:40.040
So we get the idea that what Ham Jindalee, the land of the MACIT recites in some particular category.

05:40.370 --> 05:45.730
We may get some useful information or we may not get any useful information.

05:45.770 --> 05:47.990
So that is like an initial analysis.

05:48.020 --> 05:50.600
As a machine learning engineer, you need to do it.

05:52.120 --> 05:54.350
So let's use VLT, not here.

05:56.020 --> 06:00.080
So that we create a histogram or I would say of one name instead Historia.

06:00.520 --> 06:05.680
So the first records, which we are going to posit at length.

06:06.040 --> 06:09.880
Now, Lanta, we want to display just for the home message.

06:11.410 --> 06:14.530
So what we will do is data from data.

06:15.280 --> 06:17.230
Let's just compare it to.

06:22.650 --> 06:23.030
Hemm.

06:24.690 --> 06:28.370
And we're going to display this the length column.

06:30.090 --> 06:30.600
All right.

06:31.380 --> 06:35.610
And let me just make it the Alpha Dog Show.

06:36.240 --> 06:38.880
So dysfunction we are using it from.

06:40.430 --> 06:41.180
Matplotlib.

06:42.040 --> 06:42.590
All right.

06:43.040 --> 06:51.200
So you can see majority of our ham message, I would say recites in a category of zero to somewhere

06:51.200 --> 06:52.560
around one 50.

06:52.580 --> 06:54.980
I was 150 in terms of length.

06:55.820 --> 06:57.680
We can increase this mean also.

06:59.440 --> 07:02.580
Let's say Lindesay will make it like, let's say hundred.

07:04.260 --> 07:08.650
And let's just make it a little lighter.

07:09.670 --> 07:10.620
Any feared security?

07:12.000 --> 07:15.150
All right, so you can see this is the histogram.

07:17.870 --> 07:22.840
Now, the same information, if you read Rothhaar Spam is it category also.

07:22.910 --> 07:28.040
Then you will be able to get the idea that whether any differentiating factors or not.

07:28.430 --> 07:32.050
So let me just selected another histogram.

07:32.590 --> 07:33.190
Same figure.

07:33.270 --> 07:38.700
We are going to draw that his father span category and legislate only now.

07:39.080 --> 07:42.590
Both of this stone histogram will be displayed in two different fellows.

07:43.010 --> 07:50.150
So now you will be able to see there is little differentiating factor that whenever the message of hiring

07:50.150 --> 07:53.240
terms of land, that is a very high probability.

07:53.450 --> 08:00.500
You can see higher land masses has a very high probability that it belongs to a spam category.

08:01.400 --> 08:08.210
Although there are lendee messages which are harmless, so by the probability of putting those message

08:08.210 --> 08:10.530
into a spam category will be higher.

08:11.540 --> 08:15.140
And let's take to those same thing with contrition, MARCOSSON.

08:15.170 --> 08:17.450
So let me just selected here.

08:17.480 --> 08:19.640
We need analysis for a length.

08:19.970 --> 08:23.970
So let me do for this particular column on.

08:26.350 --> 08:27.510
And let this play.

08:29.030 --> 08:29.510
All right.

08:29.930 --> 08:36.590
So by looking at this crop, there is no much differentiating factor that punctuation mark is not much

08:36.710 --> 08:41.420
affecting the classification of messaging to ham or a spam.

08:44.270 --> 08:47.430
All right, so that is about our little initial analysis.

08:47.460 --> 08:54.120
We tried to understand about the other data and we understood about what we are trying to do with this

08:54.120 --> 08:54.480
data.

08:54.780 --> 08:57.520
We made our database even balanced also.

08:57.840 --> 09:02.800
So next task is we can go ahead with a building model and for building model.

09:02.800 --> 09:09.120
The first ever Cuscuna Machine learning is to separate your data into two different buckets, training

09:09.150 --> 09:09.780
and testing.

09:10.140 --> 09:12.500
So those things we will sener next model.