WEBVTT

00:01.010 --> 00:01.500
How do you.

00:01.840 --> 00:06.070
So next project, which we are going to work upon is a spam detection.

00:06.600 --> 00:09.310
Now, earlier also we use this Pam Dixon.

00:09.890 --> 00:14.400
And we were applying mostly are those machine learning validate models.

00:14.860 --> 00:20.910
But in this video, we are going to apply whatever we learn in this particular section, convolution

00:20.940 --> 00:21.900
neural network.

00:22.530 --> 00:27.900
So for detecting the spam messages, we are going to use this convolution neural network.

00:28.380 --> 00:35.460
And all those functionality of CNN has been implemented inside this tensor flow and a Kiraz library.

00:35.940 --> 00:43.980
So let me put all those necessary stuff was so pensive, low and so close by default, dealt with this

00:44.070 --> 00:44.910
CoLab environment.

00:44.940 --> 00:47.310
So you do not require any kind of installation.

00:48.290 --> 00:52.000
Apart from that, from Haskell and Library, some training display.

00:52.800 --> 00:58.290
These are some of the default prepossessing library like Numpty Penders and forward Zeil Edgerton,

00:58.410 --> 00:59.480
MacRobert Live Leiby.

01:00.000 --> 01:06.000
No new thing here is about 10 Cibolo and I get us so far preprocessing.

01:07.240 --> 01:14.920
From the tense of low model, we are going to put this one, tokenized it under one is a padding sequence.

01:15.010 --> 01:16.660
I will tell you what is the use of that.

01:17.380 --> 01:24.810
Next to Bandolier, we are going to import this dense layer, input layer and meeting global max pooling.

01:25.780 --> 01:29.340
And just max pulling and a convolution one Lilja.

01:30.130 --> 01:32.470
And then it will be just Martin.

01:33.280 --> 01:35.820
So let me import all those necessary stuff.

01:37.470 --> 01:40.310
So we are going to deal with this paradox, not fight.

01:40.350 --> 01:43.120
So let me put it from my local machine.

01:44.590 --> 01:45.090
Kotal.

01:46.130 --> 01:46.720
Next stop.

01:49.530 --> 01:51.410
So how is Spam NCDC file?

01:55.890 --> 01:56.310
All right.

01:56.340 --> 01:56.700
So.

01:57.890 --> 01:59.670
It is successfully applauded.

02:00.350 --> 02:02.990
The next thing is we need to read this fight.

02:03.140 --> 02:04.970
So let me executed.

02:06.240 --> 02:13.030
Just Bearden's and I am using your encoding ISO eight, a finite, but otherwise you will get some error

02:13.210 --> 02:16.540
while despite because of some very odd character exist.

02:17.050 --> 02:20.170
So this type of encoding on Leegin, understand those character.

02:20.880 --> 02:23.910
Let me display first few records of this Dufferin.

02:25.470 --> 02:27.510
And you can see video of V1, V2.

02:27.960 --> 02:32.460
And there are lots of missing data like unnamed to an employee and an input.

02:32.820 --> 02:39.960
So in the next step, what we can do, we can just simply remove those column by name to name three

02:39.990 --> 02:40.850
and a name for it.

02:41.340 --> 02:42.390
So let me executed.

02:43.030 --> 02:46.800
And our interesting part will be to call V1.

02:46.920 --> 02:49.950
And we do so now VEO just the two column.

02:50.460 --> 02:57.360
So based on this V2 column, we have to decide whether messages pan or it will be a ham.

02:58.720 --> 02:59.700
Ledgers that he named the.

03:00.570 --> 03:01.260
Column name.

03:01.660 --> 03:06.000
So now the columns will be data and one will be labels.

03:06.400 --> 03:10.630
So if you observe now, it will be a data and it will be a column.

03:11.440 --> 03:16.720
Now, let's first do all those preprocessing required for the labels, like I say.

03:17.200 --> 03:22.780
So here the output values will be written in terms of, let's say, objects, or it will be in a form

03:22.780 --> 03:23.380
of text.

03:23.620 --> 03:26.960
But we are going to apply neural network model on top of it.

03:26.980 --> 03:31.630
So we have to convert this Hamman span into kind of numbers.

03:31.990 --> 03:34.990
So we are using the label encoder and.

03:36.330 --> 03:42.790
To convert every single hamin to zero and a spamming to one I have created here a small dictionary object.

03:43.450 --> 03:49.190
And with the help of this math function on just label column, we are going to apply.

03:50.290 --> 03:58.180
And you can see if I try to display let's see why I got only zero zero zero one.

03:58.210 --> 03:59.500
So first will be ham.

03:59.590 --> 04:03.820
So ham will be it to zero and spam will be matched to one.

04:04.960 --> 04:08.990
Next thing is we are going to split this dataset into two different buckets.

04:09.080 --> 04:12.940
Just like earlier, extra in excess wiping in Vytas.

04:13.150 --> 04:16.020
So that will be a training data set and testing that does it.

04:16.540 --> 04:22.140
And this particular function we used from the cyclone library, which you imported here.

04:22.480 --> 04:24.010
So Skillern model selection.

04:25.180 --> 04:30.770
Now things will be a little different compared to whatever machine learning algorithm earlier we play.

04:31.600 --> 04:34.850
First thing is that we have a data available in a text.

04:35.590 --> 04:39.490
So we need to apply some sort of embedding technique.

04:41.500 --> 04:44.160
Before start applying those embedding model.

04:44.710 --> 04:49.520
So here we are going to use a very simple bag of words and of Martin.

04:50.020 --> 04:55.900
So every single unique character will be given one index position in a complete walkability.

04:56.530 --> 04:57.640
So what we will do.

04:57.660 --> 04:58.930
Let me create new cell.

04:59.860 --> 05:03.310
And let me define this thing one by one.

05:04.320 --> 05:05.440
So I'll get a more idea.

05:05.770 --> 05:08.080
So here I am going to create one.

05:08.170 --> 05:09.250
Tokenized an object.

05:09.940 --> 05:11.140
And I'll just pass.

05:11.140 --> 05:14.940
The number of words will be maximum walkability size.

05:15.490 --> 05:20.050
Let's see if I try to display tokenized it.

05:23.920 --> 05:26.620
It will be a kind of token asset class.

05:27.160 --> 05:32.770
Next thing is we need to fit our input training dataset into a tokenized.

05:33.630 --> 05:35.890
So let me copy and put it here.

05:36.640 --> 05:38.380
So step by step, you will get idea.

05:38.680 --> 05:42.110
What we're doing now once we print the training does it?

05:42.550 --> 05:50.410
We need to transform our text into kind of sequences and those converges and Trump text two sequences.

05:50.440 --> 05:54.170
We are going to do on a boat training the concept and testing it.

05:54.270 --> 05:59.130
I say let me copy this one and put it into an adisa.

06:01.340 --> 06:06.200
All right, now let me display first food, of course, of training.

06:07.780 --> 06:16.090
Let me do with just the first record, and you can see just the set of numbers only, so the very first

06:16.720 --> 06:19.240
record in this screen.

06:19.450 --> 06:21.680
Let me displacement explain also.

06:22.630 --> 06:32.710
So that will be X green and you go on day gun, whatever it will be that got zie presented by this particular

06:32.860 --> 06:33.160
number.

06:33.700 --> 06:35.620
Now, if you try to do land.

06:36.690 --> 06:37.430
On the top of.

06:39.230 --> 06:40.520
It will be a guarantee.

06:41.780 --> 06:42.440
Now, same thing.

06:42.470 --> 06:45.220
Let's see if you play for the next record.

06:46.860 --> 06:48.200
He's having a next one.

06:48.940 --> 06:50.040
So it is a thirty.

06:50.530 --> 06:53.830
So every single record has a variable length sequence.

06:54.620 --> 06:54.870
All right.

06:54.980 --> 06:57.000
So now we can rotate.

06:58.050 --> 07:00.750
Our tax later in to kind of sequence.

07:03.050 --> 07:05.770
Next, let's try to find those vocabulary.

07:07.410 --> 07:09.180
Then we will see about those fatty.

07:10.210 --> 07:13.790
So vocabulary, let me know this much stuff.

07:13.880 --> 07:15.820
Are we we can even execute this one.

07:16.930 --> 07:24.380
So total number of unique tokens crowed this whole lower Nexus seven to fight to and from the tokenized,

07:24.410 --> 07:26.760
plus we this information.

07:27.530 --> 07:33.860
So we have a total unique tokens in our whole model will be selling to fight two.

07:35.840 --> 07:37.610
Next thing is, we need to.

07:38.550 --> 07:46.400
Create one big matrix, which we can pass on to our neural network model for doing amazing stuff.

07:46.680 --> 07:50.790
And then we do for the convolution neural network.

07:51.510 --> 07:54.120
So we have to do for the power sequence.

07:54.150 --> 07:56.880
Now, why do we need to bring the sequence?

07:56.910 --> 07:59.340
Because here we have a 20.

08:00.030 --> 08:02.470
Now, we in a second ago, we were Carletto numbers.

08:02.820 --> 08:05.570
So neural network accept the fixed land input.

08:05.970 --> 08:09.600
So in that case, we need to define one fixed line input.

08:10.170 --> 08:12.750
And we need to pair zero into it.

08:13.320 --> 08:13.980
So let me.

08:16.010 --> 08:16.310
Need.

08:17.350 --> 08:18.570
And we are here.

08:18.740 --> 08:21.600
We are using a parent and escort sequences.

08:22.180 --> 08:27.730
Now, if you just tried to find the very first record, let's say, late on this underscore plain.

08:29.430 --> 08:31.780
You can see there are lots of Gedo quite badly.

08:32.790 --> 08:35.000
And then, well, 31 got stuck.

08:35.530 --> 08:42.750
Now, if you just compare with this one or I have to print it again, this sequence.

08:44.910 --> 08:46.690
Under Scottoline, not you.

08:48.970 --> 08:51.820
And you can see we are at 31 44.

08:52.170 --> 08:52.410
Well.

08:53.890 --> 08:55.340
And you can see from here.

08:55.650 --> 08:57.150
Well, 31, 44 percent.

08:57.520 --> 09:00.910
So remaining all value, it has kept like a zero.

09:01.390 --> 09:04.120
Now, if you try to find the length of this particular

09:07.960 --> 09:12.400
individual sequence, it will be a one eighty nine.

09:12.790 --> 09:16.990
So every single record now has been presented with.

09:18.200 --> 09:19.930
One eighty nine numbers.

09:20.330 --> 09:26.360
If you try to display the second record length, so that will be also seen.

09:26.780 --> 09:30.770
So that is the whole objective behind this baffling sequence.

09:31.250 --> 09:31.540
All right.

09:31.580 --> 09:33.980
So I'm just winding up at this point.

09:34.430 --> 09:39.120
And in the next video, we will see how we can build this model.

09:39.510 --> 09:41.080
So see you in the next video.
