WEBVTT

00:00.720 --> 00:01.630
All right, everyone.

00:01.750 --> 00:09.550
So next step is applying this random forest machine learning algorithm on the top of a message column

00:09.580 --> 00:14.150
or I will say our training data, say XStream and likely.

00:15.020 --> 00:15.360
All right.

00:15.760 --> 00:23.410
So here we are dealing with just tax data only, and we just cannot prove all those actual data directly

00:23.410 --> 00:25.450
on a machine learning algorithm.

00:25.450 --> 00:31.630
V, how to convert some sort of encoding which will convert all those textual data into some sort of

00:32.020 --> 00:32.650
numbers.

00:33.010 --> 00:39.310
Now, to do that, there are a number of Mattos are available like a bag of four Smardon what I would

00:39.310 --> 00:46.830
say DFI RDF model and some of the deep learning based advance models like a vote to work for a global

00:46.830 --> 00:47.320
actors.

00:47.740 --> 00:55.630
So all those techniques eventually try to convert all those textual data into kinds of number which

00:55.630 --> 01:00.400
will try to preserve all those semantic relationship exist between the data.

01:01.230 --> 01:06.400
The deep learning mis model of Bisley works much better with accuracy.

01:06.790 --> 01:14.480
But we'll go with the one hand coding technique which will convert all your text data into kind of D.F.

01:14.530 --> 01:15.370
idea factor.

01:15.410 --> 01:17.830
That is nothing but a down frequency.

01:18.190 --> 01:20.370
Hannah, inverse documents frequency.

01:20.410 --> 01:28.150
So multiplication of most of this will give you some number and gives you some score the presence of

01:28.150 --> 01:30.430
some particular word in a document.

01:30.820 --> 01:35.410
So you can consider every single record here as a kind of document.

01:36.070 --> 01:37.060
So what do you do?

01:37.390 --> 01:41.000
Let me first import those PEF ideas.

01:42.030 --> 01:42.970
So from.

01:46.180 --> 01:46.900
SBA loan.

01:49.500 --> 01:51.180
Not feature extraction.

01:52.110 --> 01:53.100
Let's import.

01:58.170 --> 02:00.260
So feature sticks and inside Taito.

02:00.360 --> 02:08.820
So we to go like a tax model and it will be at D.F. Ideas, rectories it.

02:09.710 --> 02:10.020
All right.

02:10.050 --> 02:11.020
So that is the one input.

02:11.610 --> 02:14.620
Another one is we are going to apply this addendum forest already.

02:14.850 --> 02:18.420
So that is a part of our Skillern and Simbel Martin.

02:21.710 --> 02:22.980
And let me import

02:28.170 --> 02:29.000
random forest.

02:29.100 --> 02:29.660
Classified.

02:30.210 --> 02:32.280
So first will apply this DFI leaf.

02:32.460 --> 02:35.180
And then random forest classified.

02:35.250 --> 02:37.830
So that is a kind of very much pipelined process.

02:38.100 --> 02:43.770
So what we'll do applying this, both things individually, we can create one pipeline object.

02:44.190 --> 02:47.970
So that is also part of this Aspelin pipeline.

02:49.560 --> 02:50.820
We help import this.

02:52.930 --> 02:53.410
Pipeline.

02:54.160 --> 03:01.220
All right, now, instead of creating this two object, we will create a pipeline object.

03:01.600 --> 03:07.480
And instead the pipeline object, we will just pass most of this object, PEF, IDF, recognize it and

03:07.510 --> 03:08.890
then the forest classify it.

03:09.430 --> 03:11.620
So let me make pipeline object.

03:15.290 --> 03:18.560
And here we will pass it as a list.

03:19.190 --> 03:27.020
This two objects, DFI, Dave, vectorized object and a random forest classified object so we can pass

03:27.020 --> 03:29.390
both of them as like couples.

03:31.670 --> 03:39.500
So it will be a let's give the name like the IDF and corresponding its object.

03:41.090 --> 03:41.810
Next is.

03:46.120 --> 03:52.630
Obviously, a classic fire that is nothing but a random forest classified object.

03:53.260 --> 03:55.130
Now this number is classified.

03:55.130 --> 03:57.660
It contains a lot of hyper parameter.

03:58.420 --> 04:01.300
So you can see you can see here.

04:02.890 --> 04:08.760
There are a lot of Hyperdynamics to exist, so if you want to improve this model, you can do this hyper

04:08.760 --> 04:09.230
parameter.

04:09.700 --> 04:14.830
So let's just make it some hyper parameters, like let's say and in this estimate, stimulus.

04:17.330 --> 04:18.760
So lean forward to Ray Lewis and Ray.

04:19.230 --> 04:21.050
Let just make it even handed on.

04:22.010 --> 04:23.690
So that we be an estimate.

04:24.260 --> 04:26.670
Let's just make it four timing, let's say 10.

04:28.520 --> 04:33.710
And let me assign it to classify it as an object.

04:35.680 --> 04:42.790
And let just treat our input training data, because one object got created, you need to train.

04:42.990 --> 04:46.570
And for training, there is a uniform sets of EPA that are available.

04:46.700 --> 04:49.130
Now, this has scale and library.

04:49.450 --> 04:51.370
So that will be a fit function.

04:51.940 --> 04:53.220
So here we are going to pass.

04:53.380 --> 04:56.110
Explain Tanah Lightering.

04:58.270 --> 05:01.070
All right, let me execute it.

05:01.520 --> 05:05.330
And that training process will start and you can see for creating.

05:06.360 --> 05:10.350
Ten different estimates are I would say that is a 10 different.

05:10.780 --> 05:12.120
This is in 3D will create.

05:12.480 --> 05:18.060
It has immediately given us the results so we can even go with a let's say and date also.

05:18.560 --> 05:23.910
And let me defined as classified once again, let just fit it once more.

05:25.460 --> 05:27.520
All right, so immediately we've got the result.

05:27.950 --> 05:30.650
That means our training is finished.

05:31.130 --> 05:35.120
Let's try to evaluate what model that how good our model is.

05:35.540 --> 05:36.680
So model, it's created.

05:37.100 --> 05:41.390
Our training process is what we need to use this classifier object.

05:42.710 --> 05:44.960
And we are going to use this Braddick matter.

05:46.200 --> 05:50.370
So prediction, we are going to apply on packs and escort testing.

05:50.820 --> 05:57.540
That means the input they stay w will pass it and it will give us some kind of prediction that our model

05:57.540 --> 05:59.510
has predicted this values.

06:00.260 --> 06:01.680
Let me executive.

06:03.070 --> 06:09.740
And if you're just this by and this court test and why in the school, right.

06:10.660 --> 06:13.060
Side by side comparison, you would be able to see.

06:13.840 --> 06:14.260
All right.

06:14.290 --> 06:20.070
So first, it has displayed this vibrate and then it has displayed this light prediction.

06:20.110 --> 06:23.390
So first Hispan, then we have a spam, then spam.

06:24.070 --> 06:24.390
All right.

06:24.430 --> 06:27.610
So analytically, how we can compare both of this.

06:27.870 --> 06:28.170
Is a.

06:29.650 --> 06:37.000
And to do those thing, there is a one definite matricide available to get the accuracy behind your

06:37.390 --> 06:39.040
classification problem.

06:39.700 --> 06:42.420
So one is like a full detailed classification record.

06:42.440 --> 06:47.610
So it will be hard if you just want to know about the accuracy score or if you want to know about the

06:48.010 --> 06:50.100
one specific measurement criteria later.

06:50.890 --> 06:51.970
Confusing matrix.

06:53.920 --> 06:55.930
So let's try to find all of them.

06:55.960 --> 06:59.260
So from Haski line.

07:03.270 --> 07:04.420
Not metrics.

07:05.130 --> 07:06.240
We are going to import.

07:09.520 --> 07:14.080
Classification Report 101 is accuracy scored.

07:16.920 --> 07:21.910
And one more is confusing matrix right now.

07:22.590 --> 07:30.040
Whatever it predicted data in our test data, we will be supplying and we'll try to find all those main

07:30.070 --> 07:31.890
criteria, how good our model is.

07:32.310 --> 07:35.390
So one is we can use like a accuracy score.

07:36.460 --> 07:39.350
So first, what argument you need to pass like a virus.

07:39.400 --> 07:43.270
That means your ground level and next is that you are protected.

07:43.870 --> 07:49.390
So that will be a white test that is able to truly win and wipe predicted.

07:49.480 --> 07:50.970
Will be it predictive, is it?

07:51.850 --> 07:53.050
And if you execute it.

07:54.200 --> 07:58.420
You'll be able to see almost ninety four point sixty five percent of your case.

07:58.780 --> 07:59.830
We got it right.

07:59.860 --> 08:02.460
That means our model is predicted.

08:02.950 --> 08:04.030
Very good reason.

08:04.120 --> 08:10.150
If you just multiply it with our total number of records, that will be around.

08:11.110 --> 08:12.410
How many testing that code?

08:12.460 --> 08:14.640
That is it, 449 testing records.

08:15.190 --> 08:17.500
So if you multiply this number by

08:20.170 --> 08:25.900
forty nine, we will get 425, sort of 449.

08:26.170 --> 08:27.940
We know for 425 samples.

08:27.970 --> 08:28.870
We got it right.

08:29.230 --> 08:31.240
That means our model is quite accurate.

08:31.270 --> 08:33.400
We got enough good accuracy.

08:34.090 --> 08:38.710
We can even calculate this confusing metrics also so, so confusing metrics.

08:38.740 --> 08:46.600
Give us a little bit more detail stuff related to our classification problem and how accurate and a

08:46.600 --> 08:47.800
good our model is.

08:49.420 --> 08:51.010
And let just pass.

08:51.310 --> 08:52.240
Same like earlier.

08:52.480 --> 08:55.480
Why test and why.

08:56.940 --> 08:57.090
Right.

08:58.210 --> 08:58.630
All right.

08:59.170 --> 09:06.080
So now as we are dealing here with a binary classification problem, it will return us to cross two

09:06.250 --> 09:06.820
metrics.

09:07.030 --> 09:12.130
So in know diagonal elements, you can see it has radically correctly.

09:12.520 --> 09:16.000
So in a first case, let's consider it like a.

09:17.850 --> 09:26.340
Ham, then we always pan ham and then we spend so in off diagonal elements that that is a 24 in the

09:26.340 --> 09:26.790
numbers.

09:27.000 --> 09:35.880
That's where we got it wrong, that our data was ham and our system as predicted, like a spam, whereas

09:36.170 --> 09:39.540
sometimes we eat has predicted like a spam.

09:39.780 --> 09:41.280
But our data was a ham.

09:41.710 --> 09:42.730
So otherwise I was.

09:43.410 --> 09:47.650
But you know, on diagonal elements 226 and one ninety nine.

09:48.210 --> 09:51.950
This is the case where we completely got it right.

09:52.920 --> 09:57.510
Now, if you're dealing with, let's say, multiclass classification problem, let's say fateless classification

09:57.510 --> 09:57.900
problem.

09:58.170 --> 10:04.800
So in that case, you would have got this confusing matrix as a fake CROSSFIRE matrix because on a horizontally

10:04.920 --> 10:09.950
Aldo's through label will be given a vertical, it will be given a radically limit.

10:10.790 --> 10:15.090
Now, if you want to get more information about this confusing matrix, you can just Google around it

10:15.440 --> 10:17.040
and you will get much more idea.

10:17.890 --> 10:18.220
All right.

10:18.270 --> 10:19.980
So one more is classification.

10:22.170 --> 10:27.880
So it seemed like earlier we can pass here whiteish in a way, Fred.

10:29.440 --> 10:30.630
And let me executed.

10:31.840 --> 10:38.460
All right, so it doesn't display, you know, beautifully, Meinhardt, so we can just wrap it around

10:38.460 --> 10:39.400
a print function.

10:40.800 --> 10:41.290
All right.

10:41.380 --> 10:45.440
So somewhat more detailed explanation about our model.

10:45.850 --> 10:48.310
They are given like a precision recall.

10:48.610 --> 10:49.260
Have fun, Scott.

10:49.450 --> 10:51.340
And a support for what category?

10:51.340 --> 10:52.620
Ham and spam.

10:53.920 --> 10:54.470
All right.

10:54.520 --> 10:57.280
So that is how accurate and good our model is.

10:57.400 --> 10:58.330
A random forest.

10:58.960 --> 11:01.630
Now, can you improve it or can you apply some other model?

11:01.660 --> 11:06.640
So in the next video, we will see applying another model support regular machine.

11:06.910 --> 11:08.190
So see you next video.