WEBVTT

00:01.160 --> 00:01.740
Hey, everyone.

00:01.800 --> 00:03.630
So let's continue our discussion on.

00:04.580 --> 00:11.600
This classification, which spacy project so we help me clean, was one of our polygonal reviews.

00:11.960 --> 00:16.510
Now we have to apply all those cleaning process on every single day.

00:16.670 --> 00:19.520
Like I said, but we are not going to do that.

00:20.150 --> 00:26.560
Instead of that, this feature engineering stuff in this, as you learn or I would say is cyclical and

00:26.560 --> 00:28.880
lively, we'll pick for us.

00:29.270 --> 00:32.290
And for that, we are going to use this D.F. idea, vectorized it.

00:32.450 --> 00:33.710
And that is one pipeline.

00:34.220 --> 00:39.770
And then we are going to apply this support vector machine, machine learning algorithm.

00:39.800 --> 00:46.650
So let me execute this all necessary inputs and I will tell you what is this B.S. idea rectories.

00:47.390 --> 00:54.860
So this is like another feature encoding technique where we are going to convert all our domes.

00:54.980 --> 00:58.800
Or I would say the original documents are poisonously reviews.

00:58.870 --> 01:03.230
What is it that goes into kind of some sort of numbers?

01:03.650 --> 01:07.760
So if you'll just search for the ideas.

01:09.260 --> 01:11.690
Yes, that's a good explanation, I guess.

01:11.930 --> 01:12.410
Let me know.

01:15.260 --> 01:22.550
So based on this particular formula, they are calculating how much important individual vote in individual

01:23.120 --> 01:23.660
document.

01:24.260 --> 01:26.150
So for Tom De.

01:27.390 --> 01:35.410
A document how much its importance is and has quote of this D.F. idea will decide those things.

01:35.800 --> 01:40.960
So Tom Frequency's kind of how many times there will be cause in a document.

01:41.720 --> 01:44.400
And this was document frequency Liko.

01:45.140 --> 01:45.980
How many times?

01:45.980 --> 01:50.120
Some particular time, of course, across all those documents.

01:50.420 --> 01:57.570
And if you just multiplied, it will be kind of very much normalized stuff that many will some frequently

01:57.570 --> 02:00.860
are occurring were always occurs in an English dictionary.

02:01.100 --> 02:02.800
So it will do lower rates today.

02:03.260 --> 02:07.410
And those which are the very that word will view higher rates today.

02:07.760 --> 02:10.830
And that's why this idea component had it.

02:11.210 --> 02:13.070
So we delivered this particular formula.

02:13.140 --> 02:18.380
We're converting all those of reviews into kind of numbers.

02:18.680 --> 02:24.880
And for that, we are going to use this idea, recognize that a class from this has a loan library.

02:25.370 --> 02:32.490
Now, while converting those kneeing, we are going to pass this function, whatever the data are.

02:32.630 --> 02:39.780
Tax data, planning function we have created and we will tell this EAF ideas, vectorized class that

02:40.270 --> 02:46.370
like converting before that when you do all those kinds of tokenization tokenized.

02:47.630 --> 02:51.300
You pass through everything with this function.

02:51.870 --> 02:53.640
So let me execute it.

02:55.570 --> 02:58.220
And it will create B.F., IDF rectories it class.

02:58.580 --> 03:02.390
So this D.F. IDF vectorized class will do two things.

03:02.780 --> 03:05.540
First, it will be applied on this function.

03:06.170 --> 03:11.030
So tokenization will then according to whatever staff we help define here.

03:11.510 --> 03:18.470
And after that, those D.F. IDF formula will be applied to calculate actual DFAT of school for each

03:18.470 --> 03:21.320
individual done in different documents.

03:21.860 --> 03:23.180
And let me have any.

03:24.370 --> 03:28.330
All right, let's create a support vector machine, classify it.

03:29.910 --> 03:33.820
All right, so next thing is we have a data available with us.

03:34.240 --> 03:36.010
We have a machine learning algorithm.

03:36.010 --> 03:36.800
I will be test.

03:37.300 --> 03:39.760
Next thing is we need to train this model.

03:40.360 --> 03:42.520
And for that, we have to split our data.

03:43.060 --> 03:44.410
We are going to split the data.

03:44.440 --> 03:49.720
We Delp of this brain and this patient underscore split and 20 percentage of data.

03:49.750 --> 03:52.590
We are going to keep it inside that texting pockets.

03:53.020 --> 03:59.010
So if you just observe the shape of this green dataset and it basically does it find it in that causes

03:59.020 --> 04:03.940
a part of pasting buckets and two one nine eight is a part of draining buckets.

04:04.510 --> 04:06.700
So just observe the first few hours of training.

04:07.670 --> 04:08.110
All right.

04:08.590 --> 04:10.990
Next is we are going to create one pipeline object.

04:11.530 --> 04:14.170
And first, it will be possible.

04:14.980 --> 04:19.600
The first object of this D.F. idea and then the classify.

04:20.110 --> 04:21.980
So that will create one pipeline object.

04:22.690 --> 04:26.550
And let just fit our data in the sense that training is going to start.

04:26.710 --> 04:26.950
Now.

04:31.230 --> 04:37.710
So it's taking a good amount of time for the training purpose, so I'm just fast forwarding my video

04:37.910 --> 04:39.290
Pilates training.

04:39.780 --> 04:40.460
We call it.

04:43.620 --> 04:49.950
All right, so now you can see that training is fitness and it has written this pipeline object where

04:50.340 --> 04:56.940
the steps are like a two step one is terrified of step and and that one is a support vector machine

04:56.940 --> 04:57.200
step.

04:57.690 --> 04:58.960
Now, once the training is over.

04:59.040 --> 05:01.380
Next thing is to predict our pace.

05:01.680 --> 05:02.030
Listen.

05:02.640 --> 05:10.980
So for prediction of testing result, we are going to import this classification criteria like Accuracy

05:10.980 --> 05:14.220
Sport Classification Report and a confusing matrix.

05:15.060 --> 05:18.810
And let's apply prediction on our testing dataset.

05:18.870 --> 05:21.690
So it will return us all those prediction.

05:22.140 --> 05:25.110
Let's not confuse and metrics.

05:25.390 --> 05:29.090
So now here we are dealing with a binary classification problem.

05:29.100 --> 05:31.920
So it has come up with a two across two metrics.

05:32.340 --> 05:36.650
So in 201 cases had 221 cases.

05:36.690 --> 05:45.630
That is a national element where everything predicted was the right prediction that as v Mr. Prediction

05:45.630 --> 05:48.140
in case of people plus 70 recalls.

05:48.510 --> 05:53.670
If you need detailed explanation of this classification report, you can use this classification report

05:53.670 --> 05:57.690
and you can pass on all your grant route lewyn and predictions.

05:58.770 --> 06:03.450
So in this case, we are almost getting seventy seven percentage of accuracy.

06:04.560 --> 06:11.580
If you want to get this accuracy score, something like this total will be your total number of testing

06:11.700 --> 06:12.090
dataset.

06:12.180 --> 06:13.830
So we have a testing that does it.

06:13.860 --> 06:15.570
How much how many testing datasets?

06:16.200 --> 06:17.250
Find it interesting.

06:17.670 --> 06:24.690
So out of finding 50, there will be a 50 plus 78 already gone wrong.

06:25.200 --> 06:30.050
And our accuracy's code is seventy six point seventy two percentage.

06:30.480 --> 06:33.780
So almost down twenty three point five.

06:33.920 --> 06:36.110
What I would say twenty four percentage of his.

06:36.120 --> 06:37.230
We got it wrong.

06:37.950 --> 06:40.920
Now, let's break with some sample reviews.

06:41.450 --> 06:41.950
Let's see.

06:42.210 --> 06:43.350
Reviews will be Voll.

06:43.440 --> 06:43.700
Hi.

06:43.700 --> 06:46.710
I'm learning natural language processing a fun fashion.

06:47.160 --> 06:54.540
So if you just applied to all the classifier algorithm or pipeline algorithm, what it will do, it

06:54.540 --> 06:58.120
will try to convert with Delp of this D.F. idea.

06:58.320 --> 07:07.530
Vectorized it would all text processing function we created applied on this linear SBC or I would say

07:07.860 --> 07:12.450
support vector machine class and create a prediction for us.

07:12.900 --> 07:16.820
So the prediction is one that means it's the positive review.

07:17.670 --> 07:18.800
Let's predicted four.

07:19.020 --> 07:20.870
It's hard to learn new things.

07:21.020 --> 07:24.630
That means it's a kind of negative feelings we're giving.

07:24.690 --> 07:25.850
Let's predicate.

07:26.640 --> 07:27.780
So output is zero.

07:27.900 --> 07:29.880
That means it's a negative review.

07:30.530 --> 07:30.870
All right.

07:30.900 --> 07:33.240
So that's all about this combined.

07:33.450 --> 07:34.550
We use this.

07:34.560 --> 07:42.240
I am Libby, Yale and Amazon Review and applied a little differently, this tax classification or I

07:42.240 --> 07:46.340
would say sentiment analysis problem, positive or negative.

07:46.830 --> 07:54.560
And mainly we use a different thing, like a different spacy function and how we can apply those all

07:54.790 --> 08:01.300
specification as a pipeline process while using the different functions of this pastilla label.

08:02.070 --> 08:02.460
All right.

08:02.580 --> 08:03.930
That's all about this project.

08:03.960 --> 08:05.060
See you in the next video.