WEBVTT

00:02.210 --> 00:03.110
All right, everyone.

00:03.140 --> 00:04.630
So how are tax?

00:04.860 --> 00:06.230
Almost got clean.

00:06.620 --> 00:12.080
The next step is we are going to apply this feature engineering on this tax data.

00:12.730 --> 00:13.850
Why does feature engineering?

00:13.850 --> 00:21.170
Because you just cannot supply all those tax data on machine learning algorithm for that you to convert

00:21.170 --> 00:23.450
this data into some sort of numbers.

00:23.770 --> 00:28.010
And for that, in this particular project, we are going to use this bag of more than.

00:30.240 --> 00:34.470
Now, before diving, we do this bag of word implementation.

00:35.310 --> 00:38.490
Let me explain you what this bag of our model is.

00:39.150 --> 00:44.640
And for illustration purposes, I'll just search for one of the very good, nice representation of this

00:44.640 --> 00:46.460
bag, a Ford model on.

00:46.690 --> 00:48.080
That is available on a quarter.

00:48.900 --> 00:50.610
That quote is bag of four algorithm.

00:51.240 --> 00:56.610
So this explain in a very valid time what this bag for Martin is.

00:57.120 --> 00:59.510
So let's say you have two documents that are valuable.

00:59.730 --> 01:07.170
One is the quick brown fox jumped over the lazy dog's back and now is the time for all good men to come

01:07.170 --> 01:08.910
to the aid of their party.

01:09.570 --> 01:10.650
Now, we are two documents.

01:10.740 --> 01:13.680
Simply, we have a thousand documents here.

01:13.830 --> 01:14.190
You can.

01:14.510 --> 01:16.090
I'm just trying to, quote, liquidate.

01:16.740 --> 01:19.500
So this is like an official document or a first record.

01:19.590 --> 01:20.080
Same way.

01:20.130 --> 01:20.940
This is one record.

01:20.970 --> 01:21.780
This on the record.

01:21.810 --> 01:24.570
All you can consider this is document one document.

01:25.170 --> 01:28.410
This week we have a thousand different documents.

01:29.290 --> 01:31.020
Right now from this one.

01:31.200 --> 01:38.100
We have to create some sort of representation, something like a vector representation for document

01:38.100 --> 01:46.970
one vector representation for document two and those that we can consider it as a some sort of backup

01:46.980 --> 01:49.590
for representation of the input data.

01:50.490 --> 01:50.780
All right.

01:51.120 --> 01:56.160
So now from both of these documents, the first task in a backup vault, Martin, is to come up with

01:56.280 --> 02:01.730
all those dictionary that what other possible words did after removing the stopwatch.

02:02.130 --> 02:07.740
Now, we do not need to worry about all those stopwork them all because we have already remolding and

02:07.740 --> 02:12.900
whatever the implementation of this bag of word model is available in this cyclone library.

02:13.260 --> 02:16.390
They have also inbuilt mechanism to remove this stockhorse.

02:17.130 --> 02:17.520
All right.

02:18.030 --> 02:19.230
So let's see.

02:19.350 --> 02:22.650
These are the terms that are available across all those documents.

02:23.280 --> 02:26.850
Now, we do not need to worry about all those things because pestilent will take it.

02:27.180 --> 02:29.640
But internally, these are the things that are going on.

02:30.150 --> 02:32.110
So words are like eight.

02:32.210 --> 02:32.830
All black.

02:33.120 --> 02:34.000
Brown come.

02:35.340 --> 02:36.710
And one way to collapse.

02:36.760 --> 02:39.090
This what I would see in a horizontal axis.

02:39.390 --> 02:41.370
These are the vertical axis, horizontal axis.

02:41.400 --> 02:43.680
We have a number of documents, all the number of records.

02:44.110 --> 02:45.990
So very well the intersection occurs.

02:46.270 --> 02:47.130
There will be a one.

02:47.460 --> 02:51.220
And this one here indicates that this eight occurs in a document, too.

02:51.330 --> 02:52.240
So you can see eight.

02:52.260 --> 02:52.760
Eight, eight.

02:53.010 --> 02:53.340
Yes.

02:53.730 --> 03:02.250
So this is a part of document to read is also part of documents to all that is back is a part of document.

03:02.970 --> 03:09.920
So this may very well somewhat correct terms of in individual that goes or document that one will appear.

03:11.390 --> 03:13.040
Otherwise, it will be zero.

03:13.520 --> 03:19.700
Now you can see on a vertical axis document, one has been represented with such a kind of geology,

03:19.700 --> 03:23.160
those know one zero zero zero and simply document high.

03:23.310 --> 03:29.110
So has the same representation, that variable, some walkability appears.

03:29.660 --> 03:31.360
It needs to be represented by one.

03:31.520 --> 03:33.740
Otherwise, it will be represented by zero.

03:34.400 --> 03:40.100
So that is nothing but a conversion of your original text data into kind of numbers.

03:40.340 --> 03:46.730
What I would say vector representation of your input data, and that's what beg of or model does in

03:46.740 --> 03:47.200
tandem.

03:47.710 --> 03:48.000
All right.

03:48.050 --> 03:52.580
So now let's apply the same technique inside this.

03:53.390 --> 03:54.740
All of our review.

03:55.110 --> 03:55.590
Make I say.

03:58.010 --> 04:08.150
So for that, we are going to use from skin line like feature extraction tax.

04:08.720 --> 04:09.770
We are going to import.

04:13.180 --> 04:14.630
Contract to raise a class.

04:19.140 --> 04:22.590
And first, let's create the object of this contract crisis.

04:24.720 --> 04:26.880
Now, there are a number of options are possible.

04:27.660 --> 04:33.270
So the very important one where we are going to focus upon that will be a.

04:34.390 --> 04:36.340
Max Sanders, good features.

04:37.180 --> 04:38.720
You can see Max Sanders score feature.

04:41.310 --> 04:44.380
All right, now, what is this Max Sanders cool feature?

04:44.920 --> 04:51.120
If you can see here, individual terms, you can consider it as like an A feature, that document.

04:51.120 --> 04:55.120
When has this feature or document one do not have this feature.

04:55.540 --> 05:01.330
So it is just giving the existence of some particular terms thena document only to kind of you can see

05:01.360 --> 05:08.440
binary model, the same things apply here that while reading this model, how many features you want

05:08.440 --> 05:09.790
to take into consideration.

05:10.270 --> 05:16.780
So let's say we have our thousand records are available in each of this record has such a kind of tokens.

05:17.230 --> 05:22.870
Now, tokenized is also part of we do not need to take this contract preserve part or so we'll take

05:22.870 --> 05:22.970
it.

05:22.990 --> 05:25.700
Once we supply this thing to the contractor asset class.

05:26.660 --> 05:30.290
So let us get some thousand or let's say find.

05:30.700 --> 05:38.710
So that indicates that by creating this bag of Portmore that only 1500 words will be a total walkability

05:38.770 --> 05:46.960
or 1000 find it worse outcomes that will be a maximum features that allow and meaning all feature will

05:46.960 --> 05:49.250
be counted at Leiker extra token.

05:50.110 --> 05:55.930
So eventually what they will do, they will just take into consideration first fourteen hundred ninety

05:55.930 --> 06:02.580
nine features only in all those vocabulary count, which is apart from those fourteen and ninety nine,

06:02.920 --> 06:05.500
that will be considered as an extra token.

06:05.890 --> 06:08.890
So that is how bad of Vermeulen Modern it will create.

06:09.940 --> 06:12.440
So this total number of terms.

06:12.570 --> 06:14.830
Here we have given like fifteen hundred.

06:15.130 --> 06:18.070
Let me assign it to some CV.

06:20.860 --> 06:28.010
And now we are going to use this and underscore transform on this Karpas.

06:30.870 --> 06:37.980
And let me apply it as it is going to become a very sparse metrics, because many of the times you will

06:37.980 --> 06:41.540
be able to see if you are, let's say, fifteen hundred thumbs.

06:41.820 --> 06:42.530
Not necessarily.

06:42.720 --> 06:49.260
I mean, just one document has a two or three times only or maybe maximum five and 20 times.

06:49.530 --> 06:51.690
So all three, meaning values will be zero.

06:51.900 --> 06:58.410
So the metrics which are going to produce at this particular stage, that is a very sparse metrics.

06:59.550 --> 07:00.720
And let me add only.

07:01.970 --> 07:05.020
Now, let us pause this video for a second.

07:06.280 --> 07:11.600
And let us think about what is the state of this tax.

07:12.460 --> 07:20.260
All right, so how we are going to define this to say so here, the total number of Woodrow's will be

07:20.260 --> 07:23.140
defined by the total number of records.

07:23.620 --> 07:26.740
So obviously, the first value will be total hundred only.

07:27.370 --> 07:34.120
And each and every records, what are documents or each and every review will be presented by fifteen

07:34.120 --> 07:35.680
hundred different numbers.

07:35.890 --> 07:39.190
So obviously the total number of columns will be fifteen hundred.

07:39.550 --> 07:46.540
So the save of this X variable, which we are going to get after this Battleford Martin will be thousand

07:46.990 --> 07:49.940
multiplied by one thousand five hundred.

07:50.320 --> 07:54.190
And many of the places the value will remain zero.

07:55.540 --> 07:59.830
So that is our X next steps, we need to find a way.

07:59.980 --> 08:05.710
Because in the next video, we are going to apply this Navy's algorithm so far that we can just simply

08:05.710 --> 08:09.970
use VI is equal to from what originally delphinium object.

08:10.150 --> 08:13.170
I can just simply grab high lock loops.

08:16.530 --> 08:20.150
Let's say, all right, cause I want to take it, but only for this column.

08:21.330 --> 08:25.260
Let me make it very loose and record of it.

08:25.500 --> 08:25.800
Why?

08:26.730 --> 08:29.250
Let me display my dog ship.

08:31.160 --> 08:34.370
So that will be a housing values and each of those.

08:34.440 --> 08:36.200
Well, you let me.

08:36.710 --> 08:37.930
This place was 10.

08:39.880 --> 08:42.290
I know it will be a zero one one zero.

08:42.390 --> 08:43.780
So one indicates the red light.

08:44.450 --> 08:45.680
And this is negative.

08:46.400 --> 08:46.850
All right.

08:46.880 --> 08:49.860
So now we hope, like this bag of word moralizer.

08:49.910 --> 08:53.900
Next thing is, we are going to apply this name based algorithm.

08:53.930 --> 08:58.600
The real machine learning algorithm to predict and create a model out of it.

08:58.910 --> 09:00.290
So see you in the next video.