WEBVTT

00:01.370 --> 00:02.210
All right, everyone.

00:02.240 --> 00:07.670
So the next step, which we are going to play on the Web and leave you video, well, that is a quarter

00:07.720 --> 00:08.250
stemmer.

00:08.720 --> 00:20.690
So let me put first from any tiki God -- quarter import quarter Stemler.

00:21.190 --> 00:22.820
And let me create Cordish.

00:22.890 --> 00:23.750
Time or object.

00:26.820 --> 00:29.530
Like just assigning to some yes.

00:30.900 --> 00:36.240
And let us apply this stemming on every single word, every level in this review.

00:36.720 --> 00:40.350
So same like earlier, we can use the list comprehension.

00:41.280 --> 00:47.860
So it will be a four letter word in review.

00:49.740 --> 00:50.440
We'll be applying.

00:50.880 --> 00:57.740
He is not stem and let just supply void.

00:58.830 --> 01:00.810
Let me have signee to review.

01:01.230 --> 01:06.400
So it will came out with the stamp poison of this individual, Tolkan.

01:09.490 --> 01:09.890
All right.

01:09.950 --> 01:14.660
So while we can wall load, become low and places all this place.

01:16.480 --> 01:21.190
All right, now, before the plane, all those things to back off, word model, heavy, how dogging

01:21.320 --> 01:23.280
Giani and Sol joining.

01:23.350 --> 01:30.700
We are just going to use a very simple blank screen and will be a plain joint function here.

01:30.710 --> 01:32.340
We can pass like a review.

01:33.920 --> 01:36.280
And let me assign it to review.

01:39.770 --> 01:44.000
And let us bring what is there inside The View now.

01:45.970 --> 01:52.630
Wow, place, hardball, low place, so that is a highly clean was off your factual data.

01:53.280 --> 01:53.650
All right.

01:53.680 --> 01:59.140
So let's just summarize what we had done in this whole cleaning tax process.

01:59.230 --> 02:08.710
First important is Topper's and we just make everything lower case, then Vios Lytic like a tokenization,

02:08.980 --> 02:10.870
then remote does topless.

02:11.590 --> 02:20.230
Then we applied for this stamping process on individual Forkan and then we just Giani and created a

02:20.230 --> 02:24.400
full sentence because this is what we are going to supply dolar bag offers.

02:24.520 --> 02:29.290
Martin, now Aldo's process, we have done it for the one single record.

02:29.590 --> 02:35.080
Now we are going to apply and combine all those steps and we'll be applying on every single record.

02:35.440 --> 02:41.980
So what we will do, we are going to create one function or we can just triple what every single record

02:42.550 --> 02:47.560
and we can just keep ending into some other location.

02:47.670 --> 02:53.770
Let's see some karplus the variable I created as a list and legacy or what?

02:53.980 --> 03:01.510
Let's say for I mean, let's say range were how many total number of records are there in our regional

03:01.510 --> 03:01.870
dataset?

03:01.900 --> 03:03.960
That will be a ninety nine.

03:04.030 --> 03:04.720
Ninety nine.

03:04.750 --> 03:10.540
So we can just simply use LAN data.

03:12.610 --> 03:20.800
Let me Rindy I and we'll process one by one individual reckless and we'll just skip it to.

03:22.380 --> 03:22.700
Oops.

03:24.370 --> 03:26.500
I was just making sure that everything is working fine.

03:27.210 --> 03:29.210
All right, so let me move this plane to an.

03:30.900 --> 03:34.170
Now, what were the first step we need we had just.

03:35.250 --> 03:37.870
Started copy pasting everything from here.

03:39.560 --> 03:47.660
So first things we need to remove all those non alphabetic character punctuation, mark digits and everything.

03:48.290 --> 03:49.340
So let me just selected.

03:51.300 --> 03:53.370
And let me remove all those outputs.

03:54.870 --> 04:00.180
All right, let me I say need to review.

04:01.930 --> 04:03.700
And now instead of veto, it will be.

04:04.000 --> 04:11.160
Because we are going to process on individual records and we are recruiting on everything that records,

04:11.420 --> 04:11.870
like I say.

04:12.610 --> 04:13.690
So that is the first step.

04:15.280 --> 04:16.720
Next, what we need.

04:18.030 --> 04:19.240
We just made it lower case.

04:19.350 --> 04:20.700
So let me selected.

04:24.260 --> 04:25.370
After Lord Casey.

04:28.180 --> 04:29.950
We split that data.

04:30.100 --> 04:32.680
So that means a kind of tokenization video.

04:38.220 --> 04:42.360
And after tokenization, we tried to create a.

04:44.960 --> 04:45.680
Stamper's.

04:47.010 --> 04:49.430
First Stopford removal and then Stamford's.

04:49.890 --> 04:54.800
But I do think that first we need to find the Stamper's and like I say.

04:55.440 --> 05:02.640
And then we can remove this stopwork because Dopplers are like bees root were only available.

05:02.670 --> 05:05.400
And this Demming also does the same thing.

05:05.730 --> 05:06.870
What what we can do.

05:07.110 --> 05:09.650
We can just simply combine both of this trap.

05:10.220 --> 05:10.780
Like us.

05:10.830 --> 05:11.620
Tamang also.

05:12.150 --> 05:13.170
And this.

05:14.430 --> 05:15.570
Stock would be more low, so.

05:16.690 --> 05:24.160
So what I'm going to do, I'm just going to selected one statement and let me keep it here.

05:25.690 --> 05:30.820
Before that, let's create the object of this part of Stemler.

05:31.870 --> 05:38.000
I don't think we redeclared every time to create this Polish tmr object so we can just create it outside.

05:38.300 --> 05:40.180
Although not quite so.

05:40.180 --> 05:41.350
It has been already created.

05:41.950 --> 05:46.900
And instead of a returning word, we can just simply return the stock stem.

05:50.360 --> 05:55.840
All right, so we got a little review last after stemming plusses and Limor Stoppers.

05:57.160 --> 05:59.230
Next is we just need to Giani.

05:59.560 --> 06:02.680
So let's just use this for joining us.

06:05.450 --> 06:07.800
And then we are just going to appending.

06:07.970 --> 06:14.990
So these are all leaning steps and we have already created this corpus.

06:15.260 --> 06:21.160
And inside of Corpus, I'm just going to pending you.

06:22.970 --> 06:26.960
So we'll get a list of all the reviews, you know, Karpas ready.

06:30.590 --> 06:31.700
All right, let me, Lenny.

06:34.400 --> 06:36.980
And like discipline, this corpus.

06:40.610 --> 06:41.090
All right.

06:41.120 --> 06:46.460
So you can see every single reviews or I would say records in textual format.

06:46.490 --> 06:52.910
That is all clean was one of our original texts and every single text, pre processing step has been

06:52.910 --> 06:56.540
applied to all those records or the reviews.

06:57.160 --> 06:58.760
Like solid data got clean.

06:59.060 --> 07:04.720
Next thing we are going to do is a feature engineering that is nothing but a bag of more modern.

07:04.790 --> 07:06.110
We are going to create it.

07:06.350 --> 07:11.190
And for that, we are going to use this Haskell and Library Ciresi into Xingdou.
