WEBVTT

00:01.330 --> 00:04.710
All right, everyone, so we velho successfully imported this later.

00:04.920 --> 00:07.330
Now time to clean this year later.

00:07.800 --> 00:09.840
So what would the review columns are available.

00:09.900 --> 00:13.270
We are going to apply all those tax pre processing step.

00:13.530 --> 00:18.970
Something like the removal of your stopwatches are making all the words, you know, lowercase letter.

00:19.020 --> 00:23.970
Because even if the words is capital letter all, it's in a small letter.

00:24.000 --> 00:29.010
It just doesn't make sense as far as interpretation of those particular words are concerned.

00:29.710 --> 00:30.210
Order.

00:30.540 --> 00:36.300
We are going to just remove all those penetration, Mark, because that is anyhow not going to add any

00:36.420 --> 00:40.170
value to the positive reviews or the negative reviews.

00:40.530 --> 00:43.100
So let's start with the very first one before that.

00:43.350 --> 00:47.970
Quickly, a few shall in this leaning tax data section.

00:48.480 --> 00:51.920
So far that we are going to use this mainly and a library.

00:51.960 --> 00:55.410
So I'm going to input LDK first.

00:57.850 --> 01:01.020
And I that a library we are going to use for the regular expression.

01:01.270 --> 01:02.570
That will be a hardy.

01:03.160 --> 01:04.390
So let me just as any.

01:06.930 --> 01:10.310
Now, the first pre processing step we are going to do is a.

01:11.810 --> 01:16.920
Stopwork Limor now is an illegal I.B. or soberness with a lot of stoplights.

01:17.230 --> 01:18.800
So what we're going to do first.

01:18.940 --> 01:22.870
Let's download this stopwork from this Analytica library.

01:24.390 --> 01:28.870
And indicate not now.

01:31.130 --> 01:35.500
And we can hear supply stop, pause.

01:37.990 --> 01:38.380
All right.

01:38.800 --> 01:45.100
So download it and unzip into proper location and indicate and escort data.

01:45.680 --> 01:46.900
Let's observe whether.

01:47.940 --> 01:51.520
It is available here or somewhere as let me express it.

01:53.030 --> 01:55.350
I think some other places, the machine, they'll download it.

01:55.940 --> 01:57.550
All right, so let's just leave it for time.

02:00.610 --> 02:08.270
Next, let's import this stopwork module so from an etiquette.

02:10.030 --> 02:11.770
Not quite a push.

02:14.150 --> 02:17.260
We are going to put stoppers.

02:18.020 --> 02:19.520
Let's have a look at the very first one.

02:19.640 --> 02:20.070
Wow.

02:20.600 --> 02:23.600
Load this place Lachman's here.

02:23.660 --> 02:24.240
Looks like it.

02:24.290 --> 02:25.710
This is a stopwatch.

02:26.630 --> 02:28.160
So it needs to be removed.

02:28.910 --> 02:30.650
Let me display the very first one.

02:30.830 --> 02:34.990
Let's say just the previous column will be focusing on.

02:36.920 --> 02:39.920
And out of that, let's concentrate on a very first one.

02:45.350 --> 02:47.210
And this is our first review.

02:47.270 --> 02:52.010
So we are going to apply this stopwork removal mechanism point of forced review.

02:53.030 --> 02:58.880
So before that, let's just remove all those digits, punctuation mark.

02:59.180 --> 03:07.390
And let's just concentrate on keeping just the alphabetic character on and to keep all alphabetical

03:07.430 --> 03:07.650
order.

03:07.850 --> 03:12.350
We are just going to use this, Heidi, a regular expression library.

03:13.190 --> 03:16.410
So for that, we are going to use this party not.

03:18.240 --> 03:18.660
So.

03:20.020 --> 03:26.200
So that we search for some particular pattern and replace with some other string.

03:27.460 --> 03:30.790
So first, which back then we are going to search for.

03:31.240 --> 03:36.580
So the pattern will be something like all the capital later it does it.

03:36.880 --> 03:38.780
And a small later, it was a.

03:40.780 --> 03:44.320
And that you can describe inside this bigger brocades.

03:45.810 --> 03:46.170
Caddick.

03:48.610 --> 03:49.300
A to Z.

03:49.510 --> 03:51.310
And capital, A to Z.

03:51.730 --> 03:55.450
So only those characters which are alphabetical.

03:55.660 --> 03:56.560
That will be included.

03:57.010 --> 04:02.290
Apart from that, all those digits, all those punctuation mark will be removed.

04:02.930 --> 04:06.690
And let me just replace with a blank string.

04:07.420 --> 04:10.410
And we are going to apply this honor review.

04:10.570 --> 04:14.680
So first, we will let us a play on our first review at this time.

04:15.130 --> 04:21.360
And after that, in the next video will be just applying all those rules on every single records.

04:21.850 --> 04:23.980
So let me assign it to somebody.

04:24.370 --> 04:26.840
And we'll be keep applying all those.

04:27.160 --> 04:27.680
One by one.

04:28.690 --> 04:29.410
So review.

04:30.520 --> 04:32.550
Let me display it, if you know.

04:33.580 --> 04:33.980
All right.

04:34.030 --> 04:40.090
So there is one issue here that it is just cap only alphabetic character.

04:40.160 --> 04:43.300
But it all joint just because there were replacement.

04:43.300 --> 04:45.280
I did with a blank string.

04:45.470 --> 04:45.840
He's dead.

04:46.490 --> 04:49.210
Let's just keep it one space.

04:51.350 --> 04:52.970
And let's play review.

04:53.900 --> 04:54.290
All right.

04:54.320 --> 04:59.330
So you now you can see how everything got replaced with a one blank string.

05:00.030 --> 05:00.440
All right.

05:00.470 --> 05:05.350
So next step is we need to convert all those words into lowercase letter.

05:05.960 --> 05:07.040
That will be very simple.

05:07.040 --> 05:14.390
You can just simply use Debu is equal to review, not lower matola you can apply.

05:14.840 --> 05:19.580
It's a very simple string macaroni and ledges of the review.

05:22.370 --> 05:28.640
So you can see from people who are earlier this hell and this DeBlois mode become a lowercase.

05:31.630 --> 05:33.370
All right, so everything becomes lowercase.

05:33.880 --> 05:38.890
Now we are in a position to play this stopwork removal.

05:41.180 --> 05:43.460
So for that, we have to us tokenized.

05:43.980 --> 05:47.010
So for tokenization, we are going to use this review.

05:47.190 --> 05:53.530
Not a simple split can let me assign it to you.

05:54.020 --> 05:56.370
Let me display what is there inside the review now.

05:57.830 --> 06:01.220
So now it has been tokenized and put it in two different different.

06:01.250 --> 06:05.480
Tolkan has a list now from this particular token.

06:05.510 --> 06:11.450
We are going to search for whether this particular were existing or stopwork database or not.

06:11.480 --> 06:14.230
If we had this, we had to remove it.

06:14.720 --> 06:16.880
If it doesn't exist, we we'll keep it.

06:17.690 --> 06:21.100
So first of all, let's just play stop words.

06:22.610 --> 06:23.850
Not worse.

06:24.950 --> 06:33.410
And that will be English, so we can light a very simple looping mechanism and search for all those

06:33.710 --> 06:37.070
words, which is a part of this dataset or not.

06:37.880 --> 06:45.740
So let me keep it like for every word in review.

06:47.620 --> 06:48.540
I hope you search for.

06:49.390 --> 06:59.650
If you are not in this particular Stoppard's dot words in English.

07:01.570 --> 07:02.580
I have to keep.

07:03.950 --> 07:07.010
Because that is not a bottle stopper database.

07:07.430 --> 07:10.730
So what we will do, we are going to create another list.

07:11.000 --> 07:14.170
Let's say New Order, BWI.

07:14.440 --> 07:17.390
We'll see say that will be a process to review.

07:20.270 --> 07:21.410
And I'm just going to.

07:24.700 --> 07:27.290
I'd like to leave you.

07:28.910 --> 07:29.370
All right.

07:32.240 --> 07:33.140
Something goes wrong.

07:34.310 --> 07:35.290
We are at war.

07:35.420 --> 07:36.250
Not a preview.

07:39.390 --> 07:46.010
All right, so what it says that this object has no attribute and you see the list as, I think, appalling.

07:47.570 --> 07:47.960
All right.

07:48.080 --> 07:49.170
And let's just run it.

07:50.950 --> 07:51.450
All right.

07:51.940 --> 07:54.310
Let's just let's play a preview.

07:56.910 --> 08:02.130
And you can see this got remote because it is a stopper.

08:02.580 --> 08:07.710
Now this whole for a look, we can use it like a list comprehension also.

08:08.220 --> 08:11.160
So what we can do everything.

08:11.220 --> 08:13.020
We can put it as a comprehension.

08:13.110 --> 08:17.020
So I'm just going to copy it and put it inside the list.

08:17.690 --> 08:21.720
So word for word in a live you.

08:22.720 --> 08:23.020
If.

08:25.070 --> 08:26.900
This whole condition I can, captain.

08:31.200 --> 08:32.960
And let me sign it to you.

08:33.060 --> 08:33.870
It sent.

08:37.230 --> 08:37.670
All right.

08:37.970 --> 08:45.150
Let's just do the review now, review doesn't contain this word.

08:45.950 --> 08:48.310
That means no stopwork got three more.

08:49.600 --> 08:52.350
So there is one more step required for the preprocessing.

08:52.390 --> 08:53.770
But that will be stemming.

08:54.220 --> 08:56.850
And we'll see those Demming in the next video.
