WEBVTT

00:01.820 --> 00:02.480
Here we went.

00:02.540 --> 00:06.080
So next project on which we are going to work upon EESA.

00:06.560 --> 00:12.710
I am DBI and Amazon has a view classification with this spacy library.

00:12.980 --> 00:14.650
So all those people are siccing step.

00:14.660 --> 00:18.800
We are going to do with this inbuilt functions that are really well in this specially.

00:19.370 --> 00:20.010
And for that.

00:20.060 --> 00:21.310
There are many who will get upset.

00:21.440 --> 00:22.490
I am going to use it.

00:22.910 --> 00:25.280
So let me just upload all those NICUs.

00:26.170 --> 00:27.950
So from file upload.

00:28.940 --> 00:34.180
And I have Amazon sells level, high MVP level and yellow.

00:34.260 --> 00:34.720
Wait a sec.

00:35.330 --> 00:37.000
Let me upload all of them.

00:38.290 --> 00:38.600
OK.

00:39.570 --> 00:44.670
And you can see all the assets that are available now for this particular project.

00:44.770 --> 00:46.780
I know a lady created this notebook.

00:47.220 --> 00:53.040
So let's just walk through very fast, because majority steps are very common with respect to other

00:53.040 --> 00:53.610
projects.

00:54.390 --> 00:59.250
So and every single line of code has been heavily commended.

01:00.060 --> 01:00.420
All right.

01:00.450 --> 01:05.300
So first, as is like Liquide libraries, we are going to it next.

01:05.580 --> 01:08.600
This yellow classification dataset, we are going to load it.

01:08.870 --> 01:10.420
It's a separated at fine.

01:10.830 --> 01:11.890
So let just that only.

01:12.900 --> 01:15.820
And let's display first few records.

01:16.620 --> 01:19.870
So you can see either it will be a zero or it will be a one.

01:20.700 --> 01:21.390
Next is.

01:23.090 --> 01:24.280
It has a blue collar.

01:24.380 --> 01:27.260
So you can see it doesn't have any kind of head.

01:27.470 --> 01:29.890
So let us assign some had a name.

01:30.350 --> 01:32.420
It will be a review and a sentiment.

01:33.200 --> 01:34.160
Let me executed.

01:35.740 --> 01:39.610
And if you would just try to observe first a course again.

01:40.870 --> 01:41.800
You would be able to see.

01:41.830 --> 01:43.580
There will be reviews and assenting.

01:44.660 --> 01:45.330
And how many?

01:45.350 --> 01:47.690
Total number of rows and columns are available.

01:48.590 --> 01:50.060
So we have a total thousand.

01:50.060 --> 01:52.520
Those two items are available.

01:53.030 --> 01:54.190
Let's go with another one.

01:54.470 --> 01:58.290
That will be Amazon sells label, not testify.

01:58.700 --> 01:59.960
Let me add, Randy.

02:01.940 --> 02:05.540
And here also I'm just going to assign it to my column name.

02:06.530 --> 02:10.460
What will define earlier, that will be an ending with the reviews and a sentiment.

02:11.700 --> 02:17.220
So let me sign it and let just misplay first food quotes.

02:18.740 --> 02:19.160
Ship.

02:20.150 --> 02:24.970
So, rehabber, another thousand roads are available in this Amazon, like I say.

02:25.620 --> 02:26.840
And VLA, I am Libbey.

02:26.840 --> 02:28.610
Liberal datasets are also available.

02:29.390 --> 02:33.420
So same like the dataset we have, I am Beebee.

02:34.330 --> 02:39.970
And as usual, like I did it, I said, I'm just going to assign it to two different column names.

02:40.640 --> 02:42.100
So that will be a sentiment.

02:42.530 --> 02:51.740
And that is if you try to find the shape of desirably V v o, just the 748 that goes to column.

02:53.170 --> 02:56.410
So now we have to treat it as it's available and a key difference.

02:57.260 --> 03:01.420
The Dufferin object, what we are going to do, we are just going to combine all those three does it

03:01.570 --> 03:08.090
and create one mega dataset, something like a we'll just paint all those Amazon dataset.

03:08.530 --> 03:11.670
And I am the media, like I said, inside this dataset.

03:12.070 --> 03:17.760
So for that, we are just going to use it for brain function and we'll just make it ignoring next group

03:17.830 --> 03:19.330
because indexing as string.

03:20.660 --> 03:22.570
So they don't create any kind of confusion.

03:23.030 --> 03:25.190
So now come mainly all we have.

03:25.400 --> 03:32.360
Two thousand seven hundred and forty records because one thousand is a part of our first one will be

03:33.350 --> 03:36.500
1000 in the Amazon and sound 48.

03:37.300 --> 03:39.030
I am legit.

03:39.260 --> 03:40.930
Scratch first few records.

03:42.100 --> 03:43.240
Review and assemblyman.

03:44.210 --> 03:49.930
Let's try to find the distribution of individual positive and negative records.

03:50.620 --> 03:52.360
So heroes I cyo heavily coming.

03:52.990 --> 03:59.140
So we are total thirteen hundred eighty six records, which are positive reviews and turning under 60

03:59.150 --> 03:59.770
to records.

03:59.980 --> 04:01.450
That is a negative reviews.

04:02.050 --> 04:03.310
Is there any missing records?

04:03.310 --> 04:04.270
Are there or not?

04:04.810 --> 04:09.480
So we can just simply check it with ease, malfunction and apply some money.

04:10.330 --> 04:11.890
So that will be zero zero.

04:11.920 --> 04:13.960
That means there is no missing record.

04:14.230 --> 04:17.010
Let just segregate input and output.

04:17.170 --> 04:22.840
Let a solid review column will become our input dataset and sentiment column will become of it.

04:22.990 --> 04:23.700
I'll put it aside.

04:24.310 --> 04:26.800
So that is what the segregation we have made.

04:27.350 --> 04:27.690
All right.

04:27.730 --> 04:32.500
So now the full stop of this importing data set has completed.

04:33.730 --> 04:38.500
Let me just call upset and let's go to the next step.

04:38.950 --> 04:46.870
That will be data cleaning staff now here for data, meaning we are going to use this spacy library

04:47.350 --> 04:51.460
and we are going to remove this stop was punctuation mark.

04:51.880 --> 04:54.250
And we are going to apply this limitation.

04:55.010 --> 04:55.410
All right.

04:55.480 --> 05:02.230
So what are what we have learned in know very first initial section of this course and will be basics

05:02.470 --> 05:02.960
from there?

05:02.980 --> 05:07.520
We are going to use those function from the spacy lightly.

05:08.230 --> 05:14.080
Now, as a part of Biton distribution, you know, stream class itself, that is the one contribution.

05:15.310 --> 05:19.330
Attributes that are valuable, which you can apply on this string object.

05:19.870 --> 05:26.230
And if you just try to display this punctuation, we'll be able to see all those punctuation is a part

05:26.230 --> 05:27.080
of one string.

05:27.550 --> 05:34.090
Something like explanation, mark, double quotes, hash nahlah, placenta's ampersand and backslaps,

05:34.090 --> 05:35.470
forest lesson and many more.

05:36.280 --> 05:36.580
All right.

05:36.610 --> 05:38.290
So that is about the punctuation mark.

05:39.460 --> 05:45.120
Next, let's see about the stoplights and afterwards, we are just going to play all those things,

05:45.200 --> 05:46.100
hit a one shot.

05:46.720 --> 05:52.000
So Stoplights is a part of specially not Lange, not even stoplights.

05:53.230 --> 06:00.750
And let just law all those stop or as a list, you know, stop was variable, if you will, let's say

06:00.770 --> 06:04.370
right, to display stoppers.

06:05.800 --> 06:08.040
You can see all those are topless.

06:09.030 --> 06:09.370
All right.

06:09.400 --> 06:10.290
Now for the cleaning.

06:10.290 --> 06:10.530
Bye.

06:10.690 --> 06:16.330
We are going to create one function that will be nothing but a tax, Nataša, cleaning.

06:16.840 --> 06:23.110
And it is going to accept one sentence in the sense that one single review went to pass on here.

06:23.620 --> 06:33.040
So first, we are going to load of a spicy model and we'll be applying those model on sentences.

06:34.000 --> 06:35.880
Let's create one empty tokens.

06:36.470 --> 06:40.900
Now, in the first four Lluvia just going to lowercase every single token.

06:41.380 --> 06:48.660
And in the next four, look, we are just going to remove all those punctuation, token and stopwork

06:48.660 --> 06:49.000
Koven.

06:49.540 --> 06:56.170
So in a first case, there is a little twist due to increased grammar that we have to make the thing

06:56.320 --> 06:59.610
lowercase only all those token.

06:59.920 --> 07:03.950
But that is a little things quite change due to this English grammar.

07:04.000 --> 07:07.450
As it says that the word is proper noun.

07:07.600 --> 07:10.380
If the word is proper noun, there is no Lamai exist.

07:10.400 --> 07:14.480
So in that case you could just take the lower case was an object.

07:14.950 --> 07:18.040
But if the word is the proper noun.

07:18.400 --> 07:21.670
So in that case, you have to take a Alema and then lowercase offic.

07:22.270 --> 07:29.770
And in the next, if the token is not a part of this dopplers, part of contrition, then only we are

07:29.770 --> 07:32.240
wound up to the clean dawkins'.

07:33.080 --> 07:33.620
All right.

07:33.670 --> 07:37.390
So that is what the text near the cleaning function.

07:37.670 --> 07:38.890
Let me execute it.

07:40.450 --> 07:45.800
And let's apply this thing on some simple sentences.

07:46.160 --> 07:47.330
So if they executed.

07:49.470 --> 07:55.920
Hopes so, they say that and helping out defined because we have to import this spacy also.

07:55.980 --> 07:57.410
And all those moralise.

07:57.420 --> 07:57.710
So.

07:58.200 --> 07:59.370
So what we can do.

07:59.980 --> 08:01.140
We haven't applied it here.

08:01.200 --> 08:16.310
So before that, let me load this specially lively import specially and spacy the Lord and let let's

08:16.320 --> 08:20.420
keep here our small size English vap core model.

08:21.180 --> 08:23.460
And let me assign need to some NLB.

08:24.120 --> 08:26.610
And let's define this function again.

08:28.080 --> 08:31.650
And then we are going to apply this sentence on our An Martin.

08:32.640 --> 08:36.210
Let's just test it with this very simple sentence.

08:36.510 --> 08:42.350
So you will be able to observe that all stop words and a punctuation mark will get removed after applying

08:42.420 --> 08:43.050
dysfunction.

08:43.530 --> 08:45.380
So that will be a text later cleaning.

08:45.810 --> 08:47.880
And you can see everything is lowercase.

08:47.970 --> 08:50.520
Hello, beautiful day.

08:50.880 --> 08:51.450
And outside.

08:51.510 --> 08:52.260
So they're quite remote.

08:52.260 --> 08:58.200
Remove explanation, Mark Godinho, because that is a punctuation mark and all Gallimore comma got removed.

08:58.890 --> 09:03.070
It beats slightly more remote because that is a stop loss.

09:03.950 --> 09:04.380
All right.

09:04.400 --> 09:07.280
So our data cleaning part.

09:07.890 --> 09:08.070
Oh.

09:08.700 --> 09:14.440
Next is we are going to apply this D.F. idea like down frequency and in most document frequency.

09:14.700 --> 09:16.820
So those things we would see the next.

09:16.830 --> 09:17.110
Would you?