WEBVTT

00:01.110 --> 00:02.050
All right, everyone.

00:02.590 --> 00:06.320
So the very first topic in text analytics.

00:06.760 --> 00:12.690
I would say text processing or text preprocessing, which we are going to learn in this video, is a

00:12.970 --> 00:13.780
tokenization.

00:14.470 --> 00:16.960
So what is tokenization in a very lamington?

00:17.050 --> 00:21.700
It's something like breaking your backs into some kind of PCs.

00:22.300 --> 00:27.650
So let's say I have a sentence like I like happen.

00:27.780 --> 00:31.190
So it's a very simple sentence and simple Englis.

00:31.570 --> 00:37.140
So if I want to tokenized this particular sentence, there will be a plea.

00:37.270 --> 00:42.760
Tolkan, in which it will be tokenized is so first token will be like an eye, second token will be

00:42.760 --> 00:45.220
a light and a token will be happen.

00:46.120 --> 00:51.220
But in English there are always exceptions and things are not that much easy.

00:51.490 --> 00:53.740
Life may not be so simple.

00:54.310 --> 00:59.770
So that is where a different organization algorithm and everything got implemented inside this specially

00:59.950 --> 01:00.430
library.

01:01.060 --> 01:02.290
So what will do?

01:02.920 --> 01:05.470
I hope you didn't hear the full sentence.

01:05.880 --> 01:09.800
And for all four, we are going to play this tokenization.

01:10.120 --> 01:15.880
So let me assign it to let's say Esslin of Unmowed has asked to redeem it.

01:17.500 --> 01:17.780
Yes.

01:19.570 --> 01:20.780
Let me Miki S4.

01:21.220 --> 01:21.670
All right.

01:22.540 --> 01:25.450
So first, let's import that spacey library.

01:25.630 --> 01:32.710
And then we will apply tokenization individually on on four different English sentence.

01:33.010 --> 01:33.670
Let me write.

01:34.260 --> 01:36.580
Import spacy.

01:38.340 --> 01:39.750
And let just executed.

01:41.300 --> 01:47.450
Now, before diving into tokenization, I would like to discuss something like all three.

01:48.020 --> 01:55.840
Martin so Spacey already wonders with this number of leading Martin let me create new code.

01:56.960 --> 02:04.520
And as a part of Spacey installation, that is a one kind of small size model already implemented.

02:05.240 --> 02:07.080
So what did preprint Martin is?

02:07.430 --> 02:13.690
Let me define it here so we can load this briefly model like Speci Dark Lord.

02:14.660 --> 02:18.180
And here you can pass those preprint modelling.

02:18.950 --> 02:21.950
Now, what this Braitling model will do will see.

02:23.300 --> 02:25.610
Let's say the name, I will give it like.

02:31.610 --> 02:33.720
In this court court.

02:35.930 --> 02:38.880
This VEP underscored saying.

02:40.830 --> 02:44.320
And let me assign it to some Martelli and l.P.

02:47.460 --> 02:47.830
All right.

02:47.950 --> 02:49.840
A model got implemented.

02:50.350 --> 02:54.010
Now let's go to the new tab and search for spacy.

02:56.260 --> 02:57.880
Go to the very close link.

03:04.190 --> 03:07.930
And you'll be able to see on the top, maybe, yes.

03:08.090 --> 03:10.460
But there is a link called modern.

03:11.540 --> 03:12.030
All right.

03:12.520 --> 03:18.190
So they have given a very good quick start with the model now specially already implemented.

03:18.250 --> 03:19.630
There are two kinds of model.

03:19.750 --> 03:22.930
So one is a core model and no one is stuck on model.

03:23.380 --> 03:25.950
So mainly we will be dealing with this core, Martin.

03:26.410 --> 03:31.630
And now this model is also available in a number of different languages, naggingly, Liz, France,

03:32.140 --> 03:32.680
Spanish.

03:33.160 --> 03:35.860
So English will be our core language.

03:36.220 --> 03:41.210
If you want to dig down a little bit more, you'll be able to read about this.

03:41.230 --> 03:43.420
What is the difference between this core model?

03:43.840 --> 03:45.710
And there will be a starter, Martin.

03:46.540 --> 03:47.890
How you can use it.

03:48.130 --> 03:53.580
So if you want to load some particular model of some particular language, let's say we want to lorik

03:53.590 --> 03:54.970
for English language.

03:55.480 --> 04:01.810
And there are two ways we can Lohri like imported as a model or you can load it like speci law has.

04:01.810 --> 04:07.210
We have Danny here and there, like us saw uses example.

04:07.360 --> 04:12.310
Also, if you don't want to see that is a some simple code maybe created for you.

04:12.960 --> 04:13.330
All right.

04:13.780 --> 04:16.090
So that is what about the model?

04:16.120 --> 04:19.360
If you want to go to detail about what that model will do?

04:19.390 --> 04:21.560
You can read about it now.

04:21.640 --> 04:29.080
If you click on this English, that is not the only model available like Ian or Vavasor in which we

04:29.080 --> 04:29.870
help you load it.

04:30.070 --> 04:30.820
In this case.

04:31.300 --> 04:35.650
So Hesam here indicates that there is a small size model.

04:35.710 --> 04:37.660
You can see the size of this particular model.

04:37.660 --> 04:38.680
It's just eleven.

04:39.970 --> 04:42.550
And this model Del created.

04:43.570 --> 04:51.300
Based on the blocks, news and comments they acquired from some resources available on Internet that

04:51.310 --> 04:59.770
is English and out of this model, you will be able to gain this post tax, talking to actors, dependency

04:59.780 --> 05:00.950
pass and anemic.

05:01.900 --> 05:03.130
So that is a small model.

05:03.250 --> 05:06.190
They'll train on a small subset of net.

05:06.790 --> 05:10.030
But for our purposes, we are going to work with this small model.

05:10.600 --> 05:16.540
Now, suppose if you want to load a bigger model, you can just simply download that bigger Martin with

05:16.540 --> 05:24.310
this Python Hyppönen Spacy download even called that underscore m m limits.

05:24.640 --> 05:30.530
It's a medium sized model, so they have their own pipeline like manual.

05:30.570 --> 05:31.460
You supply text.

05:32.050 --> 05:34.420
It will do tagging, crossing and name.

05:34.440 --> 05:34.580
And.

05:35.890 --> 05:39.700
And you can see this model is a little higher in terms of size also.

05:39.990 --> 05:42.350
So I look at it as just eleven m.b model.

05:42.430 --> 05:48.930
Now it is a ninety one and be more than and at the same time we have nail gun like us index.

05:49.090 --> 05:53.170
Accuracy also in any other so that we don't learn about that in yet.

05:53.560 --> 05:57.100
That is a Neeme entity recognition and that is a large margin.

05:57.130 --> 06:03.910
So large model little train on a very huge, huge amount of tax like blocks, news and comments on video

06:03.910 --> 06:05.050
taken into consideration.

06:05.440 --> 06:07.150
But that's a very big number.

06:07.600 --> 06:10.390
And this model is literally very big in size.

06:10.400 --> 06:13.390
That is a seven eighty nine megabytes.

06:13.990 --> 06:17.270
And you may see any accuracy or might increase.

06:17.710 --> 06:19.780
So that is about the English model.

06:20.140 --> 06:23.010
And if you want to look to some specific model, you can do.

06:23.140 --> 06:24.820
So let me just copy.

06:25.510 --> 06:27.550
Now, here we are looking at this particular model.

06:28.030 --> 06:30.160
Let us try to look at another model.

06:31.240 --> 06:38.370
So if you want to add Ziggler any Biton says clip comite, I would say inside this clip, you can just

06:38.380 --> 06:40.330
simply use explanation.

06:41.020 --> 06:46.360
And all those subscript come on, you can type it here and it will download this particular.

06:47.730 --> 06:50.250
Model in your local support.

06:50.810 --> 07:01.200
If I tried to do the same thing in somewhat DSL with this Emily Martin and let me know, like an LP

07:01.200 --> 07:01.950
and this one.

07:02.640 --> 07:03.910
So if I tried to execute it.

07:07.630 --> 07:13.570
You'll be able to see we got that who is head of this court web and has got Emily.

07:13.690 --> 07:17.430
It doesn't seem to be psychic link, so that doesn't have one.

07:17.800 --> 07:20.250
So what we can do is we can just simply fire this.

07:20.260 --> 07:20.610
Come on.

07:22.170 --> 07:24.110
And it will download this particular market.

07:24.230 --> 07:25.790
You know, it looked fine.

07:27.540 --> 07:28.650
So it will take little time.

07:28.820 --> 07:29.150
Maybe.

07:30.880 --> 07:37.180
So I'm just fast forwarding my video till this completes highlights almost downloaded.

07:38.440 --> 07:39.820
Requirement audit is that despite.

07:40.910 --> 07:45.940
And you be able to see you can now load the model via Stacy load.

07:46.270 --> 07:48.350
They give given indication so.

07:50.130 --> 07:53.080
So now what we can do, we can take a look at.

07:53.170 --> 07:57.420
We know that a matter which we learn, so that will be of import.

07:58.260 --> 08:00.120
And you can type here.

08:00.400 --> 08:03.300
The model need imported.

08:05.040 --> 08:06.150
So successfully guarding.

08:06.840 --> 08:14.090
So next days, we just need to use this again on this court order and this court, ravenous Cordingly,

08:14.920 --> 08:16.980
dark Lord can.

08:16.980 --> 08:21.180
Now let me assign it to an MP on this one.

08:24.640 --> 08:25.090
All right.

08:25.120 --> 08:30.640
So you can see and in this cold one, the highways are not our medium size model also got loaded.

08:31.390 --> 08:31.780
All right.

08:31.790 --> 08:33.250
So that is all about.

08:34.390 --> 08:40.920
In this video, we'll just see how good look the model, how to install the different models for a pre

08:40.920 --> 08:41.230
train.

08:41.230 --> 08:43.390
Models are available in this PESI library.

08:43.750 --> 08:51.020
Now, the real game starts once we help model how to use this model for a different NLB cost like a

08:51.460 --> 08:52.480
tokenized is in force.

08:52.690 --> 08:55.630
So we'll see that tokenization in the next video.
