WEBVTT

00:01.190 --> 00:02.060
All right, everyone.

00:02.120 --> 00:07.910
So let's continue, it is the sun on tokenization, and for that, we have already created DSL in its

00:07.910 --> 00:09.100
quest to an escort.

00:09.620 --> 00:11.230
We are loading our model also.

00:11.290 --> 00:15.440
So now we are gonna work with this NLB, Martin.

00:15.560 --> 00:22.000
So once the model got created, we are going to supply this stream as soon as as an escort.

00:22.090 --> 00:24.140
One by one to our model.

00:24.470 --> 00:29.880
And we will tell you tokenized this particular sentence into different options.

00:31.550 --> 00:32.990
So photic.

00:34.860 --> 00:38.940
We are going to apply this and it'll be Pam Hess one.

00:39.540 --> 00:41.940
So eventually we going to return as.

00:42.340 --> 00:42.600
OK.

00:43.470 --> 00:46.470
Now, what just happened at this particular stage?

00:47.190 --> 00:52.370
You can see how apply this Esslin string to penalty Morton.

00:52.950 --> 00:57.360
So all those intelligence has been site inside this NLB model.

00:57.780 --> 01:03.510
Now, here we are dealing with a very small size model only in core Merab underscored assembly.

01:04.260 --> 01:07.680
And it will return a better representation of it all.

01:07.710 --> 01:15.030
Input text like after tokenization, bagging Horsh, tagging named entity recognition, standing and

01:15.030 --> 01:16.470
everything it will do.

01:16.770 --> 01:20.100
And create one another like you object for us.

01:20.780 --> 01:21.150
Highlight.

01:21.300 --> 01:23.010
So let me add Ranney.

01:24.770 --> 01:31.900
And now the let me create a few more so we have to be true to every single book on Inside This Dark

01:31.930 --> 01:32.210
One.

01:34.940 --> 01:42.490
So let me play you a token in, let's say knock one three token.

01:43.230 --> 01:48.110
Now, before that, let me painting one whole region S1 puzzle.

01:50.590 --> 01:57.220
All right, so now you can see it has created a tokenized was one of our alternate inputs statement.

01:57.700 --> 02:01.570
So now it's okay, like Apple is looking at mine.

02:02.500 --> 02:07.780
Now you can see even if after you there is a dot, but it doesn't zoom back.

02:08.560 --> 02:10.510
This is the end of the token.

02:10.810 --> 02:18.370
Instead of that, it has taken into consideration this United Kingdom view dot here like has one entity

02:18.870 --> 02:26.530
and like on one token, startup for dollar is another entity, one is another entity billion.

02:26.770 --> 02:28.750
And we have an explanation.

02:29.350 --> 02:32.080
So that is the beauty behind this particular.

02:33.370 --> 02:36.600
Organization implementation inside this space.

02:37.280 --> 02:38.780
Now let's go for that next one.

02:39.170 --> 02:44.510
So what I'm going to do, I am this going to bring everything you know of one cell for as to.

02:47.290 --> 02:53.490
So let me just make it Haspel and here, let us make it got to to.

02:54.930 --> 02:58.140
Now, what is this as to and what is the challenges out there?

02:58.180 --> 03:01.170
Inside this has to so if you observe here.

03:01.830 --> 03:04.640
This you might be this senior.

03:04.950 --> 03:10.110
There is no issue that will be a very simple tokenization based on the whitespace character.

03:10.650 --> 03:16.040
But then there is something special, like an e-mail, like on a website.

03:16.560 --> 03:21.330
So how special will this particular as one entity?

03:21.790 --> 03:23.400
Or it will be like a support supporters.

03:23.400 --> 03:24.690
Different had is different.

03:24.880 --> 03:25.710
Name is different.

03:26.130 --> 03:27.240
And then Quong is different.

03:27.600 --> 03:28.890
Let's see how it goes.

03:30.250 --> 03:32.160
So that will be for hopes.

03:32.410 --> 03:34.210
It has to be down to.

03:36.870 --> 03:44.340
All right, so now you can see it has taken in to consider send a full e-mail as of one single token.

03:44.760 --> 03:52.110
And that is the beauty of this spacy tokenization support that too many dot com completely as a one

03:52.110 --> 03:52.790
single clock.

03:53.220 --> 03:56.850
At the same time, the whole you are hearing me dot com.

03:57.000 --> 03:58.370
Also on the token.

03:58.740 --> 04:03.070
So whenever you are dealing with such a kind of e-mail address or you are dealing with a.

04:04.170 --> 04:08.380
Website address spacy, consider it as one single token.

04:08.870 --> 04:12.540
And that is what the intelligence decided, say spicy hockey.

04:13.070 --> 04:16.750
Let's go for the third one and let's see what is the turban.

04:17.330 --> 04:22.600
So in case of third one, there are some Suffolk's out there and there are some graphics out here.

04:23.180 --> 04:28.700
So let's see whether it has been considered as a one single token because it is like a joint.

04:28.760 --> 04:30.290
There is no space in between.

04:30.680 --> 04:31.350
And here also.

04:31.400 --> 04:32.290
It's like a joint.

04:32.330 --> 04:33.590
There is no space in between.

04:33.920 --> 04:38.390
So let's see how spacy will treat after tokenization.

04:39.020 --> 04:43.520
So let me just make it Dockley and.

04:44.990 --> 04:45.390
All right.

04:46.040 --> 04:47.740
Let's run it.

04:48.860 --> 04:51.230
And you'll be able to see Tanny's is different.

04:51.440 --> 04:52.970
KM Kiam is different.

04:53.150 --> 04:55.640
And here as our dollar is different, 20 is different.

04:55.970 --> 05:03.380
And that is what the real beauty behind this spacy tokenization let's do for the last one.

05:03.980 --> 05:09.070
Let's watch a movie together so that his father has four.

05:09.820 --> 05:12.250
So let me make it OK for.

05:13.490 --> 05:13.800
Yes.

05:16.450 --> 05:16.810
All right.

05:19.840 --> 05:20.740
And you can see.

05:21.520 --> 05:26.940
Let's watch is not pretty of one single token instead of the light is different here.

05:27.520 --> 05:30.720
This apostrophe S is a different hand.

05:30.730 --> 05:36.370
After that, everything remains seem like a watch o movie together and not quite right.

05:36.410 --> 05:38.850
So that is all about the different kinds of sentence.

05:39.080 --> 05:45.790
We have taken into consideration for tokenization and this organizes and is kind of very much advanced

05:45.820 --> 05:47.410
tokenization got implemented.

05:47.740 --> 05:54.370
So if you go to the documentation of this bessey or you can just simply say is for.

05:56.000 --> 05:56.190
Go.

05:57.190 --> 05:58.040
Good night, Yishan.

05:59.100 --> 06:02.370
And some Koreans out there, and that is a very good tutorials.

06:03.990 --> 06:04.290
All right.

06:04.320 --> 06:05.640
So that is a tokenization.

06:06.710 --> 06:07.570
I want to show you.

06:08.210 --> 06:12.830
So let's say in this particular statement, let's go to and why.

06:13.310 --> 06:14.610
And God, why God?

06:14.760 --> 06:15.510
Exclamation mark.

06:15.770 --> 06:16.950
So that is not me.

06:17.360 --> 06:18.830
Has a human already know that.

06:19.480 --> 06:22.520
And, you know, so there is a do this stuff.

06:23.060 --> 06:26.610
So it's not legal just based on a whitespace character.

06:26.900 --> 06:29.620
It will take all those designs for the tokenization.

06:30.050 --> 06:31.690
Now, here in this case, let's.

06:32.000 --> 06:38.300
And then go so very far shot because then like this and a second shot, it has a.

06:39.300 --> 06:40.500
Then this suffix.

06:41.040 --> 06:43.320
Then again, some suffix got removed.

06:43.810 --> 06:45.210
Then there is an exception.

06:45.600 --> 06:52.050
So you can see at the end light is different, as is different for strophic is going to there is no

06:52.050 --> 06:53.250
issue for understanding.

06:53.660 --> 06:57.290
Now this and why it has not taken into consideration different one.

06:57.330 --> 07:01.590
But based on those mortar and mortars, intelligence is not.

07:01.620 --> 07:04.170
Why not consider as a one single token?

07:04.200 --> 07:07.650
Because we already know as a human that is referring to New York.

07:07.980 --> 07:10.210
But this explanation, Mark, obviously it's a different.

07:10.940 --> 07:11.310
All right.

07:11.340 --> 07:16.830
So that is the beauty of this spacy tokenization and very advanced organization already implemented

07:16.890 --> 07:17.160
here.

07:17.900 --> 07:18.260
All right.

07:18.290 --> 07:19.380
See you in the next video.

07:19.440 --> 07:24.090
We'll see some more stuff related political spacy tokenization in the next video.
