WEBVTT

00:01.010 --> 00:02.240
Hey, everyone, welcome back.

00:02.930 --> 00:06.890
So in this video, we will see how we can train our own and model.

00:07.610 --> 00:13.220
And mostly we are going to use for our own training and with our custom data set.

00:13.520 --> 00:18.940
We are going to use this Jansing library so first time and put it dispenser filled with Taldo.

00:19.070 --> 00:24.170
We are not going to use this tense of law, but let me put it in this by default.

00:24.250 --> 00:24.900
Mundos.

00:26.290 --> 00:32.480
As a part of this club environment and defiantly, you will get this sense of low two point zero percent

00:32.920 --> 00:34.010
now for preprocessing.

00:34.120 --> 00:35.980
We are going to use this hand LDK library.

00:35.980 --> 00:39.680
So I'm just going to import this and indicate and return LDK.

00:40.090 --> 00:41.980
That is a one more package.

00:41.980 --> 00:42.940
I am going to download it.

00:43.000 --> 00:44.170
That is nothing but fun.

00:44.500 --> 00:47.930
So that is basically for the organization purpose.

00:47.960 --> 00:49.210
And we are going to use this.

00:50.640 --> 00:51.000
Martin.

00:51.660 --> 00:52.560
All right now.

00:52.670 --> 00:54.620
Important parties about the Jency.

00:55.170 --> 01:01.410
So from Jansing models, we are going to import this right to work and give to function.

01:01.830 --> 01:03.060
So let me embody.

01:03.870 --> 01:05.630
So, Deserto, some of the basics input.

01:06.210 --> 01:11.180
Now, first important us, we are going to start one is a data people assessing.

01:11.760 --> 01:14.340
And for that one database, we are going to use it.

01:14.540 --> 01:17.070
That is available on this Kagle website.

01:17.130 --> 01:20.120
So let us follow this Kaggle website.

01:20.400 --> 01:22.620
And it will open a new tab.

01:23.670 --> 01:31.140
Now, to get any Natus say inside the gaggle, you need a calm before that, you just cannot download

01:31.160 --> 01:31.300
it.

01:31.710 --> 01:38.370
So we are going to use this voice news dataset on ready from two thousand eight to today.

01:39.150 --> 01:45.200
So there is a huge amount of data sets that are available for our vote embedding training model, and

01:45.210 --> 01:47.160
that will be close to around 70 M.B.

01:47.610 --> 01:52.090
So our first task is to get this particular data set inside the collar.

01:52.350 --> 01:52.860
And what I mean.

01:53.480 --> 01:53.850
All right.

01:54.240 --> 02:02.490
And to get it inside the clip file, we are going to use this Kagle API and that is with the help of

02:02.490 --> 02:04.260
this Kagle package module.

02:04.290 --> 02:10.450
So let me put this Google package module with this package manager and tell this installation will complete.

02:10.860 --> 02:19.890
First is we how to grab the token so you can go to your account, Maicon, and just scroll down and

02:19.890 --> 02:25.100
you'll be able to see you'll be you will get this API key.

02:25.140 --> 02:32.070
So just create new EPA token just to download this net asset into our column environment.

02:32.880 --> 02:40.470
And I'm just going to say let me say to the download folder right now.

02:41.220 --> 02:44.070
Next thing is we are going to create one, not Google.

02:45.190 --> 02:45.820
Directory.

02:47.110 --> 02:48.900
And our Kegl dot is on fire.

02:48.940 --> 02:53.080
We are just going to cooperate and into this dark Kagle directly.

02:53.680 --> 02:57.160
But before that, from our local environment, we have to upload it.

02:58.150 --> 02:58.890
So applaud.

02:59.270 --> 03:04.090
Go to download and our piggin, not just one site is available.

03:04.450 --> 03:07.060
Now, if you open this Google, not just on fire.

03:08.080 --> 03:12.520
That is only a user named Bill Gavin and part from user name.

03:12.530 --> 03:14.380
There is a one API key.

03:15.300 --> 03:15.770
All right.

03:16.210 --> 03:18.260
So let's just copy now.

03:18.610 --> 03:22.830
And after copying, let's just disable this particular API key.

03:23.110 --> 03:25.240
So we are just changing the permission.

03:26.140 --> 03:32.280
Let me call OPSEC and now we are going to download this particular dataset.

03:32.740 --> 03:34.230
Avoid news on Reddit.

03:34.810 --> 03:36.100
So let me go back.

03:36.420 --> 03:36.830
Yes.

03:37.150 --> 03:38.770
So if you click on download.

03:39.940 --> 03:44.460
You will be able to download in your local machine on, so if you want to create this whole model on

03:44.460 --> 03:51.810
a local machine and if you want to download on a Kagle magically into a club environment, you can use

03:51.810 --> 03:53.710
this coming Kaggle data set now.

03:54.750 --> 04:01.130
So it is close around seventy eight M.B and it will take little time to download.

04:02.640 --> 04:04.790
All right, so the data got downloaded.

04:05.720 --> 04:07.930
We can even see whether data is available yet.

04:08.090 --> 04:11.870
Yes, data is available via news on Reddit GFI.

04:12.440 --> 04:19.840
Now, let's just unzip it and we will keep it inside this contain vital news on a ready for you.

04:20.270 --> 04:21.100
Let me finish it.

04:21.380 --> 04:23.770
So our hopes.

04:24.180 --> 04:27.770
Let's go to contain and inside the contain.

04:28.130 --> 04:32.840
You can see the file got chipped in two nights since we filed for me.

04:33.320 --> 04:34.340
Now, next, this.

04:34.430 --> 04:35.270
We are going to.

04:36.690 --> 04:44.130
Read this NAACP file into a data free object for reading, we are going to use this bindis library.

04:44.570 --> 04:46.450
Let this play forcefield, Akos.

04:47.750 --> 04:50.380
All right, so it's a race that has a nail.

04:50.430 --> 04:51.040
Goodness.

04:51.890 --> 04:57.110
And mainly we are going to use this title column for our war and reading Modern.

04:57.830 --> 05:00.920
How many total number of records are available?

05:00.950 --> 05:03.470
We can grab it from ship.

05:03.950 --> 05:10.490
So we have a file X nine thousand two hundred and thirty six records are available.

05:10.520 --> 05:11.930
That's quite good enough.

05:12.350 --> 05:15.830
And this is the only column which contains all those title.

05:16.580 --> 05:21.740
So we'll just grab this title column into another variable news and this called title.

05:22.060 --> 05:22.490
All right.

05:22.970 --> 05:24.770
If you just try to display.

05:25.920 --> 05:27.960
Let's say first Fluticasone leaving today.

05:28.910 --> 05:31.980
So scores killed in a Pakistan class.

05:32.330 --> 05:32.480
Yes.

05:32.580 --> 05:33.840
So this is the first record.

05:34.380 --> 05:37.650
Now we are going to first tokenized this one.

05:38.130 --> 05:42.880
So for tokenization, just like earlier or so, we use this analytical library.

05:43.140 --> 05:49.670
So from an indicator not PAVOL and the called tokenization, we are going to do and it will do the tokenization

05:49.680 --> 05:51.870
on every single record.

05:52.290 --> 05:54.660
So we'll get a director like new on this.

05:54.740 --> 05:55.200
Correct.

05:55.470 --> 05:58.350
Now it has a good amount of datasets that are available.

05:58.710 --> 06:01.260
So it will take a little bit amount of time.

06:03.620 --> 06:06.330
All right, so now you can see tokenization, Fenice.

06:07.050 --> 06:08.660
Let's display first few records.

06:08.950 --> 06:09.720
So a new.

06:10.850 --> 06:11.640
That's correct.

06:12.380 --> 06:13.340
Let's just deal.

06:14.680 --> 06:16.770
So very first record will be released.

06:17.080 --> 06:20.260
And having our individual values will be a different voice.

06:20.270 --> 06:23.320
What I would say in a very NLB language.

06:23.440 --> 06:24.640
It will be kind of took.

06:24.680 --> 06:28.660
So scores killed in Pakistan and classes.

06:28.930 --> 06:31.510
So each individual word is considered as a token.

06:31.960 --> 06:35.870
The next is we're going to build the actual vote work model.

06:36.310 --> 06:40.420
And this will direct model functionalities available in this Jency library.

06:40.920 --> 06:45.210
Right now, it is expecting this to argument ventilatory arguments.

06:45.240 --> 06:47.520
So first one is for the text.

06:48.190 --> 06:53.590
So tax will be a new underscore quarterback model or I would say text.

06:53.650 --> 06:57.400
We will supply one of the two argument will be minimum gone.

06:57.760 --> 07:03.040
So it will consider only those where we're having a minimum count will be one.

07:03.070 --> 07:09.070
So obviously every single word or unique way it will take into consideration and the size of vector

07:09.070 --> 07:09.770
will be 32.

07:09.850 --> 07:17.110
Because after embedding whatever the output we get inside the model, that will be OK size thirty two

07:17.830 --> 07:19.610
numbers in terms of timings.

07:19.630 --> 07:26.370
And so he tries to say that every single word or a token will be represented.

07:26.570 --> 07:26.760
Wait.

07:27.260 --> 07:30.370
Thirty two numbers are attracted to damage that space.

07:30.850 --> 07:34.270
So let me execute it and it will take good amount of time.

07:34.570 --> 07:39.880
So I'm just winding up in this video at this particular time.

07:40.270 --> 07:46.300
And in the next video, we will see after the model will be created how we can do a prediction based

07:46.300 --> 07:47.680
on this particular model.

07:47.990 --> 07:49.570
So see you in the next video.