WEBVTT

00:00.910 --> 00:01.470
Hey, everyone.

00:02.020 --> 00:08.350
So the next project, which we are going to work is a tax and reason, and for that we are going to

00:08.350 --> 00:16.270
use this tense of law, get us and our recurrent neural network D.M. long short term memory.

00:16.590 --> 00:19.420
So what exactly a tax condition project is?

00:19.840 --> 00:29.860
Let me go to some random text on Wikipedia and what we are trying to achieve from this tax demolition

00:29.920 --> 00:35.290
project and how this hellish Deum network or recurrent neural net could help us.

00:35.620 --> 00:37.630
So let's say we go to English, Wikipedia.

00:39.900 --> 00:46.800
Suppose if you give enough amount of tax to the machine, like thousands of thousands of this kind of

00:46.900 --> 00:51.630
Wikipedia are articles directly to this neutral nickel.

00:52.470 --> 00:59.670
And then what we want from the neural network is, if you suppose gives some tax of data like this one.

01:00.800 --> 01:08.400
Then we're expecting that machine or this neural network or recurrent neural network will predict that

01:08.780 --> 01:14.150
what will be possible values or what would be the possible next?

01:14.280 --> 01:14.520
OK.

01:15.110 --> 01:17.130
So in this particular case, it will be.

01:18.080 --> 01:18.650
Suppose.

01:19.930 --> 01:21.910
If I do some other text like this.

01:22.300 --> 01:25.300
So the next token will be sexy.

01:25.690 --> 01:34.210
So in this way, the text generation project work and we want Machine to predict what would be a next

01:34.600 --> 01:36.150
tax holiday.

01:36.550 --> 01:39.150
So let's start with the installation and a startup.

01:39.440 --> 01:41.340
I'm not going to execute any court.

01:42.070 --> 01:48.510
I just want to walk through that, how things got implemented and what other preprocessing steps are.

01:49.570 --> 01:54.880
So that will be a tens of law and thing and request lively, important.

01:55.870 --> 01:58.960
Next important thing is data pre processing.

01:59.380 --> 02:06.700
And for pre processing we are going to use this sex period are defined as input data.

02:07.630 --> 02:09.670
So if we just follow this particular link.

02:10.910 --> 02:18.780
You can see that this is the sex period that the file be and has been taken from this M.I.T. Web site.

02:19.370 --> 02:23.150
So it has a good amount of English grammar literature.

02:23.420 --> 02:26.000
So good amount of combination is available.

02:26.540 --> 02:35.240
And now we are going to generate the sequence out of this particular whole set of texts and those sequence

02:35.300 --> 02:37.990
we will freely to the Alistaire, Nicole.

02:38.630 --> 02:39.950
And those are less D.M. Network.

02:40.070 --> 02:46.340
Eventually, at a prediction time will provide us the information that based on your seed sequence,

02:46.670 --> 02:49.120
what would be the possible next token?

02:49.910 --> 02:50.250
All right.

02:50.630 --> 02:53.510
So in this particular cell, we're just getting those data.

02:53.990 --> 02:57.440
And here we are just displaying a response, not text.

02:58.860 --> 03:04.050
Let's try to split this whole tax rate slicing so line by line.

03:04.620 --> 03:07.950
And this is the very first line we hope displayed.

03:09.090 --> 03:10.620
If you'll go to the original tax.

03:12.560 --> 03:16.670
This is the handshake, a text file represented by Project Gutenberg.

03:17.290 --> 03:20.080
So it is the intricate text file represented right Gutenberg.

03:20.700 --> 03:25.420
Now we'll take on this 253 numbers line.

03:25.860 --> 03:31.540
And from those particular line onwards only, we are going to take all those data.

03:31.900 --> 03:33.070
Now, why this 253?

03:33.130 --> 03:35.170
Because before that, all those.

03:36.590 --> 03:43.280
Publisher related information in everything is situated, the village tax off, or I will say this licensing

03:43.280 --> 03:47.450
agreement they've written, the legal text will start after this.

03:47.770 --> 03:49.530
Two hundred fifty three lines.

03:50.270 --> 03:54.260
And after this, repeatedly lines all those data.

03:54.290 --> 03:57.970
I'm just slicing down and put it into a data medium.

03:58.460 --> 04:01.610
So after that, this data zero will be our first line.

04:01.640 --> 04:04.730
Let me just copy and legislate here.

04:06.570 --> 04:06.910
All right.

04:06.950 --> 04:11.420
So actual text of this sex bill will start at this particular location.

04:11.840 --> 04:16.310
Now, before that, there are already 253 sentence.

04:17.180 --> 04:23.030
If you try to find what is the length of this particular data variable v. how one looks.

04:23.090 --> 04:27.260
Twenty four thousand two hundred in four sentences.

04:27.830 --> 04:29.310
Or I would say lines are.

04:29.880 --> 04:37.490
Let's just join all of them so that I don't whatever data we will get, that will just the remote part

04:37.490 --> 04:38.000
of this one.

04:38.420 --> 04:39.320
So nothing else will.

04:39.440 --> 04:42.170
Then they'll know only this much part.

04:42.320 --> 04:46.760
We have removed those licensing part and some publisher related information.

04:47.980 --> 04:49.250
All right, let's proceed.

04:50.000 --> 04:52.940
Next, we are removing all those punctuation mark.

04:53.300 --> 04:56.900
So just splitting those data, removing punctuation mark.

04:57.890 --> 04:59.060
Trying to remove.

04:59.390 --> 05:02.110
Is there any known alphabetic character study?

05:02.210 --> 05:05.950
So we are just keeping only alphabetical character, Hanna Dean.

05:06.050 --> 05:07.250
We are just lowering it.

05:07.730 --> 05:12.980
So this particular function will take all those full text and clean it up.

05:13.340 --> 05:21.200
So this three steps, it will perform like removing punctuation mark, removing non alphabetical characters

05:21.620 --> 05:25.010
and make every single token has a lowercase.

05:26.450 --> 05:26.790
All right.

05:27.020 --> 05:33.610
So after those cleaning data here we are just trying to display first 50 tokens.

05:33.800 --> 05:36.700
So token has we started from from Pharis created.

05:36.860 --> 05:39.920
And you can see the very first sentence is also from Forrest.

05:41.940 --> 05:44.110
How many tokens card generated out of it?

05:44.430 --> 05:46.480
So that will be it lacks.

05:46.640 --> 05:48.660
Ninety eight thousand one ninety nine.

05:49.440 --> 05:51.660
And how many unique tokens are dead?

05:52.110 --> 05:53.980
So that will be just twenty seven thousand nine.

05:53.990 --> 05:55.180
Fifty six.

05:55.320 --> 06:01.830
That's also I mean, too much, actually, because in daily routine life, whatever the English we use,

06:02.130 --> 06:06.920
that is, we way less than this much amount of vocabularies.

06:07.440 --> 06:07.810
All right.

06:08.100 --> 06:11.460
So now we have our data available in terms of tokens.

06:11.730 --> 06:14.970
So next thing is we need to generate the sequence.

06:15.360 --> 06:23.880
So in this particular looping mechanism, what we are trying to do, we are just taking 50 land sequence.

06:26.020 --> 06:30.190
And putting in two different different values in this line, really.

06:30.910 --> 06:35.590
So headmen will get one like ninety nine thousand ninety one.

06:36.010 --> 06:42.240
That is just because we are just taking into consideration first 200 key words only.

06:42.670 --> 06:46.930
So if you see first 200, give us the last 50 words.

06:47.290 --> 06:49.450
Do not have anything to predict next.

06:49.550 --> 06:56.170
That's why just one ninety nine thousand nine fifty one token has been taken into consideration.

06:56.620 --> 07:00.780
So this is our first line and you can say this is our first sequence.

07:02.240 --> 07:04.510
So that will be a token zero and token.

07:04.880 --> 07:06.290
So from a self.

07:07.350 --> 07:07.680
All right.

07:08.300 --> 07:11.380
These are alignments that will be of a next sequence.

07:11.740 --> 07:17.210
Now, if you consider the very first sequence and a second sequence, the second sequence will be just

07:17.210 --> 07:19.390
sifted by one token.

07:19.690 --> 07:27.550
So here it was starting from I mean, from from tokin, whereas the second particular sequence is starting

07:27.550 --> 07:29.380
from this farer sequence.

07:29.650 --> 07:32.650
If you go at the very end of both of the sequence.

07:33.680 --> 07:37.270
Here, the last one will be lie, tie and self.

07:37.640 --> 07:40.400
So here will be like myself and there will be one more.

07:40.700 --> 07:43.160
So this is what shifted to one place.

07:43.960 --> 07:46.400
So he had also broken one end token talking.

07:46.470 --> 07:48.020
Fifteen you can consider.

07:48.080 --> 07:49.250
So this is a token one.

07:49.820 --> 07:52.320
And this one will be token 51.

07:53.270 --> 07:53.670
All right.

07:54.690 --> 08:02.720
So we have created a sequence next is we need to organize it and we have to apply those sequence to

08:02.820 --> 08:03.840
no mapping.

08:04.260 --> 08:07.340
And for that same like earlier projects we have.

08:08.010 --> 08:09.760
We are going to use this to organize it.

08:09.840 --> 08:14.500
And on top of tokenized, we are going to play all these lines very well.

08:15.030 --> 08:22.170
And then we will get one big objects like sequences, all those sequences out of something like a numbers.

08:22.440 --> 08:30.880
So here every single sequence has been represented by some vector number.

08:31.930 --> 08:37.130
So out of this particular sequence, we are just trying to grab X and Y.

08:37.560 --> 08:45.290
So this particular part will try to find except the last one put it into X very well.

08:45.900 --> 08:51.330
And this particular variable will try to find only the last character of each.

08:51.600 --> 08:52.770
Eventually we are trying to bready.

08:53.190 --> 08:59.190
So this will occupy 50 tokens and this will be just fifty one token.

08:59.550 --> 09:05.020
So eventually from this particular 50 tokens we are trying to predict.

09:05.220 --> 09:11.850
This 51 talking in the next time it will try to select the first token 250.

09:12.430 --> 09:15.300
Those number, another 50 tokens.

09:15.540 --> 09:17.850
And the next will be immediate.

09:18.000 --> 09:18.720
Next token.

09:19.680 --> 09:20.510
Same way like.

09:20.820 --> 09:22.020
We got X zero.

09:22.170 --> 09:23.880
So that will be a kind of numbers.

09:24.300 --> 09:26.820
If you find the shape of this X zero.

09:26.870 --> 09:29.880
So that will be a 50 numbers and Y zero.

09:29.910 --> 09:31.810
So Y zero will be over three zero seven.

09:31.830 --> 09:36.470
So that is what we are trying to predict when we help, given this much number.

09:36.750 --> 09:40.250
Now, all those number has been represented.

09:41.750 --> 09:43.980
Is presentation, not some token?

09:45.020 --> 09:46.730
So that broken.

09:48.170 --> 09:55.280
Name will be given by this one underscoring Sodo is one, and these two seem like if you try to find

09:55.420 --> 10:00.050
three eight zero six from this next story, you will get some number.

10:01.900 --> 10:02.500
Some token.

10:03.550 --> 10:07.750
So there will be a total vocabulary size of thirteen thousand and eight.

10:08.940 --> 10:14.420
But here, the next thing I starting from Gedo, so real vocabulary size, you will get like a thirteen

10:14.430 --> 10:15.720
thousand eight plus one.

10:15.750 --> 10:17.700
So it will be at thirteen thousand nine.

10:18.390 --> 10:23.940
And the unique tokens, you will get like twenty seven thousand ninety six.

10:24.660 --> 10:31.980
Now, whatever the output values are, the V how to convert it into some kind of categorical value.

10:32.340 --> 10:38.760
So based on the vocabulary, size of a neuron should fire.

10:39.120 --> 10:42.100
So let's say we are trying to predict some character.

10:42.300 --> 10:44.340
Let's have from this first 50.

10:45.270 --> 10:46.180
We got some idea.

10:46.740 --> 10:47.160
So let's.

10:47.310 --> 10:48.120
This is the first week.

10:48.300 --> 10:52.860
I'm just assuming and from that I am just trying to predict a tie.

10:53.310 --> 10:58.960
So PI has been represented by a vector of this walkup size.

10:59.400 --> 11:06.540
And whatever the new neuron will fire, we will predict that this particular token we help predicting.

11:07.170 --> 11:13.820
So in case of X, say we have of a P values and sequence lendings, also 50.

11:14.280 --> 11:16.420
So that is all about our preprocessing data.

11:16.820 --> 11:20.400
V how created our input dataset and output does it.

11:20.820 --> 11:23.520
So next thing is we need to build the model.

11:23.880 --> 11:26.100
So building model, we will see the next city.