1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
So we are going to continue our discussion with respect to Long Chain.

3
00:00:04,000 --> 00:00:09,000
Already we have discussed about data ingestion techniques where we discussed about various document

4
00:00:09,000 --> 00:00:10,000
loaders.

5
00:00:10,000 --> 00:00:14,000
Now in this video we are going to go to the next step and discuss about data transformation.

6
00:00:14,000 --> 00:00:20,000
Now in this data transformation, we will try to see how we will be able to convert this documents into

7
00:00:20,000 --> 00:00:23,000
chunks of text again in long chain.

8
00:00:23,000 --> 00:00:29,000
Uh, we have lot of different, different ways how we can probably go ahead and convert all these documents

9
00:00:29,000 --> 00:00:31,000
into this text chunks.

10
00:00:31,000 --> 00:00:32,000
And why do we do this?

11
00:00:32,000 --> 00:00:37,000
So that, uh, every LM model has its own limitation of the context size?

12
00:00:37,000 --> 00:00:42,000
In order to take care of that, we have to probably divide this particular documents into smaller text

13
00:00:42,000 --> 00:00:43,000
chunks.

14
00:00:43,000 --> 00:00:48,000
Now let's go ahead and let's let me just go ahead and show it to you how we can actually do this.

15
00:00:48,000 --> 00:00:52,000
I will keep my face hidden so that your focus will be completely on the screen.

16
00:00:52,000 --> 00:00:53,000
Okay.

17
00:00:53,000 --> 00:00:56,000
if you remember, in our last class, I had shown you how to read a PDF file.

18
00:00:56,000 --> 00:00:57,000
Text file?

19
00:00:57,000 --> 00:01:02,000
You know, along with that, a web based loader, how to probably read it from a website or safe.

20
00:01:02,000 --> 00:01:04,000
Then you also have Wikipedia.

21
00:01:04,000 --> 00:01:05,000
So many different things we saw.

22
00:01:05,000 --> 00:01:05,000
Right?

23
00:01:05,000 --> 00:01:07,000
Let's take one of the example.

24
00:01:07,000 --> 00:01:11,000
So let's read this particular PDF file in this particular file itself.

25
00:01:11,000 --> 00:01:13,000
Right I will go ahead and select the kernel.

26
00:01:13,000 --> 00:01:16,000
So let's go ahead and select it now once I execute it.

27
00:01:16,000 --> 00:01:21,000
So this these are my docs that you will be able to see from this particular attention dot PDF.

28
00:01:22,000 --> 00:01:26,000
Now what we are going to do is that we are going to take this entire PDF.

29
00:01:26,000 --> 00:01:30,000
And then we are going to, uh, convert this into text.

30
00:01:30,000 --> 00:01:30,000
Okay.

31
00:01:30,000 --> 00:01:38,000
So here first method that we are going to do is that we'll see how to recursively okay.

32
00:01:38,000 --> 00:01:44,000
Or let me just go ahead and make this as a markdown and write the title over here.

33
00:01:44,000 --> 00:01:48,000
How to recursively split text by characters okay.

34
00:01:48,000 --> 00:01:50,000
So we first method is this particular technique.

35
00:01:50,000 --> 00:01:52,000
So here let me just go and do it.

36
00:01:52,000 --> 00:01:57,000
So here uh what we are going to do is that we are going to recursively split, take up all these documents

37
00:01:58,000 --> 00:02:01,000
and split this text by characters itself.

38
00:02:01,000 --> 00:02:06,000
So first of all, before we go ahead, you know, we require a library.

39
00:02:06,000 --> 00:02:08,000
Now what is a library that will be required over here?

40
00:02:08,000 --> 00:02:10,000
I will just go ahead and write it down.

41
00:02:10,000 --> 00:02:15,000
So this is nothing but lang chain dash text dash splitters okay.

42
00:02:15,000 --> 00:02:21,000
So this is the library that we will be requiring in order to use all the different types of text splitting

43
00:02:21,000 --> 00:02:22,000
techniques.

44
00:02:22,000 --> 00:02:26,000
So I will go ahead and uh quickly do pip install.

45
00:02:26,000 --> 00:02:29,000
So here you can see pip install minus our requirement dot txt.

46
00:02:29,000 --> 00:02:32,000
Now this way you can actually open your terminal.

47
00:02:32,000 --> 00:02:36,000
The other way is directly go to this particular option and click on terminal right.

48
00:02:36,000 --> 00:02:38,000
So you will be able to open this right.

49
00:02:38,000 --> 00:02:43,000
I'm using a shortcut which is nothing but this backspace along with control.

50
00:02:43,000 --> 00:02:45,000
So control and back splash.

51
00:02:45,000 --> 00:02:48,000
When you do that you will be able to open this particular terminal okay.

52
00:02:48,000 --> 00:02:51,000
So already my installation is done with respect to the long chain text splitter.

53
00:02:51,000 --> 00:02:54,000
Now I will go ahead and write my code.

54
00:02:54,000 --> 00:03:02,000
Now first what we will do is that we will go ahead and import from long chain underscore text underscore

55
00:03:02,000 --> 00:03:03,000
splitters.

56
00:03:03,000 --> 00:03:04,000
Okay.

57
00:03:04,000 --> 00:03:08,000
And we are going to first of all import which is called as recursive character text splitter.

58
00:03:08,000 --> 00:03:13,000
Now this is responsible in recursively splitting the text by characters.

59
00:03:13,000 --> 00:03:18,000
So here we have already loaded the document okay I have my entire document over here.

60
00:03:18,000 --> 00:03:19,000
Right.

61
00:03:19,000 --> 00:03:21,000
So this is my docs folder.

62
00:03:21,000 --> 00:03:21,000
Okay.

63
00:03:21,000 --> 00:03:27,000
Now what I'm actually going to do is that I'll just go ahead and write text underscore splitter okay.

64
00:03:28,000 --> 00:03:29,000
And here will be my variable.

65
00:03:29,000 --> 00:03:32,000
And we will go ahead and use this recursive character text splitter.

66
00:03:33,000 --> 00:03:37,000
Inside this there are some parameters which you really need to know.

67
00:03:37,000 --> 00:03:39,000
The first parameter is something called as chunk size.

68
00:03:39,000 --> 00:03:42,000
So here you basically define the chunk.

69
00:03:42,000 --> 00:03:43,000
Let's say I go ahead and define.

70
00:03:43,000 --> 00:03:45,000
My chunk is somewhere around 100.

71
00:03:45,000 --> 00:03:51,000
And uh, I will just based on this chunk size or let's make it to 500.

72
00:03:51,000 --> 00:03:57,000
And here you'll be able to see that I will keep my second parameter which is called as chunk overlap.

73
00:03:57,000 --> 00:03:57,000
Okay.

74
00:03:57,000 --> 00:04:00,000
So this will basically be my chunk underscore overlap.

75
00:04:00,000 --> 00:04:02,000
You'll understand each and every thing.

76
00:04:02,000 --> 00:04:06,000
First of all, let me just write the code so that you get to know what I'm exactly I'm doing.

77
00:04:06,000 --> 00:04:12,000
So I here I'm saying, hey, try to split all these things by considering the chunk size of 500.

78
00:04:12,000 --> 00:04:17,000
That basically means when I'm when I'm dividing my documents into smaller chunks of text, each chunk

79
00:04:17,000 --> 00:04:20,000
of text should have a maximum chunk size of 500 characters.

80
00:04:20,000 --> 00:04:23,000
And there can be an overlap of 50 characters.

81
00:04:23,000 --> 00:04:23,000
Okay.

82
00:04:23,000 --> 00:04:28,000
And uh, here, uh, I'll use this two specific parameter for uh, okay.

83
00:04:29,000 --> 00:04:31,000
After this, uh, we will use this text splitter.

84
00:04:32,000 --> 00:04:34,000
And we have a lot of option over here.

85
00:04:34,000 --> 00:04:37,000
So first option is nothing but create documents.

86
00:04:37,000 --> 00:04:42,000
You also have option like uh let's see create documents is there.

87
00:04:42,000 --> 00:04:42,000
Okay.

88
00:04:42,000 --> 00:04:45,000
So let's let's go ahead and use this create documents okay.

89
00:04:46,000 --> 00:04:50,000
Uh so here what I'm actually going to do I'll just go ahead and write create documents.

90
00:04:50,000 --> 00:04:54,000
And along with this particular documents I'm just going to give my docs file.

91
00:04:54,000 --> 00:04:54,000
Okay.

92
00:04:55,000 --> 00:05:00,000
So once I do the split okay I'm actually going to get some values over here.

93
00:05:00,000 --> 00:05:05,000
And this will basically be my final underscore documents okay.

94
00:05:05,000 --> 00:05:11,000
Once I get this over here you will be able to see that I'll go ahead and execute it.

95
00:05:12,000 --> 00:05:17,000
So here it says hey expecting string or byte like object.

96
00:05:17,000 --> 00:05:24,000
Now see there is one problem that you will be seeing over here when we are dividing this entire text

97
00:05:24,000 --> 00:05:26,000
into documents.

98
00:05:26,000 --> 00:05:27,000
Okay.

99
00:05:27,000 --> 00:05:30,000
So here I have used split underscore create underscore documents.

100
00:05:30,000 --> 00:05:36,000
But you can actually see my return type of this particular docs right.

101
00:05:36,000 --> 00:05:38,000
What is the return type of this particular docs.

102
00:05:38,000 --> 00:05:40,000
It is nothing but document.

103
00:05:40,000 --> 00:05:43,000
So if I go ahead and see the type of docs.

104
00:05:43,000 --> 00:05:48,000
So here you'll be able to see that, hey, uh, you'll be able to see, hey, it is a list.

105
00:05:48,000 --> 00:05:49,000
Okay.

106
00:05:49,000 --> 00:05:53,000
If I go ahead and see Docs of Zero here, you'll be able to see it is a kind of document.

107
00:05:53,000 --> 00:05:54,000
Okay.

108
00:05:54,000 --> 00:05:57,000
But the function that we have used is saying create document.

109
00:05:57,000 --> 00:05:58,000
Okay.

110
00:05:58,000 --> 00:06:05,000
So, uh, create documents will be helpful when we use along with the text file, uh, or any other

111
00:06:05,000 --> 00:06:05,000
file.

112
00:06:05,000 --> 00:06:07,000
I'll also show you one of the example with respect to the text file.

113
00:06:07,000 --> 00:06:08,000
Okay.

114
00:06:08,000 --> 00:06:12,000
Instead of using create underscore document, we will just go ahead and directly use split underscore

115
00:06:12,000 --> 00:06:13,000
document.

116
00:06:13,000 --> 00:06:17,000
Because see already we know here I'm going to pass all my documents over here.

117
00:06:17,000 --> 00:06:18,000
And based on this I'm going to split it.

118
00:06:18,000 --> 00:06:23,000
So now if I go ahead and execute it now here is my entire final documents.

119
00:06:23,000 --> 00:06:27,000
Right now this final document is again a list of documents.

120
00:06:27,000 --> 00:06:29,000
So let me do one thing okay.

121
00:06:29,000 --> 00:06:31,000
Let me just go ahead and print.

122
00:06:31,000 --> 00:06:31,000
Okay.

123
00:06:31,000 --> 00:06:36,000
Final document of zero.

124
00:06:36,000 --> 00:06:42,000
And then I will go ahead and print final document of one okay.

125
00:06:43,000 --> 00:06:46,000
So here you will be able to see this is my page content.

126
00:06:46,000 --> 00:06:51,000
Now focus on these words University of Toronto with all this information okay.

127
00:06:52,000 --> 00:06:53,000
And if I go last.

128
00:06:53,000 --> 00:06:54,000
Okay.

129
00:06:54,000 --> 00:06:59,000
So here also you can see University of Toronto right now why we are actually getting this.

130
00:06:59,000 --> 00:07:01,000
Because there is a overlap right.

131
00:07:01,000 --> 00:07:03,000
So this is my first first document.

132
00:07:03,000 --> 00:07:06,000
The first document with 500 characters.

133
00:07:06,000 --> 00:07:09,000
And as I said there will be an overlap of how many characters.

134
00:07:09,000 --> 00:07:10,000
There will be an overlap of 50.

135
00:07:10,000 --> 00:07:16,000
So that is the reason you will be able to see that some of the last, uh, characters in this will be

136
00:07:16,000 --> 00:07:17,000
again getting repeated in the forward.

137
00:07:17,000 --> 00:07:18,000
Right?

138
00:07:18,000 --> 00:07:21,000
So University of Toronto is there then here you are actually getting this information.

139
00:07:21,000 --> 00:07:22,000
You can compare it.

140
00:07:22,000 --> 00:07:23,000
Okay.

141
00:07:23,000 --> 00:07:27,000
So that entire information is basically getting displayed over here.

142
00:07:27,000 --> 00:07:27,000
Right.

143
00:07:27,000 --> 00:07:34,000
So this is one way how you can probably split the documents using uh, this recursive character text

144
00:07:34,000 --> 00:07:34,000
splitter.

145
00:07:34,000 --> 00:07:35,000
This is nothing.

146
00:07:35,000 --> 00:07:39,000
But here you are trying to recursively split all the text by characters.

147
00:07:39,000 --> 00:07:39,000
Okay.

148
00:07:39,000 --> 00:07:40,000
Now this is one way.

149
00:07:40,000 --> 00:07:41,000
Let me do one thing.

150
00:07:41,000 --> 00:07:46,000
Let me, uh, if I if you remember, I had actually created this speech text.

151
00:07:46,000 --> 00:07:48,000
Okay, so I'll say reveal in File Explorer.

152
00:07:48,000 --> 00:07:52,000
I will copy this data, I'll put it in my next folder in Data Transformer.

153
00:07:53,000 --> 00:07:55,000
And let's go ahead and read this also.

154
00:07:55,000 --> 00:07:59,000
So in order to read it uh, you know, how do we read it right here.

155
00:07:59,000 --> 00:08:06,000
Uh, you can probably convert this entire things into, uh, uh, you know, probably a documents or

156
00:08:06,000 --> 00:08:08,000
how you specifically want.

157
00:08:08,000 --> 00:08:09,000
So let's go ahead and read in this way.

158
00:08:09,000 --> 00:08:13,000
So here I'm going to use this here.

159
00:08:13,000 --> 00:08:15,000
Uh, let's go ahead and write loader dot load.

160
00:08:15,000 --> 00:08:18,000
And finally this basically becomes my documents okay.

161
00:08:18,000 --> 00:08:19,000
So this will be my documents.

162
00:08:19,000 --> 00:08:23,000
And here we will go to write loader dot load okay.

163
00:08:23,000 --> 00:08:29,000
Now I will go ahead and write print or let's go ahead and display this particular docs.

164
00:08:29,000 --> 00:08:34,000
Now whenever I do this by default I'll be getting the return type as document okay.

165
00:08:34,000 --> 00:08:39,000
Let's say uh I will try to read this document that is speech dot text in another way.

166
00:08:39,000 --> 00:08:39,000
Okay.

167
00:08:39,000 --> 00:08:44,000
So here I'm going to write with open write open function I hope everybody knows in Python.

168
00:08:44,000 --> 00:08:47,000
And uh here what I'm going to do.

169
00:08:47,000 --> 00:08:50,000
I'm going to basically write my speech dot text okay.

170
00:08:50,000 --> 00:08:56,000
And then we are just going to create an file variable f with some context over here.

171
00:08:56,000 --> 00:08:59,000
And let's say I will go ahead and create a variable which is called a speech.

172
00:09:00,000 --> 00:09:01,000
Speech.

173
00:09:02,000 --> 00:09:02,000
Okay.

174
00:09:03,000 --> 00:09:06,000
And finally we will go ahead and read this entire speech.

175
00:09:06,000 --> 00:09:06,000
Okay.

176
00:09:06,000 --> 00:09:15,000
So once we read this let's uh, you can also basically, um, you know, see, like, what kind of document

177
00:09:15,000 --> 00:09:16,000
this is.

178
00:09:16,000 --> 00:09:16,000
Exactly.

179
00:09:16,000 --> 00:09:17,000
Okay.

180
00:09:17,000 --> 00:09:23,000
And, uh, let's say for here, I'll go ahead and initialize this particular speech variable to something

181
00:09:23,000 --> 00:09:24,000
like blank okay.

182
00:09:25,000 --> 00:09:27,000
Blank right now okay.

183
00:09:27,000 --> 00:09:29,000
Now let's see this particular speech.

184
00:09:29,000 --> 00:09:32,000
You'll be able to see that I'm able to read the document directly.

185
00:09:32,000 --> 00:09:33,000
So this is not giving me a document type.

186
00:09:33,000 --> 00:09:39,000
This is directly giving me the uh text that is present inside this txt file by just reading this.

187
00:09:39,000 --> 00:09:44,000
Now I'm just going to use this text splitter with recursive character text splitter.

188
00:09:45,000 --> 00:09:45,000
Okay.

189
00:09:45,000 --> 00:09:50,000
Now here I'm going to give a smaller chunk size because I know that the elements are very less.

190
00:09:50,000 --> 00:09:53,000
So let's go ahead and give the chunk size of uh 100.

191
00:09:53,000 --> 00:09:58,000
And then I will give the chunk overlap of 20.

192
00:09:58,000 --> 00:09:58,000
Okay.

193
00:09:59,000 --> 00:10:01,000
So these are the two information that I will give.

194
00:10:01,000 --> 00:10:03,000
Then I will take this text.

195
00:10:03,000 --> 00:10:07,000
You know my return type is specifically text from this particular speech variable, right.

196
00:10:07,000 --> 00:10:12,000
Let me remove this right now here what I am actually going to do, I'm going to basically go ahead and

197
00:10:12,000 --> 00:10:13,000
write text splitter dot.

198
00:10:14,000 --> 00:10:20,000
Now if I want to probably see after this particular step, you know that I have to convert all this

199
00:10:20,000 --> 00:10:25,000
text into, uh, vectors and then afterwards store all these vectors into a vector database.

200
00:10:26,000 --> 00:10:31,000
It is a good practice because over here, uh, whenever we work with Lang chain, it makes sure that

201
00:10:31,000 --> 00:10:34,000
hey, it tries to convert everything into a document type.

202
00:10:34,000 --> 00:10:39,000
Because if we try to convert a text into a document type, we will have more features which we can specifically

203
00:10:39,000 --> 00:10:40,000
implement.

204
00:10:40,000 --> 00:10:45,000
Now in order to convert this text splitter, whatever text split we are basically doing.

205
00:10:45,000 --> 00:10:50,000
And if I use this particular function which is called as create documents.

206
00:10:50,000 --> 00:10:51,000
Okay.

207
00:10:51,000 --> 00:10:56,000
Now in this create documents, if I give my speech okay in the form of list okay.

208
00:10:56,000 --> 00:11:00,000
This will basically give you your entire text.

209
00:11:00,000 --> 00:11:03,000
So I will just go ahead and print my text.

210
00:11:03,000 --> 00:11:03,000
Okay.

211
00:11:03,000 --> 00:11:09,000
So here you can see it has got converted into this particular uh document.

212
00:11:09,000 --> 00:11:09,000
Right.

213
00:11:09,000 --> 00:11:11,000
And finally you are able to see this okay.

214
00:11:11,000 --> 00:11:14,000
So text splitter dot uh create documents okay.

215
00:11:14,000 --> 00:11:16,000
Right now the 100 chunk size was there.

216
00:11:16,000 --> 00:11:18,000
I think it is less than 100.

217
00:11:18,000 --> 00:11:20,000
So I'll just go ahead and make it 50.

218
00:11:20,000 --> 00:11:21,000
Let's go ahead and make it ten.

219
00:11:21,000 --> 00:11:22,000
Okay.

220
00:11:22,000 --> 00:11:24,000
So now if I go ahead and execute it let's see.

221
00:11:24,000 --> 00:11:25,000
Let's see this okay.

222
00:11:25,000 --> 00:11:29,000
So I will just go ahead and write print text of zero okay.

223
00:11:29,000 --> 00:11:32,000
So this is my first one that I'm actually getting.

224
00:11:32,000 --> 00:11:34,000
The world must be made safe for democracy okay.

225
00:11:34,000 --> 00:11:37,000
So let's increase this size.

226
00:11:37,000 --> 00:11:39,000
So here you can see the world must be safe for democracy.

227
00:11:39,000 --> 00:11:43,000
Its peace must be planted upon the tested foundation of.

228
00:11:43,000 --> 00:11:51,000
Okay, now, if I go ahead and print my second document text of one, you'll be able to see.

229
00:11:52,000 --> 00:11:52,000
See this.

230
00:11:52,000 --> 00:11:56,000
The world must be safe for democracy.

231
00:11:56,000 --> 00:11:58,000
So here, let me just increase the chunk size.

232
00:11:58,000 --> 00:11:59,000
Then it will be much more better.

233
00:11:59,000 --> 00:12:03,000
So here you can see world must be, um, safe for democracy.

234
00:12:03,000 --> 00:12:09,000
It must be planted upon the tested foundation of a foundation of.

235
00:12:09,000 --> 00:12:10,000
See, it is getting repeated.

236
00:12:10,000 --> 00:12:13,000
This 20 chunk size is basically getting repeated.

237
00:12:13,000 --> 00:12:16,000
And this is my second chunk that you will be able to see, right?

238
00:12:16,000 --> 00:12:23,000
So for all these things, we actually use something called as uh, recursive character text splitter.

239
00:12:23,000 --> 00:12:26,000
I've also shown you if you have a text, if you just have a text, how you can actually convert this

240
00:12:26,000 --> 00:12:27,000
into a document type.

241
00:12:27,000 --> 00:12:31,000
And if I just go ahead and see this type, it is nothing, but it is document type.

242
00:12:31,000 --> 00:12:36,000
So if I go ahead and just go ahead and write type of this, it is nothing, but it is a document type.

243
00:12:36,000 --> 00:12:36,000
Okay.

244
00:12:37,000 --> 00:12:43,000
So uh, this was some of the basic examples of using recursive character text splitter.

245
00:12:44,000 --> 00:12:48,000
Now in my next video, uh, I will be showing you more other splitting technique.

246
00:12:48,000 --> 00:12:51,000
You can also do it by characters.

247
00:12:51,000 --> 00:12:52,000
You can do it by HTML headers and all.

248
00:12:52,000 --> 00:12:55,000
So that is what I'm actually going to discuss in the next video.

249
00:12:55,000 --> 00:12:56,000
Thank you.