1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:04,000
So we are going to continue the discussion with respect to Lang Chain.

3
00:00:04,000 --> 00:00:09,000
Uh, in our previous video, uh, I had actually shown you the example like what we will be specifically

4
00:00:09,000 --> 00:00:15,000
doing step by step, you know, from data ingestion to data transformation, then converting the text

5
00:00:15,000 --> 00:00:21,000
into vectors, finally storing it in a vector store db, and then probably creating a retrieval chain

6
00:00:21,000 --> 00:00:25,000
along with prompt template, we send that entire data to the LM model.

7
00:00:25,000 --> 00:00:27,000
And finally we get the response.

8
00:00:27,000 --> 00:00:33,000
Now over here what we will be doing is that we will be exploring each and every step and how we can

9
00:00:33,000 --> 00:00:35,000
actually implement with the help of Lang Chain.

10
00:00:35,000 --> 00:00:39,000
The first step that we are going to discuss about is data ingestion.

11
00:00:39,000 --> 00:00:40,000
Right?

12
00:00:40,000 --> 00:00:46,000
And when I say data ingestion, it's more about loading a data set from a specific source.

13
00:00:46,000 --> 00:00:47,000
Right.

14
00:00:47,000 --> 00:00:51,000
We also say it as uh, you know, document loaders in Lang chain.

15
00:00:51,000 --> 00:00:53,000
So uh, we'll be discussing about this.

16
00:00:53,000 --> 00:00:55,000
We will be doing some practical implementation.

17
00:00:55,000 --> 00:01:00,000
I'll be taking different types of files, and I'll show you that how we can read that particular data

18
00:01:00,000 --> 00:01:02,000
from that and convert into a document.

19
00:01:02,000 --> 00:01:03,000
Okay.

20
00:01:03,000 --> 00:01:05,000
So all those things will be probably discussed.

21
00:01:05,000 --> 00:01:07,000
So first step we'll go with this.

22
00:01:07,000 --> 00:01:12,000
Then we will uh as we go ahead we'll be seeing other modules also.

23
00:01:12,000 --> 00:01:19,000
So let me quickly, uh show you over here, I have actually created a data ingestion dot ipynb file,

24
00:01:19,000 --> 00:01:21,000
and I have selected my virtual environment.

25
00:01:22,000 --> 00:01:25,000
Um, one very important thing that you should remember.

26
00:01:25,000 --> 00:01:31,000
Whenever I really need to work with this ipynb file, I need to have a ipykernel.

27
00:01:31,000 --> 00:01:35,000
Okay, so that is the reason I have already done the installation in the requirements.txt.

28
00:01:36,000 --> 00:01:37,000
So this is perfect till here.

29
00:01:37,000 --> 00:01:39,000
Now let me do one thing.

30
00:01:39,000 --> 00:01:42,000
Let me also go ahead and show you the documentation.

31
00:01:42,000 --> 00:01:42,000
Okay.

32
00:01:43,000 --> 00:01:46,000
Now it's really important that you understand.

33
00:01:46,000 --> 00:01:50,000
Or you have the habit of reading the language in documentation.

34
00:01:50,000 --> 00:01:55,000
Because this documentation is just like you have each and every thing available over here.

35
00:01:55,000 --> 00:02:01,000
And, uh, the only problem I see while following the documentation is that it is completely scattered.

36
00:02:01,000 --> 00:02:06,000
So that is the reason when I am actually creating this course, I will create in such a way that it

37
00:02:06,000 --> 00:02:12,000
will be having the videos in a ordered way, so that you understand step by step how things are basically

38
00:02:12,000 --> 00:02:17,000
getting implemented in generative AI application, specifically for the generative AI application.

39
00:02:17,000 --> 00:02:21,000
So here, uh, as I said, we are going to discuss about data ingestion.

40
00:02:21,000 --> 00:02:24,000
In that data ingestion we are going to discuss about document loaders.

41
00:02:24,000 --> 00:02:26,000
Now what exactly is document loader.

42
00:02:26,000 --> 00:02:31,000
It's just like what are the different ways to load different, different data sources.

43
00:02:31,000 --> 00:02:37,000
And here if I just see on the left hand side there are multiple different ways you can load a table.

44
00:02:37,000 --> 00:02:40,000
You can load amplify dataset, you can load or save.

45
00:02:40,000 --> 00:02:48,000
You can probably go ahead and see about uh, uh, let's say a browser base is there, Cassandra is there.

46
00:02:48,000 --> 00:02:50,000
You'll be able to load from this particular data set.

47
00:02:50,000 --> 00:02:51,000
You want to load it from CSV.

48
00:02:51,000 --> 00:02:56,000
You have this particular option if you want to go ahead and load it from web based.

49
00:02:56,000 --> 00:02:58,000
So so many different different options are there.

50
00:02:58,000 --> 00:03:01,000
If you want hugging face data set, you can go ahead and load it.

51
00:03:01,000 --> 00:03:04,000
If you want to load images you can go ahead and load it, you know.

52
00:03:04,000 --> 00:03:08,000
So that is why I said that Langston is an amazing framework.

53
00:03:08,000 --> 00:03:15,000
And, uh, it's like whatever document loaders you really want to load the data from, you can just

54
00:03:15,000 --> 00:03:16,000
go ahead and refer this.

55
00:03:16,000 --> 00:03:17,000
Okay.

56
00:03:17,000 --> 00:03:22,000
In this video I will be showing you some of the good examples, most common examples that we use with

57
00:03:22,000 --> 00:03:25,000
respect to data ingestion using document loaders.

58
00:03:25,000 --> 00:03:30,000
Uh, anyhow, tomorrow, if you have any other requirements you can definitely see this particular documentation.

59
00:03:30,000 --> 00:03:34,000
So here I have already given the link for the documentation.

60
00:03:34,000 --> 00:03:36,000
You can start the coding from here okay.

61
00:03:36,000 --> 00:03:43,000
Now let's go ahead and uh let's start this and I'll be showing you multiple ways of doing the data injection

62
00:03:43,000 --> 00:03:44,000
technique itself.

63
00:03:44,000 --> 00:03:47,000
Basically, we'll be seeing multiple ways how you can load the data set.

64
00:03:47,000 --> 00:03:48,000
Okay.

65
00:03:48,000 --> 00:03:53,000
So the first uh, technique that I'm actually going to use is something called as text loader.

66
00:03:53,000 --> 00:03:53,000
Okay.

67
00:03:53,000 --> 00:03:58,000
Now text loader actually helps you to load any types of txt files.

68
00:03:58,000 --> 00:04:01,000
So I have created this folder called as data ingestion.

69
00:04:01,000 --> 00:04:05,000
And inside this I have created this particular speech dot txt folder file.

70
00:04:05,000 --> 00:04:05,000
Okay.

71
00:04:05,000 --> 00:04:12,000
And in this I have just put some information of a speech over here from Wikipedia, and I've just copied

72
00:04:12,000 --> 00:04:13,000
and pasted it okay.

73
00:04:13,000 --> 00:04:15,000
Now what I'm actually going to do inside this.

74
00:04:15,000 --> 00:04:17,000
And again I also have a PDF file.

75
00:04:17,000 --> 00:04:18,000
I also have an XML file.

76
00:04:18,000 --> 00:04:20,000
We'll also try to read that as we go ahead.

77
00:04:20,000 --> 00:04:27,000
So this speech this text dot loader, what it does is that it actually helps you to load any speech

78
00:04:27,000 --> 00:04:28,000
dot txt file like any.

79
00:04:28,000 --> 00:04:29,000
TXT file in short.

80
00:04:29,000 --> 00:04:30,000
Right.

81
00:04:30,000 --> 00:04:33,000
So here I've created this speech dot txt file.

82
00:04:33,000 --> 00:04:35,000
So it will actually allow you to load this.

83
00:04:35,000 --> 00:04:36,000
Now how do you do that.

84
00:04:36,000 --> 00:04:43,000
So for that what I will do I will go ahead and write from long chain underscore community okay.

85
00:04:43,000 --> 00:04:49,000
Now this long chain underscore community we need to make sure that we go ahead and install in the requirement

86
00:04:49,000 --> 00:04:50,000
dot txt okay.

87
00:04:50,000 --> 00:04:53,000
So I will quickly go ahead and press shift tab.

88
00:04:53,000 --> 00:04:59,000
And let me just quickly upload uh or run this uh requirement dot txt file.

89
00:04:59,000 --> 00:05:00,000
Okay.

90
00:05:00,000 --> 00:05:02,000
So I have to be in my virtual environment.

91
00:05:02,000 --> 00:05:03,000
Please remember that.

92
00:05:03,000 --> 00:05:08,000
And then I will go ahead and write pip install minus r requirements.txt.

93
00:05:08,000 --> 00:05:11,000
So once I do this I've already done the installation.

94
00:05:11,000 --> 00:05:14,000
So it shows me requirement already satisfied okay.

95
00:05:14,000 --> 00:05:18,000
So here in our school community we are specifically going to use.

96
00:05:18,000 --> 00:05:21,000
Now I will go ahead and import from luncheon underscore community.

97
00:05:21,000 --> 00:05:24,000
We have something called as document loaders.

98
00:05:24,000 --> 00:05:24,000
Okay.

99
00:05:24,000 --> 00:05:29,000
Now inside this you have something called as text loader.

100
00:05:29,000 --> 00:05:34,000
Now in the web page in the website we saw different different types of loaders.

101
00:05:34,000 --> 00:05:34,000
Right.

102
00:05:34,000 --> 00:05:35,000
Document loaders.

103
00:05:35,000 --> 00:05:40,000
Everything is available inside this long chain underscore community dot document loaders okay.

104
00:05:40,000 --> 00:05:46,000
Please remember that now after I import this text loader.

105
00:05:46,000 --> 00:05:46,000
Right.

106
00:05:46,000 --> 00:05:49,000
What I really need to do I need to initialize the text loader.

107
00:05:49,000 --> 00:05:52,000
And here I will give my file path.

108
00:05:53,000 --> 00:05:55,000
File path is nothing but inside this.

109
00:05:55,000 --> 00:05:58,000
Since I'm working inside this folder I have something called as speech dot txt.

110
00:05:59,000 --> 00:06:06,000
So here what we can basically do is that we can go ahead and write speech dot txt okay speech dot txt.

111
00:06:06,000 --> 00:06:13,000
Now here what I'm going to do I'll go and initialize this as a variable called as loader okay.

112
00:06:13,000 --> 00:06:18,000
Now when we are reading it okay let's let's go ahead and execute this.

113
00:06:18,000 --> 00:06:23,000
And if I go ahead and execute it shows that hey loader is nothing, but it is a document loader of a

114
00:06:23,000 --> 00:06:25,000
text at this specific memory location.

115
00:06:25,000 --> 00:06:26,000
This is fine okay.

116
00:06:26,000 --> 00:06:31,000
Now when we are loading this right, entire the content inside the speech dot text will be converted

117
00:06:31,000 --> 00:06:38,000
into a documents only when we go ahead and write loader dot load function.

118
00:06:38,000 --> 00:06:44,000
So loader load function will basically help you to create this text documents.

119
00:06:44,000 --> 00:06:45,000
Okay we'll see to it okay.

120
00:06:45,000 --> 00:06:49,000
And then finally I will go ahead and print the text documents.

121
00:06:49,000 --> 00:06:52,000
So here you can see this actually is my document.

122
00:06:52,000 --> 00:06:54,000
And it has this page content variable.

123
00:06:54,000 --> 00:06:59,000
And all the text inside that txt file is inside this page underscore content okay.

124
00:06:59,000 --> 00:07:03,000
And here you can see that metadata source is nothing but speech dot txt.

125
00:07:03,000 --> 00:07:11,000
So in short what it has done is that it has loaded this entire speech dot txt in one document that is

126
00:07:11,000 --> 00:07:12,000
available over here.

127
00:07:12,000 --> 00:07:17,000
Okay, so this is one very easy way of loading a txt file.

128
00:07:17,000 --> 00:07:18,000
Okay.

129
00:07:18,000 --> 00:07:21,000
Now let's go ahead and do one more thing.

130
00:07:21,000 --> 00:07:27,000
Uh, I will just try to show you with respect to py pdf like how do you read a PDF file.

131
00:07:27,000 --> 00:07:30,000
So reading a PDF file.

132
00:07:31,000 --> 00:07:37,000
Okay so here what I'll be writing I'll again go ahead and write from Langston underscore community dot

133
00:07:37,000 --> 00:07:39,000
document loaders okay.

134
00:07:39,000 --> 00:07:39,000
Again.

135
00:07:40,000 --> 00:07:45,000
And this time I'm going to import py pdf py pdf loader.

136
00:07:45,000 --> 00:07:51,000
So py pdf loader basically loads the PDF, any PDF that you are giving inside this.

137
00:07:51,000 --> 00:07:54,000
So I'll go ahead and initialize py pdf loader.

138
00:07:54,000 --> 00:07:58,000
And here I'm just going to go ahead and write attention dot pdf.

139
00:07:58,000 --> 00:08:05,000
So this is my PDF name that I have and this PDF I have actually taken a research paper uh of transformer

140
00:08:05,000 --> 00:08:06,000
attention is all you need.

141
00:08:06,000 --> 00:08:08,000
I hope you may have heard about it.

142
00:08:08,000 --> 00:08:10,000
And again it was available in arXiv.

143
00:08:10,000 --> 00:08:12,000
Uh, arXiv website.

144
00:08:12,000 --> 00:08:12,000
Okay.

145
00:08:13,000 --> 00:08:17,000
Now here what I'm actually going to do is that I'll go ahead and create a variable called as loader.

146
00:08:17,000 --> 00:08:22,000
And again to get the documents I will just go ahead and write loader dot load.

147
00:08:22,000 --> 00:08:26,000
So once I do that so finally you'll be able to see this will be my docs.

148
00:08:26,000 --> 00:08:28,000
So it will go and read from this particular PDF.

149
00:08:28,000 --> 00:08:31,000
Now it says hey I am getting an error.

150
00:08:31,000 --> 00:08:34,000
He's saying that hey please go ahead and install py pip PDF.

151
00:08:34,000 --> 00:08:39,000
Now let me quickly go to the requirement dot txt again the error that we are seeing.

152
00:08:39,000 --> 00:08:40,000
We really need to fix the error.

153
00:08:40,000 --> 00:08:43,000
So I here I'll go ahead and install this pip pdf.

154
00:08:43,000 --> 00:08:44,000
So quickly.

155
00:08:44,000 --> 00:08:48,000
Let's go ahead and write pip install minus t uh requirement dot txt.

156
00:08:48,000 --> 00:08:53,000
Now the installation will start and here you can see the pip pdf is installed.

157
00:08:53,000 --> 00:08:55,000
So I will close this quickly.

158
00:08:55,000 --> 00:08:58,000
And now let's go ahead and execute it once again.

159
00:08:58,000 --> 00:09:01,000
So now I think it should be loading it.

160
00:09:01,000 --> 00:09:06,000
And here you can see with respect to all the PDF it has been able to read each and everything.

161
00:09:06,000 --> 00:09:06,000
Right.

162
00:09:06,000 --> 00:09:10,000
So here was my first page I guess then this was my second page.

163
00:09:10,000 --> 00:09:11,000
Third page, fourth page.

164
00:09:11,000 --> 00:09:13,000
Like this I'm able to see all these things.

165
00:09:13,000 --> 00:09:16,000
And the return type is basically docs document.

166
00:09:16,000 --> 00:09:22,000
So if I go ahead and see type of docs here, you will be able to see that, hey it is a list.

167
00:09:22,000 --> 00:09:24,000
Uh, list of documents.

168
00:09:24,000 --> 00:09:30,000
So let's say if I go ahead and write type of docs of zero here, you can say that it is a document type,

169
00:09:30,000 --> 00:09:30,000
okay.

170
00:09:30,000 --> 00:09:34,000
And inside the document type you have this attribute called as page content.

171
00:09:34,000 --> 00:09:37,000
And at the end you'll be able to see some kind of metadata okay.

172
00:09:37,000 --> 00:09:38,000
It's a very big PDF.

173
00:09:38,000 --> 00:09:41,000
So let's not not explore that.

174
00:09:41,000 --> 00:09:45,000
So that at the end if you probably scroll till the end okay.

175
00:09:45,000 --> 00:09:49,000
If I just go ahead and scroll it till the end.

176
00:09:49,000 --> 00:09:49,000
Right.

177
00:09:49,000 --> 00:09:52,000
You'll be able to see the metadata source attention dot PDF.

178
00:09:52,000 --> 00:09:53,000
And this is page one.

179
00:09:53,000 --> 00:09:59,000
So every page is basically converted into a document over here okay.

180
00:09:59,000 --> 00:10:00,000
Perfect.

181
00:10:00,000 --> 00:10:04,000
So this is another way of reading a PDF file okay.

182
00:10:04,000 --> 00:10:08,000
So I hope you got an idea with respect to this.

183
00:10:08,000 --> 00:10:09,000
Uh till here.

184
00:10:09,000 --> 00:10:12,000
Let's go ahead and do one more.

185
00:10:13,000 --> 00:10:14,000
Amazing.

186
00:10:14,000 --> 00:10:18,000
Uh, we will just try to use another document loader technique, and this time we are just going to

187
00:10:18,000 --> 00:10:20,000
use something called as web based loader.

188
00:10:20,000 --> 00:10:25,000
So here we are going to discuss about web based loader okay.

189
00:10:25,000 --> 00:10:30,000
Again since we are just discussing how we can load the data set from different different data source.

190
00:10:30,000 --> 00:10:31,000
Okay.

191
00:10:31,000 --> 00:10:37,000
Now I will go ahead and write from long chain underscore community dot document loaders.

192
00:10:37,000 --> 00:10:41,000
And here we are going to import web base loader.

193
00:10:42,000 --> 00:10:48,000
Now web based loader is like let's say that you probably want to give any website okay.

194
00:10:48,000 --> 00:10:53,000
It should be able to read the content from that particular website okay.

195
00:10:53,000 --> 00:10:55,000
That is what we specifically do in this.

196
00:10:55,000 --> 00:11:02,000
So let's take one example and this example URL I will show you by opening the browser itself.

197
00:11:02,000 --> 00:11:02,000
Okay.

198
00:11:02,000 --> 00:11:06,000
So let me quickly show this over here.

199
00:11:06,000 --> 00:11:08,000
So let's say I'll go ahead and copy and paste it.

200
00:11:08,000 --> 00:11:10,000
And let's consider this particular website.

201
00:11:10,000 --> 00:11:17,000
This is a website which is uh, saying that hey LM powered autonomous agent, I have so and so information.

202
00:11:17,000 --> 00:11:18,000
Everything is over here.

203
00:11:18,000 --> 00:11:19,000
Okay.

204
00:11:19,000 --> 00:11:23,000
Now, this is my entire website, and I want to load this entire content.

205
00:11:23,000 --> 00:11:24,000
Okay.

206
00:11:24,000 --> 00:11:30,000
So let me quickly go over here and let's go ahead and initialize this web based loader.

207
00:11:30,000 --> 00:11:35,000
Now inside this web based loader the first parameter is something called as web underscore path.

208
00:11:35,000 --> 00:11:36,000
Okay.

209
00:11:36,000 --> 00:11:39,000
Now inside this we will be giving one path.

210
00:11:39,000 --> 00:11:42,000
Let's say my URL is something like this.

211
00:11:42,000 --> 00:11:44,000
And I'll paste it over here.

212
00:11:44,000 --> 00:11:46,000
Okay so this is my URL okay.

213
00:11:46,000 --> 00:11:52,000
And uh over here you'll be also able to see that when I'm using this particular URL.

214
00:11:52,000 --> 00:11:53,000
Okay.

215
00:11:53,000 --> 00:11:57,000
And let's let's try to execute this directly okay.

216
00:11:57,000 --> 00:12:01,000
Here what I'm doing I'll just give this web path and let me go ahead and create a variable.

217
00:12:01,000 --> 00:12:03,000
So let's see whether it will execute fast or not.

218
00:12:03,000 --> 00:12:06,000
So here it says hey user agent variable not set.

219
00:12:06,000 --> 00:12:10,000
Considering to identify your request okay no worries.

220
00:12:10,000 --> 00:12:12,000
So what I will do over here?

221
00:12:13,000 --> 00:12:14,000
I will go ahead and.

222
00:12:14,000 --> 00:12:15,000
Okay, just a second.

223
00:12:15,000 --> 00:12:19,000
I will just go ahead and put a comma and provide more parameters.

224
00:12:19,000 --> 00:12:20,000
Okay.

225
00:12:20,000 --> 00:12:25,000
Now this time the parameters that I am actually going to provide is with respect to the element.

226
00:12:25,000 --> 00:12:27,000
So let's execute this first of all okay.

227
00:12:27,000 --> 00:12:30,000
So here you can see that I have executed it over here okay.

228
00:12:30,000 --> 00:12:35,000
And the reason why I have given comma because you can put any number of URLs over here since I have

229
00:12:35,000 --> 00:12:39,000
used web paths, if you just use web path you can just give one URL.

230
00:12:39,000 --> 00:12:40,000
Okay.

231
00:12:40,000 --> 00:12:44,000
So here I have just given web paths and I've given this particular URL.

232
00:12:44,000 --> 00:12:44,000
Okay.

233
00:12:44,000 --> 00:12:49,000
Now let's go ahead and take this loader okay loader okay.

234
00:12:49,000 --> 00:12:51,000
And I'll do dot load.

235
00:12:51,000 --> 00:12:57,000
Once I do loader dot load you'll be able to see hey I have actually got the page content of the entire

236
00:12:57,000 --> 00:12:58,000
page right.

237
00:12:58,000 --> 00:13:00,000
So this is my entire page content.

238
00:13:00,000 --> 00:13:04,000
So if I just go ahead and see my browser okay.

239
00:13:04,000 --> 00:13:08,000
So here I've got this entire page content inside this okay.

240
00:13:08,000 --> 00:13:10,000
Along with new line everything as such.

241
00:13:10,000 --> 00:13:10,000
Okay.

242
00:13:10,000 --> 00:13:12,000
And this is so amazing right.

243
00:13:12,000 --> 00:13:15,000
You just by writing this two line you are able to get it.

244
00:13:15,000 --> 00:13:22,000
One important thing that you need to remember is that here you also require beautiful soup.

245
00:13:22,000 --> 00:13:23,000
Okay.

246
00:13:23,000 --> 00:13:30,000
Uh, beautiful soup is one kind of library that you definitely require in whenever you are doing the

247
00:13:30,000 --> 00:13:32,000
scrapping task in any of the project.

248
00:13:32,000 --> 00:13:32,000
Okay.

249
00:13:32,000 --> 00:13:38,000
So, uh, in order to execute or install this, uh, beautiful soup, you can just go ahead and write

250
00:13:38,000 --> 00:13:39,000
this over here.

251
00:13:39,000 --> 00:13:40,000
BS4.

252
00:13:40,000 --> 00:13:41,000
Okay.

253
00:13:41,000 --> 00:13:43,000
And I will just go ahead and press this.

254
00:13:43,000 --> 00:13:46,000
Now let's quickly go ahead and do the requirement dot txt.

255
00:13:46,000 --> 00:13:48,000
My BS4 will basically get installed.

256
00:13:48,000 --> 00:13:50,000
That is nothing but beautifulsoup.

257
00:13:50,000 --> 00:13:55,000
Now what I will do, I will go ahead and uh, close that particular terminal.

258
00:13:55,000 --> 00:14:02,000
Now, what I am actually going to do over here is that let's make this and provide more parameters,

259
00:14:02,000 --> 00:14:03,000
uh, based on our needs.

260
00:14:03,000 --> 00:14:04,000
Okay.

261
00:14:04,000 --> 00:14:12,000
Now, in order to provide more parameters over here, I can play with this web page based on what content

262
00:14:12,000 --> 00:14:12,000
I want.

263
00:14:12,000 --> 00:14:14,000
Right now, everything has been just.

264
00:14:15,000 --> 00:14:18,000
Everything has been extracted like post archive, search tags, everything.

265
00:14:19,000 --> 00:14:21,000
Let's say I will go ahead and inspect this okay.

266
00:14:22,000 --> 00:14:27,000
Now when you go ahead and inspect this, let's say that I want some of the classes.

267
00:14:27,000 --> 00:14:33,000
Let's consider I want a post title because I am interested in this okay I want post content.

268
00:14:33,000 --> 00:14:37,000
Post content basically means I want this entire content from that.

269
00:14:37,000 --> 00:14:40,000
And finally, uh, after this, I also want the post header.

270
00:14:40,000 --> 00:14:41,000
Okay.

271
00:14:41,000 --> 00:14:46,000
So let's say with respect to this three classes, I want the information from this particular page.

272
00:14:46,000 --> 00:14:46,000
Okay.

273
00:14:46,000 --> 00:14:48,000
So how do I give that.

274
00:14:48,000 --> 00:14:50,000
That is what I'm actually going to show you now okay.

275
00:14:50,000 --> 00:14:52,000
Just by using the same functionality.

276
00:14:52,000 --> 00:14:56,000
So here what I'll be doing I'll be using this web path along with the path.

277
00:14:56,000 --> 00:14:58,000
I will give more parameters.

278
00:14:58,000 --> 00:15:04,000
The next parameter that we are going to use is something called as bs underscore kw args.

279
00:15:04,000 --> 00:15:05,000
Okay.

280
00:15:05,000 --> 00:15:07,000
Here I will be giving in the form of dictionary.

281
00:15:07,000 --> 00:15:16,000
The first thing will be that I will try to give the parse underscore only will be nothing, but I'll

282
00:15:16,000 --> 00:15:18,000
just go ahead and import BS four.

283
00:15:18,000 --> 00:15:21,000
Let's go ahead and import BS four okay.

284
00:15:21,000 --> 00:15:29,000
Uh beautiful soup four I will say hey parse only is equal to BS4 dot soup strainer.

285
00:15:30,000 --> 00:15:30,000
Okay.

286
00:15:30,000 --> 00:15:34,000
And just by giving this parameter I will be able to give my class.

287
00:15:34,000 --> 00:15:38,000
So here I will write class underscore equal to.

288
00:15:39,000 --> 00:15:48,000
Let me just go ahead and give which all uh classes I want I want post I want post post dot title okay.

289
00:15:49,000 --> 00:15:59,000
Uh dash title then I also want post dash content and I also want post dash header.

290
00:16:00,000 --> 00:16:02,000
Okay I want this three information.

291
00:16:02,000 --> 00:16:07,000
And based on this particular three information, I need to retrieve the content from this particular

292
00:16:07,000 --> 00:16:08,000
URL.

293
00:16:08,000 --> 00:16:09,000
So I will just go ahead and execute it.

294
00:16:09,000 --> 00:16:12,000
And this will now be my loaded or load.

295
00:16:12,000 --> 00:16:17,000
And here you can see slash slash N has all removed because I just want that relevant text.

296
00:16:17,000 --> 00:16:24,000
And this is how I'm also able to clean some of the information while I am loading all the information

297
00:16:24,000 --> 00:16:24,000
over here.

298
00:16:24,000 --> 00:16:25,000
Right.

299
00:16:25,000 --> 00:16:29,000
And I hope you are able to understand with respect to this particular parameter.

300
00:16:29,000 --> 00:16:34,000
And here you will be seeing that power of beautiful soup is amazing because you are able to retrieve

301
00:16:34,000 --> 00:16:36,000
all the details itself.

302
00:16:36,000 --> 00:16:36,000
Right.

303
00:16:36,000 --> 00:16:40,000
So, uh, this was with respect to web based loader.

304
00:16:40,000 --> 00:16:40,000
Okay.

305
00:16:40,000 --> 00:16:47,000
Now there are also different different data ingestion technique which I will be showing you over here.

306
00:16:47,000 --> 00:16:51,000
Let's go ahead and I'll show you two more important things okay.

307
00:16:51,000 --> 00:16:56,000
One is with respect to the most famous because see, uh, this are safe.

308
00:16:56,000 --> 00:16:57,000
Okay.

309
00:16:57,000 --> 00:17:03,000
Now, our safe is something, uh, you know, if you don't know about our safe, if I go ahead and search,

310
00:17:03,000 --> 00:17:05,000
attention is all you need.

311
00:17:05,000 --> 00:17:06,000
Research paper.

312
00:17:06,000 --> 00:17:07,000
Okay.

313
00:17:07,000 --> 00:17:12,000
The default website that it will go will be in this particular website that is called a R shift.

314
00:17:12,000 --> 00:17:12,000
Right.

315
00:17:12,000 --> 00:17:15,000
And here we can go ahead and see the PDF.

316
00:17:15,000 --> 00:17:19,000
What if I have the entire data source as ourself.

317
00:17:19,000 --> 00:17:19,000
Right.

318
00:17:19,000 --> 00:17:21,000
I will be able to directly communicate with this.

319
00:17:21,000 --> 00:17:23,000
So how do I do that okay.

320
00:17:23,000 --> 00:17:29,000
So first of all what I really need to do is that I will go ahead and install this, that is pip install

321
00:17:29,000 --> 00:17:29,000
ourself.

322
00:17:29,000 --> 00:17:30,000
Okay.

323
00:17:30,000 --> 00:17:32,000
So for that I will go back to my code.

324
00:17:32,000 --> 00:17:37,000
Go ahead and update my requirement dot txt with our chef.

325
00:17:37,000 --> 00:17:43,000
Okay I will go ahead and open my terminal and quickly write pip install minus our requirement dot txt.

326
00:17:44,000 --> 00:17:48,000
Now here you can see quickly it will get installed and my RCF packages also got installed.

327
00:17:49,000 --> 00:17:53,000
Now in order to check it out whether our chef is working fine or not, I will go back to my data ingestion

328
00:17:53,000 --> 00:17:56,000
and this time I'll just go ahead and write arXiv.

329
00:17:56,000 --> 00:17:57,000
Okay.

330
00:17:57,000 --> 00:18:00,000
And let's go ahead and copy this.

331
00:18:00,000 --> 00:18:01,000
Let's say I want to copy this.

332
00:18:01,000 --> 00:18:06,000
First of all I will go ahead and import my library document underscore loaders okay.

333
00:18:06,000 --> 00:18:11,000
And then we will go ahead and load this particular paper.

334
00:18:11,000 --> 00:18:14,000
The paper is nothing but the number is 1605.

335
00:18:14,000 --> 00:18:17,000
And here you will be able to see that I will just go ahead and give this.

336
00:18:17,000 --> 00:18:18,000
So here what I am doing.

337
00:18:18,000 --> 00:18:20,000
I am initializing our shift loader.

338
00:18:21,000 --> 00:18:23,000
The query parameter will be the paper number.

339
00:18:23,000 --> 00:18:26,000
Let's see if I go ahead and search this okay.

340
00:18:27,000 --> 00:18:29,000
If I go ahead and search this over here okay.

341
00:18:29,000 --> 00:18:30,000
And press enter.

342
00:18:30,000 --> 00:18:34,000
So here you can see hey this is the research paper that we are talking about right.

343
00:18:35,000 --> 00:18:38,000
Similarly if you go ahead and see attention is all you need right.

344
00:18:38,000 --> 00:18:41,000
So attention is all you need will also have some kind of numbers.

345
00:18:41,000 --> 00:18:47,000
So if I go ahead and see hey attention is all you need okay.

346
00:18:47,000 --> 00:18:49,000
And here is the number for this right.

347
00:18:49,000 --> 00:18:50,000
Attention is all you need.

348
00:18:50,000 --> 00:18:52,000
What is the number that we saw.

349
00:18:52,000 --> 00:18:55,000
It was nothing but this one right.

350
00:18:55,000 --> 00:18:57,000
So let's go ahead and search this okay.

351
00:18:57,000 --> 00:19:02,000
So here I'm going to just go ahead and copy and paste that particular number over here.

352
00:19:02,000 --> 00:19:07,000
Now I'm saying hey go ahead and take this particular query and load maximum two documents.

353
00:19:07,000 --> 00:19:08,000
And I'm writing dot load.

354
00:19:08,000 --> 00:19:10,000
And here is my length of documents.

355
00:19:10,000 --> 00:19:12,000
So here I'm obviously so I'm getting an error.

356
00:19:12,000 --> 00:19:14,000
Let's see what is the error.

357
00:19:14,000 --> 00:19:17,000
It says hey pi mu PDF package not found.

358
00:19:17,000 --> 00:19:25,000
Please install it with uh this uh by using this command which is called as pip install pi mu pdf.

359
00:19:25,000 --> 00:19:26,000
Right.

360
00:19:26,000 --> 00:19:28,000
So here I will just go ahead and copy this.

361
00:19:28,000 --> 00:19:30,000
I will go ahead and update it over here.

362
00:19:30,000 --> 00:19:32,000
And this is how you basically do it.

363
00:19:32,000 --> 00:19:34,000
You can also do pip install right directly.

364
00:19:34,000 --> 00:19:37,000
I'll quickly go ahead and do this installation.

365
00:19:37,000 --> 00:19:38,000
Now this installation will take place.

366
00:19:40,000 --> 00:19:44,000
And once this installation is taking place we will be able to go ahead and use it.

367
00:19:44,000 --> 00:19:45,000
So I will close this now.

368
00:19:45,000 --> 00:19:48,000
And now again I'll go ahead and execute it.

369
00:19:48,000 --> 00:19:49,000
That's it.

370
00:19:50,000 --> 00:19:51,000
I'm searching for this particular query.

371
00:19:51,000 --> 00:19:56,000
It should be able to give me this particular documents from that entire research paper.

372
00:19:56,000 --> 00:19:58,000
So he provided a proper attribution.

373
00:19:58,000 --> 00:19:59,000
Google hereby grant so and so.

374
00:19:59,000 --> 00:20:01,000
All the information is probably here, right?

375
00:20:01,000 --> 00:20:05,000
The entire research paper is just given in the document.

376
00:20:05,000 --> 00:20:06,000
And this is so amazing.

377
00:20:06,000 --> 00:20:12,000
That is the reason I like long chain because it has almost document loader functionalities.

378
00:20:12,000 --> 00:20:18,000
You want to ask if you want any kind of data source, Excel, CSV, TSV, whatever things you want,

379
00:20:18,000 --> 00:20:20,000
there is a document loader for that.

380
00:20:20,000 --> 00:20:22,000
Okay, now there is also one more.

381
00:20:22,000 --> 00:20:25,000
If you want to go ahead and explore this, there is something called as Wikipedia.

382
00:20:26,000 --> 00:20:26,000
Okay.

383
00:20:26,000 --> 00:20:29,000
You can also go ahead and see Wikipedia okay.

384
00:20:29,000 --> 00:20:30,000
So let me show you over here.

385
00:20:30,000 --> 00:20:34,000
So Wikipedia for Wikipedia you need to go ahead and install Wikipedia over here.

386
00:20:34,000 --> 00:20:35,000
Right.

387
00:20:35,000 --> 00:20:38,000
So I will go ahead and install Wikipedia.

388
00:20:38,000 --> 00:20:41,000
Quickly open this install it.

389
00:20:41,000 --> 00:20:42,000
Right.

390
00:20:42,000 --> 00:20:47,000
So I'm giving you a complete generic way how you can access it along with this.

391
00:20:47,000 --> 00:20:47,000
Right.

392
00:20:47,000 --> 00:20:50,000
So I will copy this quickly okay.

393
00:20:51,000 --> 00:20:53,000
I will close it over here.

394
00:20:53,000 --> 00:20:56,000
So I will go ahead and save Wikipedia Loader.

395
00:20:56,000 --> 00:21:01,000
Then it says, hey, uh, based on any query that you search.

396
00:21:01,000 --> 00:21:06,000
So I will just go ahead and execute instead of writing Hunter x Hunter, I will say, hey, uh, what?

397
00:21:06,000 --> 00:21:10,000
Let's find out some information about generative AI, okay?

398
00:21:11,000 --> 00:21:12,000
And I want this particular dogs.

399
00:21:12,000 --> 00:21:15,000
And let's go ahead and print my docs.

400
00:21:15,000 --> 00:21:16,000
Okay.

401
00:21:16,000 --> 00:21:20,000
And here you have this automatically.

402
00:21:20,000 --> 00:21:21,000
It will be loading two documents.

403
00:21:21,000 --> 00:21:24,000
And here is the entire page content right.

404
00:21:25,000 --> 00:21:26,000
Entire page content.

405
00:21:26,000 --> 00:21:30,000
If I just go ahead and display docs you will also be able to see all this information.

406
00:21:33,000 --> 00:21:33,000
Right.

407
00:21:33,000 --> 00:21:38,000
So everything is over here and I think it should be giving you the page URL.

408
00:21:38,000 --> 00:21:41,000
Also see page URL is also given.

409
00:21:41,000 --> 00:21:43,000
This is the URL that is searched.

410
00:21:43,000 --> 00:21:45,000
And just by using this.

411
00:21:45,000 --> 00:21:46,000
Wow!

412
00:21:46,000 --> 00:21:52,000
I think you should be able to feel very much happy because now any kind of data you'll be able to get

413
00:21:52,000 --> 00:21:52,000
it.

414
00:21:52,000 --> 00:21:56,000
And this is a completely free Wherever there is an API required, you can use the API.

415
00:21:56,000 --> 00:21:58,000
Let's say I want to have a weather data.

416
00:21:58,000 --> 00:22:00,000
So I need an API for this.

417
00:22:00,000 --> 00:22:03,000
You can go to Weather Data Loader and probably create your API.

418
00:22:03,000 --> 00:22:04,000
You can do that right.

419
00:22:04,000 --> 00:22:10,000
So I hope you are able to understand the entire data ingestion technique with the help of document loaders.

420
00:22:10,000 --> 00:22:12,000
I've shown you multiple examples.

421
00:22:12,000 --> 00:22:15,000
Don't forget to check out the documentation again.

422
00:22:15,000 --> 00:22:17,000
I've given the link over here.

423
00:22:17,000 --> 00:22:23,000
Okay, at the end of the day, Lang Chen has actually included a lot of document loaders through which

424
00:22:23,000 --> 00:22:26,000
you can ingest data from different different data sources.

425
00:22:26,000 --> 00:22:26,000
Right.

426
00:22:26,000 --> 00:22:28,000
So go ahead and explore.

427
00:22:28,000 --> 00:22:34,000
Uh, I will see you all in the next video where I will be talking about once we take this entire document,

428
00:22:34,000 --> 00:22:37,000
the next thing is that how we will be able to do the text splitting.

429
00:22:37,000 --> 00:22:38,000
Right?

430
00:22:38,000 --> 00:22:41,000
That is what I'm actually going to discuss about in my next video.

431
00:22:41,000 --> 00:22:42,000
So yes, I will see you in the next video.

432
00:22:42,000 --> 00:22:43,000
Thank you.