1
00:00:00,000 --> 00:00:05,000
So guys, in this video we are going to create one amazing LM project which is nothing but PDF query

2
00:00:05,000 --> 00:00:07,000
with Langshan and Cassandra DB.

3
00:00:08,000 --> 00:00:12,000
Uh, Cassandra DB will be probably creating in a platform which is called as data stacks.

4
00:00:12,000 --> 00:00:17,000
So if you have probably heard about this particular platform, which is called as data stacks, which

5
00:00:17,000 --> 00:00:23,000
will actually help you to create Cassandra DB in the cloud itself, and why this platform is quite amazing

6
00:00:23,000 --> 00:00:27,000
because from this you will be able to perform vector search.

7
00:00:27,000 --> 00:00:32,000
And whenever we talk about this kind of documents, or if you want to really create a Q&A applications

8
00:00:32,000 --> 00:00:37,000
from huge PDFs itself, vector search is the thing that you really need to implement.

9
00:00:37,000 --> 00:00:42,000
Now, before I go ahead, let's first of all understand the entire architecture.

10
00:00:42,000 --> 00:00:45,000
We will be solving this entirely step by step.

11
00:00:45,000 --> 00:00:51,000
What are the steps specifically that will be taken to probably complete this specific project that we

12
00:00:51,000 --> 00:00:52,000
really need to understand?

13
00:00:52,000 --> 00:00:54,000
So let's begin with the architecture.

14
00:00:54,000 --> 00:00:55,000
Initially.

15
00:00:55,000 --> 00:00:57,000
Let's say you have a specific PDF.

16
00:00:57,000 --> 00:01:01,000
This PDF can be of any size and any number of pages.

17
00:01:01,000 --> 00:01:05,000
First of all, we will read the documents and understand.

18
00:01:05,000 --> 00:01:11,000
Here we are going to use Lang chain as I said, because Lang Chain has some amazing, amazing functionalities

19
00:01:11,000 --> 00:01:16,000
which will actually help you to perform all the necessary tasks to create this specific application.

20
00:01:16,000 --> 00:01:23,000
Now first we will go ahead and read the documents that is specifically the PDF and the first step.

21
00:01:23,000 --> 00:01:28,000
Usually when we, whenever we work with this kind of data set is with respect to some kind of transformation

22
00:01:28,000 --> 00:01:29,000
we really need to do.

23
00:01:29,000 --> 00:01:34,000
So after reading this documents, we will convert this into various test chunks.

24
00:01:34,000 --> 00:01:38,000
That basically means will split the data set into some kind of packets.

25
00:01:39,000 --> 00:01:39,000
Right.

26
00:01:39,000 --> 00:01:46,000
So this text chunk will be of some specific size based on the tokens that we are probably going to use.

27
00:01:46,000 --> 00:01:49,000
So over here you can see some example reading the document.

28
00:01:49,000 --> 00:01:52,000
And then we have divided this into some chunks.

29
00:01:52,000 --> 00:01:56,000
Then we will convert all this chunk into text embeddings.

30
00:01:56,000 --> 00:02:04,000
Now for here we will be specifically using OpenAI embeddings okay OpenAI embeddings actually helps you

31
00:02:04,000 --> 00:02:07,000
to convert text into vectors.

32
00:02:07,000 --> 00:02:09,000
Now why you specifically require these vectors?

33
00:02:09,000 --> 00:02:14,000
I hope you have heard about text embedding techniques in machine learning right there.

34
00:02:14,000 --> 00:02:19,000
We have specifically used bag of words, TF, IDF, and many more things that is already present in

35
00:02:19,000 --> 00:02:20,000
my YouTube channel.

36
00:02:20,000 --> 00:02:23,000
We have also used word to work, average word to work.

37
00:02:23,000 --> 00:02:25,000
What are the main aim of all these techniques?

38
00:02:25,000 --> 00:02:28,000
To convert text into vectors?

39
00:02:28,000 --> 00:02:34,000
Because once we probably convert into vectors, we can perform various tasks like classification algorithms

40
00:02:34,000 --> 00:02:36,000
like similarity search algorithm, and many more.

41
00:02:36,000 --> 00:02:42,000
So that is the reason we will specifically be using OpenAI embeddings, which will be responsible in

42
00:02:42,000 --> 00:02:45,000
converting a text into vectors itself.

43
00:02:45,000 --> 00:02:50,000
Now, once we convert every text into vectors, we will also say this as text embeddings.

44
00:02:50,000 --> 00:02:55,000
Once we get this embeddings, what we are specifically going to do now this will be quite amazing because

45
00:02:55,000 --> 00:02:58,000
understand if we have a huge PDF document, right?

46
00:02:58,000 --> 00:03:01,000
So definitely the vector size will keep on increasing.

47
00:03:01,000 --> 00:03:06,000
So it is better we store entirely all these vectors into some kind of database.

48
00:03:06,000 --> 00:03:09,000
And for this we are going to use Cassandra db.

49
00:03:10,000 --> 00:03:16,000
So in short, what we are basically going to do is that we will take all the specific vectors and save

50
00:03:16,000 --> 00:03:18,000
it in a vector database here.

51
00:03:18,000 --> 00:03:20,000
Currently we are going to use Cassandra db.

52
00:03:20,000 --> 00:03:22,000
Now what exactly is Cassandra db?

53
00:03:22,000 --> 00:03:28,000
So in order to understand about Cassandra DB, I have opened the entire documentation page over here.

54
00:03:28,000 --> 00:03:29,000
Cassandra.

55
00:03:29,000 --> 00:03:29,000
Aperture.

56
00:03:29,000 --> 00:03:37,000
Apache Cassandra is an open source NoSQL database, and it can definitely be used for saving massive

57
00:03:37,000 --> 00:03:41,000
amount of data, so it manages massive amount of data fast without losing sleep.

58
00:03:41,000 --> 00:03:42,000
Right?

59
00:03:42,000 --> 00:03:47,000
So again, understand this is a no sequel database and for vectors kind of thing, we definitely have

60
00:03:47,000 --> 00:03:48,000
to save it.

61
00:03:48,000 --> 00:03:50,000
In this kind of database itself.

62
00:03:50,000 --> 00:03:54,000
Many bigger companies are basically using this Cassandra DB for this specific purpose.

63
00:03:54,000 --> 00:03:59,000
So if you really want to read more about Apache Cassandra, you can probably see over here.

64
00:03:59,000 --> 00:04:04,000
Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for

65
00:04:04,000 --> 00:04:06,000
scalability and high availability.

66
00:04:06,000 --> 00:04:12,000
This is the most important point for scalability and high availability without compromising performance.

67
00:04:12,000 --> 00:04:18,000
Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure make it

68
00:04:18,000 --> 00:04:22,000
as a making it as a perfect platform for mission critical data.

69
00:04:22,000 --> 00:04:25,000
Now how we are going to probably create this specific database.

70
00:04:25,000 --> 00:04:31,000
For that, we will be using this data stacks platform, wherein it will actually help you to create

71
00:04:31,000 --> 00:04:37,000
this vector Cassandra DB, so that you can store entirely all these vectors into this specific DB.

72
00:04:37,000 --> 00:04:42,000
And at any point of time, if a person is trying to query from this particular DB, you will be able

73
00:04:42,000 --> 00:04:45,000
to get that specific response from that right.

74
00:04:45,000 --> 00:04:48,000
And the most similar response that you will be able to get it.

75
00:04:48,000 --> 00:04:49,000
Now that is the next step.

76
00:04:49,000 --> 00:04:54,000
What we are basically going to do, all this vectors, we are going to save it in some kind of vector

77
00:04:54,000 --> 00:04:54,000
database.

78
00:04:54,000 --> 00:04:59,000
As I said, we are going to use Cassandra DB, or we can also say Astra db, and this will be created

79
00:04:59,000 --> 00:05:06,000
in this Datastax platform, uh, wherein you can actually perform vector search.

80
00:05:06,000 --> 00:05:12,000
Now the next thing is that after you probably save entirely all your vectors in, in the database itself,

81
00:05:12,000 --> 00:05:18,000
then a human, whenever, uh, human tries to query anything that is related to that particular PDF

82
00:05:18,000 --> 00:05:24,000
document, it is going to probably apply similarity search along with text embeddings, and it is going

83
00:05:24,000 --> 00:05:26,000
to get that specific response.

84
00:05:26,000 --> 00:05:32,000
So this is the entire architecture that we are specifically going to perform in this specific project.

85
00:05:32,000 --> 00:05:36,000
All the steps will be shown step by step.

86
00:05:36,000 --> 00:05:41,000
Everything will be explained in an amazing way along with the code and along with the explanation.

87
00:05:41,000 --> 00:05:47,000
Now let's go ahead and start our specific project for this PDF query with Launching and Cassandra DB.

88
00:05:47,000 --> 00:05:51,000
So guys now let's go ahead and implement this specific project.

89
00:05:51,000 --> 00:05:52,000
I will be going step by step.

90
00:05:52,000 --> 00:05:58,000
I will also be showing you how you can create the Cassandra DB, uh, specifically in the Datastax,

91
00:05:58,000 --> 00:06:00,000
uh, platform itself.

92
00:06:00,000 --> 00:06:05,000
Uh, we'll be seeing step by step all the comment regarding this code and all is given over here.

93
00:06:05,000 --> 00:06:09,000
I will also be providing you the code in the description of this particular video.

94
00:06:09,000 --> 00:06:15,000
So first of all, uh, what exactly we are doing, we are going to query PDF with Astra DB and Lang

95
00:06:15,000 --> 00:06:16,000
chain.

96
00:06:16,000 --> 00:06:20,000
Uh, it is basically powered and uh, understand it is powered by vector search.

97
00:06:20,000 --> 00:06:24,000
So first of all you need to understand what exactly is vector search.

98
00:06:24,000 --> 00:06:28,000
So there is an amazing documentation that is given in the data stacks documentation itself.

99
00:06:28,000 --> 00:06:32,000
So vector search enhances machine learning models by allowing similarity.

100
00:06:32,000 --> 00:06:33,000
Comparison of the embeddings.

101
00:06:33,000 --> 00:06:37,000
Embeddings basically means whatever text is basically converted into the vectors.

102
00:06:37,000 --> 00:06:39,000
That is basically embedding, right?

103
00:06:39,000 --> 00:06:43,000
And over there you can definitely apply multiple algorithms, right.

104
00:06:43,000 --> 00:06:44,000
Machine learning algorithms on the fly.

105
00:06:44,000 --> 00:06:45,000
Right?

106
00:06:45,000 --> 00:06:49,000
As a capability of Astra DB vector search supports various large language models.

107
00:06:49,000 --> 00:06:53,000
So large language models can be is very is supported in an amazing way.

108
00:06:53,000 --> 00:06:56,000
In this the integration is very much smooth and easy right.

109
00:06:56,000 --> 00:07:01,000
Since this LLM are stateless, they rely on vector database like Astra DB to store the embeddings.

110
00:07:01,000 --> 00:07:02,000
See, understand.

111
00:07:02,000 --> 00:07:06,000
Because when we say stateless that basically means what.

112
00:07:06,000 --> 00:07:10,000
Suppose if we have a embeddings, once we lose it we cannot again query it.

113
00:07:10,000 --> 00:07:10,000
Right.

114
00:07:10,000 --> 00:07:14,000
So it is definitely require a database to probably store all these things.

115
00:07:14,000 --> 00:07:17,000
And what you can do after that you can query any number of time.

116
00:07:17,000 --> 00:07:20,000
So let us go step by step and let us see okay.

117
00:07:20,000 --> 00:07:24,000
So first of all we need to create a database on Astra DB.

118
00:07:24,000 --> 00:07:27,000
So I will probably click this specific link.

119
00:07:27,000 --> 00:07:28,000
Everything is basically given over here.

120
00:07:28,000 --> 00:07:33,000
For this we will be going to Astra dot data sex.com.

121
00:07:33,000 --> 00:07:33,000
Right.

122
00:07:33,000 --> 00:07:38,000
So first of all it will probably ask you to sign in right.

123
00:07:38,000 --> 00:07:42,000
And here you can probably sign it with your GitHub or with your Google account.

124
00:07:42,000 --> 00:07:45,000
So here I'm going to go ahead and sign it with GitHub.

125
00:07:45,000 --> 00:07:52,000
And probably once I probably sign in over here you will be able to see that, uh, I will be providing

126
00:07:52,000 --> 00:07:56,000
you the link along with everything in the, uh, code itself.

127
00:07:56,000 --> 00:07:56,000
Right.

128
00:07:56,000 --> 00:07:58,000
So it will be very much easy for you.

129
00:07:58,000 --> 00:08:04,000
So once you go to astra.datastax.com, the next step is basically to create a database.

130
00:08:04,000 --> 00:08:05,000
Right.

131
00:08:05,000 --> 00:08:08,000
So this database, uh what kind of database we are going to probably create.

132
00:08:08,000 --> 00:08:10,000
It will be serverless vector.

133
00:08:10,000 --> 00:08:12,000
And this is specifically a Cassandra DB.

134
00:08:12,000 --> 00:08:13,000
Okay.

135
00:08:13,000 --> 00:08:16,000
So here I will probably give my database name.

136
00:08:16,000 --> 00:08:18,000
Let's say I want to do PDF query.

137
00:08:18,000 --> 00:08:18,000
Right.

138
00:08:18,000 --> 00:08:21,000
So this will basically be my PDF query DB okay.

139
00:08:21,000 --> 00:08:23,000
This will basically be my database name.

140
00:08:23,000 --> 00:08:25,000
You can give anything as you want.

141
00:08:25,000 --> 00:08:29,000
And here I'll basically be giving lang chain underscore db a key space name.

142
00:08:29,000 --> 00:08:30,000
It should be unique.

143
00:08:30,000 --> 00:08:33,000
The provider that you can specifically use.

144
00:08:33,000 --> 00:08:36,000
You have multiple like Amazon Web Services, Microsoft Azure.

145
00:08:36,000 --> 00:08:40,000
But here I'm going to probably use Google Cloud, which is the default that is selected.

146
00:08:40,000 --> 00:08:45,000
In the next step, we will go ahead and select the country region, which is by default US East one.

147
00:08:45,000 --> 00:08:51,000
So as soon as you probably fill all this details and as you know that we are specifically going to use

148
00:08:51,000 --> 00:08:56,000
this vector database itself, because at the end of the day, the algorithms that we are probably going

149
00:08:56,000 --> 00:09:00,000
to apply it will be easy with respect to this kind of database, right?

150
00:09:00,000 --> 00:09:03,000
So finally we will go ahead and create the database.

151
00:09:03,000 --> 00:09:09,000
Now once we create the database you will be able to see that my database is basically created over here.

152
00:09:09,000 --> 00:09:10,000
Right.

153
00:09:10,000 --> 00:09:13,000
So this is what is my database that looks like right.

154
00:09:13,000 --> 00:09:15,000
PDF query underscore DV.

155
00:09:15,000 --> 00:09:19,000
Now if I probably go to my dashboard I have already created this kind of database a lot.

156
00:09:19,000 --> 00:09:22,000
So let me consider one database which I have already created.

157
00:09:22,000 --> 00:09:26,000
And over here some important information that you really need to take.

158
00:09:26,000 --> 00:09:29,000
First of all I will go and click on connect okay.

159
00:09:29,000 --> 00:09:35,000
So when I probably click on connect what some information you will definitely require one is generate

160
00:09:35,000 --> 00:09:36,000
token right.

161
00:09:36,000 --> 00:09:38,000
And the other one is the db id.

162
00:09:38,000 --> 00:09:41,000
So db id is basically present over here right.

163
00:09:41,000 --> 00:09:43,000
The token is basically present over here.

164
00:09:43,000 --> 00:09:47,000
Now I'll talk about where this specific information will be required okay.

165
00:09:47,000 --> 00:09:47,000
Okay.

166
00:09:47,000 --> 00:09:50,000
So here I will go with respect to my code.

167
00:09:50,000 --> 00:09:52,000
Now let's start our coding.

168
00:09:52,000 --> 00:09:58,000
Initially we will be requiring some of the important libraries like cashier dataset, Lang Chen, OpenAI

169
00:09:58,000 --> 00:09:59,000
and tick token.

170
00:09:59,000 --> 00:10:05,000
So here I will go ahead and execute it, and I will go ahead and install all this specific libraries.

171
00:10:05,000 --> 00:10:07,000
So it will probably take some time.

172
00:10:07,000 --> 00:10:07,000
Right.

173
00:10:07,000 --> 00:10:09,000
I've already done that installation.

174
00:10:09,000 --> 00:10:11,000
So for me it has happened very much quickly.

175
00:10:11,000 --> 00:10:16,000
Now the next thing is that as we know that we are specifically going to use Cassandra DB.

176
00:10:16,000 --> 00:10:21,000
So in long chain you have all these libraries which will actually help you to connect with Cassandra

177
00:10:21,000 --> 00:10:27,000
DB and perform all the necessary tasks like text and meetings or creating vectors and probably storing

178
00:10:27,000 --> 00:10:28,000
it in the database itself.

179
00:10:28,000 --> 00:10:33,000
So here I'm going to probably import all these libraries from Langton dot vector stores.

180
00:10:33,000 --> 00:10:35,000
Dot Cassandra I'm going to import Cassandra.

181
00:10:35,000 --> 00:10:38,000
Along with this I'm also going to use this vector store index wrapper.

182
00:10:38,000 --> 00:10:44,000
It is going to wrap all those particular vectors in one specific package so that it can be used quickly.

183
00:10:44,000 --> 00:10:49,000
Then I'm also going to import OpenAI, because OpenAI is the thing that we really need to use.

184
00:10:49,000 --> 00:10:54,000
Along with this, we are also going to use OpenAI embeddings, which will be responsible for converting

185
00:10:54,000 --> 00:10:56,000
your text into vectors.

186
00:10:56,000 --> 00:11:00,000
Along with this, if you want some kind of data set from hugging face, you can also use this.

187
00:11:00,000 --> 00:11:04,000
And one more important library that we are going to use is cashew.

188
00:11:04,000 --> 00:11:10,000
Now cashew actually helps you to, uh, probably integrate with the Astra DB right in Lang chain.

189
00:11:10,000 --> 00:11:13,000
And it will also help you to initialize the DB connection.

190
00:11:13,000 --> 00:11:17,000
So all these libraries we are going to use, I'm going to execute this step by step.

191
00:11:17,000 --> 00:11:18,000
You're going to probably see.

192
00:11:18,000 --> 00:11:22,000
And this is the first step in installing the libraries and initializing all the libraries that we are

193
00:11:22,000 --> 00:11:24,000
specifically going to use.

194
00:11:24,000 --> 00:11:29,000
Along with this, what we are going to also import is one PDF which is called as pi pdf two.

195
00:11:29,000 --> 00:11:35,000
This will actually help you to read any PDF, uh read the text inside the PDF itself.

196
00:11:35,000 --> 00:11:38,000
So this is one amazing library to probably use.

197
00:11:38,000 --> 00:11:41,000
Okay, so here I have basically used pip install Pypdf2.

198
00:11:41,000 --> 00:11:43,000
So let me just go ahead and execute it.

199
00:11:43,000 --> 00:11:46,000
And inside this you will be able to see it shows a requirement.

200
00:11:46,000 --> 00:11:52,000
Already satisfied because I have already installed over here Then from Pi PDF you are going to use PDF

201
00:11:52,000 --> 00:11:57,000
reader because this will be the functionality that will be used in order to read the document.

202
00:11:57,000 --> 00:12:00,000
Here is the document that I have specifically taken.

203
00:12:00,000 --> 00:12:02,000
So this is one budget speech PDF.

204
00:12:02,000 --> 00:12:06,000
So this is the Indian budget that is probably of 2023.

205
00:12:06,000 --> 00:12:10,000
It's a big document with somewhere around 461 KB file.

206
00:12:10,000 --> 00:12:12,000
It has around 30 pages.

207
00:12:12,000 --> 00:12:18,000
So I'm going to specifically read this PDF and then convert it into vector, store it in the database

208
00:12:18,000 --> 00:12:22,000
itself, and then query from the database anything that you have any information about that particular

209
00:12:22,000 --> 00:12:23,000
PDF.

210
00:12:23,000 --> 00:12:26,000
Now let's go ahead with the setup okay.

211
00:12:26,000 --> 00:12:30,000
Now with respect to the setup you require three important information.

212
00:12:30,000 --> 00:12:35,000
One is the Astra DB application token, one is the Astra db ID okay.

213
00:12:35,000 --> 00:12:38,000
So where you can probably get this two information.

214
00:12:38,000 --> 00:12:41,000
So go to your vector database.

215
00:12:41,000 --> 00:12:43,000
Vector database in the data stacks.

216
00:12:43,000 --> 00:12:50,000
So here is what you have specifically logged in okay as I said inside your DB just go and click on connect

217
00:12:50,000 --> 00:12:51,000
here.

218
00:12:51,000 --> 00:12:53,000
You need to click on Generate Token.

219
00:12:53,000 --> 00:12:58,000
As soon as you probably click on Generate Token then you will be getting some code which looks like

220
00:12:58,000 --> 00:12:59,000
this.

221
00:12:59,000 --> 00:13:01,000
This token you will specifically go getting.

222
00:13:01,000 --> 00:13:05,000
So this will be probably found in your token JSON file.

223
00:13:05,000 --> 00:13:08,000
So it will probably show you a JSON file which will have this information.

224
00:13:08,000 --> 00:13:08,000
Okay.

225
00:13:08,000 --> 00:13:13,000
The first information that you have over here is the Astra DB application token.

226
00:13:13,000 --> 00:13:17,000
So here you can probably see it starts with Astra CS.

227
00:13:17,000 --> 00:13:20,000
So what you need to do just click on the generate token and you will be able to see it.

228
00:13:20,000 --> 00:13:23,000
This is the first information you just need to copy and paste it over here.

229
00:13:23,000 --> 00:13:27,000
The second information is Astra db ID right?

230
00:13:27,000 --> 00:13:29,000
Astra db id is nothing but this specific information.

231
00:13:29,000 --> 00:13:32,000
That is your database ID where do you get it?

232
00:13:32,000 --> 00:13:34,000
You just need to copy it from here.

233
00:13:34,000 --> 00:13:37,000
So this is the information with respect to your Astra db ID.

234
00:13:37,000 --> 00:13:42,000
So this two information, once you do it you paste it over here I've already pasted it.

235
00:13:42,000 --> 00:13:45,000
And then you can also see that I've also used some open API key.

236
00:13:45,000 --> 00:13:47,000
And this specific API key.

237
00:13:47,000 --> 00:13:49,000
Don't use this only because I made some changes.

238
00:13:49,000 --> 00:13:51,000
I've already executed the code also.

239
00:13:51,000 --> 00:13:54,000
Okay, so I'm going to take this three information.

240
00:13:54,000 --> 00:13:58,000
This two information is basically used to connect to your Astra DB right.

241
00:13:58,000 --> 00:14:01,000
Which has a Cassandra DB hosted over there in the cloud.

242
00:14:01,000 --> 00:14:02,000
Right.

243
00:14:02,000 --> 00:14:06,000
And the other information is basically to use the open AI API features.

244
00:14:06,000 --> 00:14:06,000
Right.

245
00:14:06,000 --> 00:14:08,000
So all this information is basically there.

246
00:14:09,000 --> 00:14:11,000
I'm going to probably execute this.

247
00:14:11,000 --> 00:14:15,000
And then we will go ahead and read our budget speech PDF.

248
00:14:15,000 --> 00:14:16,000
So this is the first step.

249
00:14:16,000 --> 00:14:18,000
According to this we are reading the specific document.

250
00:14:18,000 --> 00:14:22,000
Before that we have initialized everything with respect to this okay.

251
00:14:22,000 --> 00:14:28,000
So once we specifically do this I will probably be reading this now after reading, as I said, we are

252
00:14:28,000 --> 00:14:32,000
going to divide all our content into some kind of chunks, right?

253
00:14:32,000 --> 00:14:35,000
So here is what chunks we are basically going to do.

254
00:14:35,000 --> 00:14:37,000
Now first of all, I will read all the raw text.

255
00:14:37,000 --> 00:14:43,000
So for this I'm going to use from type extension using concatenate I'm going to read from each and every

256
00:14:43,000 --> 00:14:44,000
pages.

257
00:14:44,000 --> 00:14:45,000
I will extract all the text.

258
00:14:45,000 --> 00:14:54,000
So here you can probably see for I comma page in enumerate pdf reader dot pages page dot extract text

259
00:14:54,000 --> 00:14:59,000
will basically take out all the text from those pages, and it will concatenate in this particular variable

260
00:14:59,000 --> 00:15:00,000
that is raw underscore text.

261
00:15:00,000 --> 00:15:07,000
So once I probably execute this, what will happen is that you will be able to get all the text inside

262
00:15:07,000 --> 00:15:08,000
this particular variable.

263
00:15:08,000 --> 00:15:13,000
So you can probably see over here raw underscore text has all the entire text.

264
00:15:13,000 --> 00:15:16,000
So this is the entire text from that specific PDF.

265
00:15:16,000 --> 00:15:16,000
Right.

266
00:15:16,000 --> 00:15:18,000
Slashing basically means new line.

267
00:15:18,000 --> 00:15:20,000
So this step is basically done.

268
00:15:20,000 --> 00:15:24,000
Just imagine before if you did not had this specific library, it was very difficult to read a PDF.

269
00:15:24,000 --> 00:15:24,000
Right.

270
00:15:24,000 --> 00:15:28,000
And we have actually done this just with writing 4 to 5 lines of code.

271
00:15:28,000 --> 00:15:33,000
Now the next step is that we will initialize the connection to your database.

272
00:15:33,000 --> 00:15:39,000
I have all my database information right, like our token ID and the database id I'm going to use that

273
00:15:39,000 --> 00:15:39,000
cash.

274
00:15:39,000 --> 00:15:44,000
You cashier is basically used as a library over there for initializing of this particular database.

275
00:15:44,000 --> 00:15:45,000
So cashier dot init.

276
00:15:45,000 --> 00:15:50,000
Here I'll be giving one parameter which is called as token which will be nothing but Astra db application

277
00:15:50,000 --> 00:15:51,000
token.

278
00:15:51,000 --> 00:15:54,000
And then your database ID, which is nothing but Azure db ID, right.

279
00:15:54,000 --> 00:15:56,000
So I've taken this two information.

280
00:15:56,000 --> 00:15:57,000
I will execute this.

281
00:15:57,000 --> 00:15:58,000
You will get some kind of warnings.

282
00:15:58,000 --> 00:16:00,000
So don't worry about the warnings.

283
00:16:00,000 --> 00:16:05,000
It is just like showing you some kind of warnings okay with respect to some drivers issues and all.

284
00:16:05,000 --> 00:16:07,000
But this will basically get executed.

285
00:16:07,000 --> 00:16:11,000
And now I have uh basically initialized my DB itself.

286
00:16:11,000 --> 00:16:16,000
Right now we are going to create the long chain embeddings Lnm objects for later use.

287
00:16:16,000 --> 00:16:21,000
So for that I'm going to use I'm going to initialize OpenAI with my OpenAI key and embeddings.

288
00:16:21,000 --> 00:16:23,000
Also OpenAI embeddings with my OpenAI key.

289
00:16:23,000 --> 00:16:25,000
So I have my LLM.

290
00:16:25,000 --> 00:16:26,000
I have my embeddings okay.

291
00:16:27,000 --> 00:16:32,000
Now is the main step I need to create my long chain vector store.

292
00:16:32,000 --> 00:16:36,000
So over here this is what we are basically going to create now right.

293
00:16:36,000 --> 00:16:39,000
And for that you know we have initialized Kassandra.

294
00:16:39,000 --> 00:16:39,000
Right.

295
00:16:39,000 --> 00:16:41,000
We have we have imported Kassandra.

296
00:16:41,000 --> 00:16:45,000
Now, what we will do is that in this Kassandra, we will provide three important information.

297
00:16:45,000 --> 00:16:48,000
What is the kind of embeddings we are going to use?

298
00:16:48,000 --> 00:16:51,000
What is the table name inside this particular database session?

299
00:16:51,000 --> 00:16:52,000
None.

300
00:16:52,000 --> 00:16:52,000
Keyspace none.

301
00:16:52,000 --> 00:16:55,000
So this is the default parameters we have specifically used.

302
00:16:55,000 --> 00:16:59,000
QA mini demo is my table name okay.

303
00:16:59,000 --> 00:17:03,000
Just like a question answer table name and what kind of embeddings we are going to specifically use.

304
00:17:03,000 --> 00:17:09,000
That basically means whenever we store any we whenever we push any data in my Cassandra DB, in my Astro

305
00:17:09,000 --> 00:17:14,000
DB itself, what it is going to do, it is going to probably convert all the text using this embeddings

306
00:17:14,000 --> 00:17:15,000
into vectors.

307
00:17:15,000 --> 00:17:15,000
Right?

308
00:17:15,000 --> 00:17:18,000
And this is the embeddings that we have initialized over here.

309
00:17:18,000 --> 00:17:19,000
So here is the next step.

310
00:17:19,000 --> 00:17:20,000
We will go ahead and execute this.

311
00:17:20,000 --> 00:17:22,000
So this is my Astra vector store.

312
00:17:22,000 --> 00:17:29,000
But still I have not probably converted my text into vectors only when when I'm pushing my data inside

313
00:17:29,000 --> 00:17:29,000
my DB.

314
00:17:29,000 --> 00:17:35,000
That time, this entire embeddings will, uh, probably convert that particular data into vectors.

315
00:17:35,000 --> 00:17:42,000
Then what we are specifically going to do is that we will take this data and we will try to, uh, will

316
00:17:42,000 --> 00:17:47,000
take this entire data, will convert into checks, uh, chunks and will also do the text embedding,

317
00:17:47,000 --> 00:17:47,000
right.

318
00:17:47,000 --> 00:17:49,000
Text embedding while inserting.

319
00:17:49,000 --> 00:17:49,000
Right.

320
00:17:49,000 --> 00:17:55,000
So here you first of all, we are dividing the data or the entire data entire document into text chunk.

321
00:17:55,000 --> 00:18:00,000
So for this we are using character text splitter which is basically present in lang chain dot text splitter.

322
00:18:00,000 --> 00:18:03,000
We need to split the text using character text splitter.

323
00:18:03,000 --> 00:18:05,000
It should not increase the token size.

324
00:18:05,000 --> 00:18:08,000
So here I have given character text splitter.

325
00:18:08,000 --> 00:18:13,000
I'm saying use the separator slash in use chunk size some chunk size of 800 characters.

326
00:18:13,000 --> 00:18:14,000
Chunk overlap can be 200.

327
00:18:14,000 --> 00:18:17,000
And how much is the length with respect to that specific length?

328
00:18:17,000 --> 00:18:18,000
You can probably provide it over here.

329
00:18:18,000 --> 00:18:19,000
Right.

330
00:18:19,000 --> 00:18:24,000
And once I probably do this you can see text or dot split text here.

331
00:18:24,000 --> 00:18:26,000
You will be able to get all the text.

332
00:18:26,000 --> 00:18:32,000
And if I see the top 50 text, you can probably see that I'll be able to see all the top 50 text over

333
00:18:32,000 --> 00:18:33,000
here.

334
00:18:33,000 --> 00:18:33,000
Right.

335
00:18:33,000 --> 00:18:34,000
All the data itself.

336
00:18:34,000 --> 00:18:35,000
This is amazing.

337
00:18:35,000 --> 00:18:35,000
Right.

338
00:18:35,000 --> 00:18:38,000
And this is basically from the PDF.

339
00:18:38,000 --> 00:18:38,000
Right.

340
00:18:38,000 --> 00:18:40,000
All the data all the data.

341
00:18:40,000 --> 00:18:40,000
Right.

342
00:18:40,000 --> 00:18:42,000
It is basically taking the top 50.

343
00:18:42,000 --> 00:18:43,000
Right.

344
00:18:43,000 --> 00:18:48,000
And understand the token size is basically over here as the chunk size is somewhere around 800.

345
00:18:48,000 --> 00:18:48,000
Okay.

346
00:18:49,000 --> 00:18:50,000
Now this is done.

347
00:18:50,000 --> 00:18:51,000
We have the text.

348
00:18:51,000 --> 00:18:56,000
I'm going to just use top 50 and probably store it in the vector database to see if everything is working

349
00:18:56,000 --> 00:18:57,000
fine.

350
00:18:57,000 --> 00:18:59,000
Now how to add the specific text.

351
00:18:59,000 --> 00:19:04,000
Now what will happen when I add this text inside my Cassandra db c extra vector store?

352
00:19:04,000 --> 00:19:05,000
What is this?

353
00:19:05,000 --> 00:19:09,000
This is basically initialized with respect to the Cassandra library, right?

354
00:19:09,000 --> 00:19:12,000
So here you will be able to see that I have used embedding.

355
00:19:12,000 --> 00:19:17,000
So now when I'm inserting inside the Cassandra DB what it is going to do, it is going to apply this

356
00:19:17,000 --> 00:19:18,000
specific embeddings also.

357
00:19:18,000 --> 00:19:23,000
So that is the reason you will be able to see that when we write extra underscore vector underscore

358
00:19:23,000 --> 00:19:25,000
store dot add underscore text.

359
00:19:25,000 --> 00:19:29,000
And I'm taking the top 50 top 50 texts over there.

360
00:19:29,000 --> 00:19:31,000
This will also perform embeddings.

361
00:19:31,000 --> 00:19:36,000
So that basically means if I see over here it is going to perform this task and it is going to insert

362
00:19:36,000 --> 00:19:39,000
in the Astra DB which is having the Cassandra over there.

363
00:19:39,000 --> 00:19:39,000
Right.

364
00:19:39,000 --> 00:19:43,000
So it is going to do this both the steps with respect to this particular code.

365
00:19:43,000 --> 00:19:44,000
So we are going to add this text.

366
00:19:44,000 --> 00:19:49,000
And then we are also going to wrap wrap this entire inside a wrapper okay.

367
00:19:49,000 --> 00:19:51,000
So these are the information.

368
00:19:51,000 --> 00:19:54,000
This is the index that we will be getting with respect to those texts.

369
00:19:54,000 --> 00:19:58,000
So once I probably execute it, you'll be seeing that in the same database.

370
00:19:58,000 --> 00:20:01,000
It is going to insert all this headlines okay.

371
00:20:02,000 --> 00:20:04,000
Now finally let's go ahead and test it.

372
00:20:04,000 --> 00:20:07,000
That basically means I have my vectors inside my database.

373
00:20:07,000 --> 00:20:11,000
Now it's time that we just query and we ask some kind of questions.

374
00:20:11,000 --> 00:20:13,000
Now I have read this entire PDF guys.

375
00:20:13,000 --> 00:20:17,000
I could find out some of the question like what is the current GBD, how much agriculture target will

376
00:20:17,000 --> 00:20:18,000
be increased and all.

377
00:20:18,000 --> 00:20:20,000
So I will take this particular example.

378
00:20:20,000 --> 00:20:23,000
And let's say I'm writing first question is equal to true.

379
00:20:23,000 --> 00:20:26,000
While true if first question I'm just say that input okay.

380
00:20:26,000 --> 00:20:29,000
It will just ask like what kind of question you want to type.

381
00:20:29,000 --> 00:20:33,000
Else uh, it is just asking you to uh, put more questions itself.

382
00:20:33,000 --> 00:20:35,000
If I write quit it is going to break.

383
00:20:35,000 --> 00:20:37,000
Otherwise it is going to continue.

384
00:20:37,000 --> 00:20:38,000
Now see, this is the most important.

385
00:20:38,000 --> 00:20:43,000
As soon as I give my first question, it will go ahead with vast, uh, extra vector index and it will

386
00:20:43,000 --> 00:20:49,000
query whatever query text we are specifically using and the LLM models that we have specifically initialized.

387
00:20:49,000 --> 00:20:52,000
And after that we will be getting the answer.

388
00:20:52,000 --> 00:20:58,000
Along with this we'll also be providing some information right like for doc score or similarity score

389
00:20:58,000 --> 00:21:00,000
score like some other information also.

390
00:21:00,000 --> 00:21:01,000
Right.

391
00:21:01,000 --> 00:21:03,000
So let's go ahead and execute it.

392
00:21:03,000 --> 00:21:05,000
And as I said I'm going to use this cushion okay.

393
00:21:05,000 --> 00:21:11,000
How much is the agriculture target to be increased and what focus it will be?

394
00:21:11,000 --> 00:21:12,000
Okay.

395
00:21:12,000 --> 00:21:13,000
So I'm going to paste it over here.

396
00:21:13,000 --> 00:21:15,000
I'm going to press enter.

397
00:21:15,000 --> 00:21:19,000
So as soon as I press enter you can see that it is now taking the information.

398
00:21:19,000 --> 00:21:20,000
See this.

399
00:21:20,000 --> 00:21:24,000
Um you can probably see over here we are querying this particular DB right.

400
00:21:24,000 --> 00:21:26,000
And it is going to give me the top four results okay.

401
00:21:26,000 --> 00:21:31,000
So here you can see that agriculture credit target will be increased to 20 lakh crore with the focus

402
00:21:31,000 --> 00:21:34,000
on animal husbandry dairy and fisheries.

403
00:21:34,000 --> 00:21:34,000
Right.

404
00:21:34,000 --> 00:21:36,000
Why it is giving only this much data?

405
00:21:36,000 --> 00:21:37,000
Because I have told that.

406
00:21:37,000 --> 00:21:40,000
Take the 84 characters or 84 words.

407
00:21:40,000 --> 00:21:44,000
84 characters, text till there and probably give the results right.

408
00:21:44,000 --> 00:21:46,000
If I increase this, it will give you more result.

409
00:21:46,000 --> 00:21:50,000
Along with this, you can probably see that it is giving me stop quick queries.

410
00:21:50,000 --> 00:21:51,000
That is the four query.

411
00:21:51,000 --> 00:21:54,000
Uh Hyderabad will be supported as center of excellence.

412
00:21:54,000 --> 00:22:00,000
Some more information but the most suitable answer that you have specifically got is this one right?

413
00:22:00,000 --> 00:22:05,000
And this is what probably if you go ahead and search in the PDF, if you give the same question, you

414
00:22:05,000 --> 00:22:07,000
will be able to see the same answer right along with this.

415
00:22:07,000 --> 00:22:12,000
Probably if I want to probably see what is the current GDP, if this information is present over there,

416
00:22:12,000 --> 00:22:15,000
it will also be giving you that specific answer.

417
00:22:15,000 --> 00:22:17,000
It will just do the similarity search right.

418
00:22:17,000 --> 00:22:21,000
So here you can the current GBD is estimated to be 7%.

419
00:22:21,000 --> 00:22:22,000
Isn't it amazing?

420
00:22:22,000 --> 00:22:27,000
Now you can probably take any huge data because at the end of the day you are specifically using DB,

421
00:22:27,000 --> 00:22:27,000
right?

422
00:22:27,000 --> 00:22:30,000
And finally, if you want to quit it, I will just go ahead and write quit.

423
00:22:30,000 --> 00:22:32,000
And this is basically quit, right?

424
00:22:32,000 --> 00:22:35,000
So in short, we have performed each and every step.

425
00:22:35,000 --> 00:22:36,000
Now this is what it is happening.

426
00:22:36,000 --> 00:22:39,000
Whenever a human is giving a text query, text embeddings will happen.

427
00:22:39,000 --> 00:22:41,000
And based on that similarity search.

428
00:22:41,000 --> 00:22:43,000
And then you will be probably get the output right.

429
00:22:43,000 --> 00:22:47,000
And this is the entire steps we have basically done step by step.

430
00:22:47,000 --> 00:22:54,000
Now a amazing shout out to uh datastax Astra DB, because you can also create your own free account

431
00:22:54,000 --> 00:23:00,000
vector search, which is super important, not just understand any type of Q&A applications.

432
00:23:00,000 --> 00:23:02,000
You can definitely develop with the help of vector search.

433
00:23:02,000 --> 00:23:02,000
Right.

434
00:23:02,000 --> 00:23:07,000
And uh, that is where this Datastax Astra DB uh, is basically used, right?

435
00:23:07,000 --> 00:23:10,000
And internally it is specifically using Cassandra DB.