1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
We are going to continue the discussion with respect to text summarization.

3
00:00:05,000 --> 00:00:12,000
Now in this video we are going to see the most important text summarization techniques that we can use

4
00:00:12,000 --> 00:00:12,000
with Langshan.

5
00:00:13,000 --> 00:00:20,000
The first one is something called as stuff document chain text summarization.

6
00:00:21,000 --> 00:00:24,000
So we will discuss one by one what exactly it is.

7
00:00:25,000 --> 00:00:25,000
Right.

8
00:00:25,000 --> 00:00:30,000
So first of all, we are just going to go ahead with text summarization Uh stuff document chain.

9
00:00:30,000 --> 00:00:40,000
The second one that we use is something called as map reduce maps reduce summarization technique.

10
00:00:43,000 --> 00:00:52,000
And the most amazing thing about this map reduce is that this is specifically very much, uh, important

11
00:00:52,000 --> 00:00:54,000
for larger files.

12
00:00:54,000 --> 00:00:55,000
Okay.

13
00:00:55,000 --> 00:01:01,000
If I have a larger files or larger data file, data source, I will I will be able to use this because

14
00:01:01,000 --> 00:01:04,000
let's say in one of my PDF file I have 100 pages.

15
00:01:04,000 --> 00:01:07,000
So at that point of time I cannot directly use stuffed document chain.

16
00:01:07,000 --> 00:01:09,000
And obviously what exactly stuffed document chain?

17
00:01:09,000 --> 00:01:17,000
I will just talk about it now in this map reduce summarization technique, we can also have two important

18
00:01:17,000 --> 00:01:17,000
types.

19
00:01:17,000 --> 00:01:21,000
One is just with the help of single prompt.

20
00:01:21,000 --> 00:01:23,000
Single prompt template.

21
00:01:27,000 --> 00:01:28,000
Single prompt template.

22
00:01:28,000 --> 00:01:32,000
And the second one is with the help of multiple prompt template.

23
00:01:32,000 --> 00:01:34,000
We will discuss about this.

24
00:01:34,000 --> 00:01:38,000
I will go ahead and probably show you multiple examples also.

25
00:01:38,000 --> 00:01:46,000
And the third technique that you will be seeing as we go ahead, you know, it is something called as

26
00:01:46,000 --> 00:01:48,000
refine chain summarization.

27
00:01:49,000 --> 00:01:53,000
Refine chain summarization.

28
00:01:56,000 --> 00:02:00,000
So what we are basically going to do is that we are going to first of all understand one by one, right?

29
00:02:00,000 --> 00:02:05,000
What is exactly sub document chain, what is the problem in staff document chain.

30
00:02:05,000 --> 00:02:10,000
And uh because of that, why do we use MapReduce and when should we use, uh, refine chain?

31
00:02:10,000 --> 00:02:10,000
Okay.

32
00:02:10,000 --> 00:02:13,000
So all these things we will go ahead and discuss about it.

33
00:02:14,000 --> 00:02:22,000
So let's go ahead and uh, talk about the first one over here, which is called as stuff document chain

34
00:02:22,000 --> 00:02:23,000
summarization.

35
00:02:28,000 --> 00:02:35,000
Now this is the most basic type of summarization technique okay.

36
00:02:35,000 --> 00:02:41,000
Let's say that we have our PDF something like this.

37
00:02:41,000 --> 00:02:43,000
This is my external data source.

38
00:02:43,000 --> 00:02:44,000
It can be PDF.

39
00:02:44,000 --> 00:02:45,000
It can be text file.

40
00:02:45,000 --> 00:02:46,000
It can be anything.

41
00:02:46,000 --> 00:02:47,000
It can be a website.

42
00:02:47,000 --> 00:02:48,000
Right.

43
00:02:48,000 --> 00:02:53,000
What do we do for this uh PDF is that or it can be a web URL also.

44
00:02:54,000 --> 00:02:56,000
First of all, I will take this particular website.

45
00:02:56,000 --> 00:02:58,000
Uh, I will take this particular content.

46
00:02:58,000 --> 00:02:59,000
I will read this entire content.

47
00:02:59,000 --> 00:03:07,000
And then we go ahead and use a prompt template like let's say in this prompt template, I go ahead and

48
00:03:07,000 --> 00:03:14,000
say, hey, uh, I want you all to probably summarize this entire PDF right, this PDF information.

49
00:03:14,000 --> 00:03:21,000
So I give the entire text in the form of documents, let's say inside this prompt template.

50
00:03:21,000 --> 00:03:24,000
And then we pass along with this prompt template.

51
00:03:24,000 --> 00:03:27,000
We pass it to the LLM model.

52
00:03:27,000 --> 00:03:32,000
And with the help of LLM model we specifically get the response or the output.

53
00:03:32,000 --> 00:03:34,000
And this is what we earlier did it right.

54
00:03:34,000 --> 00:03:45,000
So over here in the stuff document chain you will be seeing that if this PDF is basically of 100 documents

55
00:03:45,000 --> 00:03:51,000
also, or if it is of ten different documents, ten different documents, let's consider it.

56
00:03:51,000 --> 00:03:52,000
Okay.

57
00:03:53,000 --> 00:04:01,000
What instead of document chain happens is that this documents are all together getting combined.

58
00:04:03,000 --> 00:04:06,000
And then it is sent to this particular prompt template okay.

59
00:04:06,000 --> 00:04:10,000
Let's say in this entire PDF has uh, ten documents.

60
00:04:10,000 --> 00:04:12,000
This is my doc one.

61
00:04:12,000 --> 00:04:14,000
This is my doc two, right.

62
00:04:14,000 --> 00:04:16,000
This is my doc three.

63
00:04:17,000 --> 00:04:18,000
Doc four.

64
00:04:18,000 --> 00:04:23,000
Like this I have this entire documents content will be combined.

65
00:04:23,000 --> 00:04:24,000
Combined together.

66
00:04:24,000 --> 00:04:27,000
And then it will be sent to the prompt template.

67
00:04:27,000 --> 00:04:28,000
Okay.

68
00:04:28,000 --> 00:04:34,000
And wherever in the prompt template, uh, when I say ten different documents and when it is combined,

69
00:04:34,000 --> 00:04:39,000
we specifically get the text from all this particular documents, the combined documents, and then

70
00:04:39,000 --> 00:04:42,000
we probably send it to our prompt template.

71
00:04:42,000 --> 00:04:47,000
So with respect to the prompt template where we will be having the placeholder called as text right.

72
00:04:48,000 --> 00:04:51,000
This entire text will get replaced over there.

73
00:04:51,000 --> 00:04:55,000
Now this is fine, but let's understand what are the challenges.

74
00:04:57,000 --> 00:05:03,000
Now, in case of challenges, if this particular PDF or if any document is of a smaller size, then

75
00:05:03,000 --> 00:05:04,000
it is fine.

76
00:05:04,000 --> 00:05:14,000
But if I have a specific PDF or web, it can be a website which has, let's say more than 1000 documents.

77
00:05:14,000 --> 00:05:15,000
At that point of time.

78
00:05:15,000 --> 00:05:15,000
What happens?

79
00:05:15,000 --> 00:05:22,000
This document becomes very large, becomes very large, right?

80
00:05:22,000 --> 00:05:24,000
Very large or very big.

81
00:05:24,000 --> 00:05:26,000
I can basically say it becomes very big.

82
00:05:26,000 --> 00:05:32,000
Now when it becomes very big, obviously it will not be possible to send it directly to the prompt template

83
00:05:32,000 --> 00:05:38,000
along with the LM model, because there is a limitation with respect to the context size, right?

84
00:05:39,000 --> 00:05:40,000
Context size.

85
00:05:40,000 --> 00:05:48,000
So let's say in case of GPT 3.5, if I use this, there is a context size of 4096 tokens which I can

86
00:05:48,000 --> 00:05:49,000
actually send.

87
00:05:49,000 --> 00:05:49,000
Right.

88
00:05:49,000 --> 00:05:51,000
There is a limitation with respect to this.

89
00:05:51,000 --> 00:05:54,000
So it will not be working well that much, right?

90
00:05:54,000 --> 00:05:59,000
If we directly send the entire documents back to the prompt template and then to the LLM model.

91
00:05:59,000 --> 00:06:05,000
So considering this, what we do is that we use the other technique which is called as map reduce summarization

92
00:06:05,000 --> 00:06:06,000
technique.

93
00:06:06,000 --> 00:06:10,000
Now in the map reduce summarization technique, what exactly it happens.

94
00:06:10,000 --> 00:06:18,000
Okay, so second one, uh, we will talk about something called as MapReduce summarization technique.

95
00:06:20,000 --> 00:06:29,000
Now in MapReduce summarization technique, let's say if I have a document okay, all this are my documents.

96
00:06:29,000 --> 00:06:34,000
Let's consider this instead of combining all and giving it once to the prompt template.

97
00:06:34,000 --> 00:06:37,000
First of all, we will divide this into chunks.

98
00:06:37,000 --> 00:06:39,000
This can be a smaller chunk.

99
00:06:39,000 --> 00:06:40,000
This can be a smaller chunk.

100
00:06:41,000 --> 00:06:43,000
This can be a smaller chunk okay.

101
00:06:43,000 --> 00:06:45,000
As the name suggests MapReduce okay.

102
00:06:46,000 --> 00:06:47,000
This can be a smaller chunk.

103
00:06:47,000 --> 00:06:49,000
So this all will be a smaller chunk.

104
00:06:49,000 --> 00:06:56,000
Then what we do is that along with this smaller chunk we pass it to a prompt template.

105
00:06:57,000 --> 00:06:58,000
Right.

106
00:06:58,000 --> 00:07:00,000
We pass it to a prompt template over here.

107
00:07:02,000 --> 00:07:03,000
Right.

108
00:07:03,000 --> 00:07:05,000
And we get the summarized.

109
00:07:07,000 --> 00:07:12,000
Along with the prompt template we'll also go ahead and pass it to our LM model.

110
00:07:14,000 --> 00:07:14,000
Okay.

111
00:07:14,000 --> 00:07:17,000
We will go ahead and pass it to our LM model.

112
00:07:17,000 --> 00:07:20,000
And here you will be able to see that I will get a summary one.

113
00:07:21,000 --> 00:07:24,000
Similarly I will get summary two for this.

114
00:07:25,000 --> 00:07:27,000
Then I will get summary three.

115
00:07:28,000 --> 00:07:30,000
Then I will get summary four.

116
00:07:31,000 --> 00:07:39,000
So after getting all this summary finally we combine all this summary and get our final summary.

117
00:07:41,000 --> 00:07:48,000
This is what happens with the help of MapReduce Okay, now let me just explain it once again.

118
00:07:48,000 --> 00:07:50,000
Let's say if I have a document, what will happen?

119
00:07:50,000 --> 00:07:53,000
First of all, we will divide this into smaller chunks.

120
00:07:54,000 --> 00:07:58,000
Then for every chunk we will pass it to the prompt template along with them.

121
00:07:58,000 --> 00:08:02,000
Uh LLM to get the summary one, summary two, summary three, summary four.

122
00:08:02,000 --> 00:08:07,000
And we finally combine all this summary to get the final summary okay.

123
00:08:07,000 --> 00:08:10,000
That is what we exactly do.

124
00:08:10,000 --> 00:08:11,000
Right?

125
00:08:11,000 --> 00:08:15,000
And this final summary will be the summary of this entire document that we are going to get.

126
00:08:15,000 --> 00:08:15,000
Okay.

127
00:08:16,000 --> 00:08:23,000
Now there are one there that see uh, I told you already right over here in MapReduce summarize I have

128
00:08:23,000 --> 00:08:25,000
one option of single prompt template.

129
00:08:25,000 --> 00:08:28,000
I have another option of multiple prompt template.

130
00:08:28,000 --> 00:08:30,000
Now let me just go ahead and explain it okay.

131
00:08:31,000 --> 00:08:35,000
So here you will be able to see that one prompt template that I have applied.

132
00:08:35,000 --> 00:08:35,000
Right.

133
00:08:35,000 --> 00:08:39,000
So this all prompt template will be same for all these chunks that we have.

134
00:08:39,000 --> 00:08:39,000
Right.

135
00:08:39,000 --> 00:08:41,000
And we will get this entire summary.

136
00:08:41,000 --> 00:08:47,000
Along with this summary I can take all the summary and then finally apply another prompt template,

137
00:08:48,000 --> 00:08:54,000
another prompt template to get the final summary okay.

138
00:08:54,000 --> 00:08:56,000
See I can also use this prompt over here.

139
00:08:56,000 --> 00:08:59,000
But let's say from this summary I want more information.

140
00:08:59,000 --> 00:09:02,000
Let's say I want the title of the entire page over there.

141
00:09:03,000 --> 00:09:03,000
Right.

142
00:09:03,000 --> 00:09:05,000
I want the title from this entire summary.

143
00:09:05,000 --> 00:09:10,000
I want to say that, hey, based on this entire summary, please try to find out the motivational quotes

144
00:09:10,000 --> 00:09:11,000
from it, right.

145
00:09:11,000 --> 00:09:15,000
So that can be another prompt template with another prompt message.

146
00:09:15,000 --> 00:09:18,000
Okay, so here I will be having another prompt message.

147
00:09:19,000 --> 00:09:23,000
Right here I will be having another prompt message.

148
00:09:25,000 --> 00:09:27,000
Another prompt message.

149
00:09:27,000 --> 00:09:28,000
Right.

150
00:09:28,000 --> 00:09:30,000
So this is basically called as MapReduce.

151
00:09:30,000 --> 00:09:32,000
Why do we say this as MapReduce.

152
00:09:32,000 --> 00:09:36,000
Because see initially we are mapping this particular documents into chunks.

153
00:09:36,000 --> 00:09:39,000
We are doing individual task.

154
00:09:39,000 --> 00:09:41,000
I'm getting the summary of this individual.

155
00:09:41,000 --> 00:09:46,000
I'm just it's just like reducing the things and then mapping the content itself.

156
00:09:46,000 --> 00:09:46,000
Right.

157
00:09:46,000 --> 00:09:51,000
So here when I get summary one, summary two, summary three, summary four, these are summaries with

158
00:09:51,000 --> 00:09:52,000
respect to all this particular chunks.

159
00:09:52,000 --> 00:09:58,000
And then finally combining all the summary with a another prompt template which is a having a different

160
00:09:58,000 --> 00:10:02,000
prompt message and getting the final summary from the LLM model.

161
00:10:02,000 --> 00:10:02,000
Right.

162
00:10:03,000 --> 00:10:05,000
So this way we will be able to challenge.

163
00:10:05,000 --> 00:10:10,000
We will be able to solve this problem right where the document is very big and where you have a limitation

164
00:10:10,000 --> 00:10:11,000
of the context size.

165
00:10:11,000 --> 00:10:11,000
Right?

166
00:10:12,000 --> 00:10:15,000
So, uh, let me just go ahead and show you the documentation also over here.

167
00:10:15,000 --> 00:10:16,000
Right.

168
00:10:16,000 --> 00:10:21,000
Uh, so here is one amazing diagram, uh, which you can specifically use.

169
00:10:21,000 --> 00:10:23,000
See, this is my docs, right.

170
00:10:24,000 --> 00:10:25,000
Please focus over here.

171
00:10:25,000 --> 00:10:25,000
Okay.

172
00:10:25,000 --> 00:10:28,000
Uh, let me just go ahead and use this.

173
00:10:31,000 --> 00:10:33,000
I will copy this entire thing.

174
00:10:33,000 --> 00:10:33,000
Okay?

175
00:10:33,000 --> 00:10:35,000
And I'll paste it over here.

176
00:10:36,000 --> 00:10:38,000
These two techniques, what we have discussed.

177
00:10:38,000 --> 00:10:38,000
Right.

178
00:10:38,000 --> 00:10:42,000
It will be over here so that you will also have this entire materials.

179
00:10:43,000 --> 00:10:43,000
Right.

180
00:10:44,000 --> 00:10:46,000
So over here let's focus on this diagram.

181
00:10:46,000 --> 00:10:51,000
What all things we basically did we are having this particular documents okay.

182
00:10:52,000 --> 00:10:58,000
Here you can see that if it fits in the LM context window then what we do we directly make a prompt.

183
00:10:58,000 --> 00:11:00,000
Then we give it to the LM model.

184
00:11:00,000 --> 00:11:04,000
So this is basically done by using stuff where we combine all the documents.

185
00:11:04,000 --> 00:11:07,000
And then we are going to get the final summary as an output.

186
00:11:07,000 --> 00:11:12,000
If it does not fit in the LM context window please see this okay.

187
00:11:12,000 --> 00:11:17,000
So what we'll do we'll create another prompt which says summarize themes into group of documents.

188
00:11:17,000 --> 00:11:17,000
Right.

189
00:11:17,000 --> 00:11:21,000
So here is my document one document two document three chunks of document.

190
00:11:21,000 --> 00:11:25,000
So first of all I will go ahead and create a chunk of documents over here.

191
00:11:26,000 --> 00:11:26,000
Okay.

192
00:11:26,000 --> 00:11:28,000
Then pass it to the model.

193
00:11:28,000 --> 00:11:29,000
Get the summary.

194
00:11:29,000 --> 00:11:30,000
So these are all my summaries.

195
00:11:31,000 --> 00:11:34,000
Then from this summary I will take another prompt template.

196
00:11:34,000 --> 00:11:36,000
Extract a final summary of the list.

197
00:11:36,000 --> 00:11:37,000
I will pass it to the model.

198
00:11:37,000 --> 00:11:39,000
And finally I will get the summary.

199
00:11:39,000 --> 00:11:44,000
So this is basically a combination of map reduce technique.

200
00:11:44,000 --> 00:11:47,000
Okay so these are the two types.

201
00:11:47,000 --> 00:11:49,000
First type I have basically discussed about stuff.

202
00:11:49,000 --> 00:11:51,000
And second type we have discussed about MapReduce.

203
00:11:51,000 --> 00:11:54,000
Now let's go ahead and quickly see the implementation.

204
00:11:54,000 --> 00:11:57,000
And obviously you need to see the implementation.

205
00:11:58,000 --> 00:11:59,000
Um the implementation part.

206
00:11:59,000 --> 00:12:05,000
What I will do, uh I'll not make the video long, so I will try to show you, uh, with some examples

207
00:12:05,000 --> 00:12:06,000
in the next video.

208
00:12:06,000 --> 00:12:12,000
But I hope everybody is able to understand the difference between stuff and MapReduce.

209
00:12:12,000 --> 00:12:13,000
Again, let me repeat it.

210
00:12:13,000 --> 00:12:18,000
If it fits in the LLM context window, I will directly give all the entire documents along with the

211
00:12:18,000 --> 00:12:21,000
prompt, go to the LLM and get the final summary.

212
00:12:21,000 --> 00:12:24,000
If it does not fit in the LLM context window, then what we do?

213
00:12:24,000 --> 00:12:26,000
We divide that into chunks.

214
00:12:26,000 --> 00:12:28,000
We get multiple summaries and then we combine with another prompt.

215
00:12:28,000 --> 00:12:31,000
And then we finally get a final summary.

216
00:12:31,000 --> 00:12:35,000
So yes, uh, in the next video we will go ahead and see the practical implementation.

217
00:12:35,000 --> 00:12:36,000
Thank you.

