1
00:00:00,000 --> 00:00:00,000
Hello guys.

2
00:00:00,000 --> 00:00:03,000
So we are going to continue the discussion with respect to text splitters.

3
00:00:03,000 --> 00:00:07,000
Now, uh, we are going to see how we can split by HTML header.

4
00:00:07,000 --> 00:00:08,000
Okay.

5
00:00:08,000 --> 00:00:12,000
So there is an amazing library again which is called as HTML header text splitter.

6
00:00:12,000 --> 00:00:19,000
It is a structured aware chunk chunk that splits text at HTML elements level, and adds metadata for

7
00:00:19,000 --> 00:00:22,000
each header relevant to any given chunk.

8
00:00:22,000 --> 00:00:27,000
It can return chunk elements by element, or combine elements with the same metadata with the objective

9
00:00:27,000 --> 00:00:34,000
of a keeping related text groups more or less semantically pre b preserving context rich information

10
00:00:34,000 --> 00:00:36,000
encoded in the document structure.

11
00:00:36,000 --> 00:00:42,000
Okay, so yes, uh uh, I know, uh, you'll not be able to get much from this specific information,

12
00:00:42,000 --> 00:00:45,000
but don't worry, I will just go ahead and show you this particular code.

13
00:00:45,000 --> 00:00:46,000
Okay.

14
00:00:46,000 --> 00:00:52,000
So first of all, uh, I will go to from lang chain underscore text splitter.

15
00:00:52,000 --> 00:00:52,000
Okay.

16
00:00:52,000 --> 00:00:58,000
We are going to import HTML header text splitter okay.

17
00:00:58,000 --> 00:01:02,000
Now after this what we are going to do is that we'll take one HTML string okay.

18
00:01:02,000 --> 00:01:05,000
So let's go ahead and copy this HTML string.

19
00:01:05,000 --> 00:01:09,000
So here I'll be putting this information.

20
00:01:09,000 --> 00:01:15,000
So here you can see doctype HTML right body for some intro about foo bar main section.

21
00:01:15,000 --> 00:01:19,000
Some some tags we have given like header tag p tag, div tag right.

22
00:01:19,000 --> 00:01:21,000
So different different tags are there.

23
00:01:21,000 --> 00:01:27,000
Even break commands is specifically used right over here b r and the paragraph is there.

24
00:01:27,000 --> 00:01:28,000
Again we are closing the body.

25
00:01:28,000 --> 00:01:29,000
We are closing all the divs and HTML.

26
00:01:29,000 --> 00:01:30,000
Okay.

27
00:01:30,000 --> 00:01:34,000
So let's consider that this is my entire HTML string okay.

28
00:01:34,000 --> 00:01:41,000
Now what I'm actually going to do over here is that, uh, I will just go ahead and create on which

29
00:01:41,000 --> 00:01:44,000
all headers I need to split, right?

30
00:01:44,000 --> 00:01:51,000
So I'll say hey headers underscore two underscore split underscore on okay.

31
00:01:51,000 --> 00:02:00,000
And this will basically be my header splitting here I'm going to write H1 header okay.

32
00:02:00,000 --> 00:02:03,000
So let's say this will basically be my header one okay.

33
00:02:04,000 --> 00:02:07,000
Header one I'm just renaming it over here.

34
00:02:07,000 --> 00:02:08,000
Something like this.

35
00:02:08,000 --> 00:02:10,000
So this will be my H1 tag.

36
00:02:10,000 --> 00:02:11,000
Whatever H1 tag is there.

37
00:02:12,000 --> 00:02:18,000
Then uh let's go ahead and write okay I also want to probably play with my H2 tag.

38
00:02:18,000 --> 00:02:19,000
So I will just go ahead and write over here.

39
00:02:19,000 --> 00:02:25,000
So this will be my header two okay okay I'm giving some custom name over here.

40
00:02:25,000 --> 00:02:26,000
Header two.

41
00:02:26,000 --> 00:02:29,000
And finally uh I'll also go ahead and use h3.

42
00:02:29,000 --> 00:02:31,000
Let's say h3 tag.

43
00:02:32,000 --> 00:02:34,000
And this is nothing but my header three.

44
00:02:35,000 --> 00:02:40,000
Okay so these are my headers tag that I'm using to split okay.

45
00:02:40,000 --> 00:02:45,000
Now in order to use this tag and probably do the splitting, what I will do I will go ahead and write

46
00:02:45,000 --> 00:02:51,000
HTML underscore splitter is equal to HTML text splitter.

47
00:02:51,000 --> 00:02:56,000
And here I'm going to basically use this header to splitter on okay.

48
00:02:56,000 --> 00:02:58,000
So this is the parameter also.

49
00:02:58,000 --> 00:03:01,000
So you can just go ahead and assign this parameter over here.

50
00:03:01,000 --> 00:03:04,000
Also header splitter on is equal to this okay.

51
00:03:05,000 --> 00:03:08,000
Uh otherwise I know I am going to use that.

52
00:03:08,000 --> 00:03:09,000
So I'm just going to use this okay.

53
00:03:10,000 --> 00:03:12,000
Now this is fine.

54
00:03:12,000 --> 00:03:20,000
Now what we are going to do, we are going to basically write HTML underscore splitter dot split underscore

55
00:03:20,000 --> 00:03:20,000
text.

56
00:03:20,000 --> 00:03:24,000
So we are going to probably use split underscore text.

57
00:03:24,000 --> 00:03:26,000
Or there is also options.

58
00:03:26,000 --> 00:03:27,000
Uh yeah.

59
00:03:27,000 --> 00:03:30,000
Split text is what we are going to use over here.

60
00:03:30,000 --> 00:03:32,000
And here I'm going to give my HTML string.

61
00:03:33,000 --> 00:03:40,000
Finally, uh, I will save all these things into my variable which is called as HTML underscore header

62
00:03:40,000 --> 00:03:41,000
underscore splits.

63
00:03:42,000 --> 00:03:42,000
Okay.

64
00:03:43,000 --> 00:03:48,000
So uh, now I can go ahead and write HTML underscore header underscore splits.

65
00:03:48,000 --> 00:03:50,000
And let's go ahead and display this.

66
00:03:50,000 --> 00:03:56,000
So once I probably see this now here you can see I've used this three tags h1 h2 and h3.

67
00:03:56,000 --> 00:03:59,000
So let's see where foo was there h1 tags.

68
00:03:59,000 --> 00:03:59,000
Right.

69
00:03:59,000 --> 00:04:03,000
So based on this H1 tag I was able to get another page document right.

70
00:04:03,000 --> 00:04:07,000
Some intro about foo uh, where it is basically present in the uh p tag.

71
00:04:07,000 --> 00:04:08,000
Right.

72
00:04:08,000 --> 00:04:08,000
Some intro about foo.

73
00:04:08,000 --> 00:04:11,000
So this has also become another tag over here.

74
00:04:11,000 --> 00:04:11,000
Right.

75
00:04:12,000 --> 00:04:17,000
Uh, header one of metadata you can see over here Foo bar main section.

76
00:04:17,000 --> 00:04:20,000
Then you can see some uh, about the first subject.

77
00:04:20,000 --> 00:04:21,000
Something was written.

78
00:04:21,000 --> 00:04:22,000
Right.

79
00:04:22,000 --> 00:04:26,000
So here again with the H3 tag, another uh, division is specifically made.

80
00:04:26,000 --> 00:04:26,000
Right.

81
00:04:26,000 --> 00:04:32,000
So here you can see all the tags have been basically created with the kind of headers that we have actually

82
00:04:32,000 --> 00:04:36,000
used and And, uh, metadata information has also been shared over here.

83
00:04:37,000 --> 00:04:40,000
Uh, based on the tags that we have actually splitted this entire content.

84
00:04:40,000 --> 00:04:44,000
And finally, you can see that when we are splitting this into text, I am able to get in the form of

85
00:04:44,000 --> 00:04:45,000
documents.

86
00:04:45,000 --> 00:04:45,000
Okay.

87
00:04:45,000 --> 00:04:47,000
So this can be handy.

88
00:04:47,000 --> 00:04:52,000
Uh, again, it is up to you, uh, how you specifically want to use it and where you really want to

89
00:04:52,000 --> 00:04:53,000
use it.

90
00:04:53,000 --> 00:04:53,000
Okay.

91
00:04:54,000 --> 00:04:58,000
Um, not only this, uh, let's say that I have some kind of URL.

92
00:04:58,000 --> 00:05:02,000
Okay, so let's, uh, pick up some URL over here.

93
00:05:02,000 --> 00:05:06,000
So I will just open my browser.

94
00:05:07,000 --> 00:05:08,000
Let's see.

95
00:05:08,000 --> 00:05:09,000
So this is my tag.

96
00:05:09,000 --> 00:05:10,000
Let's see.

97
00:05:10,000 --> 00:05:10,000
Okay.

98
00:05:10,000 --> 00:05:14,000
Let me just go ahead and open my browser over here okay.

99
00:05:14,000 --> 00:05:17,000
And with respect to this particular browser.

100
00:05:18,000 --> 00:05:23,000
So this will basically be my browser here I'm going to take one URL.

101
00:05:23,000 --> 00:05:26,000
Let's say this is my URL over here.

102
00:05:26,000 --> 00:05:29,000
I'll just copy and paste it over here.

103
00:05:29,000 --> 00:05:36,000
So once I am uploading this particular URL, uh, this URL pro provides some information over here.

104
00:05:36,000 --> 00:05:36,000
Right.

105
00:05:36,000 --> 00:05:38,000
So here you can see some information is there.

106
00:05:38,000 --> 00:05:42,000
And if you go ahead and inspect all these elements there will be some kind of header tag.

107
00:05:42,000 --> 00:05:46,000
There should be something right a head h1 tag, h2 tag and h3 tag.

108
00:05:46,000 --> 00:05:49,000
So what I will do I will create this particular URL.

109
00:05:49,000 --> 00:05:51,000
Go back to my code okay.

110
00:05:51,000 --> 00:05:56,000
And uh let's create take this URL and I'm saying hey header to split on.

111
00:05:56,000 --> 00:06:02,000
And then I will try to just split that same condition whatever I have actually written over here.

112
00:06:02,000 --> 00:06:02,000
Right.

113
00:06:02,000 --> 00:06:05,000
So I will go ahead and say hey, split this particular URL.

114
00:06:05,000 --> 00:06:08,000
And finally I will be also able to see the splits.

115
00:06:08,000 --> 00:06:11,000
So if I go ahead and write HTML header, splits.

116
00:06:11,000 --> 00:06:15,000
So here you'll be able to see it will take some time because it will be a bigger website.

117
00:06:15,000 --> 00:06:18,000
And here are all my documents that has got separated.

118
00:06:18,000 --> 00:06:19,000
Right.

119
00:06:19,000 --> 00:06:24,000
So this can be really, really handy wherever you specifically want to use it okay.

120
00:06:24,000 --> 00:06:27,000
Whenever you have something like HTML and.

121
00:06:27,000 --> 00:06:27,000
All right.

122
00:06:28,000 --> 00:06:31,000
So, uh, I hope you like this particular video.

123
00:06:31,000 --> 00:06:32,000
Uh, yes.

124
00:06:32,000 --> 00:06:32,000
Uh, this was it.

125
00:06:32,000 --> 00:06:38,000
Uh, again, uh, there are many, many techniques with respect to, uh, character text splitting,

126
00:06:38,000 --> 00:06:40,000
which you should keep on exploring.

127
00:06:40,000 --> 00:06:44,000
Also, make sure that you always check the documentation of the Lang chain to see what all new techniques

128
00:06:44,000 --> 00:06:46,000
are specifically coming.

129
00:06:46,000 --> 00:06:48,000
So I hope, uh, you like this particular video.

130
00:06:48,000 --> 00:06:52,000
Uh, again, guys, uh, we are still in this particular phase.

131
00:06:52,000 --> 00:06:53,000
That is second one.

132
00:06:53,000 --> 00:06:55,000
Now we are going to go into the third one.

133
00:06:55,000 --> 00:06:56,000
Right?

134
00:06:56,000 --> 00:06:58,000
That is nothing but the embedding techniques.

135
00:06:58,000 --> 00:07:00,000
So yes, I will see you all in the next video.

136
00:07:00,000 --> 00:07:01,000
Thank you.

