1
00:00:00,000 --> 00:00:01,000
Hello guys.

2
00:00:01,000 --> 00:00:04,000
So we are going to continue a discussion with respect to text splitting technique.

3
00:00:04,000 --> 00:00:08,000
In our previous video we saw about recursive character text splitter.

4
00:00:08,000 --> 00:00:10,000
We saw all the code over here.

5
00:00:10,000 --> 00:00:14,000
I forgot to talk about this specific points, which I have again mentioned it over here.

6
00:00:14,000 --> 00:00:15,000
The text.

7
00:00:15,000 --> 00:00:20,000
This text splitter that is recursive character text splitter is the recommended one for the generic

8
00:00:20,000 --> 00:00:21,000
text okay.

9
00:00:21,000 --> 00:00:24,000
It is parameterized by a list of characters.

10
00:00:24,000 --> 00:00:28,000
It tries to split on them in order until the chunks are small enough.

11
00:00:28,000 --> 00:00:33,000
Okay, the default list that basically means they use this specific list like slash and slash n or a

12
00:00:33,000 --> 00:00:36,000
new line blank space, right?

13
00:00:36,000 --> 00:00:38,000
They use this characters to probably do the splitting.

14
00:00:38,000 --> 00:00:43,000
Okay, this has the effect of trying to keep all the paragraphs and then sentences and then words together

15
00:00:43,000 --> 00:00:44,000
as long as possible.

16
00:00:44,000 --> 00:00:49,000
Okay, there are also two important things how the text is split by the list of characters, which is

17
00:00:49,000 --> 00:00:54,000
used over here, and then how the chunk size is measured by number of characters.

18
00:00:54,000 --> 00:00:54,000
Okay.

19
00:00:55,000 --> 00:00:58,000
Similarly, in this video we are going to discuss about character text splitter.

20
00:00:58,000 --> 00:00:59,000
Okay.

21
00:00:59,000 --> 00:01:03,000
So this is also another way this this split based on a given character sequence.

22
00:01:03,000 --> 00:01:04,000
Okay.

23
00:01:05,000 --> 00:01:10,000
Right now the default character is slash and slashing over there in the list of character you add slash

24
00:01:10,000 --> 00:01:11,000
and slash and slash in blank.

25
00:01:11,000 --> 00:01:14,000
Okay, here you have slash and slashing.

26
00:01:14,000 --> 00:01:15,000
Similarly here also you have two points.

27
00:01:15,000 --> 00:01:18,000
How is the text is split by single character separator.

28
00:01:18,000 --> 00:01:22,000
How the uh this the single character separator we will be providing.

29
00:01:22,000 --> 00:01:22,000
Okay.

30
00:01:22,000 --> 00:01:25,000
How the chunk size is measured by the number of characters.

31
00:01:25,000 --> 00:01:28,000
So let me go ahead and show you this example also.

32
00:01:28,000 --> 00:01:35,000
So here you can see that I have this particular text loader okay I'm reading this speech dot txt okay.

33
00:01:35,000 --> 00:01:39,000
And again it is in the same folder location that I have actually created over here in the data transformer.

34
00:01:39,000 --> 00:01:40,000
Okay.

35
00:01:40,000 --> 00:01:46,000
So once I go ahead and execute this so here you'll be able to see this is my entire document page okay.

36
00:01:46,000 --> 00:01:50,000
Just by using loader dot load, you'll be able to get the documents which we have seen in our previous

37
00:01:50,000 --> 00:01:51,000
video.

38
00:01:51,000 --> 00:01:56,000
Now in order to import it I will be using from long chain underscore text splitter.

39
00:01:56,000 --> 00:01:58,000
Import character text splitter.

40
00:01:58,000 --> 00:02:01,000
Before we used to import recursive character text splitter okay.

41
00:02:02,000 --> 00:02:05,000
Then we'll go ahead and initialize this character text splitter.

42
00:02:05,000 --> 00:02:08,000
The first parameter that we need to give is a separator.

43
00:02:08,000 --> 00:02:09,000
So here I'm giving new line.

44
00:02:09,000 --> 00:02:11,000
New line as the character as a separator.

45
00:02:11,000 --> 00:02:17,000
You can also give blank if you want okay then you have chunk size I have given chunk size of 100 chunk

46
00:02:18,000 --> 00:02:18,000
of 20.

47
00:02:18,000 --> 00:02:22,000
Okay then we are just splitting this entire documents with the help of this document.

48
00:02:22,000 --> 00:02:26,000
So here you'll be able to see, uh, these are my documents.

49
00:02:26,000 --> 00:02:27,000
The world must be safe for this.

50
00:02:27,000 --> 00:02:30,000
And all this particular documents will be over here.

51
00:02:30,000 --> 00:02:33,000
Okay, now let's focus on this.

52
00:02:33,000 --> 00:02:38,000
It says that he created a chunk of four 6470, which is longer than the specified 100.

53
00:02:38,000 --> 00:02:38,000
Okay.

54
00:02:38,000 --> 00:02:41,000
Why it may have done this.

55
00:02:41,000 --> 00:02:44,000
The reason is very simple because it was not able to find out the separator.

56
00:02:44,000 --> 00:02:44,000
Okay.

57
00:02:44,000 --> 00:02:45,000
So that is the reason.

58
00:02:45,000 --> 00:02:49,000
What it has done is that wherever it was able to find out the separator based on that it is basically

59
00:02:49,000 --> 00:02:51,000
created the chunk size.

60
00:02:51,000 --> 00:02:51,000
Okay.

61
00:02:51,000 --> 00:02:53,000
Now this is very simple okay.

62
00:02:53,000 --> 00:02:57,000
Now coming to the next one again we will try to open the speech dot txt.

63
00:02:57,000 --> 00:02:59,000
We are again using this character text splitter.

64
00:02:59,000 --> 00:03:02,000
By default if you are not giving separator then it will be slash and slash.

65
00:03:02,000 --> 00:03:05,000
And here we are seeing chunk size is equal to 100.

66
00:03:05,000 --> 00:03:07,000
Chunk overlap is equal to 20.

67
00:03:07,000 --> 00:03:11,000
And then we'll use this creator underscore splitter dot create documents of speech.

68
00:03:11,000 --> 00:03:13,000
Then we'll use the text of zero text of one.

69
00:03:13,000 --> 00:03:15,000
So this code I have already written it over here.

70
00:03:15,000 --> 00:03:19,000
That is the reason I'm just showing you so that you can actually go ahead and check it out.

71
00:03:19,000 --> 00:03:21,000
So once I execute it this is my page content.

72
00:03:21,000 --> 00:03:27,000
And here you will be able to see that all the information with respect to this page content is possible

73
00:03:27,000 --> 00:03:30,000
Now again the question comes Krish, which one should we use?

74
00:03:30,000 --> 00:03:35,000
Should we use recursive character text or character text editor because this is the most generic one

75
00:03:35,000 --> 00:03:35,000
okay.

76
00:03:35,000 --> 00:03:40,000
So whenever you have an opportunity, you have a choice from this particular tool you want to select.

77
00:03:40,000 --> 00:03:43,000
Go ahead with character recursive character text editor.

78
00:03:43,000 --> 00:03:45,000
So I hope you like this particular video.

79
00:03:45,000 --> 00:03:47,000
I will see you all in the next video.

80
00:03:47,000 --> 00:03:50,000
And this time we are going to talk about HTML text editor.

81
00:03:50,000 --> 00:03:52,000
Okay, so yes, I will see you all in the next video.

82
00:03:52,000 --> 00:03:53,000
Thank you.

