1
00:00:00,000 --> 00:00:05,000
So guys, now let's go ahead and proceed with the discussion with respect to reciprocal rank fusion

2
00:00:05,000 --> 00:00:06,000
and hybrid search.

3
00:00:06,000 --> 00:00:12,000
As I said, whenever we try to query or whenever user queries from this particular vector database,

4
00:00:12,000 --> 00:00:18,000
which supports both vector search and keyword search, which is a combination of semantic search and

5
00:00:18,000 --> 00:00:19,000
exact search over here.

6
00:00:19,000 --> 00:00:24,000
First of all, uh, to get the keyword search, we have to convert this into sparse vectors, and the

7
00:00:24,000 --> 00:00:28,000
sparse vectors will be probably searching from this particular vector database.

8
00:00:28,000 --> 00:00:30,000
And and we get top k results.

9
00:00:30,000 --> 00:00:34,000
Similarly, with the help of dense vectors, we also get a vector search.

10
00:00:34,000 --> 00:00:38,000
And here we are going to get a result that is top k results okay.

11
00:00:38,000 --> 00:00:42,000
Now combining both of them we have to probably consider a response.

12
00:00:42,000 --> 00:00:42,000
Right.

13
00:00:42,000 --> 00:00:44,000
How to probably get the final response.

14
00:00:44,000 --> 00:00:46,000
That is what I'm actually going to discuss now.

15
00:00:46,000 --> 00:00:47,000
Okay.

16
00:00:47,000 --> 00:00:54,000
So here uh, with the help of reciprocal rank fusion and hybrid search, we will be considering two

17
00:00:54,000 --> 00:00:55,000
important things.

18
00:00:55,000 --> 00:00:59,000
Okay, let's say this is my vector database.

19
00:00:59,000 --> 00:00:59,000
Okay.

20
00:01:02,000 --> 00:01:06,000
And from this vector database, you know that whenever user queries.

21
00:01:06,000 --> 00:01:15,000
So based on this queries let's say if this is my dense vectors first we are going to do the vector search.

22
00:01:15,000 --> 00:01:18,000
And we are going to get the top k results, right.

23
00:01:18,000 --> 00:01:21,000
So when I say top k results, let's say that this is my first document.

24
00:01:22,000 --> 00:01:23,000
Right.

25
00:01:23,000 --> 00:01:25,000
And this can be my second document.

26
00:01:27,000 --> 00:01:27,000
Right.

27
00:01:27,000 --> 00:01:29,000
And this can be my third document.

28
00:01:30,000 --> 00:01:38,000
Like this I will be getting a lot of documents similarly over here based on my sparse matrix search.

29
00:01:38,000 --> 00:01:44,000
That is the keyword search that is nothing but keyword search.

30
00:01:46,000 --> 00:01:49,000
I'm actually going to get my top K documents.

31
00:01:49,000 --> 00:01:51,000
So here also I will go ahead and draw this.

32
00:01:52,000 --> 00:01:53,000
So let's say.

33
00:01:55,000 --> 00:01:58,000
Here I'm going to get some other sentences okay.

34
00:01:58,000 --> 00:02:00,000
And the ordering will be different.

35
00:02:00,000 --> 00:02:00,000
Right.

36
00:02:00,000 --> 00:02:04,000
So here uh let's say this is the order that I've actually got.

37
00:02:04,000 --> 00:02:10,000
But I got document one, two, three, four, five, let's say the five documents I've actually got.

38
00:02:10,000 --> 00:02:16,000
And here also we have got some other documents, so let's say some other documents.

39
00:02:16,000 --> 00:02:18,000
But here the numbering is different.

40
00:02:18,000 --> 00:02:25,000
Let's say the numbering is like 43251 okay.

41
00:02:25,000 --> 00:02:26,000
The numbering is different.

42
00:02:26,000 --> 00:02:33,000
This is my result One top K documents and this is my result.

43
00:02:33,000 --> 00:02:35,000
Two top K documents.

44
00:02:35,000 --> 00:02:45,000
Okay, now, considering this, uh, see here I've got in this ordering, that is one, 2345 sentence.

45
00:02:45,000 --> 00:02:49,000
That is that uh, I got in my top k like document one, document two, document three, document four,

46
00:02:49,000 --> 00:02:50,000
document five.

47
00:02:50,000 --> 00:02:52,000
Similarly over here document four, document three.

48
00:02:52,000 --> 00:02:53,000
Document two.

49
00:02:53,000 --> 00:02:53,000
Document five.

50
00:02:53,000 --> 00:02:54,000
Document one.

51
00:02:54,000 --> 00:02:56,000
This is the ranks that we have specifically got.

52
00:02:57,000 --> 00:03:00,000
Now how to calculate the final score.

53
00:03:00,000 --> 00:03:04,000
Because here this is based on my semantic search.

54
00:03:04,000 --> 00:03:06,000
This is based on my keyword search.

55
00:03:06,000 --> 00:03:13,000
When I say semantic search that is basically uh based on cosine similarity okay.

56
00:03:13,000 --> 00:03:18,000
So we are trying to find out the similar vectors and then we are trying to find out our final score.

57
00:03:18,000 --> 00:03:22,000
So in order to calculate the final score the formula is very simple.

58
00:03:22,000 --> 00:03:33,000
Here we'll write summation of one divided by uh a constant plus plus the rank of documents okay.

59
00:03:34,000 --> 00:03:39,000
Now this particular constant that is C it depends on various database.

60
00:03:40,000 --> 00:03:43,000
It depends on various

61
00:03:45,000 --> 00:03:46,000
databases.

62
00:03:47,000 --> 00:03:47,000
Okay.

63
00:03:47,000 --> 00:03:53,000
Now with respect to various databases, this may range between 1 to 60 okay.

64
00:03:53,000 --> 00:03:55,000
And this is a constant case okay.

65
00:03:55,000 --> 00:03:59,000
In various databases they'll set this value to 1020.

66
00:04:00,000 --> 00:04:03,000
And some of the databases I've seen the value to 60 okay.

67
00:04:03,000 --> 00:04:07,000
So since this is a constant we are just not going to focus much into this.

68
00:04:07,000 --> 00:04:09,000
Instead, I will just take up this formula.

69
00:04:09,000 --> 00:04:15,000
Now summation of I is equal to um one divided by rank of t.

70
00:04:15,000 --> 00:04:23,000
Okay, since this is a constant, uh, for calculating in every, uh, the rank of every, uh, documents,

71
00:04:23,000 --> 00:04:27,000
uh, will probably give you the same answer with respect to that particular constant, same impact it

72
00:04:27,000 --> 00:04:28,000
may be creating.

73
00:04:28,000 --> 00:04:29,000
Okay.

74
00:04:29,000 --> 00:04:31,000
Now how do we compute the rank.

75
00:04:31,000 --> 00:04:31,000
Right.

76
00:04:31,000 --> 00:04:35,000
So for this particular sentence you'll be seeing one divided by rank of D.

77
00:04:35,000 --> 00:04:39,000
So if I go ahead and compute this this is nothing but one by one okay.

78
00:04:39,000 --> 00:04:42,000
Uh this is nothing but one by two because this is rank two, right?

79
00:04:42,000 --> 00:04:43,000
This is rank three.

80
00:04:43,000 --> 00:04:45,000
So I'll get one by three.

81
00:04:45,000 --> 00:04:46,000
It is nothing but 0.33.

82
00:04:46,000 --> 00:04:49,000
Then here I have 0.25.

83
00:04:49,000 --> 00:04:51,000
And this is my fifth document.

84
00:04:51,000 --> 00:04:53,000
Let's say the rank is five.

85
00:04:53,000 --> 00:04:55,000
I'm actually going to get point two okay.

86
00:04:55,000 --> 00:04:59,000
Similarly, in this particular case uh this is my document four.

87
00:04:59,000 --> 00:04:59,000
Right.

88
00:04:59,000 --> 00:05:06,000
And if I try to compute this particular rank okay it will be again one by one, one by two, uh, one

89
00:05:06,000 --> 00:05:09,000
by three, one by four, one by five.

90
00:05:09,000 --> 00:05:17,000
So that basically means the document of D4 is basically getting a rank of one in this keyword search

91
00:05:17,000 --> 00:05:18,000
here.

92
00:05:18,000 --> 00:05:20,000
D4 is basically getting 0.25.

93
00:05:20,000 --> 00:05:26,000
Okay, now when we try to combine the final score okay, this is important.

94
00:05:26,000 --> 00:05:27,000
See this.

95
00:05:27,000 --> 00:05:30,000
So if I finally go ahead and take up document one.

96
00:05:30,000 --> 00:05:37,000
So if this is my document one and if I try to calculate the overall rank then what we have to do we

97
00:05:37,000 --> 00:05:38,000
have to do the summation.

98
00:05:38,000 --> 00:05:42,000
So let's see document one is giving one over here and 1.51 by five.

99
00:05:42,000 --> 00:05:46,000
So I will just go ahead and write one plus point two.

100
00:05:46,000 --> 00:05:48,000
So this will be 1.2 right.

101
00:05:48,000 --> 00:05:53,000
Similarly if I go ahead and calculate for document two right.

102
00:05:54,000 --> 00:05:57,000
So document two over here gives 0.5 right.

103
00:05:57,000 --> 00:06:03,000
So here I'm actually going to get 0.5 plus the document two over here is one by three.

104
00:06:03,000 --> 00:06:06,000
That is nothing but 0.33 right?

105
00:06:06,000 --> 00:06:10,000
So here you can see I'm actually going to get 0.83.

106
00:06:11,000 --> 00:06:13,000
So this is my score that I'm actually going to get.

107
00:06:13,000 --> 00:06:16,000
And obviously this score is better than this score.

108
00:06:16,000 --> 00:06:23,000
So I will be getting the document one based on my 50% probability of 50%.

109
00:06:23,000 --> 00:06:28,000
Uh, focus on keyword search, right.

110
00:06:28,000 --> 00:06:32,000
And 50% on semantic search, right?

111
00:06:32,000 --> 00:06:34,000
If you change the weightage, right.

112
00:06:34,000 --> 00:06:38,000
If you change the weightage, then you'll be able to see that some values may also change.

113
00:06:38,000 --> 00:06:40,000
Similarly you have this document three.

114
00:06:40,000 --> 00:06:45,000
Now in the case of document three you'll be able to see that here I have .33.

115
00:06:46,000 --> 00:06:53,000
So let's calculate this .33 plus document three is nothing but point five.

116
00:06:53,000 --> 00:06:55,000
Again here I'm going to get the same value.

117
00:06:55,000 --> 00:06:56,000
See .83.

118
00:06:56,000 --> 00:06:58,000
Now both these ranks are same right.

119
00:06:58,000 --> 00:07:03,000
So if I want to retrieve the top k result obviously this will be shown first.

120
00:07:03,000 --> 00:07:08,000
And then based on this two right where the focus is more whether it is on keyword search or semantic

121
00:07:08,000 --> 00:07:08,000
search.

122
00:07:08,000 --> 00:07:14,000
Let's say if I say 70% focus is there on the keyword search, then over here you can see point five

123
00:07:14,000 --> 00:07:15,000
value is higher than point three.

124
00:07:15,000 --> 00:07:20,000
So it is just going to give you the display the document three in the second rank.

125
00:07:20,000 --> 00:07:20,000
Right.

126
00:07:20,000 --> 00:07:26,000
So this way the weightage can also be provided in keyword search and semantic search okay.

127
00:07:26,000 --> 00:07:29,000
And similarly we go ahead and calculate each and every documents.

128
00:07:29,000 --> 00:07:35,000
So finally you can see that here we can combine result two and result one top key elements.

129
00:07:35,000 --> 00:07:40,000
And then we can use this reciprocal rank fusion in hybrid search by assigning ranks.

130
00:07:40,000 --> 00:07:44,000
And we will be able to calculate a score which will be finally given as a response.

131
00:07:44,000 --> 00:07:47,000
And here we can also assign weightage.

132
00:07:47,000 --> 00:07:49,000
And this can be combined with an prompt template.

133
00:07:49,000 --> 00:07:51,000
And we can summarize with our LM model.

134
00:07:51,000 --> 00:07:51,000
Right.

135
00:07:52,000 --> 00:07:59,000
So this gives an idea about the entire hybrid search right now uh in this section in this next section

136
00:07:59,000 --> 00:08:04,000
what we are going to do is that we're going to also see a practical example of how an hybrid search

137
00:08:04,000 --> 00:08:05,000
is basically done.

138
00:08:05,000 --> 00:08:09,000
But before that, if you know about one more search, right?

139
00:08:09,000 --> 00:08:13,000
That is based on graph knowledge search okay.

140
00:08:14,000 --> 00:08:16,000
Graph knowledge search.

141
00:08:17,000 --> 00:08:23,000
So in this graph knowledge search, uh, I hope you may have heard about graph DB.

142
00:08:23,000 --> 00:08:27,000
One of the most popular open source graph DB is Neo4j's.

143
00:08:28,000 --> 00:08:28,000
Okay.

144
00:08:28,000 --> 00:08:36,000
Now if we are specifically using graph knowledge, uh, sorry, graph DB to store the vectors, let's

145
00:08:36,000 --> 00:08:41,000
say if this is my, uh, database and this database is nothing, but it is my graph DB.

146
00:08:41,000 --> 00:08:48,000
In the case of graph db, I will be having like this nodes which will be connected with relationships.

147
00:08:48,000 --> 00:08:48,000
Right.

148
00:08:48,000 --> 00:08:51,000
So this can be my node one, this can be my node two.

149
00:08:51,000 --> 00:08:53,000
And it will be connected with relationships.

150
00:08:53,000 --> 00:08:59,000
And we can whenever a user tries to query, user tries to query.

151
00:08:59,000 --> 00:09:03,000
In this they are three types of queries that happens.

152
00:09:03,000 --> 00:09:10,000
One is keyword search which you can also be called as semantic search.

153
00:09:10,000 --> 00:09:15,000
Sorry, this can be also called as um uh synthetic search.

154
00:09:15,000 --> 00:09:17,000
Or it can be also called as exact search.

155
00:09:18,000 --> 00:09:22,000
Uh then along with this, it's also suppose semantic search.

156
00:09:23,000 --> 00:09:23,000
Right.

157
00:09:24,000 --> 00:09:30,000
And finally you will be also able to see that it supports, uh, keyword search.

158
00:09:30,000 --> 00:09:30,000
Semantic search.

159
00:09:30,000 --> 00:09:35,000
Along with this one more search is there that is called as graph knowledge search.

160
00:09:36,000 --> 00:09:38,000
Graph knowledge search.

161
00:09:38,000 --> 00:09:41,000
So it supports all these techniques which is quite amazing.

162
00:09:41,000 --> 00:09:45,000
Just imagine hybrids are just supports this two type.

163
00:09:45,000 --> 00:09:50,000
But now with this help of graph knowledge search, we will still be able to create a very good efficient

164
00:09:50,000 --> 00:09:53,000
Rag application, right?

165
00:09:53,000 --> 00:09:55,000
With the help of graph DB.

166
00:09:56,000 --> 00:10:03,000
And this we will be seeing about graph DB in the upcoming videos, uh, in the upcoming sections.

167
00:10:03,000 --> 00:10:07,000
But uh, let's go ahead and see some practical example for uh hybrid search.