AI Tutorial: Hybrid Search in Detail

AI Tutorial: Hybrid Search in Detail

Way back in this post, I conceptually explained how to do a “Hybrid” search where you combine the results of both a Lexical Search (i.e. word search) and a Semantic Search. It is time to make good on my promise to explain in detail how I did this.

The easiest way to do a Hybrid Search is to use the built-in Haystack DocumentJoiner component. Unfortunately, (and you’re probably sick of hearing me say this) there seems to be a bug in this component. I found that including it in my Haystack pipeline caused the pipeline to start streaming results from the LLM component before the documents were sent to the Large Language Model (LLM) component. It would first spit out a response ignoring all RAG docs sent and then later it will spit out the intended result. I’ll do a post on this problem later and see if it got fixed in later versions of Haystack.

So, we’re not going to do this the easy way for today. Instead, I’ll build a custom component that takes the results of a Lexical search and a Semantic search and combines them. Here is the custom component I wrote:

The Document / Query Collector

After receiving results from the Lexical and/or Semantic Search the results are all collected in another custom component that looks like this:

@component
class DocumentQueryCollector:
    def __init__(self, do_stream: bool = False, callback_func: Callable = None) -> None:
        self._do_stream: bool = do_stream
        self._callback_func: Callable = callback_func
    @component.output_types(documents=List[Document], query=str, llm_top_k=int)
    def run(self, query: str,
            llm_top_k: int = 5,
            semantic_documents: Optional[List[Document]] = None,
            lexical_documents: Optional[List[Document]] = None
            ) -> Dict[str, Any]:
        documents: List[Document] = []
        # Check for semantic documents vs lexical documents and, if both exist, merge them
        if semantic_documents is not None and lexical_documents is not None:
            # Combine semantic and lexical documents. But only include each document once and take highest scores first.
            output: List[Document] = []
            document_lists: List[list] = [semantic_documents, lexical_documents]
            docs_per_id: defaultdict = defaultdict(list)
            doc: Document
            for doc in itertools.chain.from_iterable(document_lists):
                docs_per_id[doc.id].append(doc)
            docs: list
            for docs in docs_per_id.values():
                # Take the document with the best score
                doc_with_best_score = max(docs, key=lambda a_doc: a_doc.score if a_doc.score else -inf)
                # Give a slight boost to the score for each duplicate - Add .1 to the score for each duplicate
                # but adjust the 0.1 boost by score of the duplicate
                if len(docs) > 1:
                    for doc in docs:
                        if doc != doc_with_best_score:
                            doc_with_best_score.score += min(max(doc.score, 0.0), 0.1)
                output.append(doc_with_best_score)
            output.sort(key=lambda a_doc: a_doc.score if a_doc.score else -inf, reverse=True)
            documents = output
        elif semantic_documents is not None:
            documents = semantic_documents
        elif lexical_documents is not None:
            documents = lexical_documents
        if self._do_stream:
            print()
            print("Retrieved Documents:")
            print_documents(documents)
        if self._callback_func is not None:
            self._callback_func()
        return {"documents": documents, "query": query, "llm_top_k": llm_top_k}

It is called and hooked into the pipeline like this:

doc_checker: DocumentQueryCollector = DocumentQueryCollector(do_stream=self._can_stream(),
                                                             callback_func=lambda: doc_collector_completed())
rag_pipeline.add_component("doc_query_collector", doc_checker)

Connect it to the semantic retriever:

rag_pipeline.connect("semantic_retriever.documents", "doc_query_collector.semantic_documents")

Connect it to the lexical retriever:

rag_pipeline.connect("lex_retriever.documents", "doc_query_collector.lexical_documents")

Connect it to the prompted builder:

rag_pipeline.connect("doc_query_collector.query", "prompt_builder.query")
rag_pipeline.connect("doc_query_collector.llm_top_k", "prompt_builder.llm_top_k")
rag_pipeline.connect("doc_query_collector.documents", "prompt_builder.documents")

We can invoke this as are starting node by calling it directly:

inputs: Dict[str, Any] = {
    "query_input": {"query": query, "llm_top_k": self._llm_top_k},
}
results: Dict[str, Any] = self._rag_pipeline.run(inputs, include_outputs_from=self._include_outputs_from)

The Power of Hybrid Search

Here is the key thing that really makes Hybrid search shine:

                # Give a slight boost to the score for each duplicate - Add .1 to the score for each duplicate
                # but adjust the 0.1 boost by score of the duplicate
                if len(docs) > 1:
                    for doc in docs:
                        if doc != doc_with_best_score:
                            doc_with_best_score.score += min(max(doc.score, 0.0), 0.1)

What I’m doing here is I’m checking if there are any documents that were returned by both the lexical and semantic search. If so, I allow for a small boost to the score of that document.

Now typically (as discussed in this post) a search will either score well Lexically OR it will score well Semantically. What I mean is that if I search on ‘induction’ (a single word) the Lexical search will probably score well. But if I search on ‘What is induction and how does it relate to testability?’ the Lexical search will do poorly but the semantic search will do well.

But if we notice that a document shows up for both, that is still a good sign. So, I allow a boost of up to 0.1 for showing up on both searches. Imagine I get a semantic score of say 0.89 and a Lexical score of 0.01. I’d then add those scores together for a 0.90. (I max out the boost at 0.1 to avoid strange results far over a score of 1.0. But it would be rare that would happen.)

Conclusion

And that is how I built a custom hybrid search for the Book Search Archive. There is not much to it. You just do both searches and concatenate them together with a slight boost if there are duplicates.

In a future post I’ll cover how to use a Re-Ranker to then sort these results for the Large Language Model.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter