Hybrid Search for Retrieval Augmented Generation

Hybrid Search for Retrieval Augmented Generation

In my last post I introduced the “Book Search Archive.” You can find the repository for this code here. And the codebase as it was at the time of this blog post here. (i.e., if you want to recreate what is in this blog post, use the link tied to this specific post. If you want to see the latest version of my code, use the first link).

Now, let’s talk about some of the improvements I built into my new code release for the Book Search Archive. One of these is a custom-built ‘hybrid search.’ What’s a hybrid search? Glad you asked!

Semantic Search vs Lexical Search

I've extensively discussed semantic search in previous posts. If you need a primer, see the links section below.

Simply put, semantic search retrieves results based on meaning rather than exact keywords. For example, using the Book Search Archive to search across several works by philosopher Karl Popper, I might ask, "Is induction a myth?" Now, Karl Popper—widely regarded as one of the greatest philosophers to ever live—asserted that induction is indeed a myth. Let's see how well our semantic search archive surfaces relevant results.

To set this up, I specify semantic search in the RAG pipeline configuration, like this:

    rag_processor: RagPipeline = RagPipeline(table_name="popper_archive",
                                             generator_model=model,
                                             postgres_user_name='postgres',
                                             postgres_password=postgres_password,
                                             postgres_host='localhost',
                                             postgres_port=5432,
                                             postgres_db_name='postgres',
                                             use_streaming=True,
                                             verbose=False,
                                             llm_top_k=5,
                                             retriever_top_k_docs=None,
                                             include_outputs_from=include_outputs_from,
                                             search_mode=SearchMode.SEMANTIC,
                                             embedder_model_name="BAAI/llm-embedder")

Then set the query and run it:

query: str = "Is induction a myth?"
rag_processor.generate_response(query)

Here is the top result returned from a broad search across Karl Popper's works:

Document 1:
Score: 0.8961648406884185
Item Id: Ch01
Item Num: 2
Book Title: Conjectures and Refutations
Page Number: 71
Section Name: VIII
Chapter Title: 1 Science: Conjectures and Refutations
Paragraph Num: 78
Content: I may summarize some of my conclusions as follows: (1) Induction, i.e.
inference based on many observations, is a myth. It is neither a psychological
fact, nor a fact of ordinary life, nor one of scientific procedure. (2) The
actual procedure of science is to operate with conjectures: to jump to
conclusions—often after one single observation (as noticed for example by Hume
and Born).

Not bad, right? This result directly addresses our question, doesn’t it?

Of course, we might have found that result without semantic search by simply searching for specific keywords. This approach is known as a "Lexical Search." Let’s try it by adjusting our pipeline as follows:

    rag_processor: RagPipeline = RagPipeline(table_name="popper_archive",
                                             generator_model=model,
                                             postgres_user_name='postgres',
                                             postgres_password=postgres_password,
                                             postgres_host='localhost',
                                             postgres_port=5432,
                                             postgres_db_name='postgres',
                                             use_streaming=True,
                                             verbose=False,
                                             llm_top_k=5,
                                             retriever_top_k_docs=None,
                                             include_outputs_from=include_outputs_from,
                                             search_mode=SearchMode.LEXICAL,
                                             embedder_model_name="BAAI/llm-embedder")

Now, simplify the query to just the two keywords:

query: str = "induction myth"

This time, the well-matched paragraph that directly answers our question doesn’t even appear in the top 5 results! Here’s my new top result:

Document 1:
Score: 0.10909091
Item Id: ch32
Item Num: 32
Book Title: Unended Quest
Page Number: 163
Chapter Title: 32. Induction; Deduction; Objective Truth
Paragraph Num: 1
Content: There is perhaps a need here for a few words about the myth of
induction, and about some of my arguments against induction. And since at
present the most fashionable forms of the myth connect induction with an
untenable subjectivist philosophy of deduction, I must first say a little more
about the objective theory of deductive inference, and about the objective
theory of truth.

From this result, you can still infer that Karl Popper considered induction a myth, but the answer is less direct. This makes sense, as we were searching for "induction" and "myth," so it returned the paragraph with the highest occurrence of those words rather than truly answering the question.

Sometimes, though, it’s helpful to search by specific words, while other times a semantic search is more effective. The Book Search Archive repository allows you to easily switch between a traditional word-based search (i.e., a lexical search) and an AI-driven semantic search using text embeddings and language models.

But what if you’re unsure which is best? Why not use both at once?

Hybrid Search

Ideally, we want to perform both a lexical and semantic search, then take the best results from each.

I've modified our Haystack pipeline to enable this. The top portion of the Retrieval-Augmented Generation (RAG) pipeline now looks like this:

A diagram flowchart showing a pipeline of events. It breaks into a number of trees but comes back together at the end.

Notice that we now have both a ‘semantic_retriever’ and a ‘lex_retriever’ component in the pipeline. The results from both are sent to the ‘doc_query_collector’ component, which merges them.

In a future post, I’ll cover the best practices for merging results and explain how I coded the ‘doc_query_collector’ component. For now, assume I rank the results from both searches and sort them together.

I've found hybrid search to be a powerful approach. When searching by a keyword or two, the ‘lex_retriever’ generally outperforms the ‘semantic_retriever,’ so its results rise to the top. However, when I pose a longer question, the ‘semantic_retriever’ usually outshines the ‘lex_retriever,’ bringing its results to the forefront. This hybrid approach really does deliver the best of both worlds.

To demonstrate this, set up the RAG pipeline like this:

rag_processor: RagPipeline = RagPipeline(table_name="popper_archive",
                                         generator_model=model,
                                         postgres_user_name='postgres',
                                         postgres_password=postgres_password,
                                         postgres_host='localhost',
                                         postgres_port=5432,
                                         postgres_db_name='postgres',
                                         use_streaming=True,
                                         verbose=False,
                                         llm_top_k=5,
                                         retriever_top_k_docs=None,
                                         include_outputs_from=include_outputs_from,
                                         search_mode=SearchMode.HYBRID,
                                         embedder_model_name="BAAI/llm-embedder")

First, try the query "Is induction a myth?" Interestingly, this is now my top result:

Document 1:
Score: 0.9725419972234968
Item Id: ch32
Item Num: 32
Book Title: Unended Quest
Page Number: 171
Chapter Title: 32. Induction; Deduction; Objective Truth
Paragraph Num: 23
Content: But this was to be expected. Since there can be no theory-free
observation, and no theory-free language, there can of course be no theory-free
rule or principle of induction; no rule or principle on which all theories
should be based. Thus induction is a myth. No “inductive logic” exists. And
although there exists a “logical” interpretation of the probability calculus,
there is no good reason to assume that this “generalized logic” (as it may be
called) is a system of “inductive logic”.

Notice that the hybrid search provided an even more straightforward answer to our question! Our previous top result is now in the second position. Why is that? Because my ‘doc_query_collector’ (which I’ll cover in a future post) recognized that this paragraph appeared in both the lexical and semantic search results, scoring well in each, and promoted it to the top.

Now, switch the query to simply "induction myth," resembling a lexical search. This is now my top result:

Document 1:
Score: 0.9886585235595703
Item Id: ch32
Item Num: 32
Book Title: Unended Quest
Page Number: 163
Chapter Title: 32. Induction; Deduction; Objective Truth
Paragraph Num: 1
Content: There is perhaps a need here for a few words about the myth of
induction, and about some of my arguments against induction. And since at
present the most fashionable forms of the myth connect induction with an
untenable subjectivist philosophy of deduction, I must first say a little more
about the objective theory of deductive inference, and about the objective
theory of truth.

Our previous top result—the one that directly stated, “Thus induction is a myth”—is now in the second position. The current top result is preferred because it contains more occurrences of the words "induction" and "myth."

Notice that with hybrid search, we obtained better results regardless of whether we wanted more lexical or semantic outcomes. This is because we’re truly getting the "best of both worlds" from our search.

In a future post, I’ll explain how I coded the new ‘lex_retriever’ node. There are some interesting aspects to how I implemented it to provide streaming results, ensuring we don’t have to wait on the Large Language Model (LLM). However, I should mention that a lexical search is built into both Haystack and PostgreSQL, so adding a lexical search is not difficult. Here’s the Haystack command to create a ‘keyword retriever’ component:

PgvectorKeywordRetriever(document_store=self._document_store, top_k=self._retriever_top_k)

You can read more about the PgvectorKeywordRetriever here.

Links:

Semantic Search

Hybrid Retrieval

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter