Implementing a Lexical Search
- By Bruce Nielson
- ML & AI Specialist
In my last post, I went over the idea of a hybrid search where we merge the results of both a lexical search (i.e. a traditional keyword search) and a semantic search together into a single result. The end result was better than either individually because we could give a bonus for a passage being a good lexical and semantic search.
Let’s now talk about how to actually implement a Lexical Search. You can find the version of the code at the time of this blog post here. If you want the most up-to-date version of the code that is found here.
Lexical, Semantic, and Hybrid Search
I’ve rewritten my “Book Search Archive” to have three options for searching. Lexical, Semantic, or Hybrid:
class SearchMode(Enum):
LEXICAL = 1
SEMANTIC = 2
HYBRID = 3
You can now specify which search to you be simply changing a parameter in the RagPipeline instance upon initializing it, like this:
rag_processor: RagPipeline = RagPipeline(table_name="book_archive",
generator_model=model,
postgres_user_name='postgres',
postgres_password=postgres_password,
postgres_host='localhost',
postgres_port=5432,
postgres_db_name='postgres',
use_streaming=True,
verbose=False,
llm_top_k=5,
retriever_top_k_docs=None,
include_outputs_from=include_outputs_from,
search_mode=SearchMode.HYBRID,
embedder_model_name="BAAI/llm-embedder")
Please that with SearchMode.LEXICAL or SearchMode.SEMANTIC as you see fit.
The Lexical Search In Detail
My Retrieval Augmented Generation (RAG) pipeline now looks like this:
You can see that we take a query string and an integer telling how many ‘top’ k results to take and it is sent off to both the semantic search part of the pipeline as well as the lexical search part of the pipeline.
The ‘lex_retriever’ node looks like this:
# Add the retriever component(s) depending on search mode
if self._search_mode == SearchMode.LEXICAL or self._search_mode == SearchMode.HYBRID:
lex_retriever: RetrieverWrapper = RetrieverWrapper(
PgvectorKeywordRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
rag_pipeline.add_component("lex_retriever", lex_retriever)
rag_pipeline.connect("query_input.query", "lex_retriever.query")
rag_pipeline.connect("lex_retriever.documents", "doc_query_collector.lexical_documents")
Notice that the underlying component is a custom component called “RetrieverWrapper” into which I pass a built-in Haystack component called “PgvectorKeywordRetriever”. PgvectorKeywordRetriever is doing all the real work. It simply uses the built-in ability in PostgreSQL to do keyword searches within a string contained in the database.
Under the hood it uses the tsquery command in PostgreSQL to do the search. We can emulate this directly in PgAdmin (see this post for details) by using a query like this:
SELECT ts_rank_cd(to_tsvector('english', content), to_tsquery('english', 'induction')) AS rank
, content
, meta->>'book_title' AS book_title
, meta->>'section_title' AS section_title
, *
FROM popper_archive
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'induction')
ORDER BY rank DESC;
The above query will query data out of the meta data in my document store. Notice how I’m looking for all cases of the word ‘induction’ (keeping with my Karl Popper theme) both in the content field as well as the ‘book_title’ and ‘section_title’ meta data fields contained as a JSON in the meta field. I would note that if you need to do fancy queries this is a good example of how to write your own customized queries and then use Psycopg to run the custom queries. The PgvectorKeywordRetriever simply uses the tsquery functionality to do a keyword search.
RetrieverWrapper for Streaming Retrieved Results
So why do I wrap the PgvectorKeywordRetriever component in a custom component called RetrieverWrapper? Well, let’s take a look at the code for this custom component:
@component
class RetrieverWrapper:
def __init__(self, retriever: Union[PgvectorEmbeddingRetriever, PgvectorKeywordRetriever],
do_stream: bool = False) -> None:
self._retriever: Union[PgvectorEmbeddingRetriever, PgvectorKeywordRetriever] = retriever
self._do_stream: bool = do_stream
# Alternatively, you can set the input types:
# component.set_input_types(self, query_embedding=List[float], query=Optional[str])
@component.output_types(documents=List[Document])
def run(self, query: Union[List[float], str]) -> Dict[str, Any]:
documents: List[Document] = []
if isinstance(query, list):
documents = self._retriever.run(query_embedding=query)['documents']
elif isinstance(query, str):
documents = self._retriever.run(query=query)['documents']
if self._do_stream:
print()
if isinstance(self._retriever, PgvectorEmbeddingRetriever):
print("Semantic Retriever Results:")
elif isinstance(self._retriever, PgvectorKeywordRetriever):
print("Lexical Retriever Results:")
print_documents(documents)
# Return a dictionary with documents
return {"documents": documents}
When you initialize the component it takes another component – either PgvectorEmbeddingRetriever or PgvectorKeywordRetriever and a Boolean value to stream the results or not stream the results. Then, when you run the component, it calls the component you passed to do the real work. But it then prints those results out to the console before it passes the results on to the next node in the RAG pipeline. The idea is that we get to view the retrieved documents / paragraphs before the Large Language Model (LLM) receives them so that we don’t have to wait for the LLM to generate a response. In other words, RetrieverWrapper is a streaming version of whichever retriever we decide to wrap.
Note also that I also set a streaming option for the LLM. But I’ll cover this in a future post.
The Query Input Component
You might notice that everything starts with the ‘query_input’ component. To be honest I just got tired of making changes to my pipelines and changing which node was the start of the pipeline. On top of that, as I drew parallel pipelines it was inconvenient to try to have two starts that I had to invoke at once. So, I made a single custom component like this:
@component
class QueryComponent:
@component.output_types(query=str, llm_top_k=int)
def run(self, query: str, llm_top_k: int) -> Dict[str, Any]:
return {"query": query, "llm_top_k": llm_top_k}
This takes a query string and a number of ‘top’ k documents to send to the LLLM. And then it passes it along to the appropriate components. You set it up like this:
rag_pipeline.add_component("query_input", QueryComponent())
Then connect it to all the appropriate places in the pipeline (depending on what kind of search you are doing) like this…
Connection to Semantic Search:
rag_pipeline.connect("query_input.query", "query_embedder.text")
Sent to the Lexical Retriever:
rag_pipeline.connect("query_input.query", "lex_retriever.query")
To the Document Query Collector (so that it receives the query and top k input parameter):
rag_pipeline.connect("query_input.query", "doc_query_collector.query")
rag_pipeline.connect("query_input.llm_top_k", "doc_query_collector.llm_top_k")
I’ll cover the other custom component, the “document / query collector”, in a future post as it is more related to the details of the Hybrid search. But what it does is takes the parallel searches (lexical and semantic) as well as the query input parameters and joins the results all together to be passed to the Large Language Model.
Conclusions
Lexical Search is built-into both PostgreSQL and Haystack. So it is easy to implement Lexical searching for our Book Search Archive. Semantic Search is great, but sometimes you want to search for a specific word or two rather than the semantic meaning of a sentence or question. So having Lexical search as part of our Book Search Archive was an important feature to add. We also covered how to use the power of lexical search using SQL queries in PostgreSQL for building custom queries. We also showed how to stream the results of the query if you want to view it before it goes to the LLM for processing.