
Reranking Documents Using Cross-Encoders for Retrieval Augmented Generation (RAG)
- By Bruce Nielson
- ML & AI Specialist
A while back I snuck a reranker into the Book Search Archive (Mindfire’s Open Source project to test out our open source AI stack). I forgot to do a write up for this blog, so I’m doing that now.
But what is a reranker? And why should you care?
To understand this, we have to harken back to the concept of cosine similarity and how we use it to do a semantic search on our document fragments. Recall that while cosine similarity works great for a semantic search (i.e. a search on meanings of words rather than a straight word search, i.e. a lexical search) but is quite slow because we have to do a cosine similarity computation on every single document and compare it to the query. To speed this up we use a Hierarchal Navigable Small World (HNSW) index. That works great and the results is very quick. But it only gives an approximate ‘nearest neighbor search’ result. That is to say, it works fast but as the expense that it may not find the best semantic search matches.
What the reranker does is it takes the top documents found by the HNSW index (as implemented in Postgres or Neo4j in this case) and then reranks them in memory. This may sound like a waste since the HNSW already pulled back the best hits. What use is it to rerank them?
Imagine you use the HNSW to pull back, say, the top 100 hits and then rerank those and take the top 5? The results would, ideally, include a solid hit that the HNSW network missed due to being only an approximate match. Plus, rerankers have all top 100 to work with so they can theoretically have more information to work with when coming up with the best results.
I got the idea to use a reranker while reviewing this white paper by Wang et all., 2024. (See also this excellent little article summarizing the results of Wang’s paper..) This paper goes over the best practices and best open-source models for a superior RAG pipeline. It hadn’t occurred to me to use a reranker prior to that point. And I admit I was skeptical that it would make much of a difference until I tried it and saw the positive results for myself.
Cross Encoding
A reranker utilizes a cross-encoder. Sbert.net describes a cross-encoder as follows:
- Calculates a similarity score given pairs of texts.
- Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model.
- Often slower than a Sentence Transformer model, as it requires computation for each pair rather than each text.
- Due to the previous 2 characteristics, Cross Encoders are often used to re-rank the top-k results from a Sentence Transformer model.
A cross-encoder is specially trained for this task and so often gives better results than a regular cosine similarity via a sentence transformer model.
I tried out a reranker and it did successfully include an additional very good hit on a query that I’d done multiple times and the HNSW rankings had completely missed. Plus, it took the results the HNSW index came up with and put them into a new order that seemed to me to be better results. So, I’m sold that a good reranker using cross-encoding can really help your RAG pipeline.
Implementing a Reranker
The code for a reranker is pretty simple. It was so easy to implement that I forgot to make a blog post about it. 😊 I’m currently just using the TransformersSimilarityRanker component built into Haystack. This defaults to using the cross-encoder/ms-marco-MiniLM-L-6-v2 model. (Wang and company recommended this model instead. I should probably switch to that one but haven’t yet.)
I have a parameter for my RagPipeline class called use_reranker that, if set to True, will incorporate the reranker into the RAG pipeline.
First, in the __init__ method for the RagPipeline class, I added this code to warm up the reranker:
if self._use_reranker:
# Warmup Reranker model
ranker = TransformersSimilarityRanker(device=self._component_device, top_k=self._llm_top_k,
score_threshold=0.20)
ranker.warm_up()
self._ranker = ranker
I set the score threshold to 0.20 which means it won’t return any documents with a similarity score below 20%. That’s a pretty low score, so you may want to raise the threshold to ensure only quality documents are used by the LLM.
Next, in the _create_rag_pipeline method I add the reranker into the actual RAG pipeline next to the doc_query_collector:
if self._use_reranker:
# Reranker
rag_pipeline.add_component("reranker", self._ranker)
rag_pipeline.connect("doc_query_collector.documents", "reranker.documents")
rag_pipeline.connect("doc_query_collector.query", "reranker.query")
rag_pipeline.connect("doc_query_collector.llm_top_k", "reranker.top_k")
# Stream the reranked documents
rag_pipeline.add_component("reranker_streamer", DocumentStreamer(do_stream=self._can_stream()))
rag_pipeline.connect("reranker.documents", "reranker_streamer.documents")
And then connect it to the prompt_builder:
if self._use_reranker:
# Connect the reranker documents to the prompt builder
rag_pipeline.connect("reranker_streamer.documents", "prompt_builder.documents")
else:
# Connect the doc collector documents to the prompt builder
rag_pipeline.connect("doc_query_collector.documents", "prompt_builder.documents")
And that is all there is to it. The new pipeline looks like this:
Conclusions
A reranker is a great addition to your RAG pipeline. It compensates for the short comings of an HNSW index not needing to do a cosine similarity across every document and possibly missing a good match and it tends to do a better job rating and ranking the documents.
Links