Letting Your Chatbot Determine Relevance of Documents

Letting Your Chatbot Determine Relevance of Documents

In a previous post we set up a chatbot. For me, I set up an AI version of philosopher Karl Popper. I gave the chatbot a large database of Karl Popper’s writings and asked Google’s Gemini to pretend it was Karl Popper and use quotes from our database (found via semantic search) in answering questions.

The result was very entertaining, and Gemini did a great job synthesizing the quotes from the real Karl Popper (as found in our document database) into pretty good answers.

The Problem of Low Relevance Questions

One problem we sometimes bump into is that the user asks a question that has no good answer in our database. Our Reranker will still find something to return, but maybe the top relevance score is like 0.0015. What then? How do we solve this problem?

One interesting idea is to let the LLM solve the problem for you. Basically, let AI solve a problem created by AI, so to speak. It’s a seductive idea and one worth pursuing.

AI Popper Meets His Match?

Let’s suppose that I start with the following question to my Popper chatbot:

“Tell me about induction”

This first question is short and concise and will find many really good hits out of the database of Popper’s writings. That means the Retrieval Augment Generation (RAG) pipeline will serve up to the chatbot high quality quotes to use.

But then suppose we follow that question with this question:

“Tell me about Bozo the clown”

We want our chatbot to somehow give as relevant an answer (in the persona of Popper) as we can despite the seeming irrelevance of the question. In fact, the chatbot will do okay with this question. The quotes returned include this very long quote (because it happens to have the word ‘clown’ in it):

These are a few episodes in the career of the man whose ‘windbaggery’ has given rise to modern nationalism as well as to modern Idealist philosophy, erected upon the perversion of Kant’s teaching. (I follow Schopenhauer in distinguishing between Fichte’s ‘windbaggery’ and Hegel’s ‘charlatanry’, although I must admit that to insist on this distinction is perhaps a little pedantic.) The whole story is interesting mainly because of the light it throws upon the ‘history of philosophy’ and upon ‘history’ in general. I mean not only the perhaps more humorous than scandalous fact that such clowns are taken seriously, and that they are made the objects of a kind of worship, of solemn although often boring studies (and of examination papers to match. I mean not only the appalling fact that the windbag Fichte and the charlatan Hegel are treated on a level with men like Democritus, Pascal, Descartes, Spinoza, Locke, Hume, Kant, J. S. Mill, and Bertrand Russell, and that their moral teaching is taken seriously and perhaps even considered superior to that of these other men.

That’s a good enough quote to work with, and the answer from AI Popper ends up seeming at once humorous and somehow also on point because it can work this quote into the answer.

But if you check the relevance score of this quote, it’s 0.0015! Not a strong hit? And the other RAG quotes returned from the RAG pipeline are entirely useless. Won’t giving the chatbot a bunch of useless quotes risk confusing it?

Let the LLM Decide Relevance

So, let’s come up with a way to let Gemini determine if quotes returned from the pipeline are really relevant to the question. Let’s write a new method to do that:

    def ask_llm_for_quote_relevance(self, message: str, docs: List[Document]) -> str:
        prompt = (
            f"Given the question: '{message}', review the following numbered quotes and "
            "return a comma-separated list of the numbers for the quotes that you believe will help answer the "
            "question. If there are no quotes relevant to the question, return an empty string. "
            "Answer with only the numbers or an empty string, for example: '1,3,5' or ''.\n\n"
        )
        for i, doc in enumerate(docs, start=1):
            prompt += f"{i}. {doc.content}\n\n"

        return self.ask_llm_question(prompt)

This method takes the user’s query (message) and a list of Haystack Documents returned from the RAG pipeline and then spins up a new session of Gemini with no chat history. It then creates a prompt that asks this new session of Gemini which of the Documents (quotes) given to it are relevant to the user’s question. It then returns either an empty string – if it finds none of them relevant – or else it returns indexes to the quotes it found relevant to the user’s question.

If we then use this list of indexes we can remove all quotes except the one that actually mentions clowns. This makes things less confusing for Gemini and it tends to give a smarter answer.

This is our first trick: let the LLM decide which quotes are relevant.

Btw, the askllmquestion method is a generic way to spin up chat sessions with Google Gemini:

def ask_llm_question(self, prompt: str, chat_history: Optional[List[Dict[str, Any]]] = None) -> str:
    if chat_history is None:
        chat_history = []
    # Start a new chat session with no history for this check.
    chat_session = self.model.start_chat(history=chat_history)
    chat_response = chat_session.send_message(prompt)
    # Extract numbers from Gemini's response.
    return chat_response.text.strip()

The Problem of Increased Queries

However, there is a downside. I’m using a free version of Gemini, so it is severely rate limited. And this takes one query and turns it into two. So, I’m risking over running my rate limit. Even if I didn’t have a rate limit, I’d be increasing the number of tokens by likely double, so this raises my costs quickly.

I will later on add code via try catch blocks later to handle the rate limit problem. The idea is that if the rate limit is exceeded, I’ll wait a moment and then just proceed to the next step.

But for now, just note that using the LLM to solve this problem increases the number of queries and tokens we use.

One way we might mitigate this problem is to not call this method except when the quotes/documents fragments returned fail to score high enough when rated by the Reranker. This makes sense, right? If the Reranker finds lots of high scoring documents, why do we need Gemini to determine relevance? But if the Reranker says none of the documents are relevant then it makes sense to let Gemini – which is hopefully smarter than the Reranker – decide if we keep the quote or not.

I have in mind something like this:

# Find the largest score
max_score: float = self.get_max_score(retrieved_docs)
if max_score is not None and max_score < 0.30:
    # If there are no quotes with a score at least 0.30,
    # then we ask Gemini in one go which quotes are relevant.
    response_text = self.ask_llm_for_quote_relevance(message, retrieved_docs)
    # Split by commas, remove any extra spaces, and convert to integers.
    try:
        relevant_numbers = [int(num.strip()) for num in response_text.split(',') if num.strip().isdigit()]
    except Exception as parse_e:
        print(f"Error parsing Gemini response: {parse_e}")
        time.sleep(1)
        relevant_numbers = []

    # Filter docs based on the numbered positions.
    ranked_docs = [doc for idx, doc in enumerate(retrieved_docs, start=1) if idx in relevant_numbers]

Basically, we ask for the max score of the documents the RAG pipeline returns and then we use that score to determine if we want to ask Gemini to determine relevance or not. If the score is high, we don’t bother. If it is low, we ask Gemini to remove irrelevant quotes.

If the max score is high, I will later remove less relevant quotes like this:

        else:
            # Drop any quotes with a score less than 0.20 if we have at least 3 quotes above 20
            # Otherwise drop any quotes with a score less than 0.10
            # Count how many quotes have a score >= 0.20.
            threshold: float = 0.20
            num_high = len([doc for doc in retrieved_docs if hasattr(doc, 'score') and doc.score >= threshold])
            # If we have at least 3 such quotes, drop any with a score less than 0.20.
            # Otherwise, drop quotes with a score less than 0.10.
            threshold = 0.20 if num_high >= 3 else 0.10
            ranked_docs = [doc for doc in retrieved_docs if hasattr(doc, 'score') and doc.score >= threshold]

This keeps the number of quotes to the most relevant ones to avoid confusing the Gemini chatbot.

For completeness, here is the get_max_score method.

@staticmethod
def get_max_score(docs: Optional[List[Document]]) -> float:
    # Find the largest score
    max_score: float = 0.0
    if docs:
        max_score = max(doc.score for doc in docs if hasattr(doc, 'score'))
    return max_score

Conclusions

This idea that we improve our RAG pipeline using the LLM to ‘reason’ about how to improve the results (in this case by removing irrelevant or low-quality results) is a powerful idea that we’re going to explore further in the next post. What else might we be able to do by letting the LLM make choices like this?

My code for this post can be found here.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter