
Giving Your RAG Chatbot Some Agency
- By Bruce Nielson
- ML & AI Specialist
In a previous post, we started into a new idea: using the chatbot itself to help solve problems with our RAG pipeline. We first used Google Gemini to determine the relevance of quotes returned from our RAG pipeline. To avoid spending too many tokens, we only did this if the Reranker didn’t score any of the quotes from the RAG pipeline very high.
The Problem of Long-Winded Questions
Another problem we sometimes bump into is that users can get long winded when chatting. And while Google’s Gemini (with its very large token window) can typically handle that pretty well, our semantic search can’t. Suppose a user asks a long rambling question with lots of words. The resulting embedding will be diluted to the point where the quotes returned aren’t likely to be strong hits.
How might we address this problem? We already know (from the last post) we can use Gemini itself to address problems like this, so how might we go about solving this particular problem?
AI Popper Meets His Match This time?
Let’s suppose that I start with the following question to my Popper chatbot:
“Tell me about induction”
This first question is short and concise and will find many really good hits out of the database of Popper’s writings. That means the Retrieval Augment Generation (RAG) pipeline will serve up to the chatbot high quality quotes to use.
But then suppose we follow that question with this question:
“Tell me about Bozo the clown"
We want out chatbot to somehow give as relevant an answer (in the persona of Popper) as we can despite the seeming irrelevance of the question. In fact, the chatbot will do okay with this question. We covered in the last post how to make sure this quote was relevant.
But let’s try out a better idea: what if we ask Gemini to improve the user’s query directly?
Improving the User’s Query
Let’s start with some code that shows what I have in mind:
def ask_llm_for_improved_query(self, message: str, chat_history: List[Dict[str, Any]]) -> str:
prompt = (
f"Given the query: '{message}' and the current chat history, the database of relevant quotes found none "
f"that were a strong match. This might be due to poor wording on the user's part. "
f"Reviewing the query and chat history, determine if you can provide a better wording for the query "
f"that might yield better results. If you can improve the query, return the improved "
f"query. If you cannot improve the question, return an empty string (without quotes around it) and we'll "
f"continue with the user's original query. There is no need to explain your thinking if you want to return "
f"an empty string. Do not return quotes around your answer.\n\n"
f"You must either return a single sentence or phrase that is the new query (without quotes around it) or "
f"an empty string (without quotes around it). Keep the new query as concise as possible to improve matches."
f"\n\n"
)
return self.ask_llm_question(prompt, chat_history)
Really, all this code does is create an (overly complex) prompt that asks the Large Language Model (LLM) to take the user’s query and see if it can (using the chat history for context) improve it to make it find hits in the document/quote database easier.
This is a remarkably simple approach, though I had to spend quite a bit of time coming up with a prompt that actually did what I wanted. And this prompt still seems unwieldy and too long to me. And even with multiple reminders of format, Gemini will still sometimes ignore my instructions and go off and do its own thing. This might include things like returning ‘empty string’ instead of an actual empty string. But this does work well enough, so we’ll go with it for now. We can talk about how to improve it using something like DSPy or LangChain in a future post.
Giving the Chatbot Some Agency
Now let’s integrate this into the actual chatbot with this code that comes in front of the code we wrote in the last post:
if max_score is not None and max_score < 0.50:
# If we don't have any good quotes, ask the LLM if it wants to do its own search
improved_query: str = self.ask_llm_for_improved_query(message, gemini_chat_history)
# The LLM is sometimes stupid and takes my example too literally and returns "''" instead of "" for an
# empty string. So we need to check for that and convert it to an empty string.
# Unfortunately, dropping that instruction tends to cause it to think out loud before returning an empty
# string at the end. Which sort of defeats the purpose.
# Strip off double or single quotes if the improved query starts and ends with them.
if improved_query.startswith(('"', "'")) and improved_query.endswith(('"', "'")):
improved_query = improved_query[1:-1]
if improved_query.lower() == "empty string":
improved_query = ""
new_retrieved_docs: List[Document]
temp_all_docs: List[Document]
if improved_query != "":
new_retrieved_docs, temp_all_docs = self.doc_pipeline.generate_response(improved_query)
new_max_score: float = self.get_max_score(new_retrieved_docs)
if new_max_score > max(max_score * 1.1, max_score + 0.05):
# If the new max score is better than the old one, use the new docs
retrieved_docs = new_retrieved_docs
all_docs = temp_all_docs
max_score = new_max_score
Let’s explain this code. First, we only want to look at improving the user’s query if the top score of the returned documents is below 0.50 relevance score. Otherwise, we just go with the documents found by the RAG pipeline.
Next, we run our ask_llm_for_improved_query() method for a better query to use. If it returns an empty string, we just move on. That is the LLMs way of saying it couldn’t improve the query.
We also do a few checks for if the LLM got too creative and returned something inappropriate such as turning a string with empty quotes instead of an empty string or, worse, the string ‘empty string’ instead of an empty string. Yeah, it does stuff like that unfortunately. I need to see if I can improve the prompt to fix that, but these extra checks are valuable regardless.
Finally, I take the improved query and run the RAG pipeline again to try to get a new set of quotes/documents back. But are they really an improvement? We measure the result and only take the improved results if it is better (by some margin) than the old results.
How Well Does It Work?
Remember in the previous post where we asked AI Karl Popper about Bozo the clown?
“Tell me about Bozo the clown”
Now he may come back with an improved query. Something like this:
“philosophy of clowns”
That will likely find the very same passage we found before. And sure enough, that’s what happens. So this improved query didn’t really help much.
But then consider this rather ridiculous query that I tried next:
“My grandma once took me to the circus to see Bozo. There he was honking his nose and tossing rings while wearing giant shoes. Several of his compatriots all piled out of the tiny car they were in. And that was when Grandma whispered to me that Bozo's view of the topic we're discussing was based on Kant rather than Hume. Do you recall what topic we were discussing? She insisted that since fire engines were red and Trump was president that Bozo was right to accept the philosophy under discussion even though you rejected it. Was Bozo correct about this philosophy or not?”
What is interesting about this query is just how long-winded and irrelevant most of it is. Further, I never even directly ask my question. For example, I never even ask about ‘induction’ even though that is implied in how I worded it (e.g. “Do you recall what topic we were discussing?”)
When the reranker comes back, the top score is: 0.01331. Not good!
So, Gemini improves the query to this:
'Induction, Kant, Hume, and falsification'
Whoa! That’s spot on! Really that long winded question was as simple as asking about if Kant or Hume had a correct view of induction. The new top score with this improved query is: 0.9876!
The Power of a Good Large Language Model (LLM)
What’s interesting is that I only use the ‘improved query’ to find better hits out of the RAG pipeline. I still feed the entire user’s query to the Gemini chatbot. And the result is hilarious but on point. AI Popper carefully responded to me not only about Kant’s and Hume’s views of induction – and how he had improved on their views – but also carefully applied that answer to Bozo the clown, my Grandma, and how fire engines may or may not be red.
The key here is that the Gemini chatbot plus the improved RAG results together give a first rate hit where as if I just fed that ridiculous query to the RAG pipeline it would be so diluted the answers would be close to random.
Conclusions
What’s beautiful about this is that we’re starting to really give the chatbot its own ‘agency’. It gets to decide what to query in the database if the user’s query wasn’t very good. This is a minimal step to the idea of AI Agents that we’ll cover in a future post.