Haystack, Google, and Gemma: A Tutorial

Haystack, Google, and Gemma: A Tutorial

In previous posts, I looked at both the API to Google’s Gemini (Google’s latest Large Language Model aka LLM) and the Wikipedia module for Python. My intentions were to use those for a post on how to use Haystack. But unfortunately, things did not go as planned.

This post will explain my woes trying to get Haystack to work with Google Gemini. Failing that, I’ll instead offer a short tutorial (based on this one) that I did finally get to work… sort of... using Google’s Gemma. (What is Google’s Gemma? See below!)

I’ve wanted to do a tutorial on Haystack for a while. But I found that Haystack seems to be optimized for OpenAI and frankly their Gemini integration just doesn’t seem to work right. Since I try to make my tutorials work with free LLMs, that posed a problem that took me a while to deal with.

But first, let me explain what Haystack is and why it is of interest to AI programmers.

What is Haystack and Why Should You Care?

Haystack is a library from DeepSet AI. As their overview page explains “Haystack is the open-source Python framework by Deepset for building custom apps with large language models (LLMs).”

Recall back to our post on Semantic Search. You may have wonder something like “Wouldn’t it be nice if some application did all this for us?” Well, that’s Haystack! It basically does what I did in the Semantic Search post plus a lot more. It setups up a pipeline that allows you to insert documents into and then it does all the hard work of splitting it up, embedding it, and even allowing you to ask questions about it using a Large Language Model (LLM). Plus, it can do a lot more than that:

  • Integrates with other popular document stores
  • Has an annotation tool
  • Has evaluation pipelines
  • Contains a REST API for deployment

In other words, Haystack is an important tool in your NLP arsenal.

The Problem of Haystack Tutorials

Given how cool Haystack is, I really wanted to do a number of tutorials on it. But these blog posts are supposed to (where possible) stick with freely available tools. The best LLM is, I admit, Open AI’s ChatGPT – though Google’s Gemini is starting to catch up. But Open AI is not free, so it isn’t my first pick for these tutorials. Instead, I often use Google’s Gemini because it has a free tier, and its performance is starting to catch up to ChatGPT.

So naturally I thought I’d just utilize Haystack 2.0’s Google modules which I expected to work more or less identically to the Open AI version.


It turns out it’s quite difficult to find a good working Tutorial for the Haystack Gemini module, even on Deepset AI’s own website!

Here is the official tutorial for Haystack’s Google Integration. It seems to throw an error that I describe in detail in this Stackoverflow post. I tried to find other tutorials for Gemini to follow and all of them seemed to have this error at some point. If you look at the comments on my Stackoverflow post, someone from Haystack responded to me and says it may be a bug. I will report it as a bug and see what happens. But I think, for now, something might be wrong with Haystack’s Gemini integration so my original plans are ruined.

Because of these problems, I decided to work on a similar tutorial (found here) that uses the Hugging Face module and loads up Google’s Gemma instead. As we’ll see, this tutorial has a problem as well that I’ll discuss below. But first let’s introduce Google’s Gemma.

Introducing Google’s Gemma

Google’s state-of-the-art model is Gemini and, as discussed in this previous post, it has a free API that allows up to 60 queries per minute. But Google didn’t stop there. They also released into the Hugging Face ecosystem an open source LLM called Gemma. Here is what the official Google blog post on Gemma says about it:

  • “Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.”
  • “We’re releasing model weights in two sizes: Gemma 2B and Gemma 7B. Each size is released with pre-trained and instruction-tuned variants.”
  • “Gemma models share technical and infrastructure components with Gemini, our largest and most capable AI model widely available today. This enables Gemma 2B and 7B to achieve best-in-class performance for their sizes compared to other open models. And Gemma models are capable of running directly on a developer laptop or desktop computer.”

Here is a great tutorial on getting started with Gemma that includes a link to a Colab Notebook so that you can try it out yourself. See also the Kaggle version of the same tutorial.

In short, Gemma is a lightweight version of Gemini that doesn’t require so much hardware.

Haystack with Google’s Gemma Tutorial

Let’s now dive into the actual tutorial. We’re going to explore a simple in-memory document store using Haystack using ‘documents’ that are just Wikipedia pages. In keeping with the fantasy theme of past posts, we’re going to build a document store containing information about Lord of the Rings. If you want to follow along with this tutorial in my Colab notebook, you can find the one I used here.

Let’s start with some installs:

!pip install haystack-ai==2.0.0 transformers==4.38.0
!pip install wikipedia

This will install everything we need for haystack, the necessary Hugging Face modules, and the Wikipedia module we played with in a past post.

Now we need to go through authorization for the Hugging Face ecosystem:

from google.colab import userdata
import os
os.environ["HF_API_TOKEN"] = hf_api_key

This part is similar to how we logged in for Google Gemini in our previous post. But in this case create a key called “HFSecret” (or whatever you want to name it, you can just adjust the code for the correct name for you) and insert your Hugging Face token here.

If you do not have a Hugging Face token, you’ll need to first sign up for Hugging Face here. Then you can obtain a token here.

Loading Google’s Gemma

Now that we’re authenticated with the Hugging Face api, you’ll need to sign up for Gemma, by accepting a license. You can find Gemma’s model card here and there you will find the option to accept the license. You must do this before we can continue. Once you’ve got rights to the Gemma model, then enter this code to attempt (and I emphasize ‘attempt’) to load the model:

from haystack.components.generators import HuggingFaceTGIGenerator

generator = HuggingFaceTGIGenerator(
    generation_kwargs={"max_new_tokens": 500})


There is a good chance you’ll get an error at this point. Here is the one I often get:

ValueError: The model google/gemma-7b-it is not deployed on the free tier of the HF inference API. To use free tier models provide the model ID and the token. Valid models are:…

After so many errors in other Haystack tutorials I was about to pull my hair out. This message is claiming that you can’t use Gemma through the Hugging Face ecosystem – even if you have rights to it – unless you are on the paid tier. See this post for a discussion about a similar error for Llama.

However, do this trick and you should be okay. In your Colab notebook, go to Runtime -> Manage Sessions and cancel the current session. (This might also work if you use Runtime -> Restart session.) Then rerun all of the above until you get a successful load of the Gemma model.

If this doesn’t work, try this instead:

from haystack.components.generators import HuggingFaceTGIGenerator

generator = HuggingFaceTGIGenerator(
    generation_kwargs={"max_new_tokens": 500})


I still got pretty good results with that model and it is listed as being part of the ‘free tier.’

Once you are able to load the Gemma model, you are ready to do a few imports:

from IPython.display import Image
from pprint import pprint
import rich
import random
import wikipedia
from haystack.dataclasses import Document

Wikipedia Module

Now let’s do a search using the Wikipedia module to get the top 14 results searching on “Lord of the Rings”:

# Do a Wikipedia search on "Lord of the Rings" for top 14 results
search_results = wikipedia.search("Lord of the Rings", results=14)

Here is the result I got:

['The Lord of the Rings',
 'The Lord of the Rings: The Rings of Power',
 'The Lord of the Rings (film series)',
 'The Lord of the Rings: The Fellowship of the Ring',
 'The Lord of the Rings: The Return of the King',
 'The Lord of the Rings: The Rings of Power season 2',
 'The Lord of the Rings: The Rings of Power season 1',
 'The Lord of the Rings: The Two Towers',
 'The Lord of the Rings (1978 film)',
 'The Lord of the Rings: Gollum',
 'The Lord of the Rings: The War of the Rohirrim',
 'The Lord of the Rings Online',
 'Alloyed (The Lord of the Rings: The Rings of Power)',
 'The Lord of the Rings: Return to Moria']

Those are going to be our documents we’ll put into the document store. Hopefully you catch the vision here: you can simply grab ‘documents’ off the web and store them in Haystack. Haystack will then have access to these documents for queries to your Large Language Model. This is more or less identical to what we did when we directly programmed a semantic search back in this post.

Let’s now store these documents into a Python list:


for title in search_results:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})

Setting Up the Haystack Document Pipeline

Now let’s setup the Haystack Pipeline for our documents:

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "writer")

You should get back a graph showing your Pipeline:

A chart with tiers. Content of the tiers is as such: *; "documents, List[Document]; cleaner, DocumentCleaner; documents -> documents, List[Document]; splitter, DocumentSplitter; documents -> documents, List[Document]; writer, DocumentWriter, Optional Inputs:, - policy (Optional[DuplicatePolicy]); documents_written, int; *.

You can tell from this image that this is a pretty simple pipeline that starts with the list of documents we created, runs it through a document cleaner, splits the documents up into right sized chunks, and then runs through a final document writer. Go ahead and run the pipeline:


And you’ll get this back:

{'writer': {'documents_written': 1659}}

Note that we passed in 14 pages from Wikipedia and we now have 1659 ‘documents’. This is similar to what we did back in our semantic search post where we split each PDF into paragraphs or pages first to run the cosine similarity against. The same thing is going on here.

Let’s explore our pipeline a bit. Try this:



{'cleaner': {'documents': {'type': typing.List[haystack.dataclasses.document.Document],
   'is_mandatory': True}},
 'writer': {'policy': {'type': typing.Optional[haystack.document_stores.types.policy.DuplicatePolicy],
   'is_mandatory': False,
   'default_value': None}}}

Or try walking through the nodes:

for node in indexing.walk():


('cleaner', <haystack.components.preprocessors.document_cleaner.DocumentCleaner object at 0x7cb7e1e3abf0>
  - documents: List[Document]
  - documents: List[Document])
('splitter', <haystack.components.preprocessors.document_splitter.DocumentSplitter object at 0x7cb7e1e3bcd0>
  - documents: List[Document]
  - documents: List[Document])
('writer', <haystack.components.writers.document_writer.DocumentWriter object at 0x7cb7e1e80f10>
  - documents: List[Document]
  - policy: Optional[DuplicatePolicy]
  - documents_written: int)

Or let’s try actually looking at the first 5 ‘documents’ that are now contained in the document store:

docs = document_store.filter_documents()
# Print first 5 documents
for i in range(0,5):
  print(docs[i].meta['title'], docs[i].meta['url'], docs[i].content)

I get this back:

The Lord of the Rings https://en.wikipedia.org/wiki/The_Lord_of_the_Rings The Lord of the Rings is an epic high fantasy novel by the English author and scholar J. R.
The Lord of the Rings https://en.wikipedia.org/wiki/The_Lord_of_the_Rings  R. Tolkien.
The Lord of the Rings https://en.wikipedia.org/wiki/The_Lord_of_the_Rings  Set in Middle-earth, the story began as a sequel to Tolkien's 1937 children's book The Hobbit, but eventually developed into a much larger work. Written in stages between 1937 and 1949, The Lord of the Rings is one of the best-selling books ever written, with over 150 million copies sold.

You can see that a ‘document’ is a small bite-sized chunk of text that will then be used for the semantic search. You want these to follow the Goldilocks’ rule: not too big and not too small. The first two seem a bit small to me. But after that it is looking pretty good. (I had a similar problem when I programmed it directly myself.)

What is Retrieval Augmented Generation?

Before we move on let’s take a moment to discuss what Retrieval Augmented Generation (RAG) is.

Think of RAG as a set of techniques where you combine an LLM with some sort of query process that inserts relevant text into the prompt that will go to the LLM. Think of it like this. Imagine you wanted to write an LLM that answered financial questions about your company. Obviously, your company’s current financial information wasn’t available back when, say, ChatGPT was trained. So, it can’t realistically know the answers you seek. And even if the information was around, that is such a specific set of questions ChatGPT is far more likely to hallucinate an answer rather than given the correct answer.

So how might we deal with that problem? One idea is to intentionally insert the relevant financial information right into the prompt sent to the LLM. That way the LLM has the relevant information right there in front of it. That greatly decreases the chances that it will hallucinate the answer.

The Haystack Retrieval Augmented Generation (RAG) Pipeline

Now that we’ve got the pipeline setup, we need a query template that we’re going to send to the model. I took this from the Haystack tutorial which I thought was pretty good:

from haystack.components.builders import PromptBuilder

prompt_template = """
Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.

  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};

Question: {{query}}<end_of_turn>

prompt_builder = PromptBuilder(template=prompt_template)

This is probably pretty obvious what is going on, but let me explain anyhow. We’ve created a query that first explains to the model what you want an answer to look like. In this case we instruct it to use the ‘context’ (which is where the best matching documents will be inserted) to answer the user’s question. It also asks it to reference the URL. (We’ll see that the model sometimes forgets to do this! But it generally works.) It also instructs the model to refuse to answer if the answer isn’t available. This should reduce the chances it will just hallucinate an answer.

Then you see this code:

  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};

This creates a for loop for the pipeline and writes out the text of the ‘document’ and its URL. Then we insert the actual question:

Question: {{query}}<end_of_turn>

Retrieving the Results

Now we need some code to retrieve an answer back from the model:

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

rag = Pipeline()
rag.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=5))
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("llm", generator)

rag.connect("retriever.documents", "prompt_builder.documents")
rag.connect("prompt_builder.prompt", "llm.prompt")

You should get back a graph showing the nodes you just setup:

A chart with tiers. Content of the tiers is as such: *; query, str; retriever, InMemoryBM25Retriever, Optional Inputs:, -filters (Optional[Dict[str, Any]]), -top_k (Optional[Int]), -scale_score (Optional[bool])' documents -> documents (opt.); List[Document]; prompt_builder, PromptBuilder, Optional Inputs:, -query (Any); prompt -> prompt, str; llm, HuggingFaceTGIGenerator, Optional Inputs:, -generation_kwargs (Optional[Dict[str, Any]]; reiplies, List[str]; meta, List[Dict[str, Any]]; *.

Roughly speaking, we’re asking Haystack to retrieve the top 5 documents that match the query, build a prompt out of it using our template, then send that to the LLM to generate a response.

Finaly we’re ready to ask questions. Let’s build a function to generate answers using our RAG pipeline:

def get_generative_answer(query):

  results = rag.run({
      "retriever": {"query": query},
      "prompt_builder": {"query": query}

  answer = results["llm"]["replies"][0]

This function takes the user’s question and feeds it into the rag pipeline and gets a result back from the Gemma model.

Now let’s make a list of questions to ask the mode:

questions="""Is the Lord of the Rings a trilogy?
When was Lord of the Rings published?
Who is Glorfindel?
How much did the Lord of the Rings films gross?
Which actor played Gollum / Sméagol  in the Lord of the Rings films?
What was the Lord of the Rings television series called?
What was the reception of the television show Rings of Power?""".split('\n')

And finally, we’ll iterate over the list of questions and try each one:

for q in questions:

I’ve intentionally made some of these questions harder and some easier. This will help illustrate the limits of what a Large Language model using RAG techniques can do.

Model Results

Let’s see how the model does with each question:

  • Question: Is the Lord of the Rings a trilogy?
  • Answer: Yes, the Lord of the Rings is a trilogy.
  • Source URL:
    • en.wikipedia.org/wiki/TheLordoftheRings:TheReturnofthe_King
    • en.wikipedia.org/wiki/TheLordoftheRings(filmseries)

This is a technically correct answer, of course. But I would note that the Wikipedia page for “Lord of the Rings” mentions it wasn’t intended to be a trilogy. I was hoping it would catch that fact, but it didn’t. And looking at it sources, you can see why: it actually looked up the films and the 3rd book (Return of the King) in the series.

  • Question: When was Lord of the Rings published?
  • Answer: The text states that the Second Edition of The Lord of the Rings was published in 1965.

This is a correct answer for the second edition, but not really the answer I was hoping for. It also didn’t source the answer.

  • Question: Who is Glorfindel?
  • Answer: Glorfindel is an elf who rides with Frodo, now deathly ill, towards Rivendell in the Lord of the Rings film series.
  • Source URL:
    • [The Lord of the Rings (1978 film)] (en.wikipedia.org/wiki/TheLordoftheRings(1978film))
    • [The Lord of the Rings] (en.wikipedia.org/wiki/TheLordoftheRings)
    • [The Lord of the Rings (film series)] (en.wikipedia.org/wiki/TheLordoftheRings(filmseries))

This is a correct answer! Strangely, it sources the films for this answer even though this character didn’t appear in the films!

  • Question: Which actor played Gollum / Sméagol in the Lord of the Rings films?
  • Answer: The text does not specify the actor who played Gollum / Sméagol in the Lord of the Rings films, therefore I cannot answer this question.
  • Source URL:
    • en.wikipedia.org/wiki/TheLordof_theRings:Gollum
    • en.wikipedia.org/wiki/TheLordoftheRings:TheTwo_Towers

I picked this question precisely because I didn’t see it answered on the Wikipedia page for the films. And it correctly refused to give an answer. Then it went on to source its non-answer. That’s a bit strange.

Feel free to go on and try out the other questions and see how it does. I think it should be obvious that RAG techniques are not a panacea. The model does okay with the questions, but they have to be fairly obvious questions with fairly obvious answers within the document store.

Hopefully you are catching the vision of just how powerful Haystack really is and how it could be used in your applications.


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter