Retrieval Augmented Generation with Haystack and pgvector - Part 2
- By Bruce Nielson
- ML & AI Specialist
In my last post, we finally pulled together everything we’ve learned to do Retrieval Augmented Generation (RAG) using Haystack, PostgreSQL, and Pgvector. We also used Google’s Gemma as our free open-source Large Language Model (LLM) using the Hugging Face ecosystem to obtain the model. My code for all of this is found in my git repository found here. (My post on how to setup the environment is found here).
In the previous post we went over how to create a document conversion pipeline and the actual RAG pipeline both using Haystack pipelines. In this post I’m going to briefly cover other parts of my code and explain how they work.
Initializing the Document Store
As far back as our post on installing Haystack we created some code to either load or initialize the pgvector document store. That code was minimal so let’s take a look at a more complete version of initializing the pgvector document store:
def _initialize_document_store(self) -> None:
connection_token: Secret = Secret.from_token(self._postgres_connection_str)
document_store: PgvectorDocumentStore = PgvectorDocumentStore(
connection_string=connection_token,
table_name=self._table_name,
embedding_dimension=self.sentence_embed_dims,
vector_function="cosine_similarity",
recreate_table=self._recreate_table,
search_strategy="hnsw",
hnsw_recreate_index_if_exists=True,
hnsw_index_name=self._table_name + "_haystack_hnsw_index",
keyword_index_name=self._table_name + "_haystack_keyword_index",
)
self._document_store = document_store
if document_store.count_documents() == 0 and self._book_file_path is not None:
sources: List[ByteStream]
meta: List[Dict[str, str]]
print("Loading document file")
sources, meta = self._load_epub()
print("Writing documents to document store")
self._doc_converter_pipeline()
results: Dict[str, Any] = self._doc_convert_pipeline.run({"converter": {"sources": sources, "meta": meta}})
print(f"\n\nNumber of documents: {results['writer']['documents_written']}")
This code first instantiates a PgvectorDocumentStore object. The key trick here is that if the PostgreSQL database already contains data, we just want to load what is already there. But if there is nothing in the database, we want to load the database up with data from the EPUB document. To accomplish this we first instantiate the PgvectorDocumentStore object and then check if there are any documents in it using the ‘count_documents()’ method.
Passing a Connection String
You may notice that we passed in to the PgvectorDcoumentStore constructor a connection string for PosgreSQL. Back in this post, I talked about how to create a connection string for PosgreSQL and make it an environment variable. I mentioned in passing the ability to instead pass the connection string as a parameter. Here is how to pass the connection string as a parameter of the constructor:
self._postgres_connection_str: str = (f"postgresql://{postgres_user_name}:{postgres_password}@"
f"{postgres_host}:{postgres_port}/{postgres_db_name}")
Above we take the parameters passed in (as covered in my previous post) and build a connection string out of them (see this post for more details on this). We can then take this string and turn it into a Haystack ‘Secret’ using this line of code:
connection_token: Secret = Secret.from_token(self._postgres_connection_str)
I know the name ‘from_token’ is entirely misleading. We are passing a connection string, not a token. But this works fine, so don’t worry about the strange name. It is pretty clear this function was really built to take an OpenAI (or Hugging Face) token and convert it into a Secret. But what this function really does is turn a string into a Secret.
Learn more about Haystack Secret Management here.
Initializing pgvector Document Store
If the document store is empty:
if document_store.count_documents() == 0 and self._book_file_path is not None:
We want to load the document store up. First, we want to create a list of ByteStream objects:
sources: List[ByteStream]
We’ll need a list to store any meta data as well. Meta data is always a dictionary of strings:
meta: List[Dict[str, str]]
Then we load data from the epub file:
sources, meta = self._load_epub()
Once we have the epub file loaded as a list of sources and meta data we can then run the document converter pipeline:
self._doc_converter_pipeline()
Once we have the epub file loaded as a list of sources and meta data we can then run the document conversion pipeline as described in the last post.
results: Dict[str, Any] = self._doc_convert_pipeline.run({"converter": {"sources": sources, "meta": meta}})
Loading the EPUB File
We previously went over how to load an epub file in a past post, but here is the code we use:
def _load_epub(self) -> Tuple[List[ByteStream], List[Dict[str, str]]]:
docs: List[ByteStream] = []
meta: List[Dict[str, str]] = []
book: epub.EpubBook = epub.read_epub(self._book_file_path)
section_num: int = 1
for i, section in enumerate(book.get_items_of_type(ITEM_DOCUMENT)):
section_html: str = section.get_body_content().decode('utf-8')
section_soup: BeautifulSoup = BeautifulSoup(section_html, 'html.parser')
headings = [heading.get_text().strip() for heading in section_soup.find_all('h1')]
title = ' '.join(headings)
paragraphs: List[Any] = section_soup.find_all('p')
temp_docs: List[ByteStream] = []
temp_meta: List[Dict[str, str]] = []
total_text: str = ""
for p in paragraphs:
p_str: str = str(p)
# Concatenate paragraphs to form a single document string
total_text += p_str
p_html: str = f"<html><head><title>Converted Epub</title></head><body>{p_str}</body></html>"
byte_stream: ByteStream = ByteStream(p_html.encode('utf-8'))
meta_node: Dict[str, str] = {"section_num": section_num, "title": title}
temp_docs.append(byte_stream)
temp_meta.append(meta_node)
# If the total text length is greater than the minimum section size, add the section to the list
if len(total_text) > self._min_section_size:
docs.extend(temp_docs)
meta.extend(temp_meta)
section_num += 1
return docs, meta
This is a slightly improved version of what we previously had. One obvious improvement is that we now create meta data about which section the paragraph comes from.
Obtaining Dimensions for the LLM and Sentence Embedders
What if you want to use a different model than I am using? I allow the user to pass in the name of any Hugging Face model. But each model has its own specifications. For example, what is the size of the vector for embeddings? We need to know that or we can’t create the right sized database in PostgreSQL.
To handle that I created a property that will obtain the size of the embedding vector returned by the sentence embedder:
@property
def sentence_embed_dims(self) -> Optional[int]:
if self._sentence_embedder is not None and self._sentence_embedder.embedding_backend is not None:
return self._sentence_embedder.embedding_backend.model.get_sentence_embedding_dimension()
else:
return None
This property was then used (above) to set the size of the embedding dimensions to be stored in the PostgreSQL database:
embedding_dimension=self.sentence_embed_dims,
I’ve created a similar property to get the size of the embedding dimensions for the LLM used to generate responses to the user’s queries:
@property
def llm_embed_dims(self) -> Optional[int]:
return HaystackPgvector._get_embedding_dimensions(self._llm_model_name)
This second property in turn calls this static method that is a generic way to get the size of the embedding vector for any Hugging Face model using the AutoConfig.from_pretrained method. This method takes the name of a model as a parameter and passes back the properties of that model:
@staticmethod
def _get_embedding_dimensions(model_name: str) -> Optional[int]:
config: AutoConfig = AutoConfig.from_pretrained(model_name)
embedding_dims: Optional[int] = getattr(config, 'hidden_size', None)
return embedding_dims
Getting the Context Length
Though I don’t use them yet, I also have properties to get the context length for the models – both the sentence transformer and the LLM. The context length is how many tokens the LLM in question can ‘see’ in its window. This is important for two reasons:
- It determines how long we should make our stored documents. If they are longer than the context length then we’re effectively truncating some of the text before doing the embedding.
- It determines if the full prompt you built using the prompt builder and the response all fit into the LLMs context length. If you don’t, then the LLM won’t see the first part of the prompt when generating a response.
As I mentioned, we’re currently not utilizing this information, but we will in future posts, so here is how to get the context length. First for the sentence embedder:
@property
def sentence_context_length(self) -> Optional[int]:
return HaystackPgvector._get_context_length(self._sentence_embedder.model)
Then for the LLM:
@property
def llm_context_length(self) -> Optional[int]:
return HaystackPgvector._get_context_length(self._llm_model_name)
Both off these just call a static method I wrote that does the real work by calling AutoConfig.from_pretrained to get the properties for this LLM:
@staticmethod
def _get_context_length(model_name: str) -> Optional[int]:
config: AutoConfig = AutoConfig.from_pretrained(model_name)
context_length: Optional[int] = getattr(config, 'max_position_embeddings', None)
if context_length is None:
context_length = getattr(config, 'n_positions', None)
if context_length is None:
context_length = getattr(config, 'max_sequence_length', None)
return context_length
Conclusion
That should cover most of the remaining code in my Haystack RAG class. In future posts we’ll add to this class and improve it with additional features.