Avoiding Text Truncations in RAG
- By Bruce Nielson
- ML & AI Specialist
In past blog posts, we’ve been building a cost-efficient Retrieval Augmented Generation (RAG) pipeline using open-source tools such as PostgreSQL database, pgvector, Hugging Face open-sourced models, and Haystack from Deepset.AI to build the pipelines. You can find the post on environment setup here and the posts on building the pipeline code here and here. Most recently we’ve added to the code base an ability to use Google Gemini as the Large Language Model (LLM) and even using the Hugging Face API to host an open-source LLM.
One thing I’ve been bothered about is that the sentence transformer/embedder model I’ve been using is the default model used by Haystack which is sentence-transformers/all-mpnet-base-v2. This model isn’t the best at embedding sentences and its context window (explained below) isn’t very large. The danger is that we’ll break up the documents sent to the database into chunks too large for the model and it will end up truncating the text and it won’t be part of the vector saved. This could degrade the performance of our document retrieval system.
In this post we’re going to make sure we’ve embedded our documents such that the text isn’t truncated. We’ll also try out models, such as Alibaba-NLP/gte-large-en-v1.5, that has a larger context window.
A Note on the Updated Code Base
Up to this point I’ve tried to keep all my code in one file. But that is starting to get unwieldly so this is the last post I’m going to offer a single file for my code, which for this post is found here. Instead, I’ve now split the code into three files:
-
The Generator Models: generator_model.py (Latest version: may include future posts)
- This file contains a virtual class called GeneratorModel that allows you to wrap any LLM model up for the RAG pipeline with certain default properties and methods that it can rely on existing.
- It also contains a concrete classes such as HuggingFaceLocalModel, HuggingFaceAPIModel, and GoogleGeminiModel. So that you can switch to whichever model you wish to use.
-
The Document Processor: document_processor.py (Latest version)
- This file contains the class DocumentProcessor which contains the Haystack pipeline to build the document store from an EPUB file.
-
The RAG Pipeline: rag_pipeline.py (Latest version)
- This file contains the class RagPipeline which contains the actual RAG pipeline used to query the document store.
Most of this code is pretty similar to what we’ve previously developed but now broken up into separate files to make things more manageable. So I’m not going to go over all the code in detail. Instead, I’ll just go over the specifically relevant code for how to make the document processor intelligently break up the EPUB documents.
Haystack Support for SentenceTransformer
When we previously (in past posts) loaded the Hugging Face ecosystem that included the SentenceTranformer module. The SentenceTransformer module is actually from sbert.net. (API Documentation found here). As the sbert.net website explains:
“Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models or to calculate similarity scores using Cross-Encoder models. This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.”
Haystack supports SentenceTransformer indirectly through two important classes:
- SentenceTransformersDocumentEmbedder: A component for embedding text in a Haystack Document class instance. We’ll use this to creating embedding vectors for our document store. (API Documentation found here).
- SentenceTransformersTextEmbedder: A component for embedding text in a string. This is used for embedding queries in the RAG pipeline. (API Documentation found here).
Here is the general Haystack documentation on sentence embedders.
Just What is the Context Length for sentence-transformers/all-mpnet-base-v2?
In previous posts, I had some code to figure out the context window length for the sentence embedder we’re using. Turns out, that code was incorrect, and we’ll need to fix that.
But first, what is the ‘context length’ for a sentence transformer/embedder, you ask? Well, all language models have a context length, which is basically how much text (measured in tokens) the model can process at one time. Think of it like when you’re chatting with ChatGPT—it’s great at keeping track of the recent conversation, but as texts scrolls farther up the screen, it starts forgetting. That’s because there’s a limited window of text the model can "see," and anything outside of that gets lost in the void.
So, what’s the context window for our sentence embedder that generates the vectors we’re saving to the PostgreSQL database? In the past, I used something like the code below to figure it out:
@staticmethod
def _get_context_length(model_name: str) -> Optional[int]:
config: AutoConfig = AutoConfig.from_pretrained(model_name)
context_length: Optional[int] = getattr(config, 'max_position_embeddings', None)
if context_length is None:
context_length = getattr(config, 'n_positions', None)
if context_length is None:
context_length = getattr(config, 'max_sequence_length', None)
return context_length
This uses Hugging Faces AutoConfig component to grab the properties (contained in the config.json file) for a model. When I checked what the context length was for the default sentence embedder used by Haystack (i.e. sentence-transformers/all-mpnet-base-v2
) it came back as 514 tokens. This means (or is supposed to mean) it can handle up to 514 tokens before it starts truncating text. Just to be thorough, I double-checked. AutoConfig pulls these numbers straight from the model’s config.json
file, which confirms that 514 is indeed the maxembeddingpositions (i.e. this models property for the context length).
Now, up until recently, I figured this was perfectly fine. After all, 514 tokens can cover a decent chunk of text, so I wasn’t too concerned about cutting anything off when embedding paragraphs. But just to be extra careful, I set up the document processing pipeline with a DocumentSplitter node, instructing it to split text after 10 sentences. I thought, “This should work like a charm.”
Naturally, I was wrong.
Lesson learned: always test these things, especially when you’re building a production-ready app.
There is a more direct way to get the context length for the sentence embedder. Here is my revised code:
@property
def context_length(self) -> Optional[int]:
self._setup_embedder()
if self._sentence_embedder is not None and self._sentence_embedder.embedding_backend is not None:
return self._sentence_embedder.embedding_backend.model.get_max_seq_length()
else:
return None
Here I’m not bothering with the config.json (as read by AutoConfig) and I’m instead grabbing the backend model directly and using the ‘getmaxseqlength()’ method to directly find out what the real context length is. And it turns out that sentence-transformers/all-mpnet-base-v2 actually only has a contextlength of 384.
That’s short enough that I’m probably regularly truncating my document fragments as I embed them for my document store.
Turns out (according to this post) that you can’t trust the config.json file. (Compare to here, here, and here).
Avoiding Truncation
So now that we know we only have 384 tokens for each document fragment, how can we guarantee that every single document fragment stored in the PostgreSQL will fit within that context window?
Moreover, it would be nice if we could come up with a solution to this problem that could dynamically resize the size of the document fragments if we later switch to a sentence embedder with a larger context window (which we’re going to do in this post!)
There is a larger issue here that I’m not really addressing. Ideally, how should we break up a document (such as an EPUB or PDF)? By page? By paragraph? By sentence? Multiple sentences?
What we want is for the document fragments we store in PostgreSQL to be semantically ‘self-contained’ as much as possible. We want each document fragment to carry a single thought. That sounds like a paragraph to me, though with a large enough context length it might make sense to try story by page instead. (But, so below for an argument against embedding a full page).
Luckily, we are using my EPUB code to load EPUB files which allows us to automatically break up the document into paragraphs. And most paragraphs probably fit even into the 384 token context window \– though a few paragraphs are definitely too long for that window. In such cases, we really want to break up the paragraph by sentences until it fits within the context window so that we don’t truncate any of the text. But we’d want to be as close to a paragraph as possible. (Note: this is assuming we decided that for out we want to semantically break documents into paragraphs. Your own app may have different needs).
Strategically Breaking Paragraphs to Avoid Truncation
Our proposed strategy to avoid truncation is as follows:
-
First try embedding the entire paragraph. Does the number of tokens exceed the context length (for the default model that is 384 tokens)? If it fits, then the paragraph will be saved embedded and saved as a vector to our document store.
-
If the paragraph is too large, we’ll try to grab 10 sentences at a time (with an overlap of 1 sentence). We’ll then check the number of tokens required. If it fits, we’ll save it to the document store.
-
If it still doesn’t fit then we’ll reduce to 9 sentences and try again. If it still doesn’t fit, we’ll drop an additional sentence, etc.
Note that this means we’re going to have to check the actual token length using the sentence transformer model which will slow down our pipeline quite a bit. But the document loading pipeline (for our toy app anyhow) is only loaded once, so this isn’t a big deal for us.
A Custom Component To Avoid Truncation
I don’t really want to write my own code for DocumentSplitter, but the built-in DocumentSplitter doesn’t check if the embedding truncated text or not. So, we’re going to write a custom component to do the job but it will still – under the hood – use Haystack’s DocumentSplitter. Here is my proposed code:
@component
class _CustomDocumentSplitter:
def __init__(self, embedder: SentenceTransformersDocumentEmbedder):
self.embedder: SentenceTransformersDocumentEmbedder = embedder
self.model: SentenceTransformer = embedder.embedding_backend.model
self.tokenizer = self.model.tokenizer
self.max_seq_length = self.model.get_max_seq_length()
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> dict:
processed_docs = []
for doc in documents:
processed_docs.extend(self.process_document(doc))
print(f"Processed {len(documents)} documents into {len(processed_docs)} documents")
return {"documents": processed_docs}
def process_document(self, document: Document) -> List[Document]:
token_count = self.count_tokens(document.content)
if token_count <= self.max_seq_length:
# Document fits within max sequence length, no need to split
return [document]
# Document exceeds max sequence length, find optimal split_length
split_docs = self.find_optimal_split(document)
return split_docs
def find_optimal_split(self, document: Document) -> List[Document]:
split_length = 10 # Start with 10 sentences
while split_length > 0:
splitter = DocumentSplitter(
split_by="sentence",
split_length=split_length,
split_overlap=min(1, split_length - 1),
split_threshold=min(3, split_length)
)
split_docs = splitter.run(documents=[document])["documents"]
# Check if all split documents fit within max_seq_length
if all(self.count_tokens(doc.content) <= self.max_seq_length for doc in split_docs):
return split_docs
# If not, reduce split_length and try again
split_length -= 1
# If we get here, even single sentences exceed max_seq_length
# So just let the splitter truncate the document
# But give warning that document was truncated
print(f"Document was truncated to fit within max sequence length of {self.max_seq_length}: "
f"Actual length: {self.count_tokens(document.content)}")
print(f"Problem Document: {document.content}")
return [document]
def count_tokens(self, text: str) -> int:
return len(self.tokenizer.encode(text))
This code does exactly what we discussed above. It first tries to embed a paragraph but verifies there was no truncation. If that doesn’t work, it tries 10 sentences and tries again. It keeps dropping the number of sentences until it fits within the context length of the sentence embedder. It will also automatically adjust to whatever the actual context length is for whatever model we are using. So, if we want to swap out the default sentence transformer model for a better one, this code will adapt.
You might notice that my code allows for an overlap of 1 sentence. This means that when I split up the text, I allow one sentence to overlap between fragments. This helps ensure we don’t lose too much context when we split up the text of a paragraph into document fragments.
This is still a fairly primitive way to split up documents. There are more advanced techniques available that we’ll cover in future posts. But for now, this should work pretty well. And in any case, this technique is a useful tool in our toolbox for when we use more advanced techniques.
Using an Improved Sentence Embedding Model
For the Federalist Papers EPUB we’re using, there are 1482 paragraphs. So if every single one fits within the context length we should have 1482 ‘documents’ (or rather document fragments). Using the default model (sentence-transformers/all-mpnet-base-v2), here is the result that is reported when we use our custom document splitter:
Processed 1482 documents into 1597 documents
That’s not too bad, but let’s see if we can use a better model that will allow every single paragraph to be stored in our documents database.
There is a Hugging Face Leaderboard for sentence transformers found here. Most of the really good sentence transformers/embedders are much too large for my laptop, but I do see there are some smaller ones available. I selected Alibaba-NLP/gte-large-en-v1.5 which is quite a ways down on the list, but via testing I confirmed it works out of the box with my environment (some others do not!) I like the fact that it has a context window of 8192. That is huge compared to the 384 we’ve been playing with. (Note also the new size of the embedding vector is 1024).
I first switch my code to use this new model. Something like this will work:
model: GeneratorModel = HuggingFaceAPIModel(password=hf_secret, model_name="HuggingFaceH4/zephyr-7b-alpha")
rag_processor: HaystackPgvector = HaystackPgvector(table_name="federalist_papers",
recreate_table=False,
book_file_path=epub_file_path,
generator_model=model,
embedder_model_name="Alibaba-NLP/gte-large-en-v1.5")
A note of caution: you must use the same sentence transformer/embedder model in both the document pipeline and the rag pipeline or else you’ll get back results.
Since we coded this to check for the size of the context window and the size of the embedding vector dynamically, our code will automatically adjust to this improved model.
With this new model, I get the following result:
Processed 1482 documents into 1482 documents
Perfect! Now every single paragraph is its own document fragment! We have successfully embedded our document into fragments that all fit into the model’s context window.
What Is the Optimal Number of Tokens to Embed?
A good question is ‘what is the optimal number of tokens to embed?’ In this post we’re sticking with the simple assumption that a paragraph is a semantically interesting unit and will be close to appropriately sized. We also talked about the default sentence embedder which has 384 tokens and a replacement embedder that allowed 8192 tokens by comparison! But does it even make sense to embed such a huge chunk of text? A context window of 8192 is huge and would significantly dilute any sort of search via cosine similarity.
It turns out there was a recent paper (Wang et al., 2024) where they empirically found the optimal number of tokens and that study found that between 256 and 512 tokens was generally the right size to avoid diluting your search. Surprise! The default sentence transformer’s 384 tokens are right smack in the middle of that. I doubt that is a coincidence. The default sentence transformer was undoubtedly picked based on working well in real life. By comparison, our improved model with a context window of 8192 is way out of whack. (Though we are not utilizing such a large window due to sticking with paragraphs). This is why you should probably not look at embedding full pages of text.