Loading EPUB Files Using Haystack - A Haystack with pgvector Tutorial
- By Bruce Nielson
- ML & AI Specialist
In our last post, we did a Psycogp tutorial using pgvector to place documents and their embeddings directly into PostgreSQL. In a previous post we also did a semantic search (using cosine similarity) against The Federalist Papers from the Guttenberg Project. Let’s take everything we’ve learned up to this point and create a Haystack demo that recreates the Federalist Papers Semantic search only this time using our AI stack: Haystack and PostgreSQL (with the pgvector extension) as the datastore.
I will be assuming in this tutorial that you have the necessary environment already setup. See this post here for details.
You can find my code here if you want to a full copy of the code discussed in this post.
When this tutorial is completed, you’ll have a data store in Postgres of all the Federalist papers that you can do a semantic search against. Unlike your previous demo, this version is faster due to use of pgvector’s HNSW indexing. The only thing you’ll have left to do for Retrieval Augmented Generation (RAG) is to create the pipeline that takes the results and sends them to a Large Language Model (LLM) of your choice. (I’ll show you how to do this in an upcoming post).
But First, A Few Problems…
Consistent with my other Haystack demos (as discussed in this post) there are a number of problems with Haystack that we’ll need to work around. But that will give us a chance to learn about how Haystack really works. To deal with this in a way that is easy to follow, I won’t be populating the PostgreSQL datastore using a Haystack pipeline and will instead directly call each component so that I can finesse it as needed. (In a future post I’ll show you how to do the same thing using a custom pipeline.)
Loading an EPUB File Using Haystack
I really wanted this demo to be almost identical to our previous one that we built from scratch. That meant I wanted to use an EPUB file as the basis for loading our document store. But there is no native support in Haystack for doing this! (Come on guys! Let’s get that implemented!) But no worries, I’ll show you a quick and easy way (stealing my code from the previous post) to handle important the epub file format using Haystack.
Why EPUBs and not pdfs? I mean Haystack has built in support for reading pdfs, right?
Well, the truth is that I don’t much trust PDFs. They are meant for humans to visually look at, not for a machine to read. I’ve found that many PDFs don’t parse well into Haystack. (Despite that fact, I’ll do a future tutorial on how to do this since PDFs are so common.)
As a really nice feature of EPUBs, they are broken into sections and paragraphs. That means I can use the built in formatting to help break up the text into good semantic chunks for Haystack to use with an LLM when doing RAG.
Plus, just go google Haystack and EPUB files and you’ll find there is a dearth of good information on the internet on how to handle EPUB files, so I thought it might be valuable to create a simple solution for that.
A Few Imports
First, let’s start with a few imports for Haystack (Note that this is Haystack 2.0, not 1.5):
from haystack import Pipeline, Document
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
We’ll need a few extra components imports to handle the EPUB files correctly:
from haystack.components.converters import HTMLToDocument
from haystack.dataclasses import ByteStream
from bs4 import BeautifulSoup
from ebooklib import epub, ITEM_DOCUMENT
Yes, that’s right, we’re going to use BeautifulSoup as a way to ‘cheat’ and get Haystack to load an EPUB file by making it look like an HTML file then parsing it into paragraphs that we can consume as ‘documents’.
You may need to install some of these components into your Python environment. Here is a list of documentation for each:
-
Ebooklib is a Python package for managing EPUB files.
- PyPI Documentation: https://pypi.org/project/EbookLib/
- Official documentation: https://docs.sourcefabric.org/projects/ebooklib/en/latest/
-
BeautifulSoup is a popular Python package for for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for docuemnts that can be used to extract data from HTML.
- PyPi Documentation: https://pypi.org/project/beautifulsoup4/
- Official Documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
Loading an EPUB File
Let’s write a function to load an EPUB file:
def load_epub(epub_file_path):
docs = []
book = epub.read_epub(epub_file_path)
# Find all paragraphs across sections
for section in book.get_items_of_type(ITEM_DOCUMENT):
section_html = section.get_body_content().decode('utf-8')
section_soup = BeautifulSoup(section_html, 'html.parser')
paragraphs = section_soup.find_all('p')
byte_stream: ByteStream
for p in paragraphs:
p_str = str(p)
p_html = f"<html><head><title>Converted Epub</title></head><body>{p_str}</body></html>"
# https://docs.haystack.deepset.ai/docs/data-classes#bytestream
byte_stream = ByteStream(p_html.encode('utf-8'))
docs.append(byte_stream)
return docs
This one takes a bit of explanation, so let’s go over it step-by-step. First this function takes a path to an EPUB file. Be sure you already downloaded the Federalist papers EPUB from the Guttenberg Project or you can use your own file (so long as it is an EPUB) if you wish. The function takes the path to the file and loads it using the EPUB reader:
docs = []
book = epub.read_epub(epub_file_path)
Next, we’re going to loop over the sections of the file:
# Find all paragraphs across sections
for section in book.get_items_of_type(ITEM_DOCUMENT):
For each section we’ll first load the section:
section_html = section.get_body_content().decode('utf-8')
Now use BeautifulSoup to parse it into html:
section_soup = BeautifulSoup(section_html, 'html.parser')
Grab the paragraphs out of the result:
paragraphs = section_soup.find_all('p')
You now have an object that has all the paragraph tags BeautifulSoup created out of the EPUB file! So you’ve literally already broken your text up into paragraphs!
Now comes the tricky part that took me forever to figure out but is actually pretty easy once you know the trick. The trick is this: We’re going to use Haystack’s HTMLToDocument component to import the document into Haystack. So, to repeat we used an EPUB reader to read in the EPUB file, then we used BeautifulSoup to make it look (more or less) like HTML, then we used Haystack’s HTMLToDocument component to import the result. Pretty tricky, right?
One gotcha that I struggled with at first. The HTMLToDocument component really wants a list of HTML URLs to be passed to it. But it will apparently also accept a ‘ByteStream’ as well. So, we need to take our HTML and convert it to a ‘ByteStream’ before storing it out into the list of documents to convert.
byte_stream: ByteStream
for p in paragraphs:
p_str = str(p)
p_html = f"<html><head><title>Converted Epub</title></head><body>{p_str}</body></html>"
# https://docs.haystack.deepset.ai/docs/data-classes#bytestream
byte_stream = ByteStream(p_html.encode('utf-8'))
docs.append(byte_stream)
The ByteStream class is a Haystack class that, strangely, won’t accept strings. So, we have to first convert our strings to bytes before passing to the ByteStream class. By converting to a ByteStream the HTMLToDocument component will accept us directly passing a string of HTML to it instead of passing URLs. Once that is all done, we store the ‘document’ (basically one paragraph) into a list that eventually gets passed back out of the function: return docs
Initializing Your DataStore
Here is the code for initializing our database:
def initialize_and_load_documents(epub_file_path, recreate_table=False):
document_store = PgvectorDocumentStore(
table_name="federalist_papers",
embedding_dimension=768,
vector_function="cosine_similarity",
recreate_table=recreate_table,
search_strategy="hnsw",
hnsw_recreate_index_if_exists=True
)
if document_store.count_documents() == 0:
# Convert EPUB to text documents
sources = load_epub(epub_file_path)
# https://docs.haystack.deepset.ai/docs/htmltodocument
converter = HTMLToDocument()
results = converter.run(sources=sources)
converted_docs = results["documents"]
# Remove documents with empty content
converted_docs = [Document(content=doc.content) for doc in converted_docs if doc.content is not None]
# Remove duplicate Documents with duplicates with duplicate document ids
converted_docs = list({doc.id: doc for doc in converted_docs}.values())
# Clean the documents
# https://docs.haystack.deepset.ai/docs/documentcleaner
cleaner = DocumentCleaner()
cleaned_docs = cleaner.run(documents=converted_docs)["documents"]
# Split the documents
# https://docs.haystack.deepset.ai/docs/documentsplitter
splitter = DocumentSplitter(split_by="word",
split_length=400,
split_overlap=0,
split_threshold=100)
split_docs = splitter.run(documents=cleaned_docs)["documents"]
docs_with_embeddings = create_embeddings(split_docs)["documents"]
document_store.write_documents(docs_with_embeddings)
return document_store
So, there is a bit of a trick here worth explaining. Calling this function will initialize the datastore using this call:
document_store = PgvectorDocumentStore(
table_name="federalist_papers",
embedding_dimension=768,
vector_function="cosine_similarity",
recreate_table=recreate_table,
search_strategy="hnsw",
hnsw_recreate_index_if_exists=True
)
If you accepted the functions default of ‘recreate_table = False’ then this call will first attempt to see if the table already exists from a previous call. If it does, it grabs a connection to that table and passes it back. If it doesn’t find it, it creates the table. Note also that we’re specifying we want this table to use cosine similarity using an HNSW index. I always have it recreate the index even if it exists because I’ve had problems when I don’t. But that does seem to slow down the initialization.
A problem I bumped into was that if you already have another similar table in use via Haystack it tries to name the indexes the same regardless and you end up with an error due to the index name already being in use. This seems like a pretty obvious oversight for Haystack to not just increment the name automatically.
Once we’ve run this, we can check if the table is empty or not to see if we need to put data into it or if this was previously done on a previous run:
if document_store.count_documents() == 0:
Converting an EPUB file to Haystack Documents
If the table is empty, then we’ll need to load it. Call our load_epub function to kick things off:
sources = load_epub(epub_file_path)
‘sources’ now contains the list of ByteStreams that contain our paragraphs. Let’s turn those into Haystack Documents. (i.e. instances of the Haystack Document class):
# https://docs.haystack.deepset.ai/docs/htmltodocument
converter = HTMLToDocument()
results = converter.run(sources=sources)
converted_docs = results["documents"]
Removing Duplicate and Empty Documents
Now here is where I bumped into a very strange problem that is a bit hard to believe even happens. The Haystack document class creates an id for each document as a hash of the contents of the document. So, if you have two documents with the same text you’ll end up with a duplicate id – which PostgreSQL will reject!
If this happens, you’ll see an error something like this when you try to write out to the datastore:
File "D:\Documents\AI\LLMs\.venv\lib\site-packages\haystack_integrations\document_stores\pgvector\document_store.py", line 445, in write_documents
raise DuplicateDocumentError from ie
haystack.document_stores.errors.errors.DuplicateDocumentError
Or This:
File "D:\Documents\AI\LLMs\.venv\lib\site-packages\haystack_integrations\document_stores\pgvector\document_store.py", line 442, in write_documents
self.cursor.executemany(sql_insert, db_documents, returning=True)
File "D:\Documents\AI\LLMs\.venv\lib\site-packages\psycopg\cursor.py", line 767, in executemany
raise ex.with_traceback(None)
psycopg.errors.UniqueViolation: duplicate key value violates unique constraint "federalist_papers_pkey"
DETAIL: Key (id)=(62f78e9443181a70192969a29e836994a2b55ce7dc6cea724f355ca5192926b6) already exists.
I was a bit surprised Haystack didn’t handle this by default. It isn’t that uncommon for some line in the document (as often happens with the Federalist papers!) to be identical to somewhere else in the document.
And why didn’t they just use a guid or some other similar id that was guaranteed to be unique instead of making it a hash of the document’s content? This virtually guaranteed duplicates! And if they were going to use a hash as their unique id, then why not just automatically remove duplicates?
Haystack’s choices here make very little sense to me. Luckily they did give us a flag in the DocumentWriter component (as we used back in this post) but for this post we're not using that component. So we’ll have to deal with the problem ourselves by removing the duplicate documents manually for now.
Along the same lines I’ve also bumped into problems where the converted document simply has no content at all (because the EPUB file, when converted to HTML, contained a blank paragraph). That will also result in an error when you try to clean or split the documents that looks something like this:
DocumentCleaner only cleans text documents but document.content for document ID %d4675c57fcfe114db0b95f1da46eea3c5d6f5729c17d01fb5251ae19830a3455 is None.
Or This:
ValueError: DocumentSplitter only works with text documents but document.content for document ID d4675c57fcfe114db0b95f1da46eea3c5d6f5729c17d01fb5251ae19830a3455 is None.
So, let’s deal with both of the problems by removing empty and duplicate documents:
# Remove documents with empty content
converted_docs = [Document(content=doc.content) for doc in converted_docs if doc.content is not None]
# Remove duplicate Documents with duplicates with duplicate document ids
converted_docs = list({doc.id: doc for doc in converted_docs}.values())
Cleaning and Splitting Documents
Now we’re ready to move on and go through the cleaning process to remove problematic characters (if any exists) using the Haystack DocumentCleaner component:
# Clean the documents
# https://docs.haystack.deepset.ai/docs/documentcleaner
cleaner = DocumentCleaner()
cleaned_docs = cleaner.run(documents=converted_docs)["documents"]
And finally, we can split the document into small enough chunks so that the sentence embedding model doesn’t truncate the text. We already did most of the splitting ourselves by breaking into paragraphs. (That’s one of the advantages of using an EPUB file in the first place!) But we’ll still run through the document splitter built into Haystack (The DocumentSplitter component) as an added precaution. (By experiment, we did in fact need to do this to avoid issues with the occasional really long paragraph).
# Split the documents
# https://docs.haystack.deepset.ai/docs/documentsplitter
splitter = DocumentSplitter(split_by="word",
split_length=400,
split_overlap=0,
split_threshold=100)
split_docs = splitter.run(documents=cleaned_docs)["documents"]
docs_with_embeddings = create_embeddings(split_docs)["documents"]
document_store.write_documents(docs_with_embeddings)
Finally, return the final document store. If it already exists, it will just return a link to the existing document store. Otherwise, it creates it and returns it:
return document_store
Creating Embeddings
Let’s write a function to create embeddings:
def create_embeddings(documents):
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
return documents_with_embeddings
Note our use of the Haystack SentenceTransformersDocumentEmbedder component to do the embedding.
Query Pipeline
We aren’t going to use an actual LLM doing RAG just yet, but let’s write a Haystack pipeline to at least do a semantic search query for us. This will also give us some experience with writing Haystack pipelines:
def create_query_pipeline(document_store):
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
# https://docs.haystack.deepset.ai/docs/pgvectorembeddingretriever
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
return query_pipeline
Key things to note are:
- How to setup a Haystack Pipeline using the Pipeline class. If desired, you can even visualize the pipeline.
- How to add a component to the pipeline to embed the query using the SentenceTransformerTexEmbeder component.
- How to retrieve results using the PgvectorEmeddingRetriever component.
- How to pass the actual embedding from one node to the next to use it for the retrieval from the DocumentStore.
Bring It All Together
Finally let’s write a main function to bring it all together:
def main():
epub_file_path = "Federalist Papers.epub"
document_store = initialize_and_load_documents(epub_file_path)
query_pipeline = create_query_pipeline(document_store)
query = "What is the role of the judiciary in a democracy?"
result = query_pipeline.run({"text_embedder": {"text": query}})
documents = result['retriever']['documents']
for doc in documents:
print(doc.content)
print(f"Score: {doc.score}")
print("")
This basically takes a path to the EPUB file, initializes the datastore with it, creates the query pipeline, then runs a query and prints out the results.
Let’s try asking it the same question we asked when we built a semantic search by hand. “Are we a democracy or a republic?”:
A republic, by which I mean a government in which the scheme of representation takes place, opens a different prospect, and promises the cure for which we are seeking. Let us examine the points in which it varies from pure democracy, and we shall comprehend both the nature of the cure and the efficacy which it must derive from the Union.
Score: 0.5751381173647583
That’s the same answer we got last time! So, the HNSW index is working great! Plus, we got that response far faster than when we manually did the cosine similarity.
And that’s it! You now have both a working version of Haystack using PostgreSQL as your datastore as well as an example of how to get Haystack to load an EPUB file into the datastore. Even if you don’t use this with an LLM you’ll find this a useful way to do a semantic search for your applications.