PDFs vs HTML: The Importance of Metadata for Retrieval Augmented Generation
- By Bruce Nielson
- ML & AI Specialist
Up to this point, we’ve been loading EPUB documents instead of PDFs, and I had a good personal reason for that. EPUBs are essentially HTML files, so (as discussed in this post) the key to loading EPUBs—since they aren’t built into Haystack—is treating them as HTML. However, in the latest release of my code I’ve also added support for loading PDFs from the document store database.
The Advantages of HTML
The real advantage of coding for HTML documents—whether EPUBs or actual web pages—is that I can later reuse my code for web pages. Additionally, since HTML header tags provide valuable metadata, I use them to generate richer metadata.
Take, for example, the document I retrieved from my Karl Popper document store (as discussed in the last post):
Document 7:
Score: 0.05
Item #: 17
Page #: 205
Item Id: Ch05
Book Title: Conjectures and Refutations
Paragraph #: 69
Section Name: XII
Chapter Title: 5 Back to the Presocratics
Content: This, I believe, is the true theory of knowledge (which I wish to
submit for your criticism): the true description of a practice which arose in
Ionia and which is incorporated in modern science (though there are many
scientists who still believe in the Baconian myth of induction): the theory that
knowledge proceeds by way of conjectures and refutations.
Notice how this document fragment includes a page number, book title, chapter title, and even section title. When I’m using my “Karl Popper Archive” (as discussed in the previous post) for research, having metadata like this is incredibly helpful. It also allows me to have answers to questions from a Large Language Model (LLM) cite their sources. Rich metadata like this is typically not available in most PDFs—or at least not without complex, custom data processing. But by reading header tags in HTML, I was able to extract this metadata directly from the document as I parsed it.
This is why I prefer EPUBs over PDFs: they provide an easy, rich source of metadata.
The Advantages of PDFs
However, EPUBs aren’t exactly a common document format. For instance, not every book by Karl Popper is available as an EPUB. As a result, my archive also needs to be able to load PDF files. PDFs are such a widely used format for documents, manuals, and tables of useful data, so we need a way to convert PDF files into document fragments that we can store in the archive as well.
How to Load PDFs: The Simple Way
Loading PDFs is actually built into Haystack! Initially, I used the built-in Haystack component to load PDFs into the Book Search Archive, and you can see an example of that in this particular commit of my code. However, I later decided I wasn't satisfied with the results and believed I could improve on it, so I replaced the built-in component with a custom Haystack solution that loads PDFs using (currently) PyPDF. This gave me more control over how the document fragments and related metadata were extracted from the PDFs. I’ll cover my custom approach in a future post. Even with this improved version, it’s still not exactly what I want, so I’ll likely move away from PyPDF at some point and explore a more fine-grained solution.
But for now, let’s go over how to use the built-in Haystack component. Below is a revised version of my document conversion pipeline that uses the PyPDFToDocument component built into Haystack:
As you can see, I’ve set up two parallel pipes in my pipeline. The “epub_vs_pdf_splitter” takes a list of file paths and routes the EPUB books to a pipeline for EPUBs (using the “html_converter” we discussed in previous posts) and the PDFs to a pipeline with the PyPDFToDocument converter built into Haystack. Then, the “epub_pdf_merger” combines everything back into a final list of documents, which are then passed through the remaining part of the pipeline for cleaning, splitting into document fragments, etc.
Pretty simple, right?
But how do you actually create the custom components needed to perform this kind of split?
Splitting and Merging
Here is the code that creates this pipeline:
@component
class EpubVsPdfSplitter:
@component.output_types(epub_paths=List[str], pdf_paths=List[str])
def run(self, file_paths: List[str]) -> Dict[str, List[str]]:
epub_paths: List[str] = []
pdf_paths: List[str] = []
for file_path in file_paths:
if file_path.lower().endswith('.epub'):
epub_paths.append(file_path)
elif file_path.lower().endswith('.pdf'):
pdf_paths.append(file_path)
else:
raise ValueError(f"File type not supported: {file_path}")
return {"epub_paths": epub_paths, "pdf_paths": pdf_paths}
Not much here. It takes in a list of strings as file paths and outputs that list separated into a list of pdf paths vs a list of epub paths.
@component
class EPubPdfMerger:
@component.output_types(documents=List[Document])
def run(self, epub_docs: List[Document], pdf_docs: List[Document]) -> Dict[str, List[Document]]:
documents: List[Document] = []
for doc in epub_docs:
documents.append(doc)
for doc in pdf_docs:
documents.append(doc)
return {"documents": documents}
And the above component then takes a list of epub Haystack Documents (i.e. the Document class) and a list of pdf Haystack Documents and merges them into a final combined list of Haystack Documents. Again, very simple.
Finally, we just need to connect it all together like this:
doc_convert_pipe: Pipeline = Pipeline()
doc_convert_pipe.add_component("epub_vs_pdf_splitter", EpubVsPdfSplitter())
doc_convert_pipe.add_component("pdf_loader", PyPDFToDocument())
doc_convert_pipe.add_component("epub_loader", EPubLoader(verbose=self._verbose))
doc_convert_pipe.add_component("html_parser",
HTMLParserComponent(min_paragraph_size=self._min_paragraph_size,
min_section_size=self._min_section_size,
verbose=self._verbose))
doc_convert_pipe.add_component("html_converter", HTMLToDocument())
doc_convert_pipe.add_component("epub_pdf_merger", EPubPdfMerger()) …
doc_convert_pipe.connect("epub_vs_pdf_splitter.epub_paths", "epub_loader.file_paths")
doc_convert_pipe.connect("epub_vs_pdf_splitter.pdf_paths", "pdf_loader.sources")
doc_convert_pipe.connect("epub_loader.html_pages", "html_parser.html_pages")
doc_convert_pipe.connect("epub_loader.meta", "html_parser.meta")
doc_convert_pipe.connect("html_parser.sources", "html_converter.sources")
doc_convert_pipe.connect("html_parser.meta", "html_converter.meta")
doc_convert_pipe.connect("pdf_loader.documents", "epub_pdf_merger.epub_docs")
doc_convert_pipe.connect("html_converter.documents", "epub_pdf_merger.pdf_docs")
doc_convert_pipe.connect("epub_pdf_merger.documents", "remove_illegal_docs") …
self._doc_convert_pipeline = doc_convert_pipe
Don't forget to run:
pip install -r requirements.txt
If you need help with environment setup, this blog post will help.
To get Haystack’s PyPDFToDocument component to work you’ll probably need to do:
pip install pypdf
Conclusions
And that’s the simplest approach to adding PDF documents into our Book Search Archive. I’ll cover my improved more custom approach in a future post as well as explore other alternatives, such as the built-in PDFMinerToDocument component.