AI Tutorial: What is the Best Way to Load PDFs?

AI Tutorial: What is the Best Way to Load PDFs?

In a previous post, we talked about using the built-in Haystack component for loading PDF documents called PyPDFToDocument. I personally found the results a bit underwhelming. Sure, it gets the job done, it doesn’t parse the document well. For example, you end up with page numbers or page headers baked into your text.

Is there a better approach?

And the answer is… Maybe?

The Woes of PyMuPDF4LLM

I kept hearing about an amazing new open-source PDF reader called PyMuPDF4LLM. Here is the PyPi link for how to install it. There are numerous articles (e.g. here and here) about how it’s a great alternative to LlamaParse, which can be costly.

Unfortunately, I’ve had bad luck with PyMuPDF4LLM. On the first PDF I tried it on it inserted extra newline characters in front of a word and then started repeating that word over and over every so often:

I came to Prague with the corrected page proofs of my book,
It
_Logik der Forschung._ was published three months later in
…
Tarski and Godel arrived, independently at almost the same
It
time. was first published by Tarski in 1930, whereupon
It
Godel, of course, accepted Tarski's priority. is a theory of

The first “It” is correct, but the PDF had no newline character there. The other two “It”s don’t exist in the PDF I was using at all. I reported this as a bug, and they could not replicate it. I then recreated the bug in Google Colab and sent it to them. I have yet to hear back at the time of this writing. So, for now, PyMuPDF4LLM is a non-starter for me at least. (Update: just before this went out, they managed to replicate the bug! So, they are now working on it.)

However, I love the underlying idea of PyMuPDFLLM and hopefully they’ll fix this in a future version. Other than this ‘extra word’ problem, I’d say the results were pretty good. It removed headers from pages, removed page numbers, and put the whole thing into markdown. Theoretically I could then go on and parse this markdown in a number of ways. This will be my prefered choice once it is working. Consider this pipeline I created that uses PyMuPDF4LLM:

A flowchart diagram. Will add further details at a future date.

“PDFToMarkdown” is my custom Haystack component that implements PyMuPDF4LLM. It takes a PDF document and turns it into Markdown and then sends the results to the built-in Haystack component called MarkdownToDocument. If PyMuPDF4LLM had really worked for me, I could have theoretically even turned the Markdown it creates into HTML and then just sent the results to my already existing HTMLParserComponent in the BookSearchArchive and obtain valuable meta data that way. (I covered this in the previous post.)

Here is the code for my PDFToMarkdown component:

@component
class PDFToMarkdown:
    def __init__(self, min_page_size: int = 1000):
        self._min_page_size = min_page_size

    @component.output_types(sources=List[ByteStream])
    def run(self, sources: List[str]) -> Dict[str, List[ByteStream]]:
        markdown_docs: List[ByteStream] = []
        for source in sources:
            markdown_doc: str = pymupdf4llm.to_markdown(source)
            byte_stream: ByteStream = ByteStream(markdown_doc.encode('utf-8'))
            markdown_docs.append(byte_stream)
        return {"sources": markdown_docs}

You’ll need to do the following install for it to work:

pip install pymupdf4llm

However, with PyMuPDF4LLM not working well enough for now, I moved on to another approach.

PDFReader To the Rescue?

So, I had an idea. What if I just utilize PDFReader directly? Recall from the previous post that to use the Haystack’s PyPDFToDocument component we needed to:

pip install pypdf

Why not just use that module and write our own PDF reading component?

Here is the end result:

A flowchart diagram. Will add further details at a future date.

I created a custom Haystack component that takes a list of paths to PDF files and it converts them to Haystack Document class instances.

Here is the code for the component:

@component
class PDFReader:
    def __init__(self, min_page_size: int = 1000):
        self._min_page_size = min_page_size

    @component.output_types(documents=List[Document])
    def run(self, sources: List[str]) -> Dict[str, List[Document]]:
        documents: List[Document] = []
        for source in sources:
            pdf_reader = PdfReader(source)
            for page_num, page in enumerate(pdf_reader.pages):
                page_text = page.extract_text()
                if len(page_text) < self._min_page_size:
                    continue
                meta_properties: List[str] = ["author", "title", "subject"]
                meta: Dict[str, Any] = PDFReader._create_meta_data(pdf_reader.metadata, meta_properties)
                meta["page_#"] = page_num + 1
                if not meta.get("title"):
                    # Use file name for title if none found in metadata
                    source_title: str = Path(source).stem
                    meta["title"] = source_title

                documents.append(Document(content=page_text, meta=meta))

        return {"documents": documents}

This component takes a list of strings which are paths to PDF files. It iterates over them and reads them in using the PdfReader object. I choose to enumerate over each page. This allow me to try to embed one page at a time (as opposed to how I try to do one paragraph at a time for EPUBs.) This also allows me to grab page numbers as metadata. In addition, I make an attempt to grab as much metadata as I can out of the PDF file itself. I check for author, title, and subject. I have a utility function called _create_meta_data that attempts to build metadata from the pdf_reader.metadata property.

@staticmethod
def _create_meta_data(pdf_meta_data: DocumentInformation, meta_data_titles: List[str]) -> Dict[str, str]:
    meta_data: Dict[str, str] = {}
    for title in meta_data_titles:
        value: str = getattr(pdf_meta_data, title, "")
        if hasattr(pdf_meta_data, title) and value is not None and value != "":
            meta_data[title] = getattr(pdf_meta_data, title, "")
    return meta_data

I would note that you can’t just pass the pdf_reader.metadata property directly as metadata for Haystack as the format is wrong. So, this utility function puts it into the right format. The end result isn’t bad. Probably inferior to what PyMuPDF4LLM could do – if it worked correctly. But I end up with results that look like this:

Document 1:
Score: 0.016666668
Title: A World of Propensities by Karl Popper (1997)
Page #: 6
Content: Ladies and Gentlemen, I shall begin with some personal memories and a
personal confession of faith, and only then turn to the topic of my lecture. …

Note that we have a title for the book and a page number as metadata!

One unfortunate side effect is that the results may contain page numbers or page header info right inside the text. Like this:

Document 7:
Score: 0.2
Title: A World of Propensities by Karl Popper (1997)
Page #: 7
Page Number: 1
Content: 4 A World of Propensities all fashions. And it allows us to speak of
falsity and its elimination; of our fallibility; and of the fact that we can…

You can see here that the content of the document fragment starts with a page number and the name of the book due to it being in the page header. (The bold portion.) This is not the most helpful.

Use PyMu Directly?

One possibility is to not use pymupdf4llm but instead to just use pymupdf. (See the PyPi for pymupdf here. And you can find the github repo for pymupdf here.)

PyMuPDF4LLM is built on top of PyMuPDF and if you installed PyMuPDF4LLM you also installed PyMuPDF. If PyMuPDF4LLM is giving you fits (like me) why not just use PyMuPDF directly?

Here is the result:

A flowchart diagram. Will add further details at a future date.

And here is the code:

@component
class PyMuPDFReader:
    def __init__(self, min_page_size: int = 1000):
        self._min_page_size = min_page_size

    @component.output_types(documents=List[Document])
    def run(self, sources: List[str]) -> Dict[str, List[Document]]:
        documents: List[Document] = []
        for source in sources:
            doc = pymupdf.open(source)
            for page_num in range(len(doc)):
                page = doc.load_page(page_num)
                page_text = page.get_text("text")
                if len(page_text) < self._min_page_size:
                    continue
                documents.append(Document(content=page_text))
        return {"documents": documents}

This isn’t very different than PDFReader (above) and in fact the results are identical. You still get page numbers and page headers in the content. So, this approach didn’t improve much.

Conclusions

So, really none of these approaches truly satisfy me. I really wish PyMuPDF4LLM worked correctly as that was the best option. I plan to play around with PyMu more and see if I can improve how well it works.

Note: I checked in code that allows you to pick any of these different approaches in this code here. There is now a parameter that lets you choose which approach you want.

If you have trouble with environment don’t forget about this environment setup post and don’t forget to run:

pip install -r requirements.txt

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter