Docling for PDF to Markdown Conversion

Docling for PDF to Markdown Conversion

Back in this post, I went over several ways to try to convert a PDF into document fragments for our Book Search Archive (the Mindfire toy app for our open-sourced AI stack.) None of them worked great. One reader of these posts suggested I try out Docling. (Github Repo for Docling. Documentation for Docling.)

Docling is IBM’s open-source library for reading popular document formats – including PDF – and exporting it to Markdown. It is similar to PyMuPDF4LLM in that he attempts to remove headers and footers and other extraneous information from a PDF so that you are only creating embeddings for the ‘good parts’ of the document.

If you’ll recall, on my first attempt to use PyMuPDF4LLM it didn’t work well for me and kept repeating words. Will Docling work better? Unfortunately, the answer turned out to be – nope! In fact, Docling was considerably worse than PyMyPDF4LLM in terms of results. Most of the book I tried just disappeared into a giant list of repeating words.

This did make me wonder if maybe the problem was this particular PDF. Perhaps, by dumb luck, I happened to grab a PDF that was really awful. In a future post, I will try other PDFs using both PyMuPDF4LLM and Docling and report back the results.

But for this post, let’s just go over how to install Docling plus the code I added to the Book Search Archive to make it a new PDF option. You can find my code for the Book Search Archive at the time of this post here.

Installing Docling

Installing Docling is very easy. Just run this command:

pip install docling

One word of warning here: Docling needs to install a large number of libraries to work. Far more than I recall getting installed for PyMuPDF4LLM. So, I wouldn’t recommend doing what I did and installing both Docling and PyMuPDF4LLM. Pick one that works best for you and go with it.

Integrating Docling into the Book Search Archive

That being said, my intentions for the Book Search Archive are to test out various open-source software libraries, so it made sense for me to try out both. Further, I like the idea of having both available in case one works better on some PDFs and one on other PDFs. If in testing I find this to be the case I’ll write a way to specify which PDFs use which library.

So here is how I integrated Docling into my code. First, I wrote a new custom component that utilizes Docling to convert PDFs to mark down (custom_haystack_components.py):

@component
class DoclingToMarkdown:
    def __init__(self, min_page_size: int = 1000):
        self._min_page_size = min_page_size
        self._converter = DocumentConverter()

    @component.output_types(sources=List[ByteStream])
    def run(self, sources: List[str]) -> Dict[str, List[ByteStream]]:
        markdown_docs: List[ByteStream] = []
        for source in sources:
            markdown_doc: str = self._converter.convert(source).document.export_to_markdown()
            byte_stream: ByteStream = ByteStream(markdown_doc.encode('utf-8'))
            markdown_docs.append(byte_stream)
        return {"sources": markdown_docs}

We need to instantiated a Docling DocumentConverter in the __init__ method and save it to an instance variable and then use it to create a markdown document:

markdown_doc: str = self._converter.convert(source).document.export_to_markdown()

Now we need to use this component to build our document conversation pipeline (document_processor.py). First let’s add a new PDF conversion strategy to our enum:

class PDFReadingStrategy(Enum):
    PyPDFToDocument = 1
    PDFReader = 2
    PyMuPdf4LLM = 3
    PyMuPDFReader = 4
    Docling = 5

Then we need to add some code to the _doc_converter_pipeline method. First add two new components, conditional on the Docling strategy:

elif pdf_reading_strategy == PDFReadingStrategy.Docling:
    doc_convert_pipe.add_component("pdf_loader", DoclingToMarkdown())
    doc_convert_pipe.add_component("markdown_converter", MarkdownToDocument())

And then connect the components:

elif pdf_reading_strategy == PDFReadingStrategy.Docling:
    doc_convert_pipe.connect("pdf_loader.sources", "markdown_converter.sources")
    doc_convert_pipe.connect("markdown_converter.documents", "epub_pdf_merger.pdf_docs")

We connect the pdf_loader (i.e. the Docling custom component) to a Haystack built-in MarkdownToDocument component. The rest of the pipeline can stay the same. Here is the updated pipeline:

image 1. Will add detailed description at a later date.

Note that I integrated Docling directly rather than using the built-in Haystack Docling integration via their DoclingConverter component. (See GitRepo here.) I’ll try that out in some future post. And that’s it! We now have Docling added to our open-source stack! Now we can try out comparisons between Docling and PyMuPDF4LLM or even use whichever one works best for a particularly document.

Other Links

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter