
IBM’s Docling for Superior Text Loading from PDFs
- By Bruce Nielson
- ML & AI Specialist
Now that we have NLTK installed and some code written to remove hyphens and dashes from PDF text, it’s time to put this to work for a first-class PDF loader. In this post, I’ll go over the basics of using Docling to get an improved text load from a PDF.
I previously covered how to use IBM’s Docling to turn a PDF into Markdown. (See this post here. ) At the time, I had the idea of converting a PDF to Markdown with Docling, translating it to HTML, and then running it through my existing HTML parser (as covered in a past post.) But when I actually tried it, the results were less than impressive.
As discussed in this post, PDFs are much more challenging than EPUBs because EPUBs are essentially HTML. They contain convenient tags for titles and section headers, the text is always unhyphenated, and it's easy to determine when a paragraph ends—even if it continues onto the next page.
PDFs, on the other hand, lack these advantages because they are designed primarily for visual layout, not for extracting readable text and storing it in an AI document database.
The challenge is that PDFs are far more common than EPUBs, so we need a reliable way to convert them into high-quality text for our document store.
IBM’s Docling to the Rescue
I’ve struggled to find an open-source PDF parser that I really like. In this post, I tested several popular open-source tools for extracting text from PDFs and found them all lacking. Then I discovered Docling—and at first, I didn’t like it either.
But after digging deeper, I was pretty impressed. Docling uses machine learning to tag text with various labels, allowing it to distinguish between the main body text and elements like section headers, page headers, or footers. Since it relies on machine learning, it’s not perfect—more on that later.
Still, this is the closest I’ve seen to treating a PDF like an EPUB, so I’ve decided to build a Docling parsing class around this capability and refine the results to my satisfaction.
The one thing Docling didn’t do for me was remove those pesky hyphens and dashes that break words at the end of a line in a PDF. This made the extracted text look terrible.
It also struggled with identifying paragraph breaks, performing only so-so in that regard. And when it came to detecting whether a paragraph continued onto the next page—especially with a page header or footnotes in between—it failed completely (as in, not at all).
Still, Docling is a huge step up compared to most PDF readers, so it will now serve as the foundation for loading PDFs into the Book Search Archive—Mindfire’s toy app for testing our growing open-source stack (and the basis for most of these blog posts). You can find the code as it was at the time of this blog post inside the Book Search Archive `.
The key to using Docling effectively is not to rely on its PDF-to-Markdown conversion (as most beginner tutorials suggest). Instead, it’s better to iterate over the text using Docling’s built-in objects and leverage its labels to determine how to handle the extracted content.
The Power of Docling: A Code Example
Let’s look first at some of the code in the customhaystackcomponents.py file. Look at the new component called DoclingParserComponent. This component is pretty simple. It simply takes a list of PDFs and converts them into a list of documents (as ByteStreams) and a list of meta data and then returns those lists onto the next step in the Haystack pipeline:
@component.output_types(sources=List[ByteStream], meta=List[Dict[str, str]])
def run(self, sources: List[DoclingDocument], meta: List[Dict[str, str]]) -> Dict[str, Any]:
docs_list: List[ByteStream] = []
meta_list: List[Dict[str, str]] = []
for i, doc in enumerate(sources):
meta_data: Dict[str, str] = meta[i]
parser: DoclingParser
start_page: Optional[int] = None
end_page: Optional[int] = None
if doc.name in self._valid_pages:
start_page, end_page = self._valid_pages[doc.name]
parser = DoclingParser(doc, meta_data,
min_paragraph_size=self._min_paragraph_size,
start_page=start_page,
end_page=end_page,
double_notes=True)
# Start here
temp_docs: List[ByteStream]
temp_meta: List[Dict[str, str]]
temp_docs, temp_meta = parser.run()
# item_id: str = meta_data.get("item_id", "")
book_title: str = meta_data.get("book_title", "")
# Unlike EPUB we don't have sections or chapters. So we don't need a total length.
# TODO: Add a way to skip pages instead.
self._print_verbose(f"Book: {book_title};")
docs_list.extend(temp_docs)
meta_list.extend(temp_meta)
return {"sources": docs_list, "meta": meta_list}
All the real work is done by the parser:
temp_docs, temp_meta = parser.run()
So, let’s dig into the DoclingParser component as found in the docling_parser.py file. I’ll explain this file fully in future posts, but for now let’s keep it simple. You basically create a parser for a specific PDF and then ‘run’ to get the result:
parser = DoclingParser(doc, meta_data,
min_paragraph_size=self._min_paragraph_size,
start_page=start_page,
end_page=end_page,
double_notes=True)
temp_docs, temp_meta = parser.run()
This may not have been the best approach for me. I started down one path, thinking this would be a Haystack component on its own, but then changed my mind halfway through. I’ll clean it up in the future, but for now, it works just fine.
With the DoclingParser, you can specify various parameters, like whether to enable ‘double_notes’ (i.e., allowing a double minimum size for footnotes so they don’t dominate the semantic search we’ll do later). You can also pass starting and ending pages, which I’ll specify via a CSV file, just like we did for sections in our EPUB parser. This setup allows you to exclude elements like the Introduction, Table of Contents, and Index—things that aren’t helpful for our document semantic search. But all the real work is in the DoclingParser class itself.
The DoclingParser Class: Using Docling Labels to Improve Text
The key work loop is as follows:
- Convert the DoclingDocument passed (in the constructor) to a list of various items. (See also full documentation found here.)
- Enumerate over that list and process it.
-
Collect meta data (e.g. the page number in this example) to save off.)
texts = self.getprocessed_texts()
for i, text in enumerate(texts): nexttext = getnexttext(texts, i) pageno = getcurrentpage(text, combinedparagraph, pageno)
But what does getprocessedtexts do? Honestly, not much:
def _get_processed_texts(self) -> List:
# Before we begin, we need to find all footnotes and move them to the end of the texts list
# This is because footnotes are often interspersed with the text, and we want to process them all at once
# Split texts into regular content and notes (footnotes + bottom notes)
regular = [t for t in self._doc.texts if not (is_footnote(t) or is_bottom_note(t))]
notes = [t for t in self._doc.texts if is_footnote(t) or is_bottom_note(t)]
return regular + notes
All I’m doing here is taking the ‘text’ attributer of the DoclingDocument passed in the constructor (stored in self._doc) and creating two lists. One (‘regular’) is a list of all texts in the DoclingDocument that aren’t footnotes or bottom notes. One (‘notes’) is just a list of footnotes and bottom notes. I then move the footnotes and bottom notes to the end of the list of text items.
One thing to note here (though it isn’t obvious from the annotations – I need to fix that) is that ‘texts’ is a list that can be a number of different classes. For example:
Union[SectionHeaderItem, ListItem, TextItem])
A TextItem is just regular text in the body of the document. A ListItem is text in a bullet point style list. A SectionHeaderItem is a section header. All of these are kinds of DocItems. There are others, but these are the ones I’m currently playing with. In addition, all these items have a ‘label’ on them that contains a text label like one of these:
- "section_header"
- "page_footer"
- "page_header"
- "footnote"
- "list_item"
- "formula"
- Etc.
What I next did is I wrote a number of methods to determine (based on the label attribute) which type of text this is:
def is_section_header(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
if text is None:
return False
return text.label == "section_header"
def is_page_footer(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return text.label == "page_footer"
def is_page_header(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return text.label == "page_header"
def is_footnote(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return text.label == "footnote"
def is_page_not_text(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return text.label not in ["text", "list_item", "formula"]
def is_page_text(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return not is_page_not_text(text)
def is_text_item(item: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return not (is_section_header(item)
or is_page_footer(item)
or is_page_header(item))
You should get the idea of where I’m going with this. It’s now very similar to our HTML parser from the EPUB posts. I’ll explain how I fully utilize these features in a future post. For now, the key takeaway is that Docling provides labels that help you determine what type of text you’re working with. I can even detect things like tables. This is where the real power of Docling lies.
One problem, though: it isn’t perfect. Page headers can end up as section headers. Paragraphs may break in the middle or across pages, sometimes even with footnotes in between. Occasionally, a paragraph might be inserted right in the middle of another one. Or it may think footnotes are regular text.
Because of this, I had to (for my PDFs) create some custom code to find, say, bottom notes that Docling thought were regular TextItems:
def is_bottom_note(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
if text is None or not is_page_text(text):
return False
# Check for · at the beginning of the line. This is often how OCR represents footnote number.
if text.text.startswith("·") and not text.text.startswith("· "):
return True
return bool(re.match(r"^\d+\S.*", text.text))
Here, you can see that I’m taking whatever text item is passed and—admittedly somewhat simplistically—I’m assuming that if it starts with a number or a ‘dot’ (because Docling sometimes OCRs footnote subscripts and interprets them as dots), then it’s likely a footnote. This is obviously very specific to the PDFs I’m working with and isn’t a general rule, but you get the idea.
Conclusions
This should give you the basic idea of the power—and limitations—of using Docling to parse your PDFs. We still have quite a few issues to address, and I’ll tackle those in the next post. But for now, we have a solid way to iterate over the text in the PDF and determine if it's text we want to skip (like page headers or footers with page numbers), if it’s text we want to capture into our document store (such as TextItem or ListItem), or if it’s text we want to capture as metadata (like SectionHeaderItem). We’re on our way to replicating our success with EPUBs, but this time with PDFs.
Also, don't forget that Mindfire TECH is here to not just provide these free articles but also to offer our services to help your business apply these AI principles and concepts. If you would like to learn more about how we can help you get your TECH moving with AI, please do reach out for a free consultation or discussion via our contact-us page!