Finding Paragraphs in PDFs - Using IBM’s Docling

Finding Paragraphs in PDFs - Using IBM’s Docling

In a past post, I argued that chunking text by paragraphs in a book is likely a good idea because it means a human has already grouped the text in a way that is topically relevant. After all, a paragraph is a set of sentences that are topically grouped.

In fact, I’ve found that most human-created paragraphs in a book fit neatly into an ideal-sized context window for a text embedder. (This paper on the best open-sourced RAG stack suggests this context window should be no more than 512 tokens.)

On top of that, I’ve already covered how to handle paragraphs that are too large to fit into the embedder’s context window by chunking them in a way that guarantees they will fit.

However, one could argue that human chunking (i.e., paragraphs) isn’t the best approach. There are surely more scientific methods, like measuring semantic similarity between chunks through trial and error until we find the ideal chunks (i.e., semantic chunking).

But there’s another reason I prefer to keep human-created paragraphs together as a single unit: When I feed the text back to a human to read, it’s just easier for them to receive a full paragraph from a book rather than one that’s chunked up mid-paragraph or across paragraphs.

But that only works if we have a way to read paragraph by paragraph from a document. We had that with EPUB files, thanks to them being HTML and using the <p> tag for each paragraph, which made it trivially easy to read one paragraph at a time. But what about PDF documents?

As discussed in our previous post, this is not an easy feat. IBM’s Docling can help—it at least attempts to break the text into paragraphs. But it falls short at times. What we’ll need is a program that can pull broken paragraphs back together. But how might that work?

Removing Interrupting Text

We’ve already got two parts of the puzzle in place:

One thing I covered in the last post, but didn’t make explicit, was that I removed footnotes and bottom notes from the PDF’s text. I then inserted them back at the end to avoid losing them. This way, those interrupting texts are out of the way of the main text. By doing this, they no longer break up the text flow! (Yet we can still embed them into the document store.)

This is the first step towards bringing paragraphs back together.

Finding Ends of Sentences

The next trick is to be able to determine if a TextItem that Docling feeds us terminates correctly with the end of a sentence such as a period, question mark, or exclamation point. Let’s write some code to determine that: (All code examples found in this commit at time of writing this blog post.)

def is_sentence_end(text: str) -> bool:
    has_end_punctuation: bool = is_ends_with_punctuation(text)
    # Does it end with a closing bracket, quote, etc.?
    ends_with_bracket: bool = (text.endswith(")")
                               or text.endswith("]")
                               or text.endswith("}")
                               or text.endswith("\"")
                               or text.endswith("\'"))
    return (has_end_punctuation or
            (ends_with_bracket and is_ends_with_punctuation(text[0:-1])))

def is_ends_with_punctuation(text: str) -> bool:
    return text.endswith(".") or text.endswith("?") or text.endswith("!")

This method takes a text input and checks for a series of terminations that it considers a proper sentence ending. The is_ends_with_punctuation method simply checks for a period, question mark, or exclamation point. However, my testing showed that this alone often wasn’t reliable.

To improve accuracy, I use this method inside is_sentence_end, which also checks for sentence endings inside quotes, brackets, and similar cases. Together, these two methods work pretty well. Now, we have a way to detect if a text item ends with a sentence. This doesn’t guarantee we’ll always find the end of a paragraph—since a paragraph might break across two pages while still ending with a sentence—but it should get us pretty close.

I’ve wondered if NLTK has something built in to do the same. I need to explore that further. But I didn’t think of that until after I wrote this code and debugged it.

Next, we need a method to combine paragraphs together:

def combine_paragraphs(p1_str: str, p2_str: str):
    # If the paragraph ends without final punctuation, combine it with the next paragraph
    if is_sentence_end(p1_str):
        return p1_str + "\n" + p2_str
    else:
        return p1_str + " " + p2_str

Note that I use this both for combining parts of a single paragraph or combining two short paragraphs. That is why it checks if is_sntence_end and if so, it puts a new line between them (because they are two distinct short paragraphs I want to combine). Otherwise, I join them with a single space because they are presumed to be two parts of a single paragraph split across two pages.

It should be obvious now where we’re going with this. We’ll have code that does something like this:

# If the paragraph does not end with final punctuation, accumulate it
if not is_sentence_end(p_str):
    combined_paragraph = combine_paragraphs(combined_paragraph, p_str)
    combined_chars += p_str_chars
    continue

If a Docling text item doesn’t end like a regular sentence, then we’ll try to combine it to the next one. So, something like this in a PDF:

image 1. Will add detailed alt text at a later date.

Now becomes:

“Hence I suggested that testability or refutability or falsifiability should be accepted as a criterion of the scientific character of theoretical systems; that is to say, as a criterion of demarcation between empirical science on the one hand and pure mathematics, logic, metaphysics, and pseudo-science on the other.”

The page footer (page number) and page header are removed in between and the dashes are removed.

Conclusion

We’re finally ready to pull it all together – in the next blog post. We’ll finally get a working version of our DoclingParser.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter