
Pulling It All Together: Docling for Loading PDFs
- By Bruce Nielson
- ML & AI Specialist
Up to this point I’ve written a number of blog posts on the use of Docling to load PDFs. Below is the list of relevant blog posts plus the functions and methods (covered under each post) that we’ll be using in this post.
- Using Docling for PDF to Markdown
-
Using NLTK to remove dashes and hyphens
- is_valid_word()
- combine_hyphenated_words()
-
Using Docling to parse the PDF based on Docling created text labels
- _get_processed_texts()
- is_section_header()
- is_page_footer()
- is_page_header()
- is_footnote()
- is_text_break()
- is_page_not_text()
- is_page_text()
- is_ends_with_punctuation()
- is_text_item()
- is_bottom_note()
-
Using Docling to fix paragraphs broken across page boundaries
- is_sentence_end() <part 3>
- is_ends_with_punctionation() <part 3>
- combine_paragraphs() <part 3>
Now we’re going to pull everything we’ve done together for a Docling parser that is comparable to our EPub/HTML Parser, complete with metadata such as section names. This post will cover the rest of what you need to know.
A Few More Methods
Here is the commit at the time of writing this post so that you can follow along with the actual code.
I’m going to define a few more useful functions first:
First, a function to get the next text in the list. Think of it as looking ahead one text item so that we can make choices based on the label (e.g. ‘section header’) of that text:
def get_next_text(texts: List[Union[SectionHeaderItem, ListItem, TextItem]], i: int) \
-> Optional[Union[ListItem, TextItem]]:
# Seek through the list of texts to find the next text item using is_text_item
# Should return None if no more text items are found
for j in range(i + 1, len(texts)):
if j < len(texts) and is_text_item(texts[j]):
return texts[j]
return None
Now let’s create some functions to clean up text:
def remove_extra_whitespace(text: str) -> str:
# Remove extra whitespace in the middle of the text
return ' '.join(text.split())
def clean_text(p_str: str) -> str:
p_str = str(p_str).strip() # Convert text to a string and remove leading/trailing whitespace
p_str = p_str.encode('utf-8').decode('utf-8')
p_str = re.sub(r'\s+', ' ', p_str).strip() # Replace multiple whitespace with single space
p_str = re.sub(r"([.!?]) '", r"\1'", p_str) # Remove the space between punctuation (.!?) and '
p_str = re.sub(r'([.!?]) "', r'\1"', p_str) # Remove the space between punctuation (.!?) and "
p_str = re.sub(r'\s+\)', ')', p_str) # Remove whitespace before a closing parenthesis
p_str = re.sub(r'\s+]', ']', p_str) # Remove whitespace before a closing square bracket
p_str = re.sub(r'\s+}', '}', p_str) # Remove whitespace before a closing curly brace
p_str = re.sub(r'\s+,', ',', p_str) # Remove whitespace before a comma
p_str = re.sub(r'\(\s+', '(', p_str) # Remove whitespace after an opening parenthesis
p_str = re.sub(r'\[\s+', '[', p_str) # Remove whitespace after an opening square bracket
p_str = re.sub(r'\{\s+', '{', p_str) # Remove whitespace after an opening curly brace
p_str = re.sub(r'(?<=\s)\.([a-zA-Z])', r'\1',
p_str) # Remove a period that follows a whitespace and comes before a letter
p_str = re.sub(r'\s+\.', '.', p_str) # Remove any whitespace before a period
# Remove footnote numbers at end of a sentence. Check for a digit at the end and drop it
# until there are no more digits or the sentence is now a valid end of a sentence.
while p_str and p_str[-1].isdigit() and not is_sentence_end(p_str):
p_str = p_str[:-1].strip()
return p_str
Without going into too much detail, the first removes extra spaces. So word1<space, space>word2 becomes word1
The second function is a series of regexes that clean up the text in a variety of ways. See the inline comments to see what they do.
Because many section titles start with a roman numeral, it’s often helpful to be able to detect if a word is really a roman numeral:
def is_roman_numeral(s: str) -> bool:
roman_numeral_pattern = r'(?i)^(M{0,3})(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
return bool(re.match(roman_numeral_pattern, s.strip()))
Docling tracks page numbers, so let’s write a function to read the page number:
def get_current_page(text: Union[SectionHeaderItem, ListItem, TextItem],
combined_paragraph: str,
current_page: Optional[int]) -> Optional[int]:
return text.prov[0].page_no if current_page is None or combined_paragraph == "" else current_page
Here is a helper function that checks if this is an element (based on Doclings label or if it is a roman numeral) that we don’t want to include in our document fragments that we’ll do a semantic search over:
def should_skip_element(text: Union[SectionHeaderItem, ListItem, TextItem]) -> bool:
return any([
is_page_footer(text),
is_page_header(text),
is_roman_numeral(text.text)
])
At Last: The Main Loop!
Finally, we’re ready to go over the (now fairly simple) DoclingParser class I wrote that does all the magic. First let’s declare the class and the constructor:
class DoclingParser:
def __init__(self, doc: DoclingDocument,
meta_data: dict[str, str],
min_paragraph_size: int = 300,
start_page: Optional[int] = None,
end_page: Optional[int] = None,
double_notes: bool = False):
self._doc: DoclingDocument = doc
self._min_paragraph_size: int = min_paragraph_size
self._docs_list: List[ByteStream] = []
self._meta_list: List[Dict[str, str]] = []
self._meta_data: dict[str, str] = meta_data
self._start_page: Optional[int] = start_page
self._end_page: Optional[int] = end_page
self._double_notes: bool = double_notes
And let’s create a final helper method to append a document fragment and its associated metadata into our growing list:
def _add_paragraph(self, text: str, para_num: int, section: str,
page: Optional[int], docs: List[ByteStream], meta: List[Dict]):
docs.append(ByteStream(text.encode('utf-8')))
meta.append({
**self._meta_data,
"paragraph_#": str(para_num),
"section_name": section,
"page_#": str(page)
})
And now the ‘run()’ method should make perfect sense:
def run(self) -> Tuple[List[ByteStream], List[Dict[str, str]]]:
temp_docs: List[ByteStream] = []
temp_meta: List[Dict[str, str]] = []
combined_paragraph: str = ""
i: int
combined_chars: int = 0
para_num: int = 0
section_name: str = ""
page_no: Optional[int] = None
first_note: bool = False
texts = self._get_processed_texts()
for i, text in enumerate(texts):
next_text = get_next_text(texts, i)
page_no = get_current_page(text, combined_paragraph, page_no)
# Check if the current page is within the valid range
if self._start_page is not None and page_no is not None and page_no < self._start_page:
page_no = None
continue
if self._end_page is not None and page_no is not None and page_no > self._end_page:
if self._double_notes and not first_note:
self._min_paragraph_size *= 2
first_note = True
continue
# Update section header if the element is a section header
if is_section_header(text):
section_name = text.text
continue
if should_skip_element(text):
continue
p_str = clean_text(text.text)
p_str_chars = len(p_str)
# If the paragraph does not end with final punctuation, accumulate it
if not is_sentence_end(p_str):
combined_paragraph = combine_paragraphs(combined_paragraph, p_str)
combined_chars += p_str_chars
continue
# p_str ends with a sentence end; decide whether to process or accumulate it
total_chars = combined_chars + p_str_chars
if is_section_header(next_text):
# Immediately process if the next text is a section header
p_str = combine_paragraphs(combined_paragraph, p_str)
combined_paragraph, combined_chars = "", 0
elif total_chars < self._min_paragraph_size:
# Not enough characters accumulated yet; decide based on next_text
if next_text is None or (not is_page_text(next_text) and is_sentence_end(p_str)):
# End of document or next text item is not a text item and current paragraph ends with punctuation
# Process the paragraph and reset the accumulator even though this is a short paragraph
p_str = combine_paragraphs(combined_paragraph, p_str)
combined_paragraph, combined_chars = "", 0
else:
# Combine with next paragraph
combined_paragraph = combine_paragraphs(combined_paragraph, p_str)
combined_chars = total_chars
continue
else:
# Sufficient characters: process the paragraph and reset the accumulator
p_str = combine_paragraphs(combined_paragraph, p_str)
combined_paragraph, combined_chars = "", 0
p_str = combine_hyphenated_words(p_str)
if p_str: # Only add non-empty content
para_num += 1
self._add_paragraph(p_str, para_num, section_name, page_no, temp_docs, temp_meta)
page_no = None
return temp_docs, temp_meta
Let’s break this down. First a loop:
for i, text in enumerate(texts):
next_text = get_next_text(texts, i)
page_no = get_current_page(text, combined_paragraph, page_no)
Skip the page if it falls outside the valid range (as determined by the associated valid pages csv):
# Check if the current page is within the valid range
if self._start_page is not None and page_no is not None and page_no < self._start_page:
page_no = None
continue
if self._end_page is not None and page_no is not None and page_no > self._end_page:
if self._double_notes and not first_note:
self._min_paragraph_size *= 2
first_note = True
continue
Is this text we can skip?
# Update section header if the element is a section header
if is_section_header(text):
section_name = text.text
continue
if should_skip_element(text):
continue
Clean the text:
p_str = clean_text(text.text)
p_str_chars = len(p_str)
The rest of the code is the real magic. It makes a decision about if we are going to make this paragraph a document fragment or try to combine it with another one. We would try to combine it either because it is too short on its own (i.e. we don’t want paragraphs that are too short to dominate the semantic search) or because it’s not really a paragraph due to, say, a page break. Here is the basic logic:
- Check for the end of a sentence. If so, this is now (possibly combined with a previous paragraph) now a full paragraph we’ll save as a document fragment.
- If the next text is a section header, then we want to accumulate the current paragraph and try to combine it with the next one (since we already know it is not a sentence end due to condition 1.
- If it’s too small (smaller than given minimize size for a paragraph), accumulate it to combine with the next one.
Finally, we save the accumulated paragraph off with its metadata and move to the next text.
Conclusion
And that’s it! We now have a way to parse a PDF, clean up the text, and provide metadata such as section headers.
There is undoubtedly more to do here. For example, PDFs often read words wrong or places spaces between letters of a single word. I haven’t addressed deeper problems like this yet. Though I plan to do this in the future, this will probably require a lot more processing or possibly even using an LLM or other language model to figure out contextually what the erroneous word was meant to be.
And why to go through this much trouble? I mean isn’t it true that the semantic search will probably mostly figure it out anyhow?
Ah, but this isn’t just about doing a better semantic search (though that is part of it.) It’s also because we’re going to explore how to read the text via text-to-speech (TTS). And to really do a good job of that you definitely need to fix all the words or they will get mispronounced. But that is the subject of a future post.
Also, if you are interested in learning how Docling and our other tutorials on AI can be put to use within your business, reach out over our Contact-Us page for a free consultation and/or discussion on AI.