
Using NLTK to Improve RAG (Retrieval Augmented Generation) Text Quality
- By Bruce Nielson
- ML & AI Specialist
We’ve been working with Docling (and also PyMuPDF4LLM) to read and parse text from a PDF document. But PDFs present a challenge because, unlike EPUBs, they are designed for pure visual layout, not as a source of readable text. That means what looks great on a screen in a PDF might turn into a mess when you extract the text.
For example, consider this paragraph in a PDF:
This paragraph looks just fine when you read it in a PDF. But note the highlighted hyphens at the end of the line—your eyes are trained not to even notice them. However, when Docling processes this text, it’s going to look something like this:
It is not surprising to find such views in Bishop Berkeley. Indeed, Berkeley made it clear that he published them largely in the hope of defending religion against the onslaught of science and of 'free- thinking'; against the claim that reason, unaided by divine revela- tion, can discover a world behind the world of appearance. But it is a little surprising to find support for these instrumentalist views in the camp of the admirers of science.
I doubt it affects a semantic search too much (though it might a little), but it’s not the most pleasant to read. It would be nice if we could fix problems like this.
In addition, Docling sometimes inserts odd spaces in the wrong places or introduces other artifacts due to PDFs not really being meant to serve as a source of text the way we’re trying to use them. PDFs are, first and foremost, designed to be read on a screen—not fed into an AI for processing by a Large Language Model (LLM).
How to Remove Unnecessary Hyphens
How might we go about removing unnecessary hyphens? This isn’t a simple problem. You might think you could just remove all hyphens, but in some cases, hyphens are necessary—for example, in compound words or to set off part of a sentence for emphasis. There’s no perfect set of rules that removes only the hyphens you don’t want while keeping the ones you do.
You might consider doing it based on whether there’s a space or not. But hyphens used to break up a sentence often have no spaces, while hyphens that split a word at the end of a line often do. Unfortunately, this isn’t consistent either way, so that approach won’t work.
What we really need is to determine whether, once spaces are removed, the hyphen splits a single word or if there are full words on both sides of it.
For example, revela-tion is clearly part of a single word—revelation—because revela and tion aren’t words on their own.
But how would we accomplish that?
Installing NLTK
Luckily there is an existing library called NLTK (Natural Language Toolkit) that can help us out here.
First, we need to install it into our Python environment by doing the following:
pip install nltk
I’m here assuming you already have Numpy installed – which you do if you did the environment setup already for the “Book Search Archive”.
You can find the version of Book Search Archive at the time of this post (which includes the code below) at this link.
Using NLTK
To make this work we’ll first need to download some libraries for use in NLTK:
nltk.download('words')
nltk.download('wordnet')
nltk.download('omw-1.4')
Then you’ll need to create a Python set of words:
words_list: set = set(nltk.corpus.words.words())
And then setup the lemmatizer and stemmer:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
What is a lemmatizer and stemmer? A lemmatizer returns the base of a word. So:
print("Lemmatization:")
print("caring →", lemmatizer.lemmatize(word1, pos="v")) # caring → care
print("geese →", lemmatizer.lemmatize(word2)) # geese → goose
print("troubled →", lemmatizer.lemmatize(word3, pos="a")) # troubled → troubled (lemma is the same)
And:
print("\nStemming:")
print("caring →", stemmer.stem(word1)) # caring → care
print("geese →", stemmer.stem(word2)) # geese → gees (incorrect root)
print("troubled →", stemmer.stem(word3)) # troubled → troubl
So:
In other words:
- Lemmatization considers the word’s meaning and grammar (e.g., geese → goose).
- Stemming blindly chops off word endings (e.g., geese → gees, troubled → troubl).
Unfortunately, even using the lemmatizer and stemmer, I still found that it just didn’t always find the best places to remove hyphens. So I also built some ‘custom lemmas’ of my own for common problems:
suffixes = {
"ability": "able", # testability -> testable
"ibility": "ible", # possibility -> possible
"iness": "y", # happiness -> happy
"ity": "e", # creativity -> create
"tion": "e", # creation -> create
"able": "", # testable -> test
"ible": "", # possible -> poss
"ing": "", # running -> run
"ed": "", # tested -> test
"s": "" # tests -> test
}
Finding a Valid Word Using NLTK
With this all in place, here is my isvalidword() function that I created to check if the two sides of a hyphen contained valid words or not:
def is_valid_word(word):
"""Check if a word is valid by comparing it directly and via stemming."""
stem = stemmer.stem(word)
if (word.lower() in words_list
or word in words_list):
return True
elif (stem in words_list
or stem.lower() in words_list):
return True
# Check all lemmatizations of the word
options = ['n', 'v', 'a', 'r', 's']
for option in options:
lemma = lemmatizer.lemmatize(word, pos=option)
if lemma in words_list:
return True
# Check for custom lemmatizations
suffixes = {
"ability": "able", # testability -> testable
"ibility": "ible", # possibility -> possible
"iness": "y", # happiness -> happy
"ity": "e", # creativity -> create
"tion": "e", # creation -> create
"able": "", # testable -> test
"ible": "", # possible -> poss
"ing": "", # running -> run
"ed": "", # tested -> test
"s": "" # tests -> test
}
for suffix, replacement in suffixes.items():
if word.endswith(suffix):
if suffix != 's':
pass
stripped_word = word[: -len(suffix)] + replacement
if is_valid_word(stripped_word):
return stripped_word
return False
The code does these checks:
- Is the text passed found in the list of words we previously downloaded?
- Is the stem of the text passed in the word list?
- Is the lemma of the text passed in the word list?
Note that there are 4 kinds of ‘lemmatizing’ called n, v, a, r, and s. We won’t worry, for now, what those are. We’re just going to check every single kind.
Finally, we try replacing various suffixes from my custom list with alternatives more likely to properly lemmatize and then try again.
If any of these find it to be a word, we count the text passed as a word.
I’ve found in practice this function works pretty well. It probably has some room for improvement.
Now we’ve got what we need to be able to check to see if the hyphen is between two actual words or if it is separating a single word. (Obviously there is more logic required to use this function.
Pulling it All Together: The “De-Hyphenator”
Now let’s pull it all together and write a function that takes text for a paragraph and rewrites it to remove unwanted hyphens:
def combine_hyphenated_words(p_str):
# This regular expression looks for cases where a dash separates two parts of a word
# The idea is to combine the two parts and check if they form a valid word.
def replace_dash(match):
word1, word2 = match.group(1), match.group(2)
combined = word1.strip() + word2.strip()
first_word = p_str.strip().split(' ')[0]
# does word 1 contain a space after the hyphen?
if word2.startswith(" ") and is_valid_word(combined):
# When there is a space after the hyphen, it is likely that the hyphen is separating two parts
# of a single word
return combined
# else check for each part individually being a word. If so, this is probably a compound word
elif is_valid_word(word1.strip()) and is_valid_word(word2.strip()):
return word1.strip() + '-' + word2.strip()
# else if the combined word is a valid word, then we probably had one word broken in two
elif is_valid_word(combined):
return combined # Combine the parts if they form a valid word
# if the combined word starts with a capital letter, then it is likely a proper noun. Combine the parts.
elif combined[0].isupper() and not word2.strip()[0].isupper() and not is_valid_word(word2.strip()):
return combined
# Default - assume the hyphen is separating two words
return word1.strip() + '-' + word2.strip()
# Replace soft hyphen characters (¬) with a regular dash
p_str = p_str.replace("¬", "-")
# p_str = p_str.replace("- ", "-")
# Look for dashes separating word parts (no spaces involved)
p_str = re.sub(r'(\w+)-(\s?\w+)', replace_dash, p_str)
return p_str
You can see that I have a sub-function called replace_dash() that does most of the work. The main function simply tries to place double hyphens with hyphens:
p_str = p_str.replace("", "-")
That may look like replacing a dash with a dash, but that first dash is really a double dash.
Next, we use re.sub (a regular expression substitute function) to find all strings of letters separated by dashes and run the replace dash function on it:
p_str = re.sub(r'(\w+)-(\s?\w+)', replace_dash, p_str)
return p_str
The replace_dash() function does the real work. It does the following:
- Set word1 to be the text before the dash and word2 to be the text after the dash
- Set the ‘combined’ variable to be a concatenation of the two.
- Check if word2 starts with a space (meaning word1 was text-dash-space meaning likely it was found at the end of a text line) and the combined word is valid. If so, return the combined word.
- Check if word1 and word2 are each individually valid. If so, return them with a dash in between.
- Check if the combined word is valid. If so, return combined word. (Similar to #3 but lower priority because there is no sign that it happened at the end of the text line in the PDF)
- If the combined word starts with an upper case and word2 does not and word2 is not valid on its own, return the combined word. This is likely a proper noun split up, so we’re recombining.
- Finally, the default is to just return it back as word-dash-word.
Many of these rules probably sound quite similar. But the order of operations matters here. This function does not always work right, but works pretty well most of the time. If you have ideas how to improve, let me know.
Conclusions
NLTK is another tool in the box for AI develop. LLMs deal with language and NLTK is there to assist you. We only used it for a very simple purpose – check for if a word is valid or not to remove unneeded hyphens – but NLTK has a lot more to offer.