Refactoring the Book2Audio Parsers
- By Bruce Nielson
- ML & AI Specialist
This is a progress update on Book2Audio — a tool that converts PDF and EPUB books into audio files using text-to-speech.
What We Did
Book2Audio has two parsers: DoclingParser for PDFs and EpubParser for EPUBs. Both do similar things — extract text from a document, clean it up, and chunk it into paragraphs — but they were implemented quite differently under the hood.
This update brings them into full alignment across several dimensions.
Shared Paragraph Accumulation
We extracted the shared paragraph accumulation logic into a new TextProcessor class. Both parsers now produce RawChunk objects and hand them off to TextProcessor, which handles the decisions about when to accumulate and when to emit a paragraph. Chunking behavior is now consistent across PDF and EPUB sources.
Unified Text Cleaning
Previously, PDF cleaning happened in DoclingParser and EPUB cleaning happened in EpubParser, with two separate cleaning functions that overlapped significantly. We merged these into a single clean_text function in general_utils.py and moved all cleaning into TextProcessor itself. Parsers now pass completely raw text — they extract and label, nothing more.
Cleaning now happens in a single upfront pass over all chunks before the accumulation loop runs. This gives the cleaner full context and sets up nicely for the LLM cleaning step described below.
We also took the opportunity to fix a long-standing bug in is_sentence_end — it wasn't recognizing curly quote characters as sentence-ending, which caused incorrect paragraph accumulation in some cases.
Consistent Parser Design
We also made EpubParser consistent with DoclingParser in several other ways:
- Both now accept either a file path or a pre-loaded document object, making unit testing much simpler and eliminating the need for
patch()in tests - Both are configured entirely at construction time — no configuration passed into
run() - Both live in a
parsers/package and inherit from a sharedBaseParserabstract base class - The CSV-based section skipping was removed from
EpubParser— the caller loads the CSV and passes the result in, consistent with howDoclingParserhandles page ranges
Why It Matters
The immediate benefit is cleaner, more testable code and consistent behavior between the two parsers. But the real motivation is what comes next: an LLM-based text cleaning step.
The plan is to add a DSPy module that operates on a sliding window of raw chunks before accumulation. It will:
- Remove footnotes that slipped into paragraph text
- Fix OCR errors and encoding artifacts that rule-based cleaning can't handle contextually
- Join words and paragraphs that were incorrectly split across page breaks
The window is bounded by the same logic already used for accumulation — section headers are hard boundaries, so the LLM never sees text across a section break. This keeps the LLM's job well-scoped and the context meaningful.
Having both parsers produce output through the same TextProcessor pipeline means the LLM cleaner will work identically regardless of whether the source was a PDF or an EPUB. And for books available in both formats, the cleaner EPUB output can serve as reference data for training the LLM to clean up the noisier PDF version.
If you need help with your Artificial Intelligence solutions, we're here to help.