Refactoring the Book2Audio Parsers

Refactoring the Book2Audio Parsers

This is a progress update on Book2Audio — a tool that converts PDF and EPUB books into audio files using text-to-speech.

What We Did

Book2Audio has two parsers: DoclingParser for PDFs and EpubParser for EPUBs. Both do similar things — extract text from a document, clean it up, and chunk it into paragraphs — but they were implemented quite differently under the hood.

This update brings them into full alignment across several dimensions.

Shared Paragraph Accumulation

We extracted the shared paragraph accumulation logic into a new TextProcessor class. Both parsers now produce RawChunk objects and hand them off to TextProcessor, which handles the decisions about when to accumulate and when to emit a paragraph. Chunking behavior is now consistent across PDF and EPUB sources.

Unified Text Cleaning

Previously, PDF cleaning happened in DoclingParser and EPUB cleaning happened in EpubParser, with two separate cleaning functions that overlapped significantly. We merged these into a single clean_text function in general_utils.py and moved all cleaning into TextProcessor itself. Parsers now pass completely raw text — they extract and label, nothing more.

Cleaning now happens in a single upfront pass over all chunks before the accumulation loop runs. This gives the cleaner full context and sets up nicely for the LLM cleaning step described below.

We also took the opportunity to fix a long-standing bug in is_sentence_end — it wasn't recognizing curly quote characters as sentence-ending, which caused incorrect paragraph accumulation in some cases.

Consistent Parser Design

We also made EpubParser consistent with DoclingParser in several other ways:

  • Both now accept either a file path or a pre-loaded document object, making unit testing much simpler and eliminating the need for patch() in tests
  • Both are configured entirely at construction time — no configuration passed into run()
  • Both live in a parsers/ package and inherit from a shared BaseParser abstract base class
  • The CSV-based section skipping was removed from EpubParser — the caller loads the CSV and passes the result in, consistent with how DoclingParser handles page ranges

Why It Matters

The immediate benefit is cleaner, more testable code and consistent behavior between the two parsers. But the real motivation is what comes next: an LLM-based text cleaning step.

The plan is to add a DSPy module that operates on a sliding window of raw chunks before accumulation. It will:

  • Remove footnotes that slipped into paragraph text
  • Fix OCR errors and encoding artifacts that rule-based cleaning can't handle contextually
  • Join words and paragraphs that were incorrectly split across page breaks

The window is bounded by the same logic already used for accumulation — section headers are hard boundaries, so the LLM never sees text across a section break. This keeps the LLM's job well-scoped and the context meaningful.

Having both parsers produce output through the same TextProcessor pipeline means the LLM cleaner will work identically regardless of whether the source was a PDF or an EPUB. And for books available in both formats, the cleaner EPUB output can serve as reference data for training the LLM to clean up the noisier PDF version.

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter