Book2Audio: Reviving My PDF-to-Audiobook Project (and Fighting Dependency Hell Along the Way)

Book2Audio: Reviving My PDF-to-Audiobook Project (and Fighting Dependency Hell Along the Way)

A while back, I wrote about using Kokoro to convert a PDF into an audiobook. The core idea was straightforward: take a PDF, strip away the noise—headers, footers, page numbers, figure captions—and feed clean text into a text-to-speech model so the result actually sounds like someone reading a book, not someone reading a formatted document aloud.

To get there, I leaned on IBM's Docling for PDF-to-Markdown conversion, which does an impressive job of understanding page layout, reading order, and document structure. I also used NLTK to clean up hyphenated words that PDF extraction loves to leave behind—the kind where "understand-\ning" gets split across a line break and ends up in your audio as two separate words.

Picking It Back Up

Unfortunately, I never got back to that project. The original code was embedded inside my "Book Search Archive" app, which had grown large and unwieldy enough that even small changes felt like a chore. So I decided to start fresh with a dedicated repository focused entirely on the PDF-to-audio pipeline: Book2Audio on GitHub.

Clean slate, single purpose, no distractions. Simple, right?

The Docling Regression

It was not simple.

The first thing I discovered was that the latest version of IBM Docling simply could not read my test PDF past page 17. Every time it hit that page, the conversion would die with an error like this:

Stage preprocess failed for run 1, pages [17]: std::bad_alloc

That's a memory allocation failure deep inside the C++ PDF parsing backend. It appears to be a known but unfixed error. After a fair amount of troubleshooting—trying different configurations, different PDFs, different environments—I came to accept that the newer version of Docling just didn't handle my test document as well as the original version I'd used months ago. Software updates aren't always upgrades, especially when the underlying parser has been rearchitected.

Python Dependency Hell

Rather than keep fighting the latest release, I decided to roll back to the older version that had worked before. This turned out to be its own adventure. Pinning docling==2.14.0 is one thing; getting it to coexist peacefully with Kokoro, NLTK, and all of their transitive dependencies is another.

The pip resolver tried its best, but ultimately I had to hand-tune the versions myself. After a lot of trial and error, here's the incantation that finally worked:

bash pip install docling==2.14.0 kokoro==0.7.15 misaki==0.7.15 typer==0.12.5 numpy==1.26.4 opencv-python-headless==4.10.0.84 pip install nltk==3.9.1 pip install soundfile==0.13.1

There's a requirements.txt in the repo as well, but I'll be honest—your mileage may vary. This is one of the perennial frustrations of the Python ecosystem. The pip installer does a reasonable job checking dependencies in isolation, but it often can't resolve conflicts across packages that each have their own strong opinions about which version of NumPy or OpenCV they need. And some of these packages really want Python 3.12 or lower, which adds yet another variable.

If you've spent any time in Python-land, none of this will surprise you. But it's worth documenting, if only so future-me doesn't have to rediscover it.

But it Works!

If you follow the above instructions, it will work! I am actively using even this primitive version to convert PDFs into audio books. It's surprising how well it does given how unsophisticated it is.

Go to my github repo and clone it and then run it like this:

python book_to_audio.py "BookTitle.pdf"

There is a full readme file that explains the full details of how to use it. This is a decently working app even in this early state. And the code base has full unit tests.

What's Next

The current state of the repo is a working starting point: it can take a PDF, parse it with Docling, clean the text, and generate audio with Kokoro. But I have bigger plans.

Upgrading the voice with Qwen3-TTS. Kokoro is capable, but I want to try Alibaba's recently open-sourced Qwen3-TTS. What makes it especially interesting for audiobook generation is its contextual understanding—it can adapt tone, pacing, and emphasis based on the meaning of the text, not just its phonemes. It supports 10 languages, voice cloning from just 3 seconds of audio, and even voice design from natural language descriptions ("a warm, measured narrator voice"). For long-form audio like audiobooks, that kind of semantic awareness could make a real difference in listenability.

Smarter text cleanup with local LLMs. Right now, NLTK handles the text smoothing—fixing hyphenation, removing artifacts. But I want to experiment with running a local model through Ollama to do more intelligent cleanup: identifying and removing headers, footers, page numbers, figure references, and other extraneous text that shouldn't appear in an audiobook. An LLM can understand context in a way that regex and rule-based approaches can't. Maybe the ideal approach is a combination of both—NLTK for the mechanical stuff, and a small fine-tuned model for the judgment calls. This might even be a good candidate for fine-tuning a smaller model specifically for the task of stripping non-narrative content from documents.

The Book2Audio repo is public if you want to follow along or contribute. It's early days, but the foundation is in place.

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter