Adding EPUB Support to Book2Audio

Adding EPUB Support to Book2Audio

Book2Audio started as a PDF-to-audiobook converter, but a lot of the best books come as EPUBs. This post covers how we added EPUB support by migrating code from our earlier BookSearchArchive project — a RAG-based book search tool discussed in a previous post.

The code referenced in this post can be found at this specific commit in the Book2Audio repository.

Where the Code Came From

BookSearchArchive included an EPubLoader Haystack component that loaded EPUB files and returned raw HTML sections, and an HTMLParser class that parsed those HTML sections into text chunks suitable for embedding and semantic search. Neither was designed for audio — they were optimised for RAG pipelines, returned Haystack ByteStream objects, and included features like double_notes that made sense for search but not for listening.

The goal was to strip out the Haystack dependency, clean up the interface, and make EPUB parsing a drop-in replacement for PDF parsing in Book2Audio.

What Changed

New File: epub_parser.py

This is the heart of the EPUB support. The EpubParser class takes a path to an .epub file and produces the same output as DoclingParser — a list of cleaned paragraph strings and a list of metadata dicts.

Single file, consistent interface. Rather than a loader/parser split like BookSearchArchive, EpubParser handles everything in one class:

parser = EpubParser("my_book.epub", meta_data={}, min_paragraph_size=300)
docs, meta = parser.run()

HTML parsing via BeautifulSoup. EPUBs are zipped HTML files. We use ebooklib to read the EPUB and extract each section's HTML, then BeautifulSoup to traverse the tag tree. The recursive_yield_tags function walks the HTML and yields leaf tags containing text, skipping structural elements like divs.

Chapter and section titles are emitted as paragraphs. In the RAG version, titles were stored only in metadata. For audio they need to be read aloud, so chapter titles and section headers are emitted as their own standalone paragraphs before the section content.

Footnote removal. The remove_footnotes parameter strips superscript tags from paragraphs unless they appear as the first content — which usually means they are footnote markers at the start of a footnote paragraph rather than inline citations.

Sections to skip. Two mechanisms are supported: a sections_to_skip.csv file in the same directory as the EPUB, and a sections_to_skip parameter passed directly to run(). Both are additive.

Debug output. Calling run(generate_text_file=True) writes two files alongside the EPUB:

  • <n>_processed_paragraphs.txt — the cleaned paragraph text
  • <n>_processed_meta.txt — metadata alongside each paragraph, useful for verifying chapter and section attribution

Updated: general_utils.py

The refactor revealed that a lot of text cleaning logic was duplicated or misplaced. We moved reusable utilities from docling_utils.py into general_utils.py:

  • is_sentence_end and is_ends_with_punctuation — pure string functions with no DocItem dependency
  • is_roman_numeral and enhance_title — migrated from parse_utils.py in BookSearchArchive
  • load_sections_to_skip — CSV loading logic, shared between EPUB and potentially other formats
  • The full clean_text pipeline — whitespace, hyphens, quotes, punctuation spacing, bracket spacing, apostrophes

docling_utils.py now focuses on what it should: DocItem inspection helpers and clean_pdf_text, which extends clean_text with PDF-specific steps for ligature normalisation, encoding artifact correction, and footnote number stripping.

Updated: book_converter.py

convert_to_audio now dispatches by file extension:

if suffix == '.pdf':
    # DoclingParser
elif suffix == '.epub':
    # EpubParser
elif suffix == '.txt':
    # read and convert directly

The sections_to_skip parameter threads all the way from the command line through main() and convert_to_audio to EpubParser.run().

Using It

From the Command Line

Convert an EPUB to audio:

python book_to_audio.py "documents/my_book.epub"

Dry run with debug output to inspect what the parser extracted:

python book_to_audio.py "documents/my_book.epub" --dry-run --generate-text-file

Skip front matter and navigation sections:

python book_to_audio.py "documents/my_book.epub" --sections-to-skip cover titlepage toc

Command Line Parameters

  • file_path — path to the EPUB file
  • --dry-run — parse the document but skip audio generation
  • --generate-text-file — save processed paragraph and metadata files alongside the source EPUB
  • --sections-to-skip — one or more section IDs to skip, separated by spaces
  • --engine — TTS engine to use: kokoro (default) or qwen
  • --voice — Kokoro voice identifier (default: af_heart)
  • --speaker — Qwen speaker name (default: vivian)
  • --output-file — path to the output WAV file

From Python

from epub_parser import EpubParser

parser = EpubParser(
    source="documents/my_book.epub",
    meta_data={"source": "my_book.epub"},
    min_paragraph_size=300,
    remove_footnotes=True
)
docs, meta = parser.run(
    generate_text_file=True,
    sections_to_skip=["cover", "titlepage", "toc"]
)

The --generate-text-file flag is especially useful for EPUBs where section IDs vary between publishers and you need to identify which ones to skip. Run with --dry-run --generate-text-file first, inspect the output, then add unwanted section IDs to --sections-to-skip.

What We Left Behind

A few things from BookSearchArchive did not make the cut for this version:

  • double_notes — in the RAG version, sections titled "Notes" got double the minimum paragraph size to avoid dominating search results. This does not apply to audio.
  • min_section_size — the RAG pipeline skipped sections below a minimum total length. For audio we want everything.
  • The Haystack component wrapperEPubLoader and HTMLParserComponent were Haystack-specific. All of that is gone.
  • Multiple file pathsEPubLoader accepted a list of files. EpubParser takes one file, consistent with DoclingParser.

Files Added or Modified

  • epub_parser.py — new
  • utils/general_utils.py — significantly expanded
  • utils/docling_utils.py — slimmed down, PDF-specific logic only
  • book_converter.py — EPUB dispatch in convert_to_audio
  • book_to_audio.pysections_to_skip parameter in main()
  • tests/test_epub_parser.py — new unit tests
  • tests/test_general_utils.py — new unit tests for moved utilities
  • tests/test_document_output.py — extended to cover EPUB integration tests

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter