Adding EPUB Support to Book2Audio
- By Bruce Nielson
- ML & AI Specialist
Book2Audio started as a PDF-to-audiobook converter, but a lot of the best books come as EPUBs. This post covers how we added EPUB support by migrating code from our earlier BookSearchArchive project — a RAG-based book search tool discussed in a previous post.
The code referenced in this post can be found at this specific commit in the Book2Audio repository.
Where the Code Came From
BookSearchArchive included an EPubLoader Haystack component that loaded EPUB files and returned raw HTML sections, and an HTMLParser class that parsed those HTML sections into text chunks suitable for embedding and semantic search. Neither was designed for audio — they were optimised for RAG pipelines, returned Haystack ByteStream objects, and included features like double_notes that made sense for search but not for listening.
The goal was to strip out the Haystack dependency, clean up the interface, and make EPUB parsing a drop-in replacement for PDF parsing in Book2Audio.
What Changed
New File: epub_parser.py
This is the heart of the EPUB support. The EpubParser class takes a path to an .epub file and produces the same output as DoclingParser — a list of cleaned paragraph strings and a list of metadata dicts.
Single file, consistent interface. Rather than a loader/parser split like BookSearchArchive, EpubParser handles everything in one class:
parser = EpubParser("my_book.epub", meta_data={}, min_paragraph_size=300)
docs, meta = parser.run()
HTML parsing via BeautifulSoup. EPUBs are zipped HTML files. We use ebooklib to read the EPUB and extract each section's HTML, then BeautifulSoup to traverse the tag tree. The recursive_yield_tags function walks the HTML and yields leaf tags containing text, skipping structural elements like divs.
Chapter and section titles are emitted as paragraphs. In the RAG version, titles were stored only in metadata. For audio they need to be read aloud, so chapter titles and section headers are emitted as their own standalone paragraphs before the section content.
Footnote removal. The remove_footnotes parameter strips superscript tags from paragraphs unless they appear as the first content — which usually means they are footnote markers at the start of a footnote paragraph rather than inline citations.
Sections to skip. Two mechanisms are supported: a sections_to_skip.csv file in the same directory as the EPUB, and a sections_to_skip parameter passed directly to run(). Both are additive.
Debug output. Calling run(generate_text_file=True) writes two files alongside the EPUB:
<n>_processed_paragraphs.txt— the cleaned paragraph text<n>_processed_meta.txt— metadata alongside each paragraph, useful for verifying chapter and section attribution
Updated: general_utils.py
The refactor revealed that a lot of text cleaning logic was duplicated or misplaced. We moved reusable utilities from docling_utils.py into general_utils.py:
is_sentence_endandis_ends_with_punctuation— pure string functions with no DocItem dependencyis_roman_numeralandenhance_title— migrated fromparse_utils.pyin BookSearchArchiveload_sections_to_skip— CSV loading logic, shared between EPUB and potentially other formats- The full
clean_textpipeline — whitespace, hyphens, quotes, punctuation spacing, bracket spacing, apostrophes
docling_utils.py now focuses on what it should: DocItem inspection helpers and clean_pdf_text, which extends clean_text with PDF-specific steps for ligature normalisation, encoding artifact correction, and footnote number stripping.
Updated: book_converter.py
convert_to_audio now dispatches by file extension:
if suffix == '.pdf':
# DoclingParser
elif suffix == '.epub':
# EpubParser
elif suffix == '.txt':
# read and convert directly
The sections_to_skip parameter threads all the way from the command line through main() and convert_to_audio to EpubParser.run().
Using It
From the Command Line
Convert an EPUB to audio:
python book_to_audio.py "documents/my_book.epub"
Dry run with debug output to inspect what the parser extracted:
python book_to_audio.py "documents/my_book.epub" --dry-run --generate-text-file
Skip front matter and navigation sections:
python book_to_audio.py "documents/my_book.epub" --sections-to-skip cover titlepage toc
Command Line Parameters
file_path— path to the EPUB file--dry-run— parse the document but skip audio generation--generate-text-file— save processed paragraph and metadata files alongside the source EPUB--sections-to-skip— one or more section IDs to skip, separated by spaces--engine— TTS engine to use:kokoro(default) orqwen--voice— Kokoro voice identifier (default:af_heart)--speaker— Qwen speaker name (default:vivian)--output-file— path to the output WAV file
From Python
from epub_parser import EpubParser
parser = EpubParser(
source="documents/my_book.epub",
meta_data={"source": "my_book.epub"},
min_paragraph_size=300,
remove_footnotes=True
)
docs, meta = parser.run(
generate_text_file=True,
sections_to_skip=["cover", "titlepage", "toc"]
)
The --generate-text-file flag is especially useful for EPUBs where section IDs vary between publishers and you need to identify which ones to skip. Run with --dry-run --generate-text-file first, inspect the output, then add unwanted section IDs to --sections-to-skip.
What We Left Behind
A few things from BookSearchArchive did not make the cut for this version:
double_notes— in the RAG version, sections titled "Notes" got double the minimum paragraph size to avoid dominating search results. This does not apply to audio.min_section_size— the RAG pipeline skipped sections below a minimum total length. For audio we want everything.- The Haystack component wrapper —
EPubLoaderandHTMLParserComponentwere Haystack-specific. All of that is gone. - Multiple file paths —
EPubLoaderaccepted a list of files.EpubParsertakes one file, consistent withDoclingParser.
Files Added or Modified
epub_parser.py— newutils/general_utils.py— significantly expandedutils/docling_utils.py— slimmed down, PDF-specific logic onlybook_converter.py— EPUB dispatch inconvert_to_audiobook_to_audio.py—sections_to_skipparameter inmain()tests/test_epub_parser.py— new unit teststests/test_general_utils.py— new unit tests for moved utilitiestests/test_document_output.py— extended to cover EPUB integration tests
If you need help with your Artificial Intelligence solutions, we're here to help.