Using Kokoro-82M to Convert a PDF to an Audiobook

Using Kokoro-82M to Convert a PDF to an Audiobook

Now that we’re familiar with the power of IBM’s Docling, let’s explore how we can use it to automatically convert a PDF into an audiobook. Sometimes, I just don’t have the time to read every paper I want, so this could be a game-changer. With this approach, I can take a paper, convert it into an audiobook, and listen to it while working.

There are already plenty of free tools to convert a PDF to audio such as:

But I am unsatisfied with the results. So, I wanted to look into doing my own version of it.

Introducing Kokoro-82M

In a past post, we explored Suno Bark as a way to convert text to speech (TTS). Suno Bark isn’t a bad choice—especially if you have a more powerful laptop than I do! But on my underpowered machine, it was slow to generate audio from text. So, I started searching for a better solution. That’s when I came across the incredibly impressive Kokoro-82M. (Github repo found here.)

Kokoro has only 82 million parameters, making it lightweight, yet the results are surprisingly good. To my ears, it sounds quite human-like. Plus, it’s open-source and freely available on Hugging Face.

Of course, there are some gotchas with Kokoro (as there are with all TTS models), which we’ll address in future posts. But for now, let’s just demonstrate how easy it is to use Docling to read a PDF, convert it to speech with Kokoro, and save it as a WAV file—just like an audiobook.

Setting up the Environment

First, let’s check out the usage section of the Kokoro model card. To get started, we’ll need to install Kokoro itself, along with SoundFile for handling audio. So, let’s begin with a couple of installations:

pip install kokoro

pip install soundfile

Then let’s do some imports:

from docling.document_converter import DocumentConverter, ConversionResult
from docling_core.types import DoclingDocument
from pathlib import Path
from kokoro import KPipeline
import soundfile as sf
import numpy as np
from docling_parser import DoclingParser
from custom_haystack_components import load_valid_pages

A Simple Approach

The simplest way to do this is to just load a PDF using Docling and then generate it into audio:

def load_pdf_text(file_path: str) -> str:
    """Load a PDF, caching as JSON if needed, and export its text."""
    converter = DocumentConverter()
    json_path = Path(file_path).with_suffix('.json')
    if json_path.exists():
        book = DoclingDocument.load_from_json(json_path)
    else:
        result = converter.convert(file_path)
        book = result.document
        book.save_as_json(json_path)
    return book.export_to_text()


def simple_generate_and_save_audio(text: str,
                                   output_file: str,
                                   voice: str = 'af_heart',
                                   sample_rate: int = 24000,
                                   play_audio: bool = False):
    """Generate audio from text using Kokoro, play each segment, and save combined audio to a WAV file."""
    pipeline = KPipeline(lang_code='a')
    audio_segments = []

    for i, (gs, ps, audio) in enumerate(pipeline(text, voice=voice, speed=1, split_pattern=r'\n+')):
        print(f"Segment {i}: Graphemes: {gs} | Phonemes: {ps}")
        audio_segments.append(audio)

    combined_audio = np.concatenate(audio_segments)
    sf.write(output_file, combined_audio, sample_rate)
    print(f"Audio saved to {output_file}")

And then let’s pull it all together with something like this:

def simple_pdf_to_audio():
    file_path = r"D:\Documents\AI\BookSearchArchive\documents\Realism and the Aim of Science -- Karl Popper -- 2017.pdf"
    text = load_pdf_text(file_path)
    print("Extracted text from PDF.")
    simple_generate_and_save_audio(text, "output.wav")

Here I’m giving it a path to a PDF, loading the text using Docling, and then generating the audio file.

A Better Approach

Of course, the result is rather lackluster. It does convert the PDF to an audio file, but it reads page headers, page numbers, etc. Not a very impressive result.

So, I had an idea: what if I utilize my DoclingParser from past posts to clean up the text? Here is the result:

def docling_parser_pdf_to_audio(file_path: str,
                                voice: str = 'af_heart',
                                sample_rate: int = 24000):
    converter = DocumentConverter()
    result: ConversionResult = converter.convert(file_path)
    book: DoclingDocument = result.document
    valid_pages = load_valid_pages("documents/pdf_valid_pages.csv")
    start_page = None
    end_page = None
    if book.name in valid_pages:
        start_page, end_page = valid_pages[book.name]

    parser = DoclingParser(book, {},
                           min_paragraph_size=300,
                           start_page=start_page,
                           end_page=end_page,
                           double_notes=True)
    paragraphs, meta = parser.run()
    """Generate audio from text using Kokoro, play each segment, and save combined audio to a WAV file."""
    pipeline = KPipeline(lang_code='a')
    audio_segments = []
    for i, paragraph in enumerate(paragraphs):
        print(f"Generating audio for paragraph {i+1}/{len(paragraphs)}")
        # convert paragraph which is a ByteSteam back to regular text
        text = paragraph.to_string('utf-8')

        for j, (gs, ps, audio) in enumerate(pipeline(text, voice=voice, speed=1, split_pattern=r'\n+')):
            print(f"Segment {j}: Graphemes: {gs} | Phonemes: {ps}")
            audio_segments.append(audio)

    combined_audio = np.concatenate(audio_segments)
    output_file = file_path.replace('.pdf', '.wav')
    sf.write(output_file, combined_audio, sample_rate)
    print(f"Audio saved to {output_file}")

We load up the PDF using my DoclingParser object – which is already built to remove page headers and footers. Then I generate the graphemes using Kokoro and pass it to SoundFile to create the audio. The results are pretty impressive, though there are still a few strange parts I need to tweak. But this approach seems promising!

Conclusions

There is more work to do, but this is a promising way forward to creating a way to turn a PDF into an audio book.

Links

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter