Implementing Qwen3-TTS in My PDF-to-Audiobook Pipeline (Qwen3-TTS, Part 1)

Implementing Qwen3-TTS in My PDF-to-Audiobook Pipeline (Qwen3-TTS, Part 1)

In my last post, I walked through building a PDF-to-audiobook pipeline using Kokoro for text-to-speech. The pipeline worked well enough that I've been actively using it to listen to books that only exist as PDFs. But I mentioned wanting to try Alibaba's recently open-sourced Qwen3-TTS as an alternative voice engine, and I've now done exactly that. (My code is found in my github repo.)

This post covers how to use Qwen3-TTS to generate speech from the command line and a discussion of how I refactored the code to support multiple TTS engines. A follow-up post will walk through the Qwen engine code itself.

Trying Qwen3-TTS

Before touching any of my existing code, I wanted to hear what Qwen3-TTS actually sounded like. The setup is straightforward. Install the package:

pip install -U qwen-tts

The first time you use a model, the weights download automatically from Hugging Face. The 0.6B model is roughly 1.2GB and the 1.7B model is around 3.4GB.

Once installed, generating speech from Python is only a few lines:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

wavs, sr = model.generate_custom_voice(
    text="The philosopher argued that all knowledge is provisional.",
    language="English",
    speaker="ryan",
)

sf.write("output.wav", wavs[0], sr)

Code found here.

You load a model with from_pretrained, call generate_custom_voice with your text, a language, and a speaker name, and you get back a list of waveform arrays and a sample rate. Write the first waveform to a file and you have a WAV you can play.

Qwen3-TTS comes with nine built-in speakers: aiden, dylan, eric, onoanna, ryan, serena, sohee, unclefu, and vivian. They vary quite a bit in tone and accent. I'd recommend generating a short sample with each one to find what works for your use case. However, my experience is that you get a somewhat different voice each time you run the voice. This makes it less than desirable for reading an audio book.

The 1.7B CustomVoice model also supports an instruct parameter that lets you control the delivery style with natural language:

wavs, sr = model.generate_custom_voice(
    text="The philosopher argued that all knowledge is provisional.",
    language="English",
    speaker="ryan",
    instruct="Read in a calm, steady audiobook narration style",
)

This is a genuinely interesting feature, but be aware that instruction control only works on the 1.7B models. The 0.6B models silently ignore the instruct parameter.

Find a list of all the available models on Hugging Face here.

Refactoring for Multiple Engines

My original code for Book2Audio had the Kokoro TTS model wired directly into the AudioGenerator class. To support Qwen3-TTS as an alternative, I needed to pull the model-specific logic out and make it swappable. (Apologies for naming the repo Book2Audio and the python file booktoaudio. I need to rename the repo at some point to match.)

The approach was a straightforward application of the strategy pattern. I created a TTSEngine abstract base class with two methods: generate, which takes text and returns a numpy audio array, and a sample_rate property. Then I wrote two concrete implementations: KokoroEngine wrapping the existing Kokoro pipeline, and QwenCustomVoiceEngine wrapping Qwen3-TTS.

AudioGenerator, which previously owned the Kokoro pipeline directly, now takes any TTSEngine. It delegates audio generation to whatever engine it's given and handles only the model-agnostic work of saving WAV files. BookToAudio, the class that orchestrates document parsing and paragraph-by-paragraph generation, didn't need to change at all. It still talks to AudioGenerator the same way it always did.

I also split the single book_to_audio.py file into several files. The engines live in their own directory, AudioGenerator and BookToAudio each got their own module, and book_to_audio.py became a thin CLI entry point. This makes it easy to add more engines later without everything piling up in one file.

From the command line, switching engines is just a flag:

python book_to_audio.py "documents/MyBook.pdf" --engine kokoro --voice af_heart

python book_to_audio.py "documents/MyBook.pdf" --engine qwen --speaker ryan --language English

You can also convert plain text directly without a PDF:

python book_to_audio.py --text "Hello world" --engine qwen --speaker vivian

By default, the Qwen engine uses the 0.6B model. To use the larger 1.7B model, which supports instruction control:

python book_to_audio.py "documents/MyBook.pdf" --engine qwen --speaker ryan --language English --model-size 1.7b --instruct "Read in a calm, steady audiobook narration style"

For long documents, you can limit the page range to test on a small section before committing to a full run:

python book_to_audio.py "documents/MyBook.pdf" --engine qwen --speaker ryan --start-page 10 --end-page 15

To process the document without generating audio — useful for inspecting the extracted text before spending time on generation:

python book_to_audio.py "documents/MyBook.pdf" --dry-run --generate-text-file

And to specify an output file name instead of the default:

python book_to_audio.py "documents/MyBook.pdf" --engine qwen --speaker ryan --output-file "my_audiobook.wav"

The Kokoro path works exactly as before. The Qwen path adds a few extra options for speaker, language, model size, and style instructions.

Kokoro vs. Qwen3-TTS for Audiobooks

After testing both engines on the same material, I have to be honest: I still prefer Kokoro for most audiobook listening. The Qwen3-TTS voices, while technically impressive, tend to have cadence patterns and occasional accent shifts that can be fatiguing over long listening sessions. The ryan speaker comes closest to a natural audiobook narrator in English, and I may switch to it in the future as I experiment more with the instruct parameter on the 1.7B model. But for now, Kokoro's more neutral delivery wins for extended listening.

Hardware Considerations

One thing that surprised me during this process was discovering that my laptop had been running Kokoro on CPU the whole time. PyTorch had been installed without CUDA support, which meant torch.cuda.is_available() returned False and everything silently fell through to CPU inference. It worked, just slower than it needed to be.

If you're running this on a machine with an NVIDIA GPU, make sure you install the CUDA-enabled version of PyTorch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

You can verify it worked with:

python -c "import torch; print(torch.cuda.is_available())"

In any case, my code will use your GPU if it's available and CUDA is properly installed.

For Qwen3-TTS specifically, the 0.6B model needs roughly 1.5GB of VRAM and runs comfortably on a 6GB laptop GPU. The 1.7B model needs 4-6GB and may be tight on consumer hardware, especially if other applications are using the GPU. I'd recommend starting with the 0.6B model and only moving to the 1.7B if you want instruction control or find the quality insufficient. Theoretically the 1.7B should work, but I haven't really tested it yet on my laptop. I'll do that in a future post.

What's Next

In the next post, I'll walk through the actual Qwen engine code, explaining how it works and how to use the Qwen3-TTS API. I also plan to explore Qwen3-TTS voice cloning, which lets you train the model on a specific narrator's voice from just a short audio clip and then generate an entire audiobook in that style.

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter