Inside the Qwen3-TTS Engine Code (Qwen3-TTS, Part 2)

Inside the Qwen3-TTS Engine Code (Qwen3-TTS, Part 2)

This is the follow-up to my previous post on adding Qwen3-TTS to Book2Audio. That post covered how to use Qwen3-TTS from the command line and how I refactored the code to support multiple TTS engines. This post walks through the actual engine code — how it's structured, what each piece does, and how Qwen3-TTS works under the hood.

All the code discussed here is available in my GitHub repo.

The Engine Abstraction

The starting point is a simple abstract base class that defines what any TTS engine needs to do:

from abc import ABC, abstractmethod
import numpy as np

class TTSEngine(ABC):
    @abstractmethod
    def generate(self, text: str) -> np.ndarray:
        ...

    @property
    @abstractmethod
    def sample_rate(self) -> int:
        ...

Two methods, that's it. generate takes a string of text and returns a numpy array of audio samples. sample_rate returns the sample rate in Hz so the caller knows how to save the audio correctly. Any TTS backend — Kokoro, Qwen, or something else entirely — just needs to implement these two things.

AudioGenerator then wraps any engine and handles the model-agnostic parts:

class AudioGenerator:
    def __init__(self, engine: TTSEngine) -> None:
        self._engine = engine

    def generate(self, text: str) -> np.ndarray:
        return self._engine.generate(text)

    def save(self, audio: np.ndarray, output_file: str) -> None:
        sf.write(output_file, audio, self._engine.sample_rate)

    def generate_and_save(self, text: str, output_file: str) -> None:
        self.save(self.generate(text), output_file)

The key thing here is save — it pulls sample_rate from the engine rather than hardcoding it. Different engines could theoretically produce audio at different sample rates, and this handles that transparently. generate_and_save is just a convenience method that chains the two together.

Loading the Model

The QwenCustomVoiceEngine constructor handles model loading. If you don't pass in a pre-loaded model, it figures everything out from the model_size parameter:

QWEN_MODEL_SIZES = {
    '0.6b': 'Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice',
    '1.7b': 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
}

def __init__(self,
             speaker: str = 'vivian',
             language: str = 'Auto',
             instruct: str | None = None,
             model_size: str = '0.6b',
             model: Qwen3TTSModel | None = None) -> None:
    if model is None:
        model_id = QWEN_MODEL_SIZES.get(
            model_size.lower(), QWEN_MODEL_SIZES['0.6b']
        )
        attn_impl = 'sdpa'
        try:
            import flash_attn
            attn_impl = 'flash_attention_2'
        except ImportError:
            pass
        device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
        model = Qwen3TTSModel.from_pretrained(
            model_id,
            device_map=device,
            dtype=torch.bfloat16,
            attn_implementation=attn_impl,
        )

The QWEN_MODEL_SIZES dictionary maps friendly size names to full Hugging Face model identifiers. This way the rest of the code just passes around "0.6b" or "1.7b" instead of long model strings. It also means that if Qwen releases new checkpoints, there's only one place to update. Note to self: I should really allow you to pass a full model name here and only use this dictionary for shorthands. I need to implement that still.

There are a few things worth noting about the from_pretrained call.

device_map controls where the model runs. The code checks torch.cuda.is_available() and uses the GPU if present, falling back to CPU otherwise. This is the same pattern that the Kokoro engine uses, so both engines behave consistently.

dtype=torch.bfloat16 halves the memory footprint compared to full float32 precision with negligible quality loss. For the 1.7B model, this is the difference between fitting on a consumer GPU and not fitting at all.

The attention implementation check is about GPU memory efficiency. FlashAttention 2 is an optimized attention algorithm that reduces VRAM usage during inference. But it requires the flash-attn package, which compiles from source and needs the CUDA Toolkit and a C++ compiler installed — a nontrivial setup on Windows. If it's not available, the engine falls back to PyTorch's built-in scaled dot product attention (sdpa), which works fine but uses a bit more VRAM. The code handles this gracefully: try the import, use it if it's there, move on if it's not. To be frank, I never got flash attention working — getting the CUDA Toolkit and C++ compiler set up on Windows was more than I wanted to take on right now. So the code is there for when I get around to it, but for now I use sdpa.

The constructor also accepts a pre-loaded model via the model parameter. This is useful for testing — you can inject a mock — and it also means you could share a single model instance across multiple engine objects if you needed to.

Generating Speech

The generate method is where text actually becomes audio:

def generate(self, text: str) -> np.ndarray:
    kwargs = {
        'text': text,
        'language': self._language,
        'speaker': self._speaker,
    }
    if self._instruct is not None:
        kwargs['instruct'] = self._instruct

    wavs, sr = self._model.generate_custom_voice(**kwargs)
    self._sample_rate = sr
    return wavs[0]

It assembles the keyword arguments for generate_custom_voice, conditionally including the instruct parameter, makes the call, and returns the first waveform.

The instruct parameter is only included when it's not None. This matters because the 0.6B model doesn't support instruction control — only the 1.7B CustomVoice model does. Passing instruct to the 0.6B model won't cause an error, but it will be silently ignored.

generate_custom_voice returns a list of waveforms because the Qwen3-TTS API supports batched generation — you can pass a list of strings and get back multiple audio arrays in one call. For our paragraph-at-a-time use case, we always pass a single string and take wavs[0]. However, I should really change things to allow this code to handle everything at once as an option. As I mentioned in the previous post, Qwen3-TTS produces a somewhat different voice each time you call it, which makes for a questionable audiobook experience when you're generating paragraph by paragraph. Batching everything into a single call might help with that consistency.

The sample rate is captured from the return value rather than hardcoded, though in practice Qwen3-TTS always returns 24000 Hz. By reading it from the response, the code stays correct even if a future model version changes the rate.

Wiring It Together

The CLI entry point in book_to_audio.py ties everything together. When the user passes --engine qwen, a small factory function creates the right engine:

def _create_engine(args):
    if args.engine == 'qwen':
        return QwenCustomVoiceEngine(
            speaker=args.speaker,
            language=args.language,
            instruct=args.instruct,
            model_size=args.model_size,
        )
    else:
        return KokoroEngine(voice=args.voice)

That engine gets wrapped in an AudioGenerator, which gets handed to BookToAudio, which does the actual document processing. BookToAudio doesn't know or care whether it's using Kokoro or Qwen — it just calls generate and gets audio back.

This is the payoff of the strategy pattern. Adding a third engine later — say, for voice cloning with the Qwen3-TTS Base model — means writing a new engine class, adding an option to _create_engine, and nothing else changes.

What's Next

Voice cloning is the natural next step. The Qwen3-TTS Base model can clone a voice from just a few seconds of reference audio, which opens up the possibility of generating an entire audiobook in a specific narrator's voice. That will be a separate engine class since it uses a different model and a different API (generate_voice_clone instead of generate_custom_voice), but the abstraction is already in place to support it.

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter