A Local Text-to-Speech Model Using Suno Bark

A Local Text-to-Speech Model Using Suno Bark

In our last post, we looked at how to stream a text-to-speech model using the Hugging Face API. We had mixed results at best. So, in this post, I’ll cover how to do text-to-speech (TTS) using a local TTS model.

Introducing Suno Bark… Again

In our last post I tried out several different models via the Hugging Face API including suno/bark and suno/bark-small. Both of these are excellent, but relatively small and manageable, open-sourced text-to-speech models available on Hugging Face. I’ll be working with Suno Bark Small, but it is easy enough to pass in Suno Bark or some other text-to-speech model, if you desire.

Suno Bark Small has the main features I’m looking for. It is simple, it doesn’t require additional installs (unlike parler-tts) and most importantly it has voices you can use that stay consistent.

Many text-to-speech models sound different every time you use them. You just get a random voice. It might be a man one time and a woman another. Suno Bark Small has voices built in that you can specify that sound the same every time.

As it turns out, you can even specify a voice with an accent. For the BookSearchArchive I decided to give the voice a German accent by utilizing a German voice but giving it English text.

The TextToSpeechLocal Class

I built a TTS custom Haystack component for you to use called TextToSpeechLocal that can be found in the customhaystackcomponents.py file. (Here is the code base at the time of this Blog post. Most recent code found here.)

Let’s start with declaring the class and creating an initialization method:

@component
class TextToSpeechLocal:
    def __init__(self, model_name_or_path: str = "suno/bark-small"):
        # Initialize the processor
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.processor = AutoProcessor.from_pretrained(model_name_or_path, torch_dtype=torch.float16)
        self.model = BarkModel.from_pretrained(model_name_or_path, torch_dtype=torch.float16).to(self.device)

You can see that I allow you to pass a model name and default to suno/bark-small. However, note that probably no other model will work as of today other than suno/bark itself. You’ll see why this is true in a moment. Note that we are using the Hugging Face AutoProcessor and ‘from_pretrained’ methods to create the model.

Next let’s create a method for playing audio:

@staticmethod
def _play_audio(audio_data: np.ndarray, sample_rate: int = 24000) -> None:
    audio_data = audio_data.astype("float32")
    sd.play(audio_data, samplerate=sample_rate)
    sd.wait()  # Wait until the audio finishes playing

This is simple enough. You pass in a numpy array with audio_data and give it a sampling rate. I then make sure it’s in a float32 format (I’ll explain why this is important later) and then uses the SoundDevice library to play it. Then it ‘waits’ for the audio to finish.

Of course, the ‘run’ method is where all the real work happens. We will play in a string that (presumably) is created by the Large Language Model (LLM) and it will chop the string up into sentences and then play each sentence:

@component.output_types(text=str)
def run(self, reply: str) -> Dict[str, Any]:
    # Split the input text into sentences using regular expression
    sentences: List[str] = re.split(r'(?<=[.!?])\s+', reply.strip())

    # Process each sentence
    sentence: str
    for sentence in sentences:
        # Use the v2/de_speaker_0 voice preset
        voice_preset: str = "v2/de_speaker_0"

        # Prepare the inputs for the model
        inputs: dict = self.processor(sentence,
                                      voice_preset=voice_preset,
                                      return_tensors="pt",
                                      return_attention_mask=True)

        # Ensure inputs are moved to the correct device
        inputs = {key: value.to(self.device) for key, value in inputs.items()}

        audio_array = self.model.generate(**inputs).to(self.device)
        audio_array = audio_array.cpu().numpy().squeeze()

        # Play the generated audio immediately
        self._play_audio(audio_array)

    # After all sentences are processed, return the last audio chunk and the full text
    return {"text": reply}

The input is a string called ‘reply’ (from the LLM) and the output of this component is a string called ‘text’. I use a regular expression to split up the text into sentences by looking for punctuation like ‘.’, ‘!’, or ‘?’. Then I feed each sentence to the ‘processor’ created in the init method:

        inputs: dict = self.processor(sentence,
                                      voice_preset=voice_preset,
                                      return_tensors="pt",
                                      return_attention_mask=True)

This takes each sentence and turns it into an appropriate list of inputs to be fed into the actual model. The inputs will include an attention mask, encodings, etc.

Finally, we generate the actual audio using the model:

audio_array = self.model.generate(**inputs).to(self.device)

And then we squeeze it down to a vector (single dimensional array) to be sure it is in the right format for SoundDevice.

audio_array = audio_array.cpu().numpy().squeeze()

Finally, we call our function to play the audio. On my laptop, the result is pretty slow. If you have a GPU it will probably be able to keep up. But in a future post I’ll explore ways to make this run better.

Voice Presets

You may have noticed I set the voice to "v2/de_speaker_0". This is obviously meant for a German speaker, not an English speaker. But I thought it would be cool to give the BookSearchArchive a German accent. I wasn’t sure if a German model could speak English nor was I sure that if it could it would speak with a German Accent. But it could read English fine, and I think it has a nice German accent.

Specifying a specific voice is how you get Suno Bark to use a consistent voice. Also note that I currently hard coded a Suno Bark Small voice and if you tried to use a different model it would likely not know what to do with this voice preset. I need to fix that in a future version to allow any TTS model.

There is one problem with using a preset voice like this: this ‘speaker’ is tied specifically to suno/bark (though it also works on suno/bark-small). So, this custom component will fail on any model you pass in that doesn’t specifically have this voice. Which is, presumably, almost all of them. So, this is really a suno bark specific component. I will come up with a way to fix that in the future and make it more generic.

You can find a list of voice presets here.

Using TTS Component in Your Pipeline

Here is my revised pipeline using the TTS component:

A flow chart diagram. Will create a more detailed explanation at a later date.

That is looking pretty crazy, isn’t it? I wrote a new “MergeResults” component that takes the documents list from the doc_query_ collector as well as the results from the LLM and collates them all together to make it easy to find the final results. I then connect the ‘reply’ from that component to the Text-to-Speech (TTS) component.

This is all probably more complicated it then it needed to be, but it works. I’ll clean it up later. But this allows me to only connect in the TTS component if the user asks for it via the new ‘use_voice’ parameter:

if self._use_voice and not self._can_stream():
    # Add the text to speech component
    tts_node = TextToSpeechLocal()
    rag_pipeline.add_component("tts", tts_node)
    rag_pipeline.connect("merger.reply", "tts.reply")

I know that no matter what other parameters are set, the ‘merger’ will always have what I need.

Conclusions

This code will add a voice to your Haystack pipeline using a local model and allow your LLM to speak to you. We covered how to give it a preset voice so that the model speaks with a consistent voice and also how to give it an accent.

Links:

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter