Testing Hugging Face Serverless Text-To-Speech Models

Testing Hugging Face Serverless Text-To-Speech Models

I’d like to add a voice to BookSearchArchive, my test project for the Mindfire open-source AI stack, which I first discussed in this post. I thought it would be fun to let the Large Language Model (LLM) you converse with be able to actually chat with you via audio.

There are two possible ways to do this. One is we could run a local text-to-speech model. The other is we could use the Hugging Face API. (*)

Here is a link to the code at the time of this blog post. The most up-to-date code is found here.

Local vs Serverless API Text-To-Speech

Since my laptop has a sucky GPU, my preference would be to stream a Text-To-Speech model on the Hugging Face rate-limited Serverless API – similar to how we did for the BookSearchArchive’s LLM. (*)

Unfortunately, things didn’t go as planned.

The Hugging Face rate-limited server does have some Text-To-Speech models available on it. My first attempt was to just try the default which is suno/bark. It turns out this model is available on the Hugging Face serverless API, but according to the error message I received back, it is reserved for paying Pro customers.

Bad request:

Model requires a Pro subscription; check out hf.co/pricing to learn more. Make sure to include your HF token in your query.

I also tried out suno/bark-small but it always seems to return an internal server error:

Error generating audio for suno/bark-small: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/models/suno/bark-small (Request ID: OaHuye9GkoJeQqir0WW4d)

You can tell that these models exist on the Hugging Face server because you can navigate to them through your browser and see them for yourself. (Try: suno/bark, suno/bark-small, microsoft/speecht5_tts and you can see the server returns a proper result). I can understand why suno/bark doesn’t work since it requires a paid subscription, but why do you get an error on suno/bark-small?

After doing some research, it turns out that the Hugging Face server has setup suno/bark-small incorrectly: (Found on the Hugging Face discussion board)

“The Serverless Inference API is currently turned off for most models. This is probably because the shared resources on the server have been exhausted. It is turned on for microsoft/speecht5_tts, but it does not actually work because the configuration settings are not properly set in README.md or json.”

Well, that’s disappointing.

Finding a Working Serverless Model

I wrote some code to try out several different models on the Hugging Face server and attempt to generate text-to-speech. By trying out several models at once, I can see if any of them are setup correctly to work.

Here is my test code:

from huggingface_hub import InferenceClient
import requests
from pathlib import Path

# Load Hugging Face API secret - Put the secret in a text file and read it
hf_secret = open(r'D:\Documents\Secrets\huggingface_secret.txt', 'r').read().strip()


def get_model_details(model_id: str, token: str):
    """Fetch model details, including sample rate, using the Hugging Face API."""
    url = f"https://huggingface.co/api/models/{model_id}"
    headers = {"Authorization": f"Bearer {token}"}
    try:
        response = requests.get(url, headers=headers)
        print(f"\n>>> Retrieving details for model: {model_id}")
        print(f"Status Code: {response.status_code}")
        if response.status_code == 200:
            model_info = response.json()
            # Try to extract the sample rate from the metadata
            # sample_rate = model_info.get("config", {}).get("sample_rate", None)
            # print(f"Sample Rate: {sample_rate}")
            return {
                "id": model_info.get("id"),
                "modelType": model_info.get("modelType"),
                "pipeline_tag": model_info.get("pipeline_tag"),
                "library_name": model_info.get("library_name"),
                # "sample_rate": model_info.get("config", {}).get("sampling_rate", 16000),
            }
        else:
            print(f"Failed to retrieve details for {model_id}.")
            return None
    except Exception as e:
        print(f"Error fetching model details for {model_id}: {e}")
        return None


def generate_audio(model_id: str, token: str, text: str):
    """Generate audio using InferenceClient."""
    client = InferenceClient(api_key=token)
    try:
        print(f"\n>>> Generating audio with model: {model_id}")
        audio_data = client.text_to_speech(text, model=model_id)
        # model_details = get_model_details(model_id, token)
        # config = AutoConfig.from_pretrained(model_id)
        if isinstance(audio_data, bytes):
            file_name = model_id.replace("/", "_")
            audio_file = Path(f"{file_name}_test_sentence.flac")
            audio_file.write_bytes(audio_data)
            print(f"Audio saved to {audio_file}")
            return True
        else:
            print(f"Unexpected response type from {model_id}.")
            return False
    except Exception as e:
        print(f"Error generating audio for {model_id}: {e}")
        return False


def try_models(models, text, token):
    """Test models and generate a summary report."""
    results = {}
    for model in models:
        print("\n" + "=" * 80)
        print(f"Processing model: {model}")
        print("=" * 80)

        model_details = get_model_details(model, token)
        if model_details:
            print("Model Details:")
            print(model_details)
        else:
            print(f"Skipping {model} due to missing details.")

        success = generate_audio(model, token, text)
        results[model] = "Success" if success else "Failed"

    print("\n" + "=" * 80)
    print("SUMMARY REPORT")
    print("=" * 80)
    for model, result in results.items():
        print(f"Model: {model}\n  Result: {result}")
    print("=" * 80)


if __name__ == "__main__":
    models = [
        "suno/bark",
        "suno/bark-small",
        "facebook/mms-tts-eng",
        "microsoft/speecht5_tts",
    ]
    text = "Hello, welcome to the world of text to speech!"
    try_models(models, text, hf_secret)

This code runs through several models and tries them all out:

In truth, I tried a lot more models. Most weren’t on the server at all. The 4 above all exist on the server but most of them throw an internal server error. My code tests if they are there, tries to get info on them, and then tries to create a file using Hugging Face’s InferenceClient method.

“get_model_details” is a function that gets details on the model – if it exists – and prints out information about the model.

“try_models” is a function that actually tries to call the models using Hugging Face’s InferenceClient method.

My code shows that all four of these exist on the Hugging Face server, but bark-small and speecht5-tts both get internal server errors. Only facebook/mms-tts-eng actually produces an audio file. (Called facebook_mms-tts-eng_test_sentence.flac).

Model: suno/bark
  Result: Failed
Model: suno/bark-small
  Result: Failed
Model: facebook/mms-tts-eng
  Result: Success
Model: microsoft/speecht5_tts
  Result: Failed

Environment Setup

To make this code work I upgraded tokenizer and transformer and installed SoundDevice. (PyPI SoundDevice page.) You can see the updated environment in the requirements.txt file. Try running:

pip install tokenizers==0.19.1

pip install transformers==4.43.2

pip install sounddevice

So, if you want to use the Hugging Face serverless Api it looks like facebook/mms-tts-eng is the only option I was able to find for you.

Playing Audio Immediately

Of course, my code saves the audio out to a file rather than actually playing the audio directly. You could then play the audio from the file, of course. But that’s not the most useful. Unfortunately, the facebook/mms-tts-eng model seems to (at least by default) create flac files which are compressed. So you can’t play them via the SoundDevice library directly. I will work out how to deal with that problem and publish a solution in a future post.

Using a Local Model

Alternatively, you could just use a local model. If you check my latest code, I do have a working version of a local text-to-speech model. You can activate it via the RagPipeline method’s new parameter:

use_voice=True

Checkout the TextToSpeechLocal class in customhaystackcomponents.py. I’ll cover this code in detail in a future post.

Update: The problem with the facebook/mms-tts-eng turned out to be that the Hugging Face server set it up to return flac files and currently soundfile (which underlies most libraries in python that play audio) has a bug that it won't play flac files correctly. See this bug I reported complete with how to simply replicate the problem.

Notes

(*) Recall (back in this post) we used Haystack’s HuggingFaceAPIGenerator component and used the HuggingFaceH4/zephyr-7b-alpha model, which has been optimized for Hugging Face’s serverless API.

Links:

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter