Using Hugging Face API Generators for RAG

Using Hugging Face API Generators for RAG

Artificial Intelligence and Large Language Models (LLMs) like ChatGPT can quickly become expensive, especially when costs are incurred on a per-token basis. At Mindfire TECH, we are dedicated to providing cost-effective AI solutions for our customers. In our previous post, we modified our HaystackPgvector class to work with the GoogleAIGeminiGenerator, enabling us to integrate Google Gemini into our Haystack Retrieval Augmented Generation (RAG) pipeline. Prior to that we explored how to utilize a local Hugging Face model with our RAG pipeline. (See Part 1 and Part 2 at their links).

Hugging Face Inference Endpoints

For those focused on cost efficiency, one of the most effective strategies is to access a Hugging Face model via an API call. Hugging Face offers a rate-limited serverless solution that allows us to validate this approach. While it may not be robust enough for full production use, it provides a valuable opportunity to experiment with our solutions before we commit to paying Hugging Face for a dedicated inference endpoint.

Since most of the Hugging Face ecosystem and library is free, Hugging Face makes their money by renting CPUs and GPUs for use with your AI applications. The Hugging Face inference endpoint (documentation found here) is a dedicated virtual server that runs a CPU or GPU for use with your applications. You pay per hour instead of per token.

Haystack’s HuggingFaceAPIGenerator Component

Haystack has a built-in way to call a Hugging Face Generator via an API rather than having it running locally. The component is called HuggingFaceAPIGenerator.

In a previous post, we attempted to use a precursor to this Haystack component (then called HuggingFaceTGIGenerator) and bumped into some problems. That version had a strange tendency to give you a list of models available in their rate-limited free-tier and then when you used one of them it would then change the list of available models to no longer include that model. Even stranger, it tended to do this after a free use or two. It was difficult to figure out how to handle that, so at the time I abandoned use of the HuggingFaceTGIGenerator component in favor of local models. But I still believe the idea of hosting your LLM via an API makes good sense even if the Haystack component itself was problematic.

Since then, HuggingFaceTGIGenerator has been retired in favor of a new component called HuggingFaceAPIGenerator which works much better and gives far more helpful feedback when there is a problem.

Of course, in typical Haystack fashion their website is out of date and still lists HuggingFaceTGIGenerator as a valid component without the slightest hint that it is now deprecated. This led to needless pulling out of my already thin hair as I desperately Googled around – to no avail – for an explanation why it was missing from my up-to-date Haystack implementation. Finally, someone on the Haystack discord pointed me to the release notes for version 2.3 that say the following:

“Deprecated HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator have been removed. Use HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator instead.”

And that is the sole place on the entire internet (at least accessible via Google) that tells you to stop using HuggingFaceTGIGenerator and instead use HuggingFaceAPIGenerator. Well, I guess Haystack is open-source so it’s not like it cost me anything. But dang, that seemed a bit harder than it needed to be to figure out what the correct component was to access the Hugging Face inference endpoint.

Using the HuggingFaceAPIGenerator Component

Now that we have modified our code to use three different Haystack components as our LLM (HuggingFaceLocalGenerator, GoogleAIGeminiGenerator, and HuggingFaceAPIGenerator) I’m starting to feel my original code example (back in this post) is starting to have far too much LLM specific detail inside our HaystackPgvector class.

However, I noticed that Haystack does not implement these three classes all as subclasses of a single shared superclass! Instead, they rely on Python being untyped to allow you to pass whichever component you need to your pipeline. Worse yet, these components have slight (or sometimes not so slight) differences in how you call them, what parameters are available, how you warm them up and run them, and they have differences in if they work with AutoConfig or not (as discussed in this post).

To get around this problem, I decided to yank out the model specific code from my HaystackPgvector class and create a new LanguageModel class that I can then inherit from for each model type. If this seems too burdensome to you, I also allow you to pass in the Haystack components directly and just take your chances that there aren’t incompatibilities.

The LanguageModel class now contains the code relevant to language models. You can then inherit that class and write your own init method specific to the model you want to use.

For example, here is how I now initialize the HuggingFaceAPIGenerator:

class HuggingFaceAPIModel(LanguageModel):
    def __init__(self,
                 model_name: str = 'google/gemma-1.1-2b-it',
                 max_new_tokens: int = 500,
                 password: Optional[str] = None,
                 temperature: float = 0.6,
                 verbose: bool = True) -> None:
        """
        Initialize the LanguageModel instance.

        Args:
            model_name (str): Name of the language model to use.
            task (str): The task to perform using the language model.
        """
        super().__init__(verbose)

        self._max_new_tokens: int = max_new_tokens
        self._temperature: float = temperature
        self._model_name: str = model_name

        self._model: HuggingFaceAPIGenerator = HuggingFaceAPIGenerator(
            api_type="serverless_inference_api",
            api_params={
                "model": self._model_name,
            },
            token=Secret.from_token(password),
            generation_kwargs={
                "max_new_tokens": self._max_new_tokens,
                "temperature": self._temperature,
                "do_sample": True,
            })

That last part, the call to HuggingFaceAPIGenerator, is the important part. Note that you need to pass the model parameters in a dictionary now, unlike with HuggingFaceLocalGenerator. And there isn’t a place to pass a task any more. The generation_kwargs stayed the same. You can call this as follows to create your wrapped component:

     model: LanguageModel = HuggingFaceAPIModel(password=hf_secret,

Haystack then does all the necessary magic behind the scene. It calls the Hugging Face serverless API (free tier) and the component then acts just like a local model from there. Very convenient. Also, on my very slow laptop at least, it is quite a bit faster. Plus, as we’ll see, I can use larger models without running out of memory. Well, that’s not quite true. See below for details.

You can find the code for the revised HaystackPgvector class here. And the most up-to-date version of HaystackPgvector is always found here.

If you need instructions on how to setup the environment, you can find the instructions here.

Trying Out Models

The code, by default, still uses google/gemma-1.1-2b-it because that will run on my laptop. However, now that we’re offloading the LLM to an API, there is no real reason to not try a larger model. Keep in mind that this is the rate-limited free tier version we’re playing with right now, so they put hard limits on what models you are allowed to use. I tried out the following models with these results:

That last (zephyr-7b-alpha) is surprising. It’s a moderately large model that is about the same size as gemma-7b-it for which I was told it was too large for the free tier. After a bit of checking, I found that Zephyr is a Hugging Face tuned model that they intentionally allow in the free tier despite its size. Zephyr is a fined tuned version of Mistral-7B-v0.1 so it’s relatively powerful.

Some of these models are gated and you have to have permission to use them. You can get a list of what permissions you have on this page. The model page itself (links above) is usually where you ask for permissions.

Conclusions

The HuggingFaceAPIGenerator is a Haystack component that does all the hard work for you to connect to the Hugging Face inference endpoint which is a Hugging Face hosted CPU and/or GPU for use with your AI projects. We have now extended our sample HaystackPgvector class to allow our Federalist Paper’s RAG pipeline to utilize the Hugging Face inference endpoint API. This is an important tool in our toolbox for developing low-cost AI solutions.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter