
Llama.cpp for Large Language Models
- By Bruce Nielson
- ML & AI Specialist
In a previous post, we tried Ollama software to run our Large Language Models (LLM). Ollama seemed to be an improvement overloading the model using the Haystack Hugging Face component – probably mostly because it pre-loads and caches the models when the service starts up.
Llama.cpp (pronounced Llama C++) is another way to run LLMs similar to Ollama. However, it is written from the ground up in C++ for efficient inference of LLMs. And Haystack has a built-in integration component: The Haystack LlamaCppGenerator component.
The Advantages of Llama.cpp
Llama.cpp leverages the efficient quantized GGUF format. While this will reduce memory requirements and accelerating inference, it means you have to actually download the GGUF file directly to run the model. Yes, that’s right. Unlike the Ollama or Hugging Face interface where you can just pass a model name, you must actually download the GGUF file first yourself.
To try to abstract this a way a bit, I’ve created a model wrapper (as I did with all the other model interfaces we’ve tried out) that includes an automatic download of the model if it isn’t already downloaded. The Haystack Llama.cpp integration documentation suggests how to do this. I just integrated it into my code and simplified it a bit. Here is the wrapper I wrote that my RAG pipeline uses:
class LlamaCppModel(GeneratorModel):
def __init__(self,
model_link: str = 'https://huggingface.co/TheBloke/openchat-3.5-1210-GGUF/resolve/main/openchat-3.5-1210.Q3_K_S.gguf', # noqa: E501
context_length: int = 2048,
max_tokens: int = 512,
temperature: float = 0.6,
verbose: bool = True) -> None:
super().__init__(verbose=verbose)
self._warmed_up: bool = False
self._model_link = model_link
# Take name of the model from the link. Everything after the last /
self._model_name = model_link.split("/")[-1]
self._context_length = context_length
self._max_tokens = max_tokens
self._temperature = temperature
if self._verbose:
print("Warming up LlamaCPP Large Language Model: " + self._model_name)
# Check if model is already downloaded and download if necessary
self._download_model()
self._model: LlamaCppGenerator = LlamaCppGenerator(
model=self._model_name,
n_ctx=self._context_length,
n_batch=512,
model_kwargs={"n_gpu_layers": -1},
generation_kwargs={"max_tokens": self._max_tokens, "temperature": self._temperature},
)
def generate(self, prompt: str) -> str:
return self._model.run(prompt)
@property
def context_length(self) -> Optional[int]:
return self._context_length
def warm_up(self) -> None:
if not self._warmed_up:
self._model.warm_up()
self._warmed_up = True
def _download_model(self):
# Checks if the file already exists before downloading
if not os.path.isfile(self._model_name):
urllib.request.urlretrieve(self._model_link, self._model_name)
print("Model file downloaded successfully: " + self._model_name)
else:
print("Model file already exists: " + self._model_name)
You have to pass to the model wrapper the full URL for the GGUF file you’re interested in using like this:
model: gen.GeneratorModel = gen.LlamaCppModel(model_link="https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf")
You’ll also need to install the Haystack component for the Llama.cpp integration:
pip install llama-cpp-haystack
This will also automatically install llamacpppython which is a python wrapper for llama.cpp (which is of course written in C++ for speed. Note: I remember back when C++ was considered the slow language. 😊) You can find the GitHub repo for llamacpppythong here.
Picking a Model
You might be wondering how to find the GGUF model file to use. Not all Hugging Face Models have one. In fact, most official files don’t. But the community often quantizes official models and then puts them up on the Hugging Face website and ecosystem. “TheBloke” is famous for this, so let’s try out one of his and I’ll walk you through how to find the full path to the file within the Hugging Face website.
The Haystack integration documentation suggests this model:
https://huggingface.co/TheBloke/openchat-3.5-1210-GGUF
But I wanted to try a quantized version of the zephyr-7B-beta model because that is the model we use via the Hugging Face API (as discussed back in this post). It can be found here:
https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF
What is Quantization?
Perhaps you are wondering what quantization is? When an LLM is initially trained, floating point numbers are used to represent the weights. This is usually represented via 16-bits, or possibly 32-bits. To quantize the model, we compress the values of the weights down to a smaller number of bits. The values stay the same, but some accuracy of the floating-point values is lost.
The result is the same model with the same weights – with a bit of loss of accuracy in the weights – but compressed down to a much smaller size. This means the model takes less memory and will generate responses (i.e. inference) faster.
How to Find the GGUF Model File
Let’s start by navigating to the link of the model we are going to use:
Scroll down a bit and there is a list and description of your options:
I’m going to target the 4-bit quantized model. That means we’re squeezing 16-bits down to 4-bits! But notice the comment next to the zephyr-7b-beta.Q4KM.gguf model: “medium, balanced quality – recommended”. Sounds promising. Click the link to that file:
https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/blob/main/zephyr-7b-beta.Q4KM.gguf
And then click here and it will copy a download link into your clipboard:
For me that was:
https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4KM.gguf
That is the link you want to pass to my model wrapper. The wrapper will then download the gguf file. If it was previously downloaded it will use the downloaded file.
You can find the updated code as of writing this blog here.
The Disadvantages of Llama.cpp
So, Llama.cpp is faster than other methods of running an LLM, but it has some definite disadvantages – beyond having to find your own GGUF model file. While Llama.cpp does support streaming (the LangChain integration utilizes streaming using Llama.cpp) Haystack’s Llama.cpp integration does not support steaming yet.
Conclusions
Llama.cpp is a great way to implement a production LLM locally so that you don’t need to let your company’s private data visible to OpenAI or Microsoft. It’s support for quantized models and C++ implementation makes it a superior choice. However, it requires a bit more work to get it up and running.
Links:
- Llama.cpp Tutorial: https://www.datacamp.com/tutorial/llama-cpp-tutorial
- Llama.cpp official documents: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama
- Llama.ccp Github Repo: https://github.com/ggerganov/llama.cpp
- A Llama.cpp Python Interface Repo: https://github.com/abetlen/llama-cpp-python
- Llama.cpp Haystack Integration: https://haystack.deepset.ai/integrations/llama_cpp
- My feature request for streaming callbacks: https://github.com/deepset-ai/haystack/issues/8682