Installing Ollama for Large Language Models (LLM) in Windows

Installing Ollama for Large Language Models (LLM) in Windows

Installing Ollama for Large Language Models (LLM) in Windows

I kept reading about Ollama for Large Language models (LLM), so I decided to download and try it out. Ollama is basically just software that lets you run an LLM. We’ve covered other software for running LLMs. I mentioned that the easiest way to do this is to use LM Studio. (See https://lmstudio.ai/) And of course we’ve used Haystack and Hugging Face all throughout my blog posts to run LLMs. They have their own software for that.

Ollama is a good alternative to LM Studios. It is easy to download and start using plus it has support in Haystack (see below) So it is somewhere between LM Studio in that an end user can run LLMs using it (though with a command line interface instead of a nice GUI) and it can also integrate with your Haystack or Hugging Face stack. Plus, being a command line, it has a lot more options available than LM Studio did.

That being said, I admit that I’m not overly impressed with Ollama. It doesn’t seem to ‘do much’ as it were that couldn’t just be done directly via Haystack or Hugging Face. Perhaps I’m missing the real point.

Loading Ollama

Be that as it may, let’s learn how to install Ollama and (in our next point) integrate it into our stack using Haystack. This post will cover installation on Windows (though it shouldn’t be much different for other operating systems.)

First navigate to the Ollama website at https://ollama.com/. You should see this:

Will add detailed description at a later date. image 1

Click the download button in the middle of the page and it will take you to this page: https://ollama.com/download. You will see this:

Will add detailed description at a later date. image 2

For me, I selected Windows, which seems to be the default. Or you can navigate here: https://ollama.com/download/windows

Now click the download button and you’ll download OllamaSetup.exe.

The instructions for Ollama are found on their github page. This includes install instructions similar (but more detailed) than this post.

When you are read, run the OllamaSetup.exe you downloaded and let it install. You should see this:

Will add detailed description at a later date. image 3

Click install and let it install.

It should run a small service, but if not, you can run Ollama directly by searching for Ollama app like this:

Will add detailed description at a later date. image 4

Once you run it, it will appear as a service like this:

Will add detailed description at a later date. image 5

Running Ollama in a Terminal

Now let’s run Windows Terminal. Presumably you already know how to do that but just in case, search for it like this:

Will add detailed description at a later date. image 6

From here you can start a model using the run command like this:

Will add detailed description at a later date. image 7

i.e. Ollama run gemma2 in this case.

Ollama will automatically download the model you choose, store it locally, and then run it and you’ll get a session like this:

Will add detailed description at a later date. image 8

FYI, that answer gemma 2 gave to my question was almost pure hallucination. Sigh.

Finding a Model

The Ollama website has a list of standard models found on this web page: https://ollama.com/search

Will add detailed description at a later date. image 9

However, Ollama also supports a community of models similar to Hugging Face ecosystem and you can search this set of custom models by searching using this search bar (on the home page!) instead of the official one:

Will add detailed description at a later date. image 10

There are a huge number of official and community models available. Let’s take a look at the Llama3.3 model page:

Will add detailed description at a later date. image 11

You can select the number of parameters you want your model to have (in this case, 70b) and then copy the Ollama command to run it:

Will add detailed description at a later date. image 12

You are now up and running with Ollama and local LLMs! Easy, right?

The C Drive Problem

One thing I rather dislike about Ollama is that it installs only to the C Drive and doesn’t let you specify where to download models to other than – of course! – the C drive. However, here are some helpful links that should get you around that problem. (Here and here.)

Integrating Ollama into Book Search Archive

I next went on to integrate Ollama into my open-sourced Book Search Archive to see how it performed compared to other models. You can find the codebase at the time of this blog post here.

First, you’ll need to install the needed integration with Haystack:

pip install ollama-haystack
I wrote a new wrapper just for Ollama models that looks like this:
class OllamaModel(StreamingGeneratorModel):
    def __init__(self,
                 model_name: str = 'gemma2',
                 url="http://localhost:11434",
                 temperature: float = 0.6,
                 streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
                 verbose: bool = True) -> None:

        super().__init__(verbose=verbose, streaming_callback=streaming_callback)

        if self._verbose:
            print("Warming up Ollama Large Language Model: " + model_name)

        self._model: OllamaGenerator = OllamaGenerator(
            model=model_name,
            url=url,
            streaming_callback=self._default_streaming_callback_func,
            generation_kwargs={
                "temperature": temperature,
                # "num_gpu": 1,  # Number of GPUs to use
                # "num_ctx": 2048,  # Reduce context window
                # "num_batch": 512,  # Reduce batch size
                # "mirostat": 0,  # Disable mirostat sampling
                # "seed": 42,  # Set a fixed seed for reproducibility
            },
        )

    def generate(self, prompt: str) -> str:
        return self._model.run(prompt)

Note that I now have a StreamingGeneratorModel abstract class to abstract away how streaming works. Previously only Hugging Face Models could stream. Now Hugging Face models and Ollama models can both stream. The key change is the call to Haystack’s OllamaGenerator component.

I created the model like this:

model: gen.GeneratorModel = gen.OllamaModel(model_name="gemma2")

Then pass that to the RAG pipeline. (I really need a better way to handle models than creating a custom wrapper for each one. I’ll add that to the TODO. It seemed like a good idea at the time. ☹)

Note that I’m using ‘gemma2’ instead of gemma-1.1-2b-it as I did for a local Hugging Face model. According to the Ollama model page, gemma 2 is a 9B parameters model! Yet it seems like it runs faster to me than the 2B parameters model I was using for Hugging Face. I’m not sure why this is. It might be mostly an illusion due to Ollama caching models whereas Hugging Face loads it each time. But even the generation – while not fast on my wimpy laptop – seems snappier than the 2B model I was previously using. So, color me impressed so far!

(Update: it occurred to me later that the default for Ollama Gemma2 model is only a 2k window of tokens whereas my Gemma 1 model defaulted to 8k for the context window. That explains why they run at about the same speed. ☹ But I’m still impressed by how quickly Ollama warms up. Whatever Ollama is doing seems to be effective. Probably because you actually warm it up when you start the service, and it is cached after that.)

Another nice feature is that you can pass a URL right into the OllamaGenerator Haystack class. That means you can use OllamaGenerator either locally or via an API and you can use the same generator. (Whereas with Hugging Face that was two different generators.) It’s not a huge deal, but makes it conceptually easier to treat local and API models the same like that. (Local just runs on localhost is all.)

And I did confirm that streaming works fine for OllamaGenerator. (Note that I pass self._default_streaming_callback_func instead of streaming_callback for a variety of complex reasons I explained in this post.)

Conclusions

Ollama is a great easy way to get started with local LLMs. It is more technical than LM Studios but also more flexible. I’m not sure if this is just my imagination, but it Ollama seems a lot faster than running a model locally (at least on my wimpy laptop) via Hugging Face. I was able to get okay results out of a 9B parameter Gemma 2 model.

However, in our next post I’ll cover You can find the official Haystack Ollama Integration Guide here.

Links - Matt Williams has an excellent basic and advanced courses on Ollama: https://www.youtube.com/watch?v=9KEUFe4KQAI&list=PLvsHpqLkpw0fIT-WbjY-xBRxTftjwiTLB

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter