Google Gemma Demo: Setting up a LLM with Text Streaming

Google Gemma Demo: Setting up a LLM with Text Streaming

In a previous post, I introduced Google’s Gemma. Gemma was just one part of that post, so I wanted to do a single short post about getting a simple but effective Large Language Model (LLM) like Gemma setup and working on your laptop.

What is Google Gemma?

Everyone knows about Google’s premier LLM, Gemini because it is integrated into their search engine. What many don’t know is that Google also released a number of simplified open-source versions of their Gemini model into the Hugging Face ecosystem. That model (or rather models) is called ‘Gemma.’

If you want to setup a good but relatively small LLM on, say, your laptop Gemma is probably one of the best options available to you. I’ve had good luck with it despite it’s (relatively) small size.

You can find my full code for this post here if you wish to try it out. Or you can try it out in a Google Colab found here.

Installs and Imports

First, you’ll need to install the Hugging Face ecosystem:

pip install -U transformers

Then you’ll need a few Hugging Face imports:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from huggingface_hub import login

Accessing Gemma via Hugging Face Website

Gemma models are gated, so you need to login to Hugging Face to use them and obtain permission. If you don’t already have a login for the Hugging Face website, go to this page to sign up.

You’ll then need to request access for the specific Gemma model that you end up using. I’ll provide several possible model links below, but all of them will look like this:

A page on the Hugging Face website, it appears similar to a GitHub page. gemma-1.1-2b-it is a submenu which is shown, revealing a page with various options such as Model card, Files and versions, and Community. Model card is the tab open revealing a disclaimer post stating you must agree to Google's usage license before you can access Gemma.

You can see which models you have access to via this page. Then go to the tokens page to setup a Hugging Face token for your program to use.

Logging Into Hugging Face

Once you access to a Gemma model via the Hugging Face website, have a login, and created a token you’ll then be able to login to the Hugging Face eco system from within your code. The way I went about this is I put a copy of one of my Hugging Face tokens into a text file called ‘huggingface_secret.txt’. You could just put the secret token directly into your code but then you risk letting it escape onto the internet should you accidentally push it out to git in a public repo, so it is good practice to save it into either a secret file like this or to put it into an environment variable.

Here is my code to login using my secret file:

# Use huggingface-cli login
secret_file = r'D:\Documents\Secrets\huggingface_secret.txt'
try:
    with open(secret_file, 'r') as file:
        secret_text = file.read()
except FileNotFoundError:
    print(f"The file '{secret_file}' does not exist.")
except Exception as e:
    print(f"An error occurred: {e}")

login(secret_text)

Instantiate the Model and Tokenizer

Now we need to instantiate and load the Gemma model and also tokenizer so that we can tokenize requests and send them to the Gemma model. I’m going to use google/gemma-1.1-2b-it as my model because it’s small enough to comfortably run on my laptop. (Note that this tutorial is partially stolen from their example code).

model_name = 'google/gemma-1.1-2b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name
)

Gemma 1 vs Gemma 2

I’m going to use Gemma 1. However, Gemma 2 is really a better choice if your computer can handle it. There are several Gemma options available on Hugging Face’s website. You may want to use google/gemma-2-9b-it as your model instead. Or if you have enough computing power, you can even use google/gemma-2-27b-it.

You’re probably wondering what the ‘it’ at the end means? It means this model is ‘instruction-tuned’. Because I’m using an instruction-tuned model you’ll see tags like '<eos>' at the end of the output. Several of these models (though not the one I’m using) have non-instruction tuned versions. (For example, try google/gemma-2-9b) However, instruction tuned models are generally more useful than ones that aren't because they understand the concept of writing to the end of some semantically relevant point and then ending their text stream.

Tokenizing a User Query

Let’s now make a query to send to the model and tokenize it. We’re going to stick with the sword and sorcery theme of past posts:

input_text = "Write me a poem with rhyming lines about a Dungeons and Dragons adventure."
inputs = tokenizer(input_text, return_tensors="pt")

Streaming the Output

Now let’s generate a response from the Gemma LLM and stream it to the console as it writes it:

streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(**inputs, streamer=streamer, max_length=2000, do_sample=True, temperature=0.9)

This last requires a bit of explanation. We previously imported the TextStreamer component from the Hugging Face transformers package. This component allows us to not have to wait for the entire output of the large language model before it prints the result. I much prefer to see it slowly writing a response so that I can start to see the response right away. This also gives me a pretty good idea of how fast (or rather slow) my laptop is without and without use of a gpu.

We instantiated an instance of TextStreamer and pass it to the tokenizer. We also tell TextStreamer instance to not include the prompt we gave it (i.e. skip_prompt=True).

The real work is in the final line, the model.generate() function call. Note that we output the result to the standard console, i.e. ‘_’ so that we can see the text as it is generated.

We pass in to the model.generate() function the tokenizer as well as the streamer. In addition, we give it a max length of 2000 characters and specify ‘do_sample=True’. Setting do_sample=True randomizes the output so that you get a different result each time, which is more fun and honestly tends to improve the results. Note that temperature=0.9 sets how much randomness is used when doing a randomized output. Don’t let it become too random or the output won’t seem coherent any more. I put 0.9 for the temperature, but really you should probably lower the temperature a bit.

Gemma’s Poem

And here is the result I get back from Gemma:
A quest undertaken, brave and bold,
Through dungeons deep, a story untold.
With sword and spell, our heroes strong,
Against darkness' grasp, their path must belong.

Through shadowy caves, we tread and stride,
Where goblins hiss and trolls are wide.
With cunning minds and hearts of fire,
We face our foes, beyond compare.

From hidden halls to perilous heights,
Our journey takes us through the night.
Boss battles rage, magic keen,
Our resolve unwavering, our spirits keen.

Through epic fights and wondrous deeds,
Our adventure takes us far afield.
A triumph won, a foe undone,
Our reputation forever bound.

So let us sing of our epic tale,
A triumph of courage and the bold.
For in the realm of magic and might,
Our Dungeons and Dragons adventure shines bright.<eos>

Not bad at all! Really pretty impressive.

Adding Cuda (Optional)

Right now, this code works using a cpu. I find the results a bit on the painfully slow side. My laptop does have a small GPU, so it would be nice to utilize it. However, that requires a lot more installs and environment setup. But let’s at least go over how to use the gpu.

You’ll need to install several things for this to work:

  1. Install pytorch for your environment. See also this post for more instructions on how to set things up.
  2. pip install accelerate

Once you have those set up, add the following to your imports:

import torch

And then we’ll need to adjust the code as follows. First let’s determine if a gpu is available or not:

device = "cuda" if torch.cuda.is_available() else "cpu"

This line of code uses the torch component we imported to check if cuda is available and sets a variable to either ‘cuda’ or ‘cpu.’

Now modify the model to use the GPU if available:

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
)

And for the tokenizer:

inputs = tokenizer(input_text, return_tensors="pt").to(device)

Now re-run your code and (at least for my laptop) it is quite a bit faster now! Plus, it should still work on a cpu if no gpu is available.

Other Links:

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter