Writing a Robust “Do‑It‑All” Gemini API Caller in Python
- By Bruce Nielson
- ML & AI Specialist
In our last post, we talked about how Google killed the rate limits on their free tier. How can we best deal with that problem programmatically? I wrote a utility function that wraps calls to the Gemini API and does all the following:
- Send either a plain “generate content” call or a full chat message
- Catch and parse rate‑limit errors (
ResourceExhausted) - Back off for the exact number of seconds Google suggests
However, even with this functionality, you can get yourself into an infinite loop of retries because Google will lie to you about how long you need to wait before you can try again. (Presumably due to going over your -- now very small -- daily allotment, though I'm somewhat doubtful that they don't just rate limit you before you've reached your daily allotment.)
The "send_message" Utility
Drop llm_message_utils.py into your project (from BookSearchArchive):
from google.api_core.exceptions import ResourceExhausted
from google.generativeai import ChatSession, GenerativeModel
from google.generativeai.types.generation_types import GenerationConfig, GenerateContentResponse
from typing import Any, List, Union
import time
import re
def _extract_retry_seconds(exc: ResourceExhausted, default: int = 15) -> int:
"""
Parses "retry_delay { seconds: N }" from exc.details.
Returns N if found, else returns default.
"""
try:
details = str(getattr(exc, "details", ""))
match = re.search(r'retry_delay\s*{\s*seconds:\s*(\d+)', details)
if match:
return int(match.group(1))
except Exception:
pass
return default
def send_message(
model: Union[ChatSession, GenerativeModel],
message: str,
tools: List = None,
stream: bool = False,
config: GenerationConfig = None,
**generation_kwargs: Any
) -> Union[GenerateContentResponse, str]:
# Build GenerationConfig from kwargs if needed
if config is None and generation_kwargs:
config = GenerationConfig(**generation_kwargs)
try:
if isinstance(model, ChatSession):
# Full chat support (streaming, tools)
return model.send_message(
message,
generation_config=config,
tools=tools,
stream=stream
)
else:
# Single-shot content generation
response = model.generate_content(
contents=message,
generation_config=config,
tools=tools,
stream=stream
)
return getattr(response, "text", None) or "[No response text]"
except ResourceExhausted as e:
delay = _extract_retry_seconds(e, default=15)
print(f"Rate limit hit. Backing off for {delay} seconds…")
time.sleep(delay)
# WARNING: If you’ve hit your daily quota, this will loop forever!
return send_message(
model,
message,
tools=tools,
stream=stream,
config=config,
**generation_kwargs
)
How it Works
Let's start with the _extract_retry_seconds method. This method takes a ResourceExhausted error and gets the "retry_delay" value, which is supposed to be the number of seconds before you can retry again. As mentioned in our last post, this may do you no good if you reached the daily limit.
The send_message method does the real work. At a minimum, you need to pass a model and a message/prompt. The 'model' can be either a GenerativeModel or a ChatSession taken from a GenerativeModel. ChatSessions have a lot of extra functionality like tracking history. I try to abstract away some of the pain here by allowing you to pass either. I'm assuming that if you pass a ChatSession you want to talk to the Large Language Model (LLM) in a chat session format, otherwise you just want to generate come content based on the prompt.
If this is a chat session, I call:
model.send_message(
message,
generation_config=config,
tools=tools,
stream=stream
)
Note that I allow you to pass in configuration arguments, a list of Tools, and a boolean value for streaming or not.
If you just want to generate content, I instead call:
response = model.generate_content(
contents=message,
generation_config=config,
tools=tools,
stream=stream
)
return getattr(response, "text", None) or "[No response text]"
Note that if this is a chat session we're returning a GenerateContentResponse so that it comes with the chat history. But if we're generating content, we just return a regular string.
We wrap all that into a try exception block and trap ResourceExhausted errors. From there we just call the _extract_retry_seconds helper function and then sleep for the number of seconds they told us to. If they didn't give us a number, we default to 15 seconds.
except ResourceExhausted as e:
delay = _extract_retry_seconds(e, default=15)
print(f"Rate limit hit. Backing off for {delay} seconds…")
time.sleep(delay)
# WARNING: If you’ve hit your daily quota, this will loop forever!
return send_message(
model,
message,
tools=tools,
stream=stream,
config=config,
**generation_kwargs
)
I now can call all LLM calls via this utility function and I'm getting retries for free. We could obviously add a lot more here like exponential backoff or smarter checking for how long to wait.
We'll be using this little utility in BookSearchArchive for all our calls in preparation for the ReAct based research agent we're going to build next to allow our agent to do its own queries to research and find an answer to a user's query.