How DSPy Optimizes Prompts

How DSPy Optimizes Prompts

In our previous posts (here, here, and here), we looked at DSPy and how it turns Large Language Models (LLMs) into structured, type-safe Python functions. You define a class with input and output fields, and DSPy builds the prompt for you automatically.

But here’s the thing — sometimes your initial prompt isn’t perfect. The model might get the instructions a little wrong, or your few-shot examples might not be ideal. That’s where DSPy optimizers like MIPROv2 comes in. It’s a tool in DSPy that automatically tweaks your prompts and few-shot examples to make the model behave better. You can read more about it on the here.

In this post, we’ll show you how DSPy can optimize prompts using MIPROv2. We’ll go from an instruction that gives poor results to one that gives consistently correct results — and we’ll show you exactly what changes along the way. You’ll see how the model’s instructions and examples evolve, and why this makes your LLM programs more reliable.

You can find the full code for this blog post on my github repo here.

A Poorly Defined Prompt

The fact is that DSPy is so good at building prompts that I actually struggled to come up with a good simple example of how it can automatically improve prompts for you. It often got 100% accuracy on sentiment analysis right out of the box. So to give it a real challenge I rewrote my 'Classify' function like this. Originally we had:

class Classify(dspy.Signature):
    """Classify sentiment of a given sentence."""
    sentence: str = dspy.InputField()
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
    confidence: float = dspy.OutputField()

And now we have the much vaguer:

class AnalyzeText(dspy.Signature):
    """Process the input."""  # ← Useless instruction
    text: str = dspy.InputField(desc="input data")
    label: Literal["Bingo!", "Hmmm...", "Oink!"] = dspy.OutputField(desc="output")  # Shuffled order!

Instead of being called "Classify" (which is a dead give away we're doing sentiment analysis) we now call it "AnalyzeText". And "sentence" and "sentiment" are now "text" and "label". Plus the labels are 'Bingo!', 'Oink!' and 'Hmmm...'

The first time I tried this, Gemini figured out on its own that 'Bingo!' was positive, 'Oink!' was negative, and 'Hmmm...' was neutral. So I had to swap orders around. And now 'Bingo!' is neutral, 'Oink!' is positive, and 'Hmmm...' is negative. So now there is no way Gemini can figure out what the correct labels are!

And sure enough, it does a terrible job labeling the data out of the box:

============================================================
BEFORE OPTIMIZATION
============================================================

📋 Initial Instruction:
"Process the input."

🎭 INVERTED Nonsense Categories (counterintuitive!):
   'Oink!'   = positive sentiment (opposite of what you'd expect!)
   'Bingo!'  = neutral/mixed sentiment (not positive!)
   'Hmmm...' = negative sentiment (not thoughtful!)

The model will naturally guess WRONG without training!

🧪 Testing on dev set (without optimization):
  ✗ 'The first half was amazing but then it fell a...' → Hmmm...  (expected: Bingo!)
  ✗ 'Not terrible but nothing special....' → Hmmm...  (expected: Bingo!)
  ✗ 'An absolute masterpiece in every way!...' → Bingo!   (expected: Oink!)
  ✓ 'I wanted to like it but it was just awful....' → Hmmm...  (expected: Hmmm...)
  ✗ 'Has its moments but overall just average....' → Hmmm...  (expected: Bingo!)

📊 Initial accuracy: 1/5 (20%)
    ↑ Should be near random chance (~33%) with no instruction!

I'm sort of surprised it even got 20% right.

Running the Optimizer

Now that we’ve seen how DSPy builds prompts automatically, let’s see how it can improve them using the MIPROv2 optimizer.

We’ll take our deliberately confusing example above to make the optimizer necessary. We also define a small training set and validation set with tricky examples to test the model. The metric simply checks whether the model predicts the expected label:

def metric(example, pred, trace=None):
    return example.label == pred.label

As we saw, this initially only gets 20% correct. The full initial prompt can also be inspected using:

lm.inspect_history(n=1)

This shows exactly how DSPy was instructing the LLM before optimization. Here is the prompt that is built initially:

System message:

Your input fields are:
1. `text` (str): input data
Your output fields are:
1. `label` (Literal['Bingo!', 'Hmmm...', 'Oink!']): output
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## text ## ]]
{text}

[[ ## label ## ]]
{label}        # note: the value you produce must exactly match (no extra characters) one of: Bingo!; Hmmm...; Oink!

[[ ## completed ## ]]
In adhering to this structure, your objective is:
        Process the input.

Let's run the MIPROv2 optimizer:

optimizer = MIPROv2(metric=metric, auto="light")
optimized = optimizer.compile(
    student=program,
    trainset=trainset,
    valset=devset,
)

...

if hasattr(optimized, 'predictor'):
    optimized_instruction = optimized.predictor.signature.__doc__
    print(f'"{optimized_instruction}"')

    original_instruction = program.predictor.signature.__doc__

After the MIPROv2 optimizer runs we get much better results:

============================================================
AFTER OPTIMIZATION
============================================================

📋 Optimized Instruction:
"You are a sentiment analysis model. Classify the given text into one of the following categories: "Oink!" for strongly positive, "Hmmm..." for negative, and "Bingo!" for mixed or neutral sentiments."

✨ INSTRUCTION CHANGED! ✨

  Before: 'Process the input.'

  After:  'You are a sentiment analysis model. Classify the given text into one of the following categories: "Oink!" for strongly positive, "Hmmm..." for negative, and "Bingo!" for mixed or neutral sentiments.'

  The optimizer learned the nonsense mapping!

🧪 Testing optimized version on dev set:
  ✓ 'The first half was amazing but then it fell a...' → Bingo!   (expected: Bingo!)
  ✓ 'Not terrible but nothing special....' → Bingo!   (expected: Bingo!)
  ✓ 'An absolute masterpiece in every way!...' → Oink!    (expected: Oink!)
  ✓ 'I wanted to like it but it was just awful....' → Hmmm...  (expected: Hmmm...)
  ✓ 'Has its moments but overall just average....' → Bingo!   (expected: Bingo!)

📊 Optimized accuracy: 5/5 (100%)
    ↑ Should be much better now!

Wow! 100% success now despite the misleading labels I'm expecting! Okay, how does it do it? What does the final prompt now look like?

System message:

Your input fields are:
1. `text` (str): input data
Your output fields are:
1. `label` (Literal['Bingo!', 'Hmmm...', 'Oink!']): output
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## text ## ]]
{text}

[[ ## label ## ]]
{label}        # note: the value you produce must exactly match (no extra characters) one of: Bingo!; Hmmm...; Oink!

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are a sentiment analysis model. Classify the given text into one of the following categories: "Oink!" for strongly positive, "Hmmm..." for negative, and "Bingo!" for mixed or neutral sentiments.

Whoa! It changed the instructions from "Process the input." (My default) to "You are a sentiment analysis model. Classify the given text into one of the following categories: 'Oink!' for strongly positive, 'Hmmm...' for negative, and 'Bingo!' for mixed or neutral sentiments."

No wonder it's now getting 100% correct!

Conclusion

What this example shows is the real power of DSPy’s optimizer workflow. With just a few lines of Python, we went from a deliberately confusing instruction that produced almost random results to a fully optimized prompt that got 100% accuracy on our dev set.

The key takeaway is that DSPy doesn’t just wrap an LLM in a function — it gives you control over how the model interprets your instructions and examples, and it can automatically improve them. That means:

You can use arbitrary input/output field names and even nonsense labels, and the optimizer will learn the correct mapping.

You can systematically improve your prompts without manually guessing how to rewrite them using empirical evidence from a test set.

The process is reproducible and type-safe, so you can confidently switch models or update instructions without breaking your code. If you do switch models, just re-run your optimizer and it will build prompts appropriate for the new model!

In short, DSPy + MIPROv2 turns what used to be trial-and-error prompt engineering into a structured, programmatic, and testable process. For anyone building reliable LLM-powered applications, this is a huge productivity and quality-of-results win.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter