Loading learning content…
Loading learning content…
Build evals to measure prompt quality systematically — before and after every change.
Read through the lesson, mark it complete when the concept is clear, then move to the next lesson in the sequence or jump back to the module map.
Without evals, you're flying blind. You can't know if your prompt changes helped or hurt, and you can't catch regressions before they reach production.
Evals are the testing framework for prompts. Build them before you start iterating, not after.
A good eval dataset has:
Building this dataset is the hardest part. Invest in it — a bad eval is worse than no eval.
Exact match — For classification and extraction tasks. The output must match the expected label. Fast, cheap, unambiguous.
Semantic similarity — For generation tasks where wording can vary. Use embedding distance or another model to check if meaning matches.
LLM-as-judge — Use a strong model to evaluate outputs against your criteria. More expensive but handles nuanced quality dimensions.
Human review — Gold standard for tasks where quality is subjective. Use for calibration and spot-checking automated evals.
def run_eval(prompt_template, examples):
results = []
for example in examples:
output = call_model(prompt_template.format(**example.inputs))
score = judge(output, example.expected)
results.append(score)
return sum(results) / len(results)
# Run before and after every change
baseline_score = run_eval(old_prompt, eval_dataset)
new_score = run_eval(new_prompt, eval_dataset)
print(f"Delta: {new_score - baseline_score:+.2%}")
Define acceptable quality before you start: "We ship when 90% of eval examples score 4/5 or higher." This prevents perpetual iteration and anchors decisions in data.