Ask most teams how they know their LLM-powered feature is working and the honest answer is some version of: we tried it, and it seemed good. That is vibes-based testing. It is fine for a prototype and dangerous for a production system, because it cannot tell you when a change has quietly made things worse.
Traditional software has a clear notion of correct: a function returns the right value or it does not. Language model output does not work that way — there are many acceptable answers and many subtly wrong ones, and the difference is often a matter of degree. That is exactly why evaluation needs to be deliberate, and exactly why most teams avoid building it.
Why "it seemed good" fails
Informal testing breaks down in production for specific reasons. It is not repeatable — you cannot rerun "we tried it" after a change and compare. It does not scale — a handful of manual checks cannot represent the range of real usage. And it is biased toward the cases you thought of, which are rarely the cases that break.
The real cost shows up at change time. You update a prompt, swap a model, or adjust retrieval — and you have no way to know whether overall quality rose, fell, or held. Without evaluation, every change is a gamble and regressions reach users before anyone notices.
Without evaluation, every change is a gamble — and regressions reach users before anyone notices.
Building an evaluation set
The foundation is a set of test cases that represents how the system is actually used. A good evaluation set is built deliberately.
- Drawn from real usage. The best cases come from genuine user queries — including the awkward, ambiguous, and unexpected ones — not from inputs invented by the team that built the system.
- Inclusive of edge cases. Deliberately include the hard inputs: ambiguous questions, missing information, things the system should refuse or hand off.
- Paired with a definition of good. Each case needs a clear notion of what an acceptable answer looks like — sometimes an exact answer, more often a set of criteria a correct response must satisfy.
- Maintained over time. When something breaks in production, that case joins the evaluation set so the same regression cannot return unnoticed.
How to score output that has no single right answer
With test cases in hand, you need a way to judge responses. Three approaches, used together, cover most needs.
Deterministic checks
Some properties can be verified by code with no ambiguity: valid format, required fields present, no forbidden content, citations included. These are cheap, fast, and perfectly reliable — use them wherever a property can be checked mechanically.
Model-assisted evaluation
For qualities such as relevance, coherence, or faithfulness to a source, a separate model can score responses against a clear rubric. This scales far better than manual review and, with a well-written rubric, is reasonably consistent — though it should be validated against human judgement before being trusted.
Human review
For the qualities that matter most — genuine helpfulness, appropriate tone, nuanced correctness — human judgement remains the standard. The point is not to review everything by hand, but to review a meaningful sample regularly and to use it to keep the automated methods honest.
Evaluation as routine, not event
The shift that matters is making evaluation continuous. The suite should run automatically on every change to a prompt, model, or pipeline — before it ships — so a regression shows up as a failed check rather than a user complaint. In production, sampling and scoring live output keeps the picture current, since real usage drifts in ways no fixed suite fully anticipates.
Evaluation is not the unglamorous overhead of building with language models. It is the thing that turns an LLM feature from something that seemed good once into something you can change, improve, and trust over time.