Evals: The New Unit Test

In the traditional software world, testing was simple. You write a function, you write a test, it either passes or it doesn't. Deterministic. You could trust that if it passed locally, it'd behave the same way in production.

AI software breaks that contract.

When your application is powered by an LLM, you're no longer dealing with deterministic outputs. You can run the exact same input a hundred times and get slightly different results each time. That's not a bug — that's the nature of the technology. But it creates a real problem for engineers trying to ship reliable systems: how do you test something you can't fully predict?

That's where evals come in.

What evals actually are

"Evals" is one of those terms the industry uses loosely, so let me define what I mean when I use it.

Evals are to AI engineers what unit tests and integration tests were to traditional software engineers. Not a replacement — your regular test suite still matters — but an evolution of the same underlying goal: confidence that your system does what you expect.

The difference is that with AI software, you're not asking "does this always return the right answer?" You're asking "does this return the right answer reliably enough, and in what situations does it not?"

That's a fundamentally different question, and it requires a different kind of testing infrastructure.

Offline Evals

The first category is offline evals. These are your known test cases — scripted scenarios that represent how your agent should behave in situations you've already thought through.

Think of it like unit testing. You define an input, you define what a successful output looks like, and you run your agent against it. The difference is that passing isn't always binary. You might be checking whether the response contains certain information, follows a particular format, or in some cases you use an LLM-as-judge to assess quality.

The primary use case is regression testing. At Cora, when someone on the team is making prompt changes or adding new tools to an agent, we need to know those changes aren't silently breaking expected behavior elsewhere in the system. In the old world you'd run your test suite and things were green or red. In the AI world you're watching pass rates — are the things that used to work still working?

There's also a stability dimension that just doesn't exist in traditional testing. A test that passes once might only pass 95 times out of 100 at volume. Whether that's acceptable depends entirely on the use case — but you have to know. A single passing run doesn't tell you much. Running evals at scale gives you a real picture of reliability.

Online Evals

The second category is online evals, or what we call hypercare testing at Cora. This is where it gets interesting.

Offline evals cover the scenarios you thought of. Online evals are for everything else.

When an agent is running in production, it's going to encounter inputs you never anticipated. A customer sends an email that doesn't fit any pattern you planned for. The agent handles it in a way that technically follows the rules but isn't what you'd actually want. No offline eval would have caught it — you didn't know to write that test case.

Online evals are about monitoring execution in production and judging whether the agent is behaving the way you actually want. And "judging" is the right word here — it's different from a unit test. It's closer to the way a manager evaluates an employee. Not just "did you follow the policy?" but "did you handle that situation the right way?"

That analogy goes further than it might seem.

Prompting an agent is like writing a policy

When you're building an AI agent, you're essentially writing a policy for how it should behave. At the beginning that policy is loose — you define the broad strokes and leave a lot to the agent's judgment. That works fine until it doesn't.

Over time, you observe the agent operating in ways you didn't expect. Not wrong exactly — it followed the policy — but not right either. So you tighten it. In this situation, do this. In that situation, don't do that. The policy grows more specific over time, and the agent's behavior gets closer to what you actually want.

This is the core loop of AI agent development. Online evals are what make that loop work. They surface the situations where behavior diverged from expectations, so you have something concrete to go back and address.

The feedback loop

Here's where the two come together.

When an online eval flags unexpected behavior, that's not just a production alert — it's a new test case. We take that scenario, move it into our offline eval suite, refine the prompt to handle it correctly, and make sure it passes before it goes back into production.

The offline suite grows over time as a living record of everything we've seen go wrong and how we've addressed it. And because these are real failure modes pulled from production — not hypotheticals — they're the most valuable test cases you have. You know they matter because production already proved it.

So the workflow ends up looking like this: run offline evals during development to catch regressions, monitor online evals in production to catch edge cases, and when something surfaces online, loop it back into the offline suite to prevent it from happening again.

Why this framing matters

There are a lot of ways to define evals, and the term means different things to different teams. Some people mean benchmark performance on academic datasets. Some mean red-teaming. Some mean LLM-as-judge scoring pipelines.

When I talk about evals, I mean the practical testing and monitoring infrastructure that lets you build AI agents and actually trust them in production. The unit test analogy is intentional — it's not a perfect mapping, but it's the right mental model for engineers coming from a traditional software background. It's how you bring engineering discipline to a part of the stack that doesn't naturally lend itself to it.

The underlying goal hasn't changed. We still want to ship software we can trust. Evals are just how we get there when the output isn't guaranteed.