From Unit Test to Continuous Eval: How to Test AI-Powered Systems

Two disciplines, not one replacing the other

We design systems with AI applying two complementary testing disciplines: deterministic tests for domain logic and continuous evals for model behavior. Each one covers a different kind of error; understanding the boundary between them decides whether the system goes to production with confidence or with hope.

The most frequent temptation in teams coming from classic TDD is to assume their tests are enough, or to assume evals replace them. Both readings are incorrect. Here goes how the work splits and how to integrate it in a serious pipeline.

What TDD still covers perfectly

A system with AI maintains a majority of logic that is deterministic and, therefore, testable with classic tests:

Input and output validation. Before and after every model call there's code that validates schemas, ranges, formats. That gets tested with unit tests.
Agent domain logic. The business rules the agent applies — escalation thresholds, qualification criteria, pricing policies — live in pure domain classes. With LLM-port stubs, that logic gets tested in a fully deterministic way.
External-system integrations. Every adapter (HubSpot, SAP, Postgres) has its unit test with doubles and its integration test against a controlled environment.
Tool validation. Every MCP tool the agent exposes has a defined contract — parameters, ranges, responses. That gets tested before executing anything with AI.
Error handling and fallbacks. What the system does when the LLM fails, when a tool times out, when external data comes in corrupt. Classic logic, classic testing.

In systems we deliver, deterministic test coverage stays around 70-80% of the lines. AI adds a new layer, doesn't reduce the one already there.

What TDD doesn't cover and never will

Three kinds of error deterministic tests don't detect because, by design, they can't:

Model decisions in edge cases. The agent gets an ambiguous request and decides; the model reasoned defensibly, but the decision is suboptimal for your business. No unit test catches it because the code executed correctly.

Silent degradation from external changes. The LLM provider updates the model. Responses change subtly: the agent starts citing sources less, or escalating to a human more than needed. Tests still pass; quality drops.

Behavior on real data that wasn't in your dataset. The agent works fine with the PoC's 50 cases and starts failing on real scenarios nobody had anticipated.

All three get tackled with evals: a representative set of cases, a judgment rubric (human, programmatic, or calibrated LLM-as-judge), a metric measured periodically. The output isn't PASS/FAIL — it's a score compared to the historical baseline.

The updated testing pyramid

The classic pyramid (unit tests at the bottom, integration in the middle, e2e at the top) is alive. But in systems with AI it gains one more layer:

              ╱ Continuous evals
            ╱   (model behavior,
           ╱    LLM-as-judge, drift)
          ╱─────────────────────────
         ╱ E2E tests
        ╱  (critical business flows)
       ╱──────────────────────────
      ╱ Integration tests
     ╱  (adapters against real systems)
    ╱─────────────────────────────────
   ╱ Unit tests
  ╱  (pure domain, validators, tools)

Evals sit on top of the pyramid because they cover fewer cases but detect the problems no lower layer can see. They don't replace any layer; they complement.

Evals have their own subdivisions:

Retrieval evals over RAG: recall@k, MRR, groundedness.
Action evals of the agent: tool success rate, decision precision.
Response evals: factual accuracy, tone, format, citation correctness.

Each with its dataset, its rubric, its acceptance threshold.

How both approaches coexist in CI

The pipeline we apply in serious projects has three validation tiers:

Tier 1 — every commit. Unit tests + lint + type-check. Suite running in under three minutes. Blocks the commit if it fails. Catches 90% of the errors before they reach production.

Tier 2 — every pull request. Integration tests + E2E on critical flows. Suite running in ten to fifteen minutes. Additionally, a light eval over a reduced dataset (20-50 cases) if the PR touches prompts, retrieval, or agent logic.

Tier 3 — daily or post-release. Full eval over the extended dataset (200-500 cases). Compares against the historical baseline. Triggers an alert if a metric drops beyond the threshold.

This stratification balances speed and coverage. A change in a REST endpoint doesn't need to trigger the full eval; a change in the system prompt does. The rule comes from which layer of the system the change touches.

Anti-patterns that destroy trust in the system

Four errors we see often in teams adopting AI that invalidate the discipline:

Treating evals as tests. The eval isn't PASS/FAIL — it's a metric with a trend. An individual case can fail without breaking the build; what breaks the build is an aggregate drop in the metric.

Pretending TDD covers the model. Classic teams sometimes assume that if tests pass, the agent works. That's true for the deterministic half; false for the model-behavior half. You need both disciplines, not one.

Eval datasets with synthetic cases only. Cases generated with an LLM to test another LLM tend to converge to what the model does well. The useful dataset contains real production cases — including the ones that failed in the first month of use.

Without calibrated LLM-as-judge. Using LLM-as-judge without comparing its scores against human review on 50-100 cases produces a metric that looks objective but isn't. Calibration isn't optional.

A real case: TDD-classic team adopting evals

A client of ours with a consolidated Symfony / Spring Boot team and mature TDD discipline incorporated a first AI agent twelve months ago. The challenge wasn't technical — the team knew how to test — but conceptual: the difference between what each discipline covers.

What we built during the first month:

Agent domain with classic coverage of 85% on pure classes. LLM-port stubs, deterministic tests, fast cycle.
Initial eval dataset with 80 cases curated from the PoC and the first real cases. Each case with input, expected output, and explicit rubric.
CI pipeline with three tiers (commit / PR / daily), automatically running light evals when the PR touched prompts.
Metrics dashboard showing baseline and per-commit deviation. Slack alerts if a metric dropped more than 3% off baseline.

Result at six months: two model migrations executed without drama (each one a PR, four days between PR and promotion, no critical regressions), dataset grown to 240 cases with real incidents incorporated, team now reasoning about the cost-benefit of each testing layer instead of defending TDD as before or distrusting evals as a year ago.

How we build it in production

When we step into a project where a TDD-classic team is going to incorporate AI, the first deliverable isn't the agent. It's the testing map: what each pyramid layer covers in this specific use case, what eval dataset the agent needs, how it all integrates into the existing CI without breaking the team's rhythm.

On top of that map, deterministic tests stay exactly as they were — the team keeps working with TDD where it makes sense. Evals get added as a new layer in CI and as a new practice in the workflow. The team keeps what it earned and adds the discipline the AI system needs.

Behind each of these systems is a team of engineers who believe quality in AI systems isn't replacing TDD with evals — it's combining them with criteria. Deterministic tests protect the predictable; evals protect the probabilistic. Both disciplines, not one.

If your team comes from classic TDD and is incorporating AI in some piece of the system, we can audit the current testing setup and hand you the plan to add the evals layer without dismantling what already works.