Skip to content
Curriculum/Day 8: AI Evaluation & Testing
Day 8Ship AI to Production

AI Evaluation & Testing

Unit tests check exact outputs. LLMs are non-deterministic — same input, different output every time. You need evals: automated quality checks using rubrics, LLM-as-judge, and eval datasets. You'll learn eval-driven development — write evals first, then iterate prompts to pass them.

80 min(+30 min boss)★★★☆☆
Bridge:Unit tests + CI/CDEvals + eval-driven development

Use this at work tomorrow

Write 10 eval cases for any AI feature your team ships — catch prompt regressions before users do.

Learning Objectives

  • 1Understand why expect(output).toBe(exact) fails for AI
  • 2Build eval datasets from real user queries and expert labels
  • 3Implement LLM-as-judge evaluation with scoring rubrics
  • 4Set up eval-driven development: write evals → iterate prompts → measure
  • 5Create an eval suite that catches regressions in your Day 3 RAG app

Ship It: Eval suite for your RAG app

By the end of this day, you'll build and deploy a eval suite for your rag app. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.

1 = no idea · 5 = ship it blindfolded
Predict First — Then Learn

What's the #1 reason AI projects fail in production?

Week 2: Production AI — Evals Are Your Test Suite

Week 1 taught you to build AI features. Week 2 teaches you to ship them with confidence. The #1 reason AI projects fail in production isn't bad models — it's no way to measure quality. Evals are your test suite for non-deterministic systems. Without them, every deployment is a coin flip.

💡Evals are your test suite for AI — without them, every deployment is a coin flip.
Quick Pulse Check

Why can't traditional unit tests fully cover AI feature quality?

The Eval Mental Model: Think CI/CD for AI

You already write unit tests and integration tests. AI evals are the same idea: define expected behavior, run the system, check the output. The difference: AI outputs are non-deterministic, so you need fuzzy matching. Instead of assertEquals('hello'), you check 'does the response contain the key information?' or 'is the tone professional?'. This is where LLM-as-judge comes in.

💡AI evals = unit tests with fuzzy matching. Check meaning, not exact strings.
Quick Pulse Check

What replaces assertEquals() in AI evals?

Predict First — Then Learn

In LLM-as-judge, which model grades the outputs?

LLM-as-Judge: Using AI to Test AI

The most powerful eval technique: use a strong model (GPT-4o) to grade outputs from your production model (GPT-4o-mini). Define rubrics with criteria and scores. The judge model evaluates each criterion. This scales better than manual review and catches regressions automatically. It's not perfect — judge models have biases — but it's 10x better than 'looks good to me' testing.

💡LLM-as-judge: use GPT-4o to grade GPT-4o-mini. Scales better than manual review, catches regressions.
Quick Pulse Check

What's a known limitation of LLM-as-judge?

Types of Evals: What to Test

Core eval types: (1) Factuality — is the answer correct given the context? (2) Relevance — does the answer address the question? (3) Faithfulness — does the answer only use provided context (no hallucination)? (4) Harmfulness — does it contain unsafe content? (5) Style — does it match your tone/format requirements? Start with factuality and faithfulness for RAG systems. Add others as you find failure modes.

💡Start with factuality + faithfulness evals for RAG. Add style and harmfulness as you find failure modes.
Predict First — Then Learn

If you run the same eval suite twice, will scores be identical?

Building an Eval Pipeline with Promptfoo

Promptfoo is the open-source standard for LLM evals. It runs your prompts against test datasets and scores results. Think of it as Jest/Vitest for AI. You define test cases in YAML, run `promptfoo eval`, and get a report showing pass/fail rates. Integrate it into CI to catch regressions before deployment. Every serious AI team uses something like this.

💡Promptfoo is Jest for AI — YAML test cases, CLI runner, CI integration, pass/fail reports.
Quick Pulse Check

Where should you integrate Promptfoo eval runs?

The Full Evolution

Watch one function evolve through every concept you just learned.

Production Gotchas

Golden rule: build your eval set from real user queries, not imagined ones. Production failures are always weirder than your test cases. Start with 20-50 eval cases covering happy path + known failure modes. LLM-as-judge costs money — run full evals on PRs, not every commit. Eval scores will fluctuate 2-5% between runs (non-determinism) — set thresholds with margin. Version your prompts alongside eval results so you can correlate changes.

Code Comparison

Unit Tests vs AI Evals

Deterministic testing vs non-deterministic AI evaluation

Unit Test (exact match)Traditional
// Testing a deterministic function
describe("calculateTotal", () => {
  it("sums items correctly", () => {
    const result = calculateTotal([
      { price: 10.00 },
      { price: 5.50 },
    ]);
    // Exact match — always the same output
    expect(result).toBe(15.50);
  });

  it("applies discount", () => {
    const result = calculateTotal(
      [{ price: 100 }],
      { discount: 0.1 }
    );
    expect(result).toBe(90.00);
  });
});
// Deterministic: same input → same output
// Binary: pass or fail, nothing in between
AI Eval (rubric-based scoring)AI Engineering
// Testing a non-deterministic AI system
import { generateObject } from "ai";
import { z } from "zod";

async function evalResponse(
  question: string,
  aiAnswer: string,
  groundTruth: string
) {
  const { object } = await generateObject({
    model: openai("gpt-4o"), // Strong judge
    schema: z.object({
      factuality: z.number().min(1).max(5),
      relevance: z.number().min(1).max(5),
      reasoning: z.string(),
    }),
    prompt: `Grade this AI response:
Question: ${question}
Expected: ${groundTruth}
Actual: ${aiAnswer}

Score 1-5 on factuality and relevance.`,
  });
  return object;
}

// Run across eval dataset
const results = await Promise.all(
  evalSet.map(({ q, expected }) =>
    evalResponse(q, await getAIAnswer(q), expected)
  )
);
const avgScore = avg(results.map(r => r.factuality));
assert(avgScore >= 4.0, "Quality regression!");

KEY DIFFERENCES

  • Unit tests: exact match, deterministic, binary pass/fail
  • AI evals: fuzzy scoring (1-5), non-deterministic, threshold-based
  • LLM-as-judge: use GPT-4o to grade GPT-4o-mini outputs
  • Run evals as CI checks to catch regressions before deploy

Bridge Map: Unit tests + CI/CD → Evals + eval-driven development

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

Real-World Challenge

AI Eval Dashboard

Build and deploy an evaluation dashboard for your Day 7 capstone (or any AI feature). Create a golden eval dataset, implement LLM-as-judge scoring, and build a dashboard that tracks quality over time. This is how production AI teams prevent regressions.

~4h estimated
Next.js 14+Vercel AI SDKOpenAI GPT-4o (for judge)Tailwind CSSRecharts or Chart.js (visualization)Vercel (deploy)

Acceptance Criteria

  • Create a golden eval dataset with 20+ question/expected-answer pairs
  • Implement LLM-as-judge scoring on factuality, relevance, and faithfulness
  • Run the full eval suite and compute pass rates and average scores
  • Build a visual dashboard showing scores, per-question breakdown, and trends
  • Support multiple eval runs with timestamp-based comparison
  • Identify and highlight the weakest-performing test cases
  • Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the architecture: eval dataset storage, judge API, results database, and dashboard UI.

npx create-next-app@latest ai-eval-dashboard --typescript --tailwind --app
Create folders: /data/evals (golden datasets), /lib/judge (scoring logic), /app/dashboard

Deploy Tip

Push to GitHub and import into Vercel. Pre-load sample eval results so the dashboard has data on first visit. This project shows hiring managers you build quality infrastructure, not just features.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.

1 = no idea · 5 = ship it blindfolded