Curriculum/Day 12: Fine-tuning & Model Customization

Day 12Ship AI to Production

Fine-tuning & Model Customization

When should you fine-tune vs prompt-engineer vs use RAG? Most teams get this wrong. You'll build the decision framework, then practice each approach: optimize prompts first (cheap), add RAG for domain knowledge (medium), and fine-tune only when needed (expensive). By the end you'll know exactly which tool to reach for.

85 min(+30 min boss)★★★★☆

🎯

Bridge:Config files + templatesFine-tuning + prompt optimization

Use this at work tomorrow

Before fine-tuning, try prompt optimization + few-shot examples — it's free and often good enough.

Learning Objectives

1Build a decision framework: prompt engineering → RAG → fine-tuning
2Optimize prompts systematically (the cheapest 'customization' lever)
3Prepare fine-tuning data in the correct JSONL format
4Fine-tune a classifier using OpenAI's fine-tuning API
5Evaluate: did fine-tuning actually beat a well-crafted prompt? (often it doesn't)

Ship It: Decision framework + fine-tuned classifier

By the end of this day, you'll build and deploy a decision framework + fine-tuned classifier. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.

1 = no idea · 5 = ship it blindfolded

Fine-Tuning: When Prompts Aren't Enough

Prompt engineering gets you 80% of the way. Fine-tuning gets you the last 20% — and it's cheaper per call. Fine-tuned models learn your specific style, format, and domain knowledge. Instead of spending 500 tokens on a system prompt for every call, the model already 'knows' what you want. But fine-tuning is expensive to SET UP, so you need a clear decision framework.

💡Fine-tuning gets the last 20% and is cheaper per call — but expensive to set up. Use the decision framework.

Quick Pulse Check

Why is fine-tuning cheaper per call than prompt engineering?

Predict First — Then Learn

When should you fine-tune instead of using prompt engineering?

The Decision Framework: Prompt vs Fine-Tune vs RAG

Use this framework: (1) Start with prompting — it costs nothing to change and works for 80% of cases. (2) Add RAG if the model needs access to your data — it keeps the model current without retraining. (3) Fine-tune only when: you need a specific style/format consistently, your system prompt is > 500 tokens, you're making thousands of calls/day (cost savings justify it), or you need smaller model performance to match larger models on your specific task.

💡Start with prompts (free to iterate), add RAG for data access, fine-tune only for consistent style at scale.

Quick Pulse Check

If your model needs to know about recent company data, what should you use?

🌳 Decision Flowchart: Prompt vs RAG vs Fine-Tune

Answer yes/no questions to get a recommendation.

Does the model need access to your specific data?

Predict First — Then Learn

What percentage of fine-tuning quality comes from data quality?

JSONL Data Preparation: The Critical Step

Fine-tuning quality is 90% data quality. The training format is JSONL: one JSON example per line, each with messages (system, user, assistant). You need 50-200 high-quality examples minimum. Bad examples = bad model. Curate your training data like you'd curate a database migration — carefully, with validation, and with tests. Every typo, wrong answer, or inconsistent format in your training data becomes a feature of the model.

💡JSONL format: one {messages: [...]} per line. 50-200 examples minimum. Every typo becomes a model feature.

Quick Pulse Check

What's the correct JSONL training format for OpenAI fine-tuning?

📝 JSONL Training Data Editor

Build fine-tuning examples. Edit messages, preview JSONL output.

system

user

assistant

⚠️ Need at least 10 examples (have 1)

Show JSONL Output (1 examples)

{"messages":[{"role":"system","content":"You are Acme API support."},{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Hi! Here's how to reset your password:\n\n1. Go to Settings → Security\n2. Click 'Reset Password'\n3. Check your email for the reset link\n\nNeed more help?"}]}

1 example · 1 issue

Predict First — Then Learn

What's the minimum number of training examples for fine-tuning?

Fine-Tuning in Practice: OpenAI API

The process: (1) Prepare JSONL training file with 50+ examples. (2) Upload the file to OpenAI. (3) Create a fine-tuning job targeting a base model (gpt-4o-mini recommended). (4) Wait 15-60 minutes for training. (5) Test the fine-tuned model. (6) If good, deploy. The fine-tuned model has a unique ID (ft:gpt-4o-mini:your-org:custom-model) and you use it like any other model in the AI SDK.

💡5 steps: prepare JSONL → upload → create job → wait 15-60 min → test and deploy.

📉 Training Progress Animator

Toggle data quality to see good vs bad training curves.

LossEpochs

Evaluation: Is the Fine-Tuned Model Better?

Never deploy a fine-tuned model without comparing it to your prompted baseline. Use your Day 8 eval pipeline: run the SAME eval dataset against (1) base model + system prompt, (2) fine-tuned model. Compare scores. The fine-tuned model should score higher AND use fewer tokens (no long system prompt needed). If it doesn't beat the prompted baseline, your training data needs work.

💡Always A/B test: fine-tuned model vs prompted baseline on the SAME eval dataset. No improvement = bad data.

Quick Pulse Check

If a fine-tuned model doesn't beat the prompted baseline, what's most likely wrong?

The Full Evolution

Watch one function evolve through every concept you just learned.

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

The SWE starting point

Raw fetch, manual headers, raw text output

1async function reviewCode(code: string) {
2  const response = await fetch(
3    "https://api.openai.com/v1/chat/completions",
4    {
5      method: "POST",
6      headers: {
7        "Authorization": `Bearer ${API_KEY}`,
8        "Content-Type": "application/json",
9      },
10      body: JSON.stringify({
11        model: "gpt-4o-mini",
12        messages: [
13          { role: "user", content: `Review: ${code}` }
14        ],
15      }),
16    }
17  );
18  const data = await response.json();
19  return data.choices[0].message.content;
20  // Returns raw text — unparseable!
21}

1 / 5

Production Gotchas

Fine-tuning is NOT a silver bullet. Common failures: (1) Training on too few examples (<50) → model doesn't generalize. (2) Training on inconsistent examples → model outputs are random. (3) Not evaluating against baseline → deployed model is actually worse. (4) Fine-tuning for knowledge retrieval → use RAG instead, models forget facts. (5) Training data contains PII → now it's baked into the model. Start with 50 examples, evaluate, add 50 more, evaluate again. Iterate.

Code Comparison

Long System Prompt vs Fine-Tuned Model

Spending tokens on instructions every call vs training the model once

System Prompt (every call pays)Traditional

// ❌ 400+ token system prompt on every call
const result = await generateText({
  model: openai("gpt-4o-mini"),
  system: `You are a technical support agent for
Acme API Platform. Follow these rules exactly:

TONE: Professional but friendly. Never use slang.
Always address the user by name if provided.

FORMAT: Always structure responses as:
1. Acknowledge the issue
2. Provide solution steps (numbered)
3. Offer follow-up help

KNOWLEDGE: Acme API supports REST and GraphQL.
Auth uses OAuth2 with JWT tokens.
Rate limit: 1000 req/min for Pro, 100 for Free.
Pricing: Free ($0), Pro ($49/mo), Enterprise (custom).

RULES: Never discuss competitors.
Never reveal internal system details.
Always suggest upgrade for Free tier limitations.
If unsure, create a support ticket.
`,  // ~400 tokens = $0.00006 per call
    prompt: userMessage,
});
// Cost for 10K calls/day: $0.60 just for prompt
// Plus: fragile, easy to forget rules

Fine-Tuned Model (learned once)AI Engineering

// ✅ Fine-tuned model already knows the rules
const result = await generateText({
  model: openai("ft:gpt-4o-mini:acme:support-v3"),
  // No system prompt needed — or minimal one
  system: "You are Acme support.", // ~5 tokens
  prompt: userMessage,
});
// The model already knows:
// - Acme's tone and format
// - Product details and pricing
// - Rules about competitors and upgrades
// - How to structure responses

// Cost for 10K calls/day: $0.0075 for prompt
// 80x cheaper prompt tokens
// Plus: more consistent behavior

// Training cost: ~$5-20 one-time for 200 examples
// Break-even: ~2 days of production traffic

KEY DIFFERENCES

Long system prompts cost tokens on every single call
Fine-tuned models 'know' your rules without being told each time
Break-even: training cost vs saved tokens over time
Use fine-tuning for FORMAT and STYLE, not for KNOWLEDGE (use RAG for that)

Bridge Map: Config files + templates → Fine-tuning + prompt optimization

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

starter

Prepare JSONL Fine-Tuning Data

Create a JSONL training dataset for a customer support agent. Write 10 diverse examples that teach the model: (1) Acme support tone, (2) response structure (acknowledge → steps → follow-up), (3) product knowledge. Then validate the dataset format and quality.

PLAYGROUND

import { useState } from "react";

interface TrainingExample {
  messages: Array<{
    role: "system" | "user" | "assistant";
    content: string;
  }>;
}

// TODO: Create 10+ training examples following this pattern:
// Each example should have: system message (short), user message, assistant message
// The assistant messages should demonstrate:
// 1. Professional but friendly tone
// 2. Structure: Acknowledge → Solution Steps → Follow-up
// 3. Product knowledge: REST/GraphQL API, OAuth2, rate limits, pricing tiers

const trainingData: TrainingExample[] = [
  // Example 1 (provided):
  {
    messages: [
      { role: "system", content: "You are Acme API support." },
      { role: "user", content: "Hi, I'm getting 429 errors on my API calls. What's happening?" },
      { role: "assistant", content: "Hi there! I can see you're hitting rate limit errors — let me help you resolve that.\n\nHere's what's happening and how to fix it:\n\n1. **429 errors** mean you've exceeded your plan's rate limit\n2. **Free tier** allows 100 requests/minute, **Pro** allows 1,000/minute\n3. To fix immediately: add exponential backoff to your API client\n4. For a permanent solution: consider upgrading to Pro ($49/mo) for 10x the rate limit\n\nWould you like help implementing backoff logic, or shall I walk you through the upgrade process?" },
    ],
  },
  // TODO: Add 9 more diverse examples covering:
  // - Authentication questions (OAuth2, JWT tokens)
  // - Getting started / onboarding
  // - Billing and pricing questions
  // - Error troubleshooting
  // - Feature requests
  // - GraphQL vs REST questions
  // - Account management
  // - API versioning
  // - Webhook configuration
];

function validateExample(example: TrainingExample): string[] {
  const errors: string[] = [];

  // TODO: Implement validation checks:
  // 1. Must have exactly 3 messages (system, user, assistant)
  // 2. Roles must be in order: system, user, assistant
  // 3. System message should be short (< 50 characters for fine-tuned models)
  // 4. Assistant message should be > 100 characters (substantive answer)
  // 5. Assistant message should contain numbered steps or bullet points

  return errors;
}

function toJsonl(data: TrainingExample[]): string {
  return data.map(example => JSON.stringify(example)).join("\n");
}

export default function App() {
  const [validation, setValidation] = useState<Array<{ index: number; errors: string[] }>>([]);
  const [jsonlPreview, setJsonlPreview] = useState("");
  const [stats, setStats] = useState<any>(null);

  function handleValidate() {
    const results = trainingData.map((example, index) => ({
      index,
      errors: validateExample(example),
    }));
    setValidation(results);

    const totalExamples = trainingData.length;
    const validExamples = results.filter(r => r.errors.length === 0).length;
    const avgAssistantLength = trainingData.reduce((sum, ex) => {
      const assistantMsg = ex.messages.find(m => m.role === "assistant");
      return sum + (assistantMsg?.content.length || 0);
    }, 0) / totalExamples;

    setStats({
      total: totalExamples,
      valid: validExamples,
      avgLength: Math.round(avgAssistantLength),
      ready: validExamples >= 10 && validExamples === totalExamples,
    });
  }

  function handleExport() {
    const jsonl = toJsonl(trainingData);
    setJsonlPreview(jsonl);
  }

  return (
    <div style={{ padding: 20, fontFamily: "sans-serif", maxWidth: 700 }}>
      <h2>📝 Fine-Tuning Data Preparation</h2>
      <p style={{ color: "#666", fontSize: 13 }}>
        Prepare and validate JSONL training data for a fine-tuned support agent
      </p>

      <div style={{ display: "flex", gap: 8, margin: "12px 0" }}>
        <button onClick={handleValidate}
          style={{ padding: "10px 20px", background: "#8b5cf6", color: "white", border: "none", borderRadius: 8, cursor: "pointer" }}>
          Validate Dataset
        </button>
        <button onClick={handleExport}
          style={{ padding: "10px 20px", background: "#0ea5e9", color: "white", border: "none", borderRadius: 8, cursor: "pointer" }}>
          Export JSONL
        </button>
      </div>

      {stats && (
        <div style={{
          padding: 12, borderRadius: 8, marginBottom: 12,
          background: stats.ready ? "#f0fdf4" : "#fef2f2",
          border: `1px solid ${stats.ready ? "#bbf7d0" : "#fecaca"}`,
        }}>
          <strong>{stats.ready ? "✅ Dataset Ready" : "❌ Dataset Needs Work"}</strong>
          <div style={{ fontSize: 13, marginTop: 4 }}>
            Examples: {stats.valid}/{stats.total} valid | Avg assistant length: {stats.avgLength} chars
          </div>
          {stats.total < 10 && <div style={{ color: "#dc2626", fontSize: 12 }}>Need at least 10 examples (have {stats.total})</div>}
        </div>
      )}

      {validation.length > 0 && (
        <div style={{ marginTop: 8 }}>
          {validation.map((v, i) => (
            <div key={i} style={{
              padding: 8, margin: "4px 0", borderRadius: 6, fontSize: 12,
              background: v.errors.length === 0 ? "#f0fdf4" : "#fef2f2",
              border: `1px solid ${v.errors.length === 0 ? "#bbf7d0" : "#fecaca"}`,
            }}>
              <strong>{v.errors.length === 0 ? "✅" : "❌"} Example {v.index + 1}:</strong>{" "}
              {trainingData[v.index]?.messages.find(m => m.role === "user")?.content.slice(0, 60)}...
              {v.errors.length > 0 && (
                <ul style={{ margin: "4px 0 0 16px", color: "#dc2626" }}>
                  {v.errors.map((err, j) => <li key={j}>{err}</li>)}
                </ul>
              )}
            </div>
          ))}
        </div>
      )}

      {jsonlPreview && (
        <div style={{ marginTop: 12 }}>
          <h3 style={{ fontSize: 14 }}>JSONL Output:</h3>
          <pre style={{ background: "#1e293b", color: "#e2e8f0", padding: 12, borderRadius: 8, fontSize: 11, overflow: "auto", maxHeight: 300 }}>
            {jsonlPreview}
          </pre>
        </div>
      )}
    </div>
  );
}

Open Sandbox

Real-World Challenge

Fine-Tuning Decision Framework + Classifier

Build and deploy a complete fine-tuning workflow: curate training data, fine-tune a model (or simulate the process), evaluate it against a prompted baseline, and build a decision dashboard that shows whether fine-tuning was worth it. This teaches the most important AI engineering skill: knowing when NOT to fine-tune.

~4h estimated

Next.js 14+Vercel AI SDKOpenAI Fine-tuning APIRecharts (comparison charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

Create 50+ high-quality training examples in JSONL format for a classification task
Run a fine-tuning job using the OpenAI API (or simulate with documented steps)
Evaluate the fine-tuned model against a well-prompted baseline on the same test set
Build a comparison dashboard showing accuracy, cost, latency, and quality metrics
Include a decision framework visualization: when to prompt vs RAG vs fine-tune
Document the decision: was fine-tuning worth it? (often it isn't)
Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the workflow: data curation → training → evaluation → comparison → decision.

npx create-next-app@latest fine-tuning-lab --typescript --tailwind --app

Create folders: /data (training examples), /lib/eval (evaluation logic), /app/dashboard

Deploy Tip

Push to GitHub and import into Vercel. Pre-load comparison results so the dashboard has data. The decision framework is the most valuable part — make it visually clear and actionable.

After Learning — Rate Your Confidence Again

I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.

1 = no idea · 5 = ship it blindfolded

Day 11: LLMOps & Observability

Day 13: Adding AI to Existing Apps

Fine-tuning & Model Customization

Learning Objectives

Ship It: Decision framework + fine-tuned classifier

Fine-Tuning: When Prompts Aren't Enough

The Decision Framework: Prompt vs Fine-Tune vs RAG

🌳 Decision Flowchart: Prompt vs RAG vs Fine-Tune

JSONL Data Preparation: The Critical Step

📝 JSONL Training Data Editor

Fine-Tuning in Practice: OpenAI API

📉 Training Progress Animator

Evaluation: Is the Fine-Tuned Model Better?

The Full Evolution

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Production Gotchas

Code Comparison

Long System Prompt vs Fine-Tuned Model

Bridge Map: Config files + templates → Fine-tuning + prompt optimization

Hands-On Challenges

Prepare JSONL Fine-Tuning Data

Fine-Tuning Decision Framework + Classifier

Acceptance Criteria

Build Roadmap

Discussion

🌳 Decision Flowchart: Prompt vs RAG vs Fine-Tune

📝 JSONL Training Data Editor

📉 Training Progress Animator

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Prepare JSONL Fine-Tuning Data

Discussion