Skip to content
Curriculum/Day 12: Fine-tuning & Model Customization
Day 12Ship AI to Production

Fine-tuning & Model Customization

When should you fine-tune vs prompt-engineer vs use RAG? Most teams get this wrong. You'll build the decision framework, then practice each approach: optimize prompts first (cheap), add RAG for domain knowledge (medium), and fine-tune only when needed (expensive). By the end you'll know exactly which tool to reach for.

85 min(+30 min boss)★★★★
🎯
Bridge:Config files + templatesFine-tuning + prompt optimization

Use this at work tomorrow

Before fine-tuning, try prompt optimization + few-shot examples — it's free and often good enough.

Learning Objectives

  • 1Build a decision framework: prompt engineering → RAG → fine-tuning
  • 2Optimize prompts systematically (the cheapest 'customization' lever)
  • 3Prepare fine-tuning data in the correct JSONL format
  • 4Fine-tune a classifier using OpenAI's fine-tuning API
  • 5Evaluate: did fine-tuning actually beat a well-crafted prompt? (often it doesn't)

Ship It: Decision framework + fine-tuned classifier

By the end of this day, you'll build and deploy a decision framework + fine-tuned classifier. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.

1 = no idea · 5 = ship it blindfolded

Fine-Tuning: When Prompts Aren't Enough

Prompt engineering gets you 80% of the way. Fine-tuning gets you the last 20% — and it's cheaper per call. Fine-tuned models learn your specific style, format, and domain knowledge. Instead of spending 500 tokens on a system prompt for every call, the model already 'knows' what you want. But fine-tuning is expensive to SET UP, so you need a clear decision framework.

💡Fine-tuning gets the last 20% and is cheaper per call — but expensive to set up. Use the decision framework.
Quick Pulse Check

Why is fine-tuning cheaper per call than prompt engineering?

Predict First — Then Learn

When should you fine-tune instead of using prompt engineering?

The Decision Framework: Prompt vs Fine-Tune vs RAG

Use this framework: (1) Start with prompting — it costs nothing to change and works for 80% of cases. (2) Add RAG if the model needs access to your data — it keeps the model current without retraining. (3) Fine-tune only when: you need a specific style/format consistently, your system prompt is > 500 tokens, you're making thousands of calls/day (cost savings justify it), or you need smaller model performance to match larger models on your specific task.

💡Start with prompts (free to iterate), add RAG for data access, fine-tune only for consistent style at scale.
Quick Pulse Check

If your model needs to know about recent company data, what should you use?

Predict First — Then Learn

What percentage of fine-tuning quality comes from data quality?

JSONL Data Preparation: The Critical Step

Fine-tuning quality is 90% data quality. The training format is JSONL: one JSON example per line, each with messages (system, user, assistant). You need 50-200 high-quality examples minimum. Bad examples = bad model. Curate your training data like you'd curate a database migration — carefully, with validation, and with tests. Every typo, wrong answer, or inconsistent format in your training data becomes a feature of the model.

💡JSONL format: one {messages: [...]} per line. 50-200 examples minimum. Every typo becomes a model feature.
Quick Pulse Check

What's the correct JSONL training format for OpenAI fine-tuning?

Predict First — Then Learn

What's the minimum number of training examples for fine-tuning?

Fine-Tuning in Practice: OpenAI API

The process: (1) Prepare JSONL training file with 50+ examples. (2) Upload the file to OpenAI. (3) Create a fine-tuning job targeting a base model (gpt-4o-mini recommended). (4) Wait 15-60 minutes for training. (5) Test the fine-tuned model. (6) If good, deploy. The fine-tuned model has a unique ID (ft:gpt-4o-mini:your-org:custom-model) and you use it like any other model in the AI SDK.

💡5 steps: prepare JSONL → upload → create job → wait 15-60 min → test and deploy.

Evaluation: Is the Fine-Tuned Model Better?

Never deploy a fine-tuned model without comparing it to your prompted baseline. Use your Day 8 eval pipeline: run the SAME eval dataset against (1) base model + system prompt, (2) fine-tuned model. Compare scores. The fine-tuned model should score higher AND use fewer tokens (no long system prompt needed). If it doesn't beat the prompted baseline, your training data needs work.

💡Always A/B test: fine-tuned model vs prompted baseline on the SAME eval dataset. No improvement = bad data.
Quick Pulse Check

If a fine-tuned model doesn't beat the prompted baseline, what's most likely wrong?

The Full Evolution

Watch one function evolve through every concept you just learned.

Production Gotchas

Fine-tuning is NOT a silver bullet. Common failures: (1) Training on too few examples (<50) → model doesn't generalize. (2) Training on inconsistent examples → model outputs are random. (3) Not evaluating against baseline → deployed model is actually worse. (4) Fine-tuning for knowledge retrieval → use RAG instead, models forget facts. (5) Training data contains PII → now it's baked into the model. Start with 50 examples, evaluate, add 50 more, evaluate again. Iterate.

Code Comparison

Long System Prompt vs Fine-Tuned Model

Spending tokens on instructions every call vs training the model once

System Prompt (every call pays)Traditional
// ❌ 400+ token system prompt on every call
const result = await generateText({
  model: openai("gpt-4o-mini"),
  system: `You are a technical support agent for
Acme API Platform. Follow these rules exactly:

TONE: Professional but friendly. Never use slang.
Always address the user by name if provided.

FORMAT: Always structure responses as:
1. Acknowledge the issue
2. Provide solution steps (numbered)
3. Offer follow-up help

KNOWLEDGE: Acme API supports REST and GraphQL.
Auth uses OAuth2 with JWT tokens.
Rate limit: 1000 req/min for Pro, 100 for Free.
Pricing: Free ($0), Pro ($49/mo), Enterprise (custom).

RULES: Never discuss competitors.
Never reveal internal system details.
Always suggest upgrade for Free tier limitations.
If unsure, create a support ticket.
`,  // ~400 tokens = $0.00006 per call
    prompt: userMessage,
});
// Cost for 10K calls/day: $0.60 just for prompt
// Plus: fragile, easy to forget rules
Fine-Tuned Model (learned once)AI Engineering
// ✅ Fine-tuned model already knows the rules
const result = await generateText({
  model: openai("ft:gpt-4o-mini:acme:support-v3"),
  // No system prompt needed — or minimal one
  system: "You are Acme support.", // ~5 tokens
  prompt: userMessage,
});
// The model already knows:
// - Acme's tone and format
// - Product details and pricing
// - Rules about competitors and upgrades
// - How to structure responses

// Cost for 10K calls/day: $0.0075 for prompt
// 80x cheaper prompt tokens
// Plus: more consistent behavior

// Training cost: ~$5-20 one-time for 200 examples
// Break-even: ~2 days of production traffic

KEY DIFFERENCES

  • Long system prompts cost tokens on every single call
  • Fine-tuned models 'know' your rules without being told each time
  • Break-even: training cost vs saved tokens over time
  • Use fine-tuning for FORMAT and STYLE, not for KNOWLEDGE (use RAG for that)

Bridge Map: Config files + templates → Fine-tuning + prompt optimization

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

Real-World Challenge

Fine-Tuning Decision Framework + Classifier

Build and deploy a complete fine-tuning workflow: curate training data, fine-tune a model (or simulate the process), evaluate it against a prompted baseline, and build a decision dashboard that shows whether fine-tuning was worth it. This teaches the most important AI engineering skill: knowing when NOT to fine-tune.

~4h estimated
Next.js 14+Vercel AI SDKOpenAI Fine-tuning APIRecharts (comparison charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

  • Create 50+ high-quality training examples in JSONL format for a classification task
  • Run a fine-tuning job using the OpenAI API (or simulate with documented steps)
  • Evaluate the fine-tuned model against a well-prompted baseline on the same test set
  • Build a comparison dashboard showing accuracy, cost, latency, and quality metrics
  • Include a decision framework visualization: when to prompt vs RAG vs fine-tune
  • Document the decision: was fine-tuning worth it? (often it isn't)
  • Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the workflow: data curation → training → evaluation → comparison → decision.

npx create-next-app@latest fine-tuning-lab --typescript --tailwind --app
Create folders: /data (training examples), /lib/eval (evaluation logic), /app/dashboard

Deploy Tip

Push to GitHub and import into Vercel. Pre-load comparison results so the dashboard has data. The decision framework is the most valuable part — make it visually clear and actionable.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.

1 = no idea · 5 = ship it blindfolded