Fine-tuning & Model Customization
When should you fine-tune vs prompt-engineer vs use RAG? Most teams get this wrong. You'll build the decision framework, then practice each approach: optimize prompts first (cheap), add RAG for domain knowledge (medium), and fine-tune only when needed (expensive). By the end you'll know exactly which tool to reach for.
Use this at work tomorrow
Before fine-tuning, try prompt optimization + few-shot examples — it's free and often good enough.
Learning Objectives
- 1Build a decision framework: prompt engineering → RAG → fine-tuning
- 2Optimize prompts systematically (the cheapest 'customization' lever)
- 3Prepare fine-tuning data in the correct JSONL format
- 4Fine-tune a classifier using OpenAI's fine-tuning API
- 5Evaluate: did fine-tuning actually beat a well-crafted prompt? (often it doesn't)
Ship It: Decision framework + fine-tuned classifier
By the end of this day, you'll build and deploy a decision framework + fine-tuned classifier. This isn't a toy — it's a real project for your portfolio.
I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.
Fine-Tuning: When Prompts Aren't Enough
Prompt engineering gets you 80% of the way. Fine-tuning gets you the last 20% — and it's cheaper per call. Fine-tuned models learn your specific style, format, and domain knowledge. Instead of spending 500 tokens on a system prompt for every call, the model already 'knows' what you want. But fine-tuning is expensive to SET UP, so you need a clear decision framework.
Why is fine-tuning cheaper per call than prompt engineering?
When should you fine-tune instead of using prompt engineering?
The Decision Framework: Prompt vs Fine-Tune vs RAG
Use this framework: (1) Start with prompting — it costs nothing to change and works for 80% of cases. (2) Add RAG if the model needs access to your data — it keeps the model current without retraining. (3) Fine-tune only when: you need a specific style/format consistently, your system prompt is > 500 tokens, you're making thousands of calls/day (cost savings justify it), or you need smaller model performance to match larger models on your specific task.
If your model needs to know about recent company data, what should you use?
What percentage of fine-tuning quality comes from data quality?
JSONL Data Preparation: The Critical Step
Fine-tuning quality is 90% data quality. The training format is JSONL: one JSON example per line, each with messages (system, user, assistant). You need 50-200 high-quality examples minimum. Bad examples = bad model. Curate your training data like you'd curate a database migration — carefully, with validation, and with tests. Every typo, wrong answer, or inconsistent format in your training data becomes a feature of the model.
What's the correct JSONL training format for OpenAI fine-tuning?
What's the minimum number of training examples for fine-tuning?
Fine-Tuning in Practice: OpenAI API
The process: (1) Prepare JSONL training file with 50+ examples. (2) Upload the file to OpenAI. (3) Create a fine-tuning job targeting a base model (gpt-4o-mini recommended). (4) Wait 15-60 minutes for training. (5) Test the fine-tuned model. (6) If good, deploy. The fine-tuned model has a unique ID (ft:gpt-4o-mini:your-org:custom-model) and you use it like any other model in the AI SDK.
Evaluation: Is the Fine-Tuned Model Better?
Never deploy a fine-tuned model without comparing it to your prompted baseline. Use your Day 8 eval pipeline: run the SAME eval dataset against (1) base model + system prompt, (2) fine-tuned model. Compare scores. The fine-tuned model should score higher AND use fewer tokens (no long system prompt needed). If it doesn't beat the prompted baseline, your training data needs work.
If a fine-tuned model doesn't beat the prompted baseline, what's most likely wrong?
The Full Evolution
Watch one function evolve through every concept you just learned.
Production Gotchas
Fine-tuning is NOT a silver bullet. Common failures: (1) Training on too few examples (<50) → model doesn't generalize. (2) Training on inconsistent examples → model outputs are random. (3) Not evaluating against baseline → deployed model is actually worse. (4) Fine-tuning for knowledge retrieval → use RAG instead, models forget facts. (5) Training data contains PII → now it's baked into the model. Start with 50 examples, evaluate, add 50 more, evaluate again. Iterate.
Code Comparison
Long System Prompt vs Fine-Tuned Model
Spending tokens on instructions every call vs training the model once
// ❌ 400+ token system prompt on every call
const result = await generateText({
model: openai("gpt-4o-mini"),
system: `You are a technical support agent for
Acme API Platform. Follow these rules exactly:
TONE: Professional but friendly. Never use slang.
Always address the user by name if provided.
FORMAT: Always structure responses as:
1. Acknowledge the issue
2. Provide solution steps (numbered)
3. Offer follow-up help
KNOWLEDGE: Acme API supports REST and GraphQL.
Auth uses OAuth2 with JWT tokens.
Rate limit: 1000 req/min for Pro, 100 for Free.
Pricing: Free ($0), Pro ($49/mo), Enterprise (custom).
RULES: Never discuss competitors.
Never reveal internal system details.
Always suggest upgrade for Free tier limitations.
If unsure, create a support ticket.
`, // ~400 tokens = $0.00006 per call
prompt: userMessage,
});
// Cost for 10K calls/day: $0.60 just for prompt
// Plus: fragile, easy to forget rules// ✅ Fine-tuned model already knows the rules
const result = await generateText({
model: openai("ft:gpt-4o-mini:acme:support-v3"),
// No system prompt needed — or minimal one
system: "You are Acme support.", // ~5 tokens
prompt: userMessage,
});
// The model already knows:
// - Acme's tone and format
// - Product details and pricing
// - Rules about competitors and upgrades
// - How to structure responses
// Cost for 10K calls/day: $0.0075 for prompt
// 80x cheaper prompt tokens
// Plus: more consistent behavior
// Training cost: ~$5-20 one-time for 200 examples
// Break-even: ~2 days of production trafficKEY DIFFERENCES
- Long system prompts cost tokens on every single call
- Fine-tuned models 'know' your rules without being told each time
- Break-even: training cost vs saved tokens over time
- Use fine-tuning for FORMAT and STYLE, not for KNOWLEDGE (use RAG for that)
Bridge Map: Config files + templates → Fine-tuning + prompt optimization
Click any bridge to see the translation
Hands-On Challenges
Build, experiment, and get AI-powered feedback on your code.
Fine-Tuning Decision Framework + Classifier
Build and deploy a complete fine-tuning workflow: curate training data, fine-tune a model (or simulate the process), evaluate it against a prompted baseline, and build a decision dashboard that shows whether fine-tuning was worth it. This teaches the most important AI engineering skill: knowing when NOT to fine-tune.
Acceptance Criteria
- Create 50+ high-quality training examples in JSONL format for a classification task
- Run a fine-tuning job using the OpenAI API (or simulate with documented steps)
- Evaluate the fine-tuned model against a well-prompted baseline on the same test set
- Build a comparison dashboard showing accuracy, cost, latency, and quality metrics
- Include a decision framework visualization: when to prompt vs RAG vs fine-tune
- Document the decision: was fine-tuning worth it? (often it isn't)
- Deploy to a public URL (Vercel, Netlify, etc.)
Build Roadmap
0/6Create a new Next.js app with TypeScript and Tailwind CSS. Plan the workflow: data curation → training → evaluation → comparison → decision.
npx create-next-app@latest fine-tuning-lab --typescript --tailwind --appCreate folders: /data (training examples), /lib/eval (evaluation logic), /app/dashboardDeploy Tip
Push to GitHub and import into Vercel. Pre-load comparison results so the dashboard has data. The decision framework is the most valuable part — make it visually clear and actionable.
Sign in to submit your deployed project.
I can decide when to fine-tune vs use prompts/RAG, prepare JSONL training data, and evaluate a fine-tuned model against baseline.