LLMOps & Observability
You use APM to monitor APIs. LLMOps adds AI-specific observability: prompt tracing, token usage tracking, latency per model, quality scores, and cost dashboards. You'll integrate real tools (Langfuse) and learn what to log, trace, and alert on in production AI systems.
Use this at work tomorrow
Add token usage and latency logging to every LLM call — find your most expensive prompts.
Learning Objectives
- 1Instrument LLM calls with structured logging (prompt, tokens, latency, cost)
- 2Build end-to-end traces for multi-step AI pipelines (RAG, agents)
- 3Integrate Langfuse for production-grade AI observability
- 4Set up alerting: cost spikes, quality drops, latency degradation
- 5Ship an observable AI pipeline with real-time tracing dashboard
Ship It: Observable AI pipeline
By the end of this day, you'll build and deploy a observable ai pipeline. This isn't a toy — it's a real project for your portfolio.
I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.
What makes debugging AI features harder than traditional features?
LLMOps: DevOps for AI Features
You know DevOps: CI/CD, monitoring, alerting, logging. LLMOps applies those same principles to AI features. The difference: LLM outputs are non-deterministic, so you need specialized observability. You can't grep logs for errors when the 'error' is a subtly wrong answer. This is where structured logging, tracing, and platforms like Langfuse come in.
What's the key difference between DevOps and LLMOps?
What's the minimum you should log for every LLM call?
Structured Logging for LLM Calls
Every LLM call should log: (1) input prompt (or hash for privacy), (2) model used, (3) token counts (prompt + completion), (4) latency, (5) cost estimate, (6) user ID, (7) any errors. This data lets you debug issues, track costs, spot regressions, and understand user patterns. Use structured JSON logging — not console.log() — so you can query and alert on it.
Why use structured JSON logging instead of console.log() for LLM calls?
How many steps does a typical RAG query touch before returning a response?
Tracing: Follow a Request Through Your AI Pipeline
A RAG query hits 5+ steps: embed query → search vectors → re-rank → build prompt → LLM call → parse output. When something goes wrong, you need to see the full trace. Langfuse and similar tools provide trace views: each step with its input, output, latency, and cost. Think of it as your browser DevTools Network tab for AI pipelines.
When a RAG answer is wrong, what does tracing help you identify?
Langfuse Integration: Observability in Practice
Langfuse is the open-source standard for LLM observability. It captures traces, scores, and costs for every LLM call. Integration is lightweight: wrap your AI SDK calls with Langfuse's trace context. You get a dashboard showing: latency trends, cost per feature, quality scores (from evals), and error rates. It's the Datadog/New Relic for AI features.
Prompt Version Management
Prompts are code. Version them like code. Store prompts in your repo (not hardcoded in function calls), tag versions, and track which version produced which results. When a user reports a bad answer, you need to know: which prompt version, which model, what input. Without this, debugging AI in production is impossible. Some teams use Langfuse's prompt management, others use simple gitversioned files.
Why version prompts like code?
The Full Evolution
Watch one function evolve through every concept you just learned.
Production Gotchas
Don't log full prompts if they contain user PII — hash or redact sensitive fields. Langfuse adds ~2-5ms latency per traced call — negligible for LLM calls that take 500ms+. Set up alerts for: latency spikes (model provider issues), cost spikes (runaway loops), and error rate spikes (API failures). Keep 30 days of traces minimum — AI bugs often surface weeks later when users report 'it used to work.' Separate your eval scores from user feedback — they measure different things.
Code Comparison
console.log vs Structured LLM Observability
Basic logging vs production LLM observability with Langfuse
// ❌ Unstructured logging
export async function POST(req: Request) {
const { message } = await req.json();
console.log("User asked:", message);
const result = await generateText({
model: openai("gpt-4o-mini"),
prompt: message,
});
console.log("AI responded:", result.text);
return Response.json({ text: result.text });
}
// Problems:
// - Can't query logs by model, cost, or user
// - No latency tracking
// - No token or cost data
// - Can't correlate prompt versions with outputs
// - Good luck debugging at 3 AM// ✅ Structured observability with Langfuse
import { Langfuse } from "langfuse";
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
secretKey: process.env.LANGFUSE_SECRET_KEY!,
});
export async function POST(req: Request) {
const { message, userId } = await req.json();
// Create a trace for this request
const trace = langfuse.trace({
name: "chat",
userId,
metadata: { promptVersion: "v2.1" },
});
const generation = trace.generation({
name: "llm-call",
model: "gpt-4o-mini",
input: message,
});
const result = await generateText({
model: openai("gpt-4o-mini"),
prompt: message,
});
generation.end({
output: result.text,
usage: {
promptTokens: result.usage.promptTokens,
completionTokens: result.usage.completionTokens,
},
});
// Now you can: query by user, track costs,
// spot regressions, debug with full trace
return Response.json({ text: result.text });
}KEY DIFFERENCES
- Every LLM call gets a trace with input, output, tokens, and cost
- User ID enables per-user debugging and cost tracking
- Prompt version tagging lets you correlate changes with quality
- Dashboard shows latency trends, costs, and error rates
Bridge Map: APM / Datadog / logging → LLM tracing + AI-specific monitoring
Click any bridge to see the translation
Hands-On Challenges
Build, experiment, and get AI-powered feedback on your code.
Observable AI Pipeline
Build and deploy production-grade observability for an AI pipeline: structured logging for every LLM call, Langfuse integration for tracing, a cost/latency dashboard, and prompt versioning. This is the infrastructure that makes AI features maintainable.
Acceptance Criteria
- Add structured logging to all LLM calls (model, tokens, cost, latency, user context)
- Integrate Langfuse (or similar) for end-to-end request tracing
- Build a dashboard showing cost per day, latency per model, token usage trends, and cache hit rates
- Implement prompt versioning: extract prompts into versioned files with performance tracking
- Set up alerts for cost spikes and quality drops
- Show traces with per-step breakdown (embedding → retrieval → generation)
- Deploy to a public URL (Vercel, Netlify, etc.)
Build Roadmap
0/6Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an AI endpoint, logging infrastructure, and a dashboard page.
npx create-next-app@latest ai-observability --typescript --tailwind --appCreate /lib/logger.ts for structured logging and /lib/tracing.ts for LangfuseDeploy Tip
Push to GitHub and import into Vercel. Pre-seed the dashboard with sample observability data. Set OPENAI_API_KEY, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY in Vercel environment variables.
Sign in to submit your deployed project.
I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.