Skip to content
Curriculum/Day 11: LLMOps & Observability
Day 11Ship AI to Production

LLMOps & Observability

You use APM to monitor APIs. LLMOps adds AI-specific observability: prompt tracing, token usage tracking, latency per model, quality scores, and cost dashboards. You'll integrate real tools (Langfuse) and learn what to log, trace, and alert on in production AI systems.

80 min(+30 min boss)★★★☆☆
📊
Bridge:APM / Datadog / loggingLLM tracing + AI-specific monitoring

Use this at work tomorrow

Add token usage and latency logging to every LLM call — find your most expensive prompts.

Learning Objectives

  • 1Instrument LLM calls with structured logging (prompt, tokens, latency, cost)
  • 2Build end-to-end traces for multi-step AI pipelines (RAG, agents)
  • 3Integrate Langfuse for production-grade AI observability
  • 4Set up alerting: cost spikes, quality drops, latency degradation
  • 5Ship an observable AI pipeline with real-time tracing dashboard

Ship It: Observable AI pipeline

By the end of this day, you'll build and deploy a observable ai pipeline. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.

1 = no idea · 5 = ship it blindfolded
Predict First — Then Learn

What makes debugging AI features harder than traditional features?

LLMOps: DevOps for AI Features

You know DevOps: CI/CD, monitoring, alerting, logging. LLMOps applies those same principles to AI features. The difference: LLM outputs are non-deterministic, so you need specialized observability. You can't grep logs for errors when the 'error' is a subtly wrong answer. This is where structured logging, tracing, and platforms like Langfuse come in.

💡LLMOps = DevOps for AI. You need specialized observability because 'wrong answer' isn't in your error logs.
Quick Pulse Check

What's the key difference between DevOps and LLMOps?

Predict First — Then Learn

What's the minimum you should log for every LLM call?

Structured Logging for LLM Calls

Every LLM call should log: (1) input prompt (or hash for privacy), (2) model used, (3) token counts (prompt + completion), (4) latency, (5) cost estimate, (6) user ID, (7) any errors. This data lets you debug issues, track costs, spot regressions, and understand user patterns. Use structured JSON logging — not console.log() — so you can query and alert on it.

💡Log 7 fields per LLM call as structured JSON. console.log() is useless for querying and alerting.
Quick Pulse Check

Why use structured JSON logging instead of console.log() for LLM calls?

Predict First — Then Learn

How many steps does a typical RAG query touch before returning a response?

Tracing: Follow a Request Through Your AI Pipeline

A RAG query hits 5+ steps: embed query → search vectors → re-rank → build prompt → LLM call → parse output. When something goes wrong, you need to see the full trace. Langfuse and similar tools provide trace views: each step with its input, output, latency, and cost. Think of it as your browser DevTools Network tab for AI pipelines.

💡Tracing is your browser DevTools Network tab for AI — see every step's input, output, latency, and cost.
Quick Pulse Check

When a RAG answer is wrong, what does tracing help you identify?

Langfuse Integration: Observability in Practice

Langfuse is the open-source standard for LLM observability. It captures traces, scores, and costs for every LLM call. Integration is lightweight: wrap your AI SDK calls with Langfuse's trace context. You get a dashboard showing: latency trends, cost per feature, quality scores (from evals), and error rates. It's the Datadog/New Relic for AI features.

💡Langfuse = Datadog for AI. Lightweight integration, dashboard for latency, cost, quality, and errors.

Prompt Version Management

Prompts are code. Version them like code. Store prompts in your repo (not hardcoded in function calls), tag versions, and track which version produced which results. When a user reports a bad answer, you need to know: which prompt version, which model, what input. Without this, debugging AI in production is impossible. Some teams use Langfuse's prompt management, others use simple gitversioned files.

💡Prompts are code — version them, tag them, track which version produced which results.
Quick Pulse Check

Why version prompts like code?

The Full Evolution

Watch one function evolve through every concept you just learned.

Production Gotchas

Don't log full prompts if they contain user PII — hash or redact sensitive fields. Langfuse adds ~2-5ms latency per traced call — negligible for LLM calls that take 500ms+. Set up alerts for: latency spikes (model provider issues), cost spikes (runaway loops), and error rate spikes (API failures). Keep 30 days of traces minimum — AI bugs often surface weeks later when users report 'it used to work.' Separate your eval scores from user feedback — they measure different things.

Code Comparison

console.log vs Structured LLM Observability

Basic logging vs production LLM observability with Langfuse

console.log (unstructured)Traditional
// ❌ Unstructured logging
export async function POST(req: Request) {
  const { message } = await req.json();

  console.log("User asked:", message);

  const result = await generateText({
    model: openai("gpt-4o-mini"),
    prompt: message,
  });

  console.log("AI responded:", result.text);

  return Response.json({ text: result.text });
}

// Problems:
// - Can't query logs by model, cost, or user
// - No latency tracking
// - No token or cost data
// - Can't correlate prompt versions with outputs
// - Good luck debugging at 3 AM
Langfuse Tracing (structured)AI Engineering
// ✅ Structured observability with Langfuse
import { Langfuse } from "langfuse";

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
});

export async function POST(req: Request) {
  const { message, userId } = await req.json();

  // Create a trace for this request
  const trace = langfuse.trace({
    name: "chat",
    userId,
    metadata: { promptVersion: "v2.1" },
  });

  const generation = trace.generation({
    name: "llm-call",
    model: "gpt-4o-mini",
    input: message,
  });

  const result = await generateText({
    model: openai("gpt-4o-mini"),
    prompt: message,
  });

  generation.end({
    output: result.text,
    usage: {
      promptTokens: result.usage.promptTokens,
      completionTokens: result.usage.completionTokens,
    },
  });

  // Now you can: query by user, track costs,
  // spot regressions, debug with full trace
  return Response.json({ text: result.text });
}

KEY DIFFERENCES

  • Every LLM call gets a trace with input, output, tokens, and cost
  • User ID enables per-user debugging and cost tracking
  • Prompt version tagging lets you correlate changes with quality
  • Dashboard shows latency trends, costs, and error rates

Bridge Map: APM / Datadog / logging → LLM tracing + AI-specific monitoring

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

Real-World Challenge

Observable AI Pipeline

Build and deploy production-grade observability for an AI pipeline: structured logging for every LLM call, Langfuse integration for tracing, a cost/latency dashboard, and prompt versioning. This is the infrastructure that makes AI features maintainable.

~4h estimated
Next.js 14+Vercel AI SDKLangfuseRecharts (charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

  • Add structured logging to all LLM calls (model, tokens, cost, latency, user context)
  • Integrate Langfuse (or similar) for end-to-end request tracing
  • Build a dashboard showing cost per day, latency per model, token usage trends, and cache hit rates
  • Implement prompt versioning: extract prompts into versioned files with performance tracking
  • Set up alerts for cost spikes and quality drops
  • Show traces with per-step breakdown (embedding → retrieval → generation)
  • Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an AI endpoint, logging infrastructure, and a dashboard page.

npx create-next-app@latest ai-observability --typescript --tailwind --app
Create /lib/logger.ts for structured logging and /lib/tracing.ts for Langfuse

Deploy Tip

Push to GitHub and import into Vercel. Pre-seed the dashboard with sample observability data. Set OPENAI_API_KEY, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY in Vercel environment variables.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.

1 = no idea · 5 = ship it blindfolded