Learning Objectives

1Instrument LLM calls with structured logging (prompt, tokens, latency, cost)
2Build end-to-end traces for multi-step AI pipelines (RAG, agents)
3Integrate Langfuse for production-grade AI observability
4Set up alerting: cost spikes, quality drops, latency degradation
5Ship an observable AI pipeline with real-time tracing dashboard

Ship It: Observable AI pipeline

By the end of this day, you'll build and deploy a observable ai pipeline. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.

1 = no idea · 5 = ship it blindfolded

Predict First — Then Learn

What makes debugging AI features harder than traditional features?

LLMOps: DevOps for AI Features

You know DevOps: CI/CD, monitoring, alerting, logging. LLMOps applies those same principles to AI features. The difference: LLM outputs are non-deterministic, so you need specialized observability. You can't grep logs for errors when the 'error' is a subtly wrong answer. This is where structured logging, tracing, and platforms like Langfuse come in.

💡LLMOps = DevOps for AI. You need specialized observability because 'wrong answer' isn't in your error logs.

Quick Pulse Check

What's the key difference between DevOps and LLMOps?

📊 Live Metrics Dashboard

Animated counters with sparklines. Trigger a spike to see alerts fire.

⚡ Latency p50

480ms

🐌 Latency p95

1200ms

💰 Cost/hour

$2.40

❌ Error Rate

0.5%

📈 Requests/min

45

Predict First — Then Learn

What's the minimum you should log for every LLM call?

Structured Logging for LLM Calls

Every LLM call should log: (1) input prompt (or hash for privacy), (2) model used, (3) token counts (prompt + completion), (4) latency, (5) cost estimate, (6) user ID, (7) any errors. This data lets you debug issues, track costs, spot regressions, and understand user patterns. Use structured JSON logging — not console.log() — so you can query and alert on it.

💡Log 7 fields per LLM call as structured JSON. console.log() is useless for querying and alerting.

Quick Pulse Check

Why use structured JSON logging instead of console.log() for LLM calls?

📝 Log Comparator

Compare unstructured console.log vs structured JSON. Click fields to "query" them.

timestamp: 2024-01-15T10:23:45Zmodel: gpt-4o-minipromptTokens: 342completionTokens: 89cost: 0.00012latencyMs: 485userId: usr_abc123status: 200

timestamp: 2024-01-15T10:23:46Zmodel: gpt-4o-minipromptTokens: 0completionTokens: 0cost: 0latencyMs: 2userId: usr_xyz789status: 429error: rate_limit_exceeded

timestamp: 2024-01-15T10:23:50Zmodel: gpt-4opromptTokens: 1250completionTokens: 340cost: 0.0114latencyMs: 2340userId: usr_def456status: 200

timestamp: 2024-01-15T10:23:51Zmodel: cachepromptTokens: 0completionTokens: 0cost: 0latencyMs: 8userId: usr_ghi012status: 200cacheHit: true

💡 Structured logs let you query by field: "show me all gpt-4o calls over $0.01". Unstructured logs? Good luck.

Predict First — Then Learn

How many steps does a typical RAG query touch before returning a response?

Tracing: Follow a Request Through Your AI Pipeline

A RAG query hits 5+ steps: embed query → search vectors → re-rank → build prompt → LLM call → parse output. When something goes wrong, you need to see the full trace. Langfuse and similar tools provide trace views: each step with its input, output, latency, and cost. Think of it as your browser DevTools Network tab for AI pipelines.

💡Tracing is your browser DevTools Network tab for AI — see every step's input, output, latency, and cost.

Quick Pulse Check

When a RAG answer is wrong, what does tracing help you identify?

🔬 Tracing Waterfall

Like DevTools Network tab — click a span to see input/output.

Total: 507ms

Langfuse Integration: Observability in Practice

Langfuse is the open-source standard for LLM observability. It captures traces, scores, and costs for every LLM call. Integration is lightweight: wrap your AI SDK calls with Langfuse's trace context. You get a dashboard showing: latency trends, cost per feature, quality scores (from evals), and error rates. It's the Datadog/New Relic for AI features.

💡Langfuse = Datadog for AI. Lightweight integration, dashboard for latency, cost, quality, and errors.

Prompt Version Management

Prompts are code. Version them like code. Store prompts in your repo (not hardcoded in function calls), tag versions, and track which version produced which results. When a user reports a bad answer, you need to know: which prompt version, which model, what input. Without this, debugging AI in production is impossible. Some teams use Langfuse's prompt management, others use simple gitversioned files.

💡Prompts are code — version them, tag them, track which version produced which results.

Quick Pulse Check

Why version prompts like code?

The Full Evolution

Watch one function evolve through every concept you just learned.

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

The SWE starting point

Raw fetch, manual headers, raw text output

1async function reviewCode(code: string) {
2  const response = await fetch(
3    "https://api.openai.com/v1/chat/completions",
4    {
5      method: "POST",
6      headers: {
7        "Authorization": `Bearer ${API_KEY}`,
8        "Content-Type": "application/json",
9      },
10      body: JSON.stringify({
11        model: "gpt-4o-mini",
12        messages: [
13          { role: "user", content: `Review: ${code}` }
14        ],
15      }),
16    }
17  );
18  const data = await response.json();
19  return data.choices[0].message.content;
20  // Returns raw text — unparseable!
21}

1 / 5

Production Gotchas

Don't log full prompts if they contain user PII — hash or redact sensitive fields. Langfuse adds ~2-5ms latency per traced call — negligible for LLM calls that take 500ms+. Set up alerts for: latency spikes (model provider issues), cost spikes (runaway loops), and error rate spikes (API failures). Keep 30 days of traces minimum — AI bugs often surface weeks later when users report 'it used to work.' Separate your eval scores from user feedback — they measure different things.

Code Comparison

console.log vs Structured LLM Observability

Basic logging vs production LLM observability with Langfuse

console.log (unstructured)Traditional

// ❌ Unstructured logging
export async function POST(req: Request) {
  const { message } = await req.json();

  console.log("User asked:", message);

  const result = await generateText({
    model: openai("gpt-4o-mini"),
    prompt: message,
  });

  console.log("AI responded:", result.text);

  return Response.json({ text: result.text });
}

// Problems:
// - Can't query logs by model, cost, or user
// - No latency tracking
// - No token or cost data
// - Can't correlate prompt versions with outputs
// - Good luck debugging at 3 AM

Langfuse Tracing (structured)AI Engineering

// ✅ Structured observability with Langfuse
import { Langfuse } from "langfuse";

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
});

export async function POST(req: Request) {
  const { message, userId } = await req.json();

  // Create a trace for this request
  const trace = langfuse.trace({
    name: "chat",
    userId,
    metadata: { promptVersion: "v2.1" },
  });

  const generation = trace.generation({
    name: "llm-call",
    model: "gpt-4o-mini",
    input: message,
  });

  const result = await generateText({
    model: openai("gpt-4o-mini"),
    prompt: message,
  });

  generation.end({
    output: result.text,
    usage: {
      promptTokens: result.usage.promptTokens,
      completionTokens: result.usage.completionTokens,
    },
  });

  // Now you can: query by user, track costs,
  // spot regressions, debug with full trace
  return Response.json({ text: result.text });
}

KEY DIFFERENCES

Every LLM call gets a trace with input, output, tokens, and cost
User ID enables per-user debugging and cost tracking
Prompt version tagging lets you correlate changes with quality
Dashboard shows latency trends, costs, and error rates

Bridge Map: APM / Datadog / logging → LLM tracing + AI-specific monitoring

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

starter

Build an LLM Call Logger & Dashboard

Create a structured logging system for LLM calls that captures model, tokens, cost, and latency. Build a dashboard that shows aggregate stats (total cost, average latency, model distribution). This simulates what Langfuse does under the hood.

PLAYGROUND

import { useState } from "react";
// TODO: Import generateText, openai, estimateCost from the mock
// import { generateText, openai, estimateCost } from "./ai-sdk-mock";

interface LogEntry {
  id: string;
  timestamp: number;
  model: string;
  prompt: string;
  response: string;
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  cost: number;
  latencyMs: number;
}

// TODO: Create a logger class or use an array to store LogEntry items
// It should have:
// - log(entry: LogEntry): void
// - getAll(): LogEntry[]
// - getStats(): { totalCost, avgLatency, totalTokens, callCount, modelBreakdown }

const testQueries = [
  { model: "gpt-4o-mini", prompt: "What are your business hours?" },
  { model: "gpt-4o-mini", prompt: "How do I reset my password?" },
  { model: "gpt-4o", prompt: "Analyze the performance implications of switching from REST to GraphQL for our API, considering our current traffic patterns and data model complexity." },
  { model: "gpt-4o-mini", prompt: "What is your return policy?" },
  { model: "gpt-4o", prompt: "Compare the security implications of JWT vs session-based authentication for a multi-tenant SaaS application." },
  { model: "gpt-4o-mini", prompt: "Where is my order?" },
];

export default function App() {
  const [logs, setLogs] = useState<LogEntry[]>([]);
  const [stats, setStats] = useState<any>(null);
  const [running, setRunning] = useState(false);

  async function runSimulation() {
    setRunning(true);
    setLogs([]);
    const allLogs: LogEntry[] = [];

    for (const query of testQueries) {
      const start = Date.now();

      // TODO: Call generateText with the query's model and prompt
      // TODO: Calculate cost using estimateCost()
      // TODO: Create a LogEntry with all fields populated
      // TODO: Add to allLogs array

      const entry: LogEntry = {
        id: crypto.randomUUID(),
        timestamp: Date.now(),
        model: query.model,
        prompt: query.prompt,
        response: "Implement logging!",
        promptTokens: 0,
        completionTokens: 0,
        totalTokens: 0,
        cost: 0,
        latencyMs: Date.now() - start,
      };

      allLogs.push(entry);
      setLogs([...allLogs]);
    }

    // TODO: Calculate aggregate stats:
    // - totalCost: sum of all entry costs
    // - avgLatency: average of all latencyMs
    // - totalTokens: sum of all totalTokens
    // - callCount: number of entries
    // - modelBreakdown: { [model]: { calls, cost, tokens } }

    setStats({
      totalCost: 0,
      avgLatency: 0,
      totalTokens: 0,
      callCount: allLogs.length,
    });
    setRunning(false);
  }

  return (
    <div style={{ padding: 20, fontFamily: "sans-serif", maxWidth: 750 }}>
      <h2>📊 LLM Observability Dashboard</h2>
      <p style={{ color: "#666", fontSize: 13 }}>
        Structured logging for every LLM call — cost, latency, tokens
      </p>

      <button onClick={runSimulation} disabled={running}
        style={{ padding: "10px 24px", background: running ? "#94a3b8" : "#8b5cf6", color: "white", border: "none", borderRadius: 8, cursor: "pointer" }}>
        {running ? "Running..." : `Simulate ${testQueries.length} API calls`}
      </button>

      {stats && (
        <div style={{ marginTop: 16, padding: 12, background: "#f8fafc", borderRadius: 8, display: "flex", gap: 24, flexWrap: "wrap" }}>
          <div>💰 Total Cost: <strong>${stats.totalCost.toFixed(4)}</strong></div>
          <div>⚡ Avg Latency: <strong>{stats.avgLatency.toFixed(0)}ms</strong></div>
          <div>📝 Total Tokens: <strong>{stats.totalTokens.toLocaleString()}</strong></div>
          <div>📞 Calls: <strong>{stats.callCount}</strong></div>
        </div>
      )}

      {logs.length > 0 && (
        <div style={{ marginTop: 12 }}>
          {logs.map((entry, i) => (
            <div key={entry.id} style={{ padding: 8, margin: "4px 0", background: "#f1f5f9", borderRadius: 6, fontSize: 12 }}>
              <div style={{ display: "flex", justifyContent: "space-between" }}>
                <span>
                  <strong style={{ color: entry.model.includes("4o-mini") ? "#0ea5e9" : "#8b5cf6" }}>
                    {entry.model}
                  </strong>
                  {" "}{entry.prompt.slice(0, 60)}...
                </span>
                <span style={{ color: "#94a3b8" }}>
                  {entry.latencyMs}ms | ${entry.cost.toFixed(4)} | {entry.totalTokens} tokens
                </span>
              </div>
            </div>
          ))}
        </div>
      )}
    </div>
  );
}

Open Sandbox

Real-World Challenge

Observable AI Pipeline

Build and deploy production-grade observability for an AI pipeline: structured logging for every LLM call, Langfuse integration for tracing, a cost/latency dashboard, and prompt versioning. This is the infrastructure that makes AI features maintainable.

~4h estimated

Next.js 14+Vercel AI SDKLangfuseRecharts (charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

Add structured logging to all LLM calls (model, tokens, cost, latency, user context)
Integrate Langfuse (or similar) for end-to-end request tracing
Build a dashboard showing cost per day, latency per model, token usage trends, and cache hit rates
Implement prompt versioning: extract prompts into versioned files with performance tracking
Set up alerts for cost spikes and quality drops
Show traces with per-step breakdown (embedding → retrieval → generation)
Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an AI endpoint, logging infrastructure, and a dashboard page.

npx create-next-app@latest ai-observability --typescript --tailwind --app

Create /lib/logger.ts for structured logging and /lib/tracing.ts for Langfuse

Deploy Tip

Push to GitHub and import into Vercel. Pre-seed the dashboard with sample observability data. Set OPENAI_API_KEY, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY in Vercel environment variables.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can implement structured LLM logging, distributed tracing, and Langfuse integration for production AI observability.

1 = no idea · 5 = ship it blindfolded

Day 10: Cost Optimization & Multi-Model Strategy

Day 12: Fine-tuning & Model Customization

Discussion

Sign in to join the discussion.

Loading...

LLMOps & Observability

Learning Objectives

Ship It: Observable AI pipeline

LLMOps: DevOps for AI Features

📊 Live Metrics Dashboard

Structured Logging for LLM Calls

📝 Log Comparator

Tracing: Follow a Request Through Your AI Pipeline

🔬 Tracing Waterfall

Langfuse Integration: Observability in Practice

Prompt Version Management

The Full Evolution

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Production Gotchas

Code Comparison

console.log vs Structured LLM Observability

Bridge Map: APM / Datadog / logging → LLM tracing + AI-specific monitoring

Hands-On Challenges

Build an LLM Call Logger & Dashboard

Observable AI Pipeline

Acceptance Criteria

Build Roadmap

Discussion

📊 Live Metrics Dashboard

📝 Log Comparator

🔬 Tracing Waterfall

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Build an LLM Call Logger & Dashboard

Discussion