Curriculum/Day 6: Multimodal AI & Streaming UX

Day 6Build AI Products

Multimodal AI & Streaming UX

You handle file uploads. Now you'll make AI understand them — extracting structured JSON from images, analyzing documents, and processing audio. Plus, you'll master the UX patterns that make AI apps feel magical: streaming responses, optimistic UI, and progressive loading.

80 min(+30 min boss)★★★☆☆

🖼️

Bridge:File uploads + loading statesVision/audio APIs + streaming UX

Use this at work tomorrow

Add a streaming AI response to any chat-like interface in your app — it transforms the UX.

Learning Objectives

1Extract structured data from images (receipts → JSON, screenshots → code)
2Process documents with vision models (PDFs, diagrams, handwriting)
3Build streaming AI responses with real-time character-by-character display
4Implement AI UX patterns: optimistic updates, progressive loading, error recovery
5Ship a receipt scanner that extracts structured data from photos

Ship It: Receipt scanner + streaming chat

By the end of this day, you'll build and deploy a receipt scanner + streaming chat. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.

1 = no idea · 5 = ship it blindfolded

Predict First — Then Learn

A 1024×1024 image costs ~765 tokens on GPT-4o (~$0.003). You send 50 product photos. What's the cost?

Multimodal = New Input Types for Your AI APIs

You already handle file uploads — images, PDFs, audio. Multimodal AI processes these same file types but extracts meaning. Instead of just storing a receipt image, you can extract { vendor: 'Starbucks', total: 5.75, items: ['latte'] } as structured JSON. Same upload pipeline, AI-powered extraction.

💡Multimodal = same file upload pipeline + AI understanding. Image → structured JSON replaces months of OCR work.

Quick Pulse Check

What does 'multimodal' mean in the context of AI APIs?

🖼️ Multimodal Pipeline Visualizer

See how images, audio, and files flow through the AI processing pipeline

Extract structured expense data from a receipt image

Input: 📷 receipt_starbucks.jpg

📤

Upload

📐

Resize

🔤

Base64 Encode

🧠

Vision Model

✅

Validate

Structured Data from Images: The Killer Use Case

The most practical multimodal skill: image → structured JSON. Receipts → expense data. Screenshots → UI code. Diagrams → descriptions. Handwriting → text. Use generateObject() with a Zod schema and a vision model — same structured output pattern from Day 1, but with image input. This replaces months of OCR/computer vision work.

💡Image → structured JSON is the killer use case. Same generateObject() + Zod pattern, just add image input. Resize images to save tokens.

Quick Pulse Check

You want to extract receipt data as typed JSON from a photo. What API pattern?

🧮 Image Token Calculator

See how image resolution affects token count and cost — why resizing matters

Resolution

Detail Level

Image split into 512×512 tiles. Each tile = 170 tokens + 85 base.

Number of Images

Tile Grid (4 tiles)

Per Image

765tokens

= 170 × 4 tiles + 85 base

Estimated Cost (GPT-4o)

$0.0019USD

Predict First — Then Learn

Why does ChatGPT stream tokens instead of waiting for the full response?

Streaming: The UX Pattern That Makes AI Feel Magical

Waiting 3-5 seconds for a full response feels broken. Streaming token-by-token feels alive. The Vercel AI SDK provides streamText() (server) and useChat() (client) for this. It uses the same ReadableStream Web API you know. This is why ChatGPT, Cursor, and every great AI app streams — the perceived latency drops from seconds to milliseconds.

💡Streaming drops perceived latency from seconds to milliseconds. streamText() + useChat() = same ReadableStream API you know.

Predict First — Then Learn

Your AI auto-categorizes uploaded documents. The user uploads and waits for the AI. What UX pattern?

AI UX Patterns: Beyond the Chat Interface

Not everything needs to be a chatbot. AI UX patterns include: streaming text (token by token), optimistic UI (show placeholder while AI generates), progressive enrichment (show basic answer → enrich with details), skeleton loading with AI-specific messaging ('Analyzing your image...'), and graceful degradation (fallback when AI fails). The best AI apps feel fast even when the model is slow.

💡5 key patterns: streaming, optimistic UI, progressive enrichment, AI skeleton loading, graceful degradation. Not everything is a chatbot.

Quick Pulse Check

Your AI search takes 4 seconds. Which UX pattern makes it feel faster?

🎨 AI UX Pattern Showcase

Interactive demos of the 5 key UX patterns that make AI apps feel amazing

Tokens appear word by word instead of all at once. Perceived latency drops from seconds to milliseconds.

When to use: Chat interfaces, long-form generation, any text output

Live Demo

Click play to see streaming text...

The Full Evolution

Watch one function evolve through every concept you just learned.

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

The SWE starting point

Raw fetch, manual headers, raw text output

1async function reviewCode(code: string) {
2  const response = await fetch(
3    "https://api.openai.com/v1/chat/completions",
4    {
5      method: "POST",
6      headers: {
7        "Authorization": `Bearer ${API_KEY}`,
8        "Content-Type": "application/json",
9      },
10      body: JSON.stringify({
11        model: "gpt-4o-mini",
12        messages: [
13          { role: "user", content: `Review: ${code}` }
14        ],
15      }),
16    }
17  );
18  const data = await response.json();
19  return data.choices[0].message.content;
20  // Returns raw text — unparseable!
21}

1 / 5

Production Gotchas

Image tokens are expensive: a 1024x1024 image costs ~765 tokens on GPT-4o (~$0.003). Resize images before sending to reduce costs. Audio transcription (Whisper) is separate from chat models — it's a different API endpoint. For PDFs, convert to images first (each page) or extract text. Rate limit file-heavy endpoints more aggressively — users love uploading 50 images at once. Always validate file type and size server-side.

Code Comparison

File Upload vs Vision AI Understanding

Processing files traditionally vs with AI — from metadata to understanding

Image Upload (metadata only)Traditional

// Traditional image processing
import sharp from "sharp";

const file = formData.get("image");
const metadata = await sharp(file.buffer)
  .metadata();

return {
  width: metadata.width,
  height: metadata.height,
  format: metadata.format,
  size: file.size,
};
// Can extract: dimensions, format, size
// CANNOT understand what's IN the image

Vision AI (content understanding)AI Engineering

// AI image understanding + extraction
import { generateObject } from "ai";
import { z } from "zod";

const { object } = await generateObject({
  model: openai("gpt-4o-mini"),
  schema: z.object({
    vendor: z.string(),
    total: z.number(),
    date: z.string(),
    items: z.array(z.object({
      name: z.string(),
      price: z.number(),
    })),
  }),
  messages: [{
    role: "user",
    content: [
      { type: "text",
        text: "Extract receipt data." },
      { type: "image", image: imageBuffer },
    ],
  }],
});
// Returns typed JSON:
// { vendor: "Starbucks", total: 5.75,
//   items: [{ name: "Latte", price: 5.75 }] }

KEY DIFFERENCES

Traditional: extract metadata (dimensions, format, size)
Vision AI: understand + extract structured data from content
Same generateObject() pattern from Day 1 — add image input
Replaces months of OCR/CV work with a single API call

Loading Spinner vs Streaming Response

Traditional loading vs AI streaming UX

Traditional Loading StateTraditional

// Wait for full response, show spinner
const [loading, setLoading] = useState(false);
const [data, setData] = useState("");

async function handleSubmit() {
  setLoading(true);

  // User sees: ⏳ spinner for 3-5 seconds
  const res = await fetch("/api/analyze", {
    method: "POST",
    body: formData,
  });
  const result = await res.json();

  setData(result.text);  // All at once
  setLoading(false);
}

// UX: Nothing... nothing... WALL OF TEXT
// Feels slow even if only 3 seconds

Streaming AI ResponseAI Engineering

// Stream response token by token
"use client";
import { useChat } from "ai/react";

export default function Chat() {
  const { messages, input, handleInputChange,
    handleSubmit, isLoading } = useChat();

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong>
          {m.content}
          {/* Text appears word by word! */}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input}
          onChange={handleInputChange} />
      </form>
    </div>
  );
}
// UX: Words flow in naturally ✨
// Feels instant even if total is 5 seconds

KEY DIFFERENCES

Spinner → wall of text feels slow (even at 3 seconds)
Streaming → words flow in naturally (feels instant)
useChat() handles streaming, state, and conversation history
Same Web Streams API you'd use for file downloads

Bridge Map: File uploads + loading states → Vision/audio APIs + streaming UX

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

starter

Extract Structured Data from Text (Vision Simulation)

Build a receipt parser that extracts structured data from receipt text (simulating vision input). Use the mock generateObject() to produce typed JSON from unstructured text. In production, you'd pass an image — the API pattern is identical.

PLAYGROUND

import { useState } from "react";
import { z } from "zod";
// TODO: Import generateObject and openai from the mock
// import { generateObject, openai } from "./ai-sdk-mock";

// Sample receipt text (in production, this would be image input)
const receipts = [
  `STARBUCKS #12345
  123 Main St, NYC
  Mar 15, 2024

  Caramel Latte      $5.75
  Blueberry Muffin   $3.50
  ──────────────────────
  Subtotal:          $9.25
  Tax:               $0.74
  TOTAL:             $9.99`,
  `WHOLE FOODS MARKET
  456 Oak Ave, Brooklyn
  Mar 14, 2024

  Organic Bananas    $2.99
  Almond Milk        $4.49
  Bread              $3.99
  ──────────────────────
  Subtotal:         $11.47
  Tax:               $0.92
  TOTAL:            $12.39`,
];

// TODO: Define a Zod schema for the receipt data:
// - vendor: string
// - date: string
// - items: array of { name: string, price: number }
// - subtotal: number
// - tax: number
// - total: number

export default function App() {
  const [selectedReceipt, setSelectedReceipt] = useState(0);
  const [result, setResult] = useState<any>(null);
  const [loading, setLoading] = useState(false);

  async function handleExtract() {
    setLoading(true);
    try {
      // TODO: Call generateObject() with:
      // - model: openai("gpt-4o-mini")
      // - schema: your receipt Zod schema
      // - prompt: "Extract structured data from this receipt:\n" + receipts[selectedReceipt]
      // Then set result with the returned object

      setResult({ error: "Implement me!" });
    } finally {
      setLoading(false);
    }
  }

  return (
    <div style={{ padding: 20, fontFamily: "sans-serif" }}>
      <h2>🧾 Receipt Parser (Structured Extraction)</h2>
      <p style={{ color: "#666", fontSize: 13 }}>
        Extract typed JSON from receipt text — same pattern works with images
      </p>
      <div style={{ display: "flex", gap: 8, margin: "8px 0" }}>
        {receipts.map((_, i) => (
          <button key={i} onClick={() => { setSelectedReceipt(i); setResult(null); }}
            style={{
              padding: "6px 14px", borderRadius: 6, cursor: "pointer",
              background: selectedReceipt === i ? "#0ea5e9" : "#f1f5f9",
              color: selectedReceipt === i ? "white" : "black", border: "none",
            }}>
            Receipt {i + 1}
          </button>
        ))}
      </div>
      <pre style={{ background: "#f1f5f9", padding: 12, borderRadius: 6, fontSize: 12, overflow: "auto" }}>
        {receipts[selectedReceipt]}
      </pre>
      <button onClick={handleExtract} disabled={loading}
        style={{ marginTop: 8, padding: "8px 20px", background: loading ? "#94a3b8" : "#8b5cf6", color: "white", border: "none", borderRadius: 6, cursor: "pointer" }}>
        {loading ? "Extracting..." : "Extract Data"}
      </button>
      {result && !result.error && (
        <div style={{ marginTop: 16, padding: 16, background: "#f0fdf4", borderRadius: 8, border: "1px solid #bbf7d0", fontSize: 13 }}>
          <h3 style={{ fontSize: 14, margin: "0 0 8px" }}>{result.vendor} — {result.date}</h3>
          <table style={{ width: "100%", borderCollapse: "collapse" }}>
            <thead>
              <tr style={{ borderBottom: "1px solid #e2e8f0" }}>
                <th style={{ textAlign: "left", padding: 4 }}>Item</th>
                <th style={{ textAlign: "right", padding: 4 }}>Price</th>
              </tr>
            </thead>
            <tbody>
              {result.items?.map((item: any, i: number) => (
                <tr key={i} style={{ borderBottom: "1px solid #f1f5f9" }}>
                  <td style={{ padding: 4 }}>{item.name}</td>
                  <td style={{ textAlign: "right", padding: 4 }}>${item.price.toFixed(2)}</td>
                </tr>
              ))}
              <tr style={{ borderTop: "2px solid #e2e8f0", fontWeight: 600 }}>
                <td style={{ padding: 4 }}>Total</td>
                <td style={{ textAlign: "right", padding: 4 }}>${result.total?.toFixed(2)}</td>
              </tr>
            </tbody>
          </table>
        </div>
      )}
    </div>
  );
}

Open Sandbox

Real-World Challenge

Receipt Scanner + Streaming Chat

Build and deploy a multimodal AI app that extracts structured data from receipt photos and lets users ask follow-up questions about their receipts via a streaming chat. Combine vision AI with real-time UX.

~3h estimated

Next.js 14+Vercel AI SDKOpenAI GPT-4o-mini (vision)ZodTailwind CSSVercel (deploy)

Acceptance Criteria

Accept image uploads (receipt photos) via drag-and-drop or file picker
Send images to a vision model and extract structured data (items, prices, total, date, merchant)
Display extracted receipt data in a clean, editable format
Add streaming chat where users can ask questions about the receipt data
Show progressive loading states ('Analyzing receipt...', 'Extracting items...')
Handle errors: blurry images, non-receipt images, API failures
Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an upload page, a processing API route, and a chat API route.

npx create-next-app@latest receipt-scanner --typescript --tailwind --app

Plan two API routes: /api/scan (image → structured data) and /api/chat (follow-up questions)

Deploy Tip

Push to GitHub and import into Vercel. Include 2-3 sample receipt images users can try without uploading their own. Set your OPENAI_API_KEY in Vercel environment variables.

After Learning — Rate Your Confidence Again

I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.

1 = no idea · 5 = ship it blindfolded

Day 5: AI Agents & Orchestration

Day 7: Capstone: AI Documentation Assistant

Multimodal AI & Streaming UX

Learning Objectives

Ship It: Receipt scanner + streaming chat

Multimodal = New Input Types for Your AI APIs

🖼️ Multimodal Pipeline Visualizer

Structured Data from Images: The Killer Use Case

🧮 Image Token Calculator

Resolution

Detail Level

Number of Images

Tile Grid (4 tiles)

Streaming: The UX Pattern That Makes AI Feel Magical

AI UX Patterns: Beyond the Chat Interface

🎨 AI UX Pattern Showcase

Live Demo

The Full Evolution

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Production Gotchas

Code Comparison

File Upload vs Vision AI Understanding

Loading Spinner vs Streaming Response

Bridge Map: File uploads + loading states → Vision/audio APIs + streaming UX

Hands-On Challenges

Extract Structured Data from Text (Vision Simulation)

Receipt Scanner + Streaming Chat

Acceptance Criteria

Build Roadmap

Discussion

🖼️ Multimodal Pipeline Visualizer

🧮 Image Token Calculator

Resolution

Detail Level

Number of Images

Tile Grid (4 tiles)

🎨 AI UX Pattern Showcase

Live Demo

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Extract Structured Data from Text (Vision Simulation)

Discussion