Skip to content
Curriculum/Day 6: Multimodal AI & Streaming UX
Day 6Build AI Products

Multimodal AI & Streaming UX

You handle file uploads. Now you'll make AI understand them — extracting structured JSON from images, analyzing documents, and processing audio. Plus, you'll master the UX patterns that make AI apps feel magical: streaming responses, optimistic UI, and progressive loading.

80 min(+30 min boss)★★★☆☆
🖼️
Bridge:File uploads + loading statesVision/audio APIs + streaming UX

Use this at work tomorrow

Add a streaming AI response to any chat-like interface in your app — it transforms the UX.

Learning Objectives

  • 1Extract structured data from images (receipts → JSON, screenshots → code)
  • 2Process documents with vision models (PDFs, diagrams, handwriting)
  • 3Build streaming AI responses with real-time character-by-character display
  • 4Implement AI UX patterns: optimistic updates, progressive loading, error recovery
  • 5Ship a receipt scanner that extracts structured data from photos

Ship It: Receipt scanner + streaming chat

By the end of this day, you'll build and deploy a receipt scanner + streaming chat. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.

1 = no idea · 5 = ship it blindfolded
Predict First — Then Learn

A 1024×1024 image costs ~765 tokens on GPT-4o (~$0.003). You send 50 product photos. What's the cost?

Multimodal = New Input Types for Your AI APIs

You already handle file uploads — images, PDFs, audio. Multimodal AI processes these same file types but extracts meaning. Instead of just storing a receipt image, you can extract { vendor: 'Starbucks', total: 5.75, items: ['latte'] } as structured JSON. Same upload pipeline, AI-powered extraction.

💡Multimodal = same file upload pipeline + AI understanding. Image → structured JSON replaces months of OCR work.
Quick Pulse Check

What does 'multimodal' mean in the context of AI APIs?

Structured Data from Images: The Killer Use Case

The most practical multimodal skill: image → structured JSON. Receipts → expense data. Screenshots → UI code. Diagrams → descriptions. Handwriting → text. Use generateObject() with a Zod schema and a vision model — same structured output pattern from Day 1, but with image input. This replaces months of OCR/computer vision work.

💡Image → structured JSON is the killer use case. Same generateObject() + Zod pattern, just add image input. Resize images to save tokens.
Quick Pulse Check

You want to extract receipt data as typed JSON from a photo. What API pattern?

Predict First — Then Learn

Why does ChatGPT stream tokens instead of waiting for the full response?

Streaming: The UX Pattern That Makes AI Feel Magical

Waiting 3-5 seconds for a full response feels broken. Streaming token-by-token feels alive. The Vercel AI SDK provides streamText() (server) and useChat() (client) for this. It uses the same ReadableStream Web API you know. This is why ChatGPT, Cursor, and every great AI app streams — the perceived latency drops from seconds to milliseconds.

💡Streaming drops perceived latency from seconds to milliseconds. streamText() + useChat() = same ReadableStream API you know.
Predict First — Then Learn

Your AI auto-categorizes uploaded documents. The user uploads and waits for the AI. What UX pattern?

AI UX Patterns: Beyond the Chat Interface

Not everything needs to be a chatbot. AI UX patterns include: streaming text (token by token), optimistic UI (show placeholder while AI generates), progressive enrichment (show basic answer → enrich with details), skeleton loading with AI-specific messaging ('Analyzing your image...'), and graceful degradation (fallback when AI fails). The best AI apps feel fast even when the model is slow.

💡5 key patterns: streaming, optimistic UI, progressive enrichment, AI skeleton loading, graceful degradation. Not everything is a chatbot.
Quick Pulse Check

Your AI search takes 4 seconds. Which UX pattern makes it feel faster?

The Full Evolution

Watch one function evolve through every concept you just learned.

Production Gotchas

Image tokens are expensive: a 1024x1024 image costs ~765 tokens on GPT-4o (~$0.003). Resize images before sending to reduce costs. Audio transcription (Whisper) is separate from chat models — it's a different API endpoint. For PDFs, convert to images first (each page) or extract text. Rate limit file-heavy endpoints more aggressively — users love uploading 50 images at once. Always validate file type and size server-side.

Code Comparison

File Upload vs Vision AI Understanding

Processing files traditionally vs with AI — from metadata to understanding

Image Upload (metadata only)Traditional
// Traditional image processing
import sharp from "sharp";

const file = formData.get("image");
const metadata = await sharp(file.buffer)
  .metadata();

return {
  width: metadata.width,
  height: metadata.height,
  format: metadata.format,
  size: file.size,
};
// Can extract: dimensions, format, size
// CANNOT understand what's IN the image
Vision AI (content understanding)AI Engineering
// AI image understanding + extraction
import { generateObject } from "ai";
import { z } from "zod";

const { object } = await generateObject({
  model: openai("gpt-4o-mini"),
  schema: z.object({
    vendor: z.string(),
    total: z.number(),
    date: z.string(),
    items: z.array(z.object({
      name: z.string(),
      price: z.number(),
    })),
  }),
  messages: [{
    role: "user",
    content: [
      { type: "text",
        text: "Extract receipt data." },
      { type: "image", image: imageBuffer },
    ],
  }],
});
// Returns typed JSON:
// { vendor: "Starbucks", total: 5.75,
//   items: [{ name: "Latte", price: 5.75 }] }

KEY DIFFERENCES

  • Traditional: extract metadata (dimensions, format, size)
  • Vision AI: understand + extract structured data from content
  • Same generateObject() pattern from Day 1 — add image input
  • Replaces months of OCR/CV work with a single API call

Loading Spinner vs Streaming Response

Traditional loading vs AI streaming UX

Traditional Loading StateTraditional
// Wait for full response, show spinner
const [loading, setLoading] = useState(false);
const [data, setData] = useState("");

async function handleSubmit() {
  setLoading(true);

  // User sees: ⏳ spinner for 3-5 seconds
  const res = await fetch("/api/analyze", {
    method: "POST",
    body: formData,
  });
  const result = await res.json();

  setData(result.text);  // All at once
  setLoading(false);
}

// UX: Nothing... nothing... WALL OF TEXT
// Feels slow even if only 3 seconds
Streaming AI ResponseAI Engineering
// Stream response token by token
"use client";
import { useChat } from "ai/react";

export default function Chat() {
  const { messages, input, handleInputChange,
    handleSubmit, isLoading } = useChat();

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong>
          {m.content}
          {/* Text appears word by word! */}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input}
          onChange={handleInputChange} />
      </form>
    </div>
  );
}
// UX: Words flow in naturally ✨
// Feels instant even if total is 5 seconds

KEY DIFFERENCES

  • Spinner → wall of text feels slow (even at 3 seconds)
  • Streaming → words flow in naturally (feels instant)
  • useChat() handles streaming, state, and conversation history
  • Same Web Streams API you'd use for file downloads

Bridge Map: File uploads + loading states → Vision/audio APIs + streaming UX

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

Real-World Challenge

Receipt Scanner + Streaming Chat

Build and deploy a multimodal AI app that extracts structured data from receipt photos and lets users ask follow-up questions about their receipts via a streaming chat. Combine vision AI with real-time UX.

~3h estimated
Next.js 14+Vercel AI SDKOpenAI GPT-4o-mini (vision)ZodTailwind CSSVercel (deploy)

Acceptance Criteria

  • Accept image uploads (receipt photos) via drag-and-drop or file picker
  • Send images to a vision model and extract structured data (items, prices, total, date, merchant)
  • Display extracted receipt data in a clean, editable format
  • Add streaming chat where users can ask questions about the receipt data
  • Show progressive loading states ('Analyzing receipt...', 'Extracting items...')
  • Handle errors: blurry images, non-receipt images, API failures
  • Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an upload page, a processing API route, and a chat API route.

npx create-next-app@latest receipt-scanner --typescript --tailwind --app
Plan two API routes: /api/scan (image → structured data) and /api/chat (follow-up questions)

Deploy Tip

Push to GitHub and import into Vercel. Include 2-3 sample receipt images users can try without uploading their own. Set your OPENAI_API_KEY in Vercel environment variables.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.

1 = no idea · 5 = ship it blindfolded