RAG Deep Dive
RAG is the #1 AI pattern in production. You'll go far beyond basics — learn multiple chunking strategies, debug retrieval failures, handle the #1 cause of hallucination (bad retrieval), and build a Q&A system that cites its sources. This is the day that separates AI engineers from tutorial followers.
Use this at work tomorrow
Build a Q&A bot over your team's internal docs — Confluence, Notion, or README files.
Learning Objectives
- 1Master the RAG pipeline: chunk → embed → store → retrieve → generate
- 2Compare chunking strategies: fixed-size, recursive, semantic, by-heading
- 3Debug RAG failures: bad retrieval, context overflow, hallucination grounding
- 4Add source citations with [1], [2] notation for trustworthy answers
- 5Ship a document Q&A system that answers from YOUR data
Ship It: Document Q&A system
By the end of this day, you'll build and deploy a document q&a system. This isn't a toy — it's a real project for your portfolio.
I can build a RAG pipeline that chunks documents, embeds them, retrieves relevant context, and generates grounded answers with citations.
How does RAG reduce LLM hallucinations?
RAG = Query Your Own Data with AI
RAG stands for Retrieval-Augmented Generation. Think of it as: query your database, but instead of rendering the data directly in a template, you pass it to an LLM as context to generate a natural language answer. It's the #1 AI pattern in production because it grounds LLM responses in YOUR data, dramatically reducing hallucination.
What does the 'R' in RAG do?
The RAG Pipeline: Chunk → Embed → Store → Retrieve → Generate
The RAG pipeline is a 5-step data flow. (1) Chunk: split documents into manageable pieces. (2) Embed: convert chunks to vectors. (3) Store: save vectors in a vector database. (4) Retrieve: find chunks similar to the user's query. (5) Generate: pass retrieved chunks as context to an LLM. Each step has trade-offs — chunk size affects retrieval quality, embedding model affects accuracy, and the generation prompt affects answer quality.
In the RAG pipeline, which step happens at query time (not during indexing)?
You split a 10,000-word document every 500 characters. What's the biggest problem?
Chunking Strategies: Fixed-Size Is Just the Beginning
Fixed-size (500 chars) is the simplest but often worst strategy — it splits mid-sentence. Recursive splitting follows document structure (paragraphs → sentences → words). Semantic chunking groups by meaning. Heading-based chunking follows document hierarchy. For code: chunk by function/class. The right strategy depends on your data. Bad chunking is the #1 cause of bad RAG.
You're chunking API documentation. What's the best strategy?
Your RAG app gives bad answers. What should you debug FIRST?
RAG Failure Modes: What Goes Wrong in Production
Garbage retrieval → hallucinated answers. If the retriever pulls irrelevant chunks, the LLM will still generate a confident answer from nonsense context. Other failure modes: context overflow (too many chunks), lost-in-the-middle (LLMs ignore middle chunks), stale data (embeddings from old docs), and adversarial queries that retrieve unrelated content. Debugging RAG means debugging retrieval first.
Your RAG system retrieves chunks about 'cooking recipes' for a question about 'React hooks'. What happens?
The Full Evolution
Watch one function evolve through every concept you just learned.
Production Gotchas
Chunk overlap prevents losing context at boundaries (50-100 char overlap is standard). Always include document metadata (source, date, section) in your chunks — you'll need it for citations and filtering. Monitor retrieval quality separately from generation quality — if retrieval is bad, no prompt can fix generation. Re-rank after retrieval for better quality (reorders by actual relevance, not just vector similarity).
Code Comparison
Data Query: SQL + Template vs RAG
Traditional data display vs RAG-powered answers
// Traditional: query DB, render template
const docs = await db.query(
"SELECT * FROM docs WHERE topic = $1",
[userQuestion]
);
return docs.map(doc => ({
title: doc.title,
snippet: doc.content.slice(0, 200),
link: doc.url,
}));
// Returns: list of links
// User must read & synthesize themselves// RAG: retrieve context, generate answer
// 1. Embed user's question
const { embedding } = await embed({
model: openai.embedding(
"text-embedding-3-small"
),
value: userQuestion,
});
// 2. Retrieve relevant chunks
const chunks = await vectorDB.query({
vector: embedding, topK: 5,
});
// 3. Generate grounded answer
const { text } = await generateText({
model: openai("gpt-4o-mini"),
system: `Answer based ONLY on the context.
If the answer isn't there, say "I don't know."`,
prompt: `Context:
${chunks.map(c => c.content).join("\n\n")}
Question: ${userQuestion}`,
});
// Returns: synthesized answer with contextKEY DIFFERENCES
- Traditional: user searches → reads → synthesizes answer manually
- RAG: user asks → system retrieves → LLM synthesizes → user gets answer
- RAG pipeline: Chunk → Embed → Store → Retrieve → Generate
- The 'only answer from context' prompt prevents hallucination
Chunking: Fixed vs Recursive
Why chunking strategy matters for RAG quality
// Fixed-size: simple but crude
function chunkFixed(text: string, size = 500) {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}
// Problem: "The React useEffect hook
// runs after every-"
// [CHUNK BOUNDARY]
// "-render by default."
// Splits mid-sentence! Context lost.// Recursive: follows document structure
function chunkRecursive(
text: string,
maxSize = 500
) {
// Split by paragraphs first
const paragraphs = text.split("\n\n");
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if ((current + para).length > maxSize) {
if (current) chunks.push(current.trim());
current = para;
} else {
current += "\n\n" + para;
}
}
if (current) chunks.push(current.trim());
return chunks;
}
// Respects paragraph boundaries!
// Each chunk is a complete thought.KEY DIFFERENCES
- Fixed-size is easy to implement but splits mid-sentence
- Recursive follows document structure (paragraphs → sentences)
- Bad chunking = bad retrieval = hallucinated answers
- Always test your chunking on real docs — look at the boundaries
Bridge Map: Database queries → Retrieval + AI generation
Click any bridge to see the translation
Hands-On Challenges
Build, experiment, and get AI-powered feedback on your code.
Document Q&A System
Build and deploy a RAG-powered document Q&A system that lets users upload documents, ask questions in natural language, and get accurate answers with source citations. This is the #1 AI pattern in production — build it for real.
Acceptance Criteria
- Accept document uploads (text, markdown, or PDF) and chunk them intelligently
- Generate and store embeddings for all document chunks
- Retrieve the most relevant chunks for a user's question using vector similarity
- Generate answers grounded in the retrieved context with [1], [2] source citations
- Handle 'I don't know' gracefully when the answer isn't in the documents
- Support multiple documents with source attribution
- Deploy to a public URL (Vercel, Netlify, etc.)
Build Roadmap
0/6Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with a document upload page and API routes for processing and querying.
npx create-next-app@latest doc-qa --typescript --tailwind --appPlan three API routes: /api/upload, /api/embed, /api/askDeploy Tip
Push to GitHub and import into Vercel. For the demo, pre-load a few sample documents so reviewers can try it immediately without uploading. Set your OPENAI_API_KEY in Vercel environment variables.
Sign in to submit your deployed project.
I can build a RAG pipeline that chunks documents, embeds them, retrieves relevant context, and generates grounded answers with citations.