Skip to content

02 — Ship RAG That Measurably Retrieves (Week 3)

Mission

Ship a question-answering app over a real corpus you actually care about — your notes, a niche hobby's documentation, a codebase's docs, a set of manuals or filings — where retrieval quality is measured, and the README shows the numbers improving: baseline → hybrid search → reranking.

Why this rung

RAG is where most AI products quietly fail, and almost always in the retrieval half — the generator gets blamed for answers the retriever never gave it the facts for. The top-1% move is treating retrieval as its own measurable system: hit-rate before answer quality, changes justified by numbers. One honest week of this teaches you more than every "advanced RAG" listicle combined — because you'll watch a fashionable technique not help on your corpus, and cutting it will be the right call.

The corpus must be real and yours: messy formatting, inconsistent length, duplicates. That mess is the curriculum. Good corpora in-domain: your runbooks / ops docs, a set of man pages, the RFCs for a protocol you work with, a cloud provider's service docs, or a slice of MITRE ATT&CK.

The mental model

RAG is two systems wearing one trenchcoat: a search engine and a writer. Users see the writer, so the writer takes the blame — but in practice most bad answers are the search engine's fault: the facts never made it into the context. That's the whole reason this week's eval measures retrieval separately from generation. If hit-rate is low, no prompt engineering on the generator can save you; if hit-rate is high and answers are still wrong, now it's a generation problem. Diagnose in that order, always.

Embeddings are lossy compression of meaning: a text becomes a point in space, and "similar meaning" becomes "nearby point." That buys you paraphrase-tolerance — a query about "reset my password" finds a doc titled "credential recovery" — but the compression throws away exactly what keyword search keeps: exact strings. Product codes, error IDs, proper names, rare jargon all blur in vibe-space. Hybrid search works because the two retrievers have complementary blind spots — dense finds what's meant, BM25 finds what's literally said — which is why fusing them is the single most reliable upgrade in retrieval, ahead of anything exotic.

The reranker is the careful reader behind the fast skimmer. Your first-stage retriever scores query and document separately (that's what makes it fast enough to search everything), so it can only judge resemblance. A cross-encoder reads query and candidate together and judges relevance — far too slow to run over the corpus, cheap to run over the top fifty. Fast-but-shallow over everything, slow-but-sharp over the shortlist: the same two-stage shape as every mature search system you've ever used.

The gotcha — chunking silently destroys context. Split a document into paragraphs and each chunk forgets which product, which year, which section it belonged to; it embeds as an orphan and retrieves as one (this is exactly what contextual retrieval repairs). And a second gotcha, about the genre: most "advanced RAG" techniques were validated on someone else's corpus. On yours, some will do nothing — this week you'll likely watch one fashionable improvement flatline, and cutting it because your numbers said so is the top-1% behavior being trained.

The path

Start here (the first hour): corpus chosen and loaded — files readable from Python, counted, one chunk embedded, a similarity score between two chunks printed. If the corpus looks messy, resist the urge to clean it; naive-and-working is Monday's whole job.

Default pick (if you haven't chosen in 30 minutes): a man-page / runbook assistant — index the man pages (or your own ops runbooks) for the tools you use daily and ask it real operational questions ("how do I rate-limit with tc?", "which flag makes rsync delete extraneous files?"). You know the right answers cold, so the eval set is fast to build and the failures are obvious on sight.

Build order — each step feeds the next:

  1. [ ] Mon — dumb baseline, end to end. Fixed-size chunking, any embedding model, top-k cosine, answering something. (Hint: a vector store's local/embedded mode — no server to run.)
  2. [ ] Tue — the mirror before the polish. The 50-question eval set, built before any quality work. (Hint: draft candidate questions with a model from your own docs, then hand-review every one and reject what you'd never really ask. Include 5+ out-of-corpus questions.)
  3. [ ] Wed — measure the baseline. hit-rate@k and MRR over the 50. This number is the villain of the week — put it in the README now, unflattering as it is.
  4. [ ] Thu — hybrid. Add BM25 alongside dense, fuse the rankings (RRF is fine), re-measure. Row two of the table.
  5. [ ] Fri — rerank. Cross-encoder over the top ~50 candidates, re-measure. Row three. (Expect one row this week to disappoint; it stays in the table anyway.)
  6. [ ] Sat — generation + honesty. Answer synthesis with citations, the weak-retrieval "I don't know" path, answer quality judged via your eval template.
  7. [ ] Sun — deploy + publish. Results table front and center, the anatomy-of-a-failure section, build-log entry.

Spec — must-haves

  • [ ] A real corpus, ≥100 documents/chunks worth of genuinely useful content.
  • [ ] An eval set of ≥50 question → expected-source (and where possible expected-answer) pairs, drawn from questions you'd really ask. Build this first — it's the week's most valuable artifact.
  • [ ] Three retrieval configurations, evaluated on the same set:
    1. Baseline — embeddings + cosine top-k;
    2. Hybrid — dense + keyword (BM25), fused;
    3. Hybrid + reranker — cross-encoder reranking the candidates.
  • [ ] Retrieval metrics per config: hit-rate@k and MRR. Answer-quality metrics on top (faithfulness / correctness via LLM-as-judge from your eval template).
  • [ ] Citations in the UI — every answer shows which chunks it stands on.
  • [ ] An "I don't know" path when retrieval comes back weak — measured too (it should trigger on your out-of-corpus questions, and you should have out-of-corpus questions).
  • [ ] Deployed, with the results table front and center in the README.
  • [ ] Secure it + license/PII note: state the corpus's license and provenance in the README, and scrub PII/secrets before indexing. If the corpus has any access boundaries, retrieval must respect them — RAG's signature failure is answering from a document the asker shouldn't see (retrieval is an exfiltration path if filtering is an afterthought).

Eval bar

  • The results table exists: 3 configs × (hit-rate@k, MRR, answer quality, latency, cost).
  • Hybrid+rerank beats baseline on your set — or the README explains why it didn't and what you chose instead. A negative result honestly analyzed clears the bar.
  • ≥5 out-of-corpus questions correctly refused.

JIT learning — pull when stuck

  • sentence-transformers docs — embeddings and cross-encoder rerankers from one library; the semantic-search example is your baseline.
  • Qdrant documentation — a vector DB with painless local mode and good hybrid-search docs (any of Qdrant/pgvector/Chroma is fine; pick one and move).
  • Anthropic — Contextual retrieval — a strong, concrete write-up on why naive chunking loses context and what fixes cost/buy; includes real benchmark deltas (~25 min).
  • Ragas docs — steal metric definitions (faithfulness, context precision/recall) even if you implement scoring with your own template.
  • Eugene Yan — Patterns for building LLM systems — the retrieval and eval sections synthesize the research honestly (~30 min for those sections).

Key ideas

  • RAG = search engine + writer; the writer takes the blame for the search engine's misses.
  • Evaluate retrieval separately from generation; diagnose retrieval first, always.
  • Embeddings compress meaning and lose exact strings; BM25 keeps strings and loses meaning.
  • Hybrid wins because the blind spots are complementary, not because it's fancy.
  • Rerankers read query+candidate together — too slow for the corpus, cheap for the shortlist.
  • Chunks forget their parent document; context must be put back deliberately.
  • Techniques validated on other corpora owe you nothing — measure on yours.

Check yourself

  • Hit-rate@5 is 0.9 but answers are wrong. Where do you look next, and why?
  • Why does a query containing an exact error code embarrass pure vector search?
  • Why can't you just run the cross-encoder over the whole corpus and skip stage one?

Publish

  • The app + repo with the results table and an "anatomy of a failure" section: one query that fails, traced to why (chunking? embedding blind spot? reranker overruled a hit?).
  • Build-log entry with the numbers.

Stretch

  • Add query rewriting (expansion/decomposition) as config #4 — measure it; keep it only if it pays.
  • Swap embedding models (e.g. a small local one vs a hosted one) and show what that alone does to hit-rate — embedding choice is an underrated lever.

Proof

"I've built RAG over a real messy corpus with a 50-question eval set — I can show you hit-rate and MRR moving from baseline through hybrid to reranked, and point to the query level why."