All Articles
AI/ML
11 min read
February 14, 2025

5 Hard Lessons from Shipping RAG to Production

What nobody tells you about retrieval-augmented generation until you're debugging at 2am — chunking strategies, embedding drift, and latency traps that will bite you.

RAGLangChainLLMsProductionPython

After shipping three production RAG systems in the last year, I’ve accumulated a list of things I wish someone had told me before I started. None of these are in the official docs. All of them cost real time.

Lesson 1: Chunk size is a hyperparameter, not a default

Every tutorial defaults to 512 tokens. That number is meaningless for your specific corpus. I’ve seen systems where 128-token chunks produced dramatically better recall than 1024-token ones — and vice versa.

What actually matters:

  • The granularity of questions your users ask
  • How self-contained your source documents are
  • Whether you need sentence-level or concept-level retrieval

Run ablation studies on chunk sizes before you go live. Build a small eval set (50–100 question-answer pairs) and measure recall@k for each configuration. An hour of setup here saves days of debugging later.

Lesson 2: Embedding models drift. Your index doesn’t.

You upgrade your embedding model from text-embedding-ada-002 to a newer model because benchmarks look better. What you didn’t plan for: every vector in your index is now in a different semantic space than your queries.

The fix: version your embedding model alongside your index. When you upgrade the model, you must re-embed and re-index everything. This sounds obvious. It is not obvious at 11pm when a product manager is asking why search quality suddenly dropped by 40%.

Keep your embedding model pinned in your dependency config:

EMBEDDING_MODEL = "text-embedding-3-small"  # DO NOT change without re-indexing
EMBEDDING_VERSION = "v2"                    # Bump this when you do change it

Lesson 3: The retrieval step is where most latency hides

Most teams I’ve talked to blame the LLM for their slow response times. Usually it’s the vector search. A naive Pinecone query on a 10M-vector index with no namespace filtering can take 600ms+. That’s before you’ve sent a single token to GPT-4.

Optimizations that actually moved the needle for us:

  1. Namespace by tenant/document-type to reduce search space
  2. Use metadata filtering to pre-narrow the candidate set
  3. Cache embeddings for common queries (Redis with 1-hour TTL)
  4. Use a smaller, faster embedding model for retrieval; reserve the big model for generation

After these changes, our P95 retrieval latency dropped from 580ms to 45ms.

Lesson 4: Hybrid search beats pure vector search for most real-world queries

Pure semantic search is bad at exact lookups. If a user asks “what does RFC 2119 say about MUST vs SHOULD”, semantic similarity will give you tangentially related documents about specifications. What you need is a document that contains the literal string “RFC 2119”.

The answer: BM25 + vector search, combined with a reranker.

from langchain.retrievers import EnsembleRetriever

retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

The 0.4/0.6 split is a reasonable starting point. Tune it based on your query mix.

Lesson 5: Evaluate before you ship, not after

“We’ll add evals later” is the fastest way to deploy a system that confidently hallucinates. Build your evaluation harness before you start optimizing. Two things you absolutely need:

  1. Faithfulness score — does the answer actually follow from the retrieved context?
  2. Answer relevance — does the answer address what was asked?

Tools like RAGAS make this tractable without needing a massive labeled dataset. Run evals in CI. Fail the build if faithfulness drops below your threshold. Treat RAG accuracy like you treat test coverage.


The throughline in all five lessons: RAG is a system, not a feature. It needs the same engineering rigor as any other production system — observability, versioning, evaluation, and iteration. Treat it that way from day one.

Found this useful? I write about AI engineering, distributed systems, and cloud infrastructure.