Blog · ai · 13 min read
Production RAG in 2026: Architecture Patterns That Survive Real-World Use
Most prototype RAG systems break under production load. This guide covers chunking, embedding selection, vector database choice, reranking, and evaluation — the four patterns that separate production RAG from demos.
Published 2026-04-28 by Milton Acosta III
Why most RAG demos don't survive production
A weekend RAG prototype works because the demo data is small, queries are predictable, and there's no one querying the system at scale. Production RAG fails in five common ways:
- Chunks too small or too large — small chunks lose context; large chunks dilute retrieval relevance.
- Wrong embedding model for the domain — general-purpose embeddings miss domain vocabulary.
- Vector database that doesn't scale — local chroma works for 1K docs; production needs millions of vectors with latency targets.
- No reranking — pure cosine similarity returns plausible-but-wrong chunks at the top.
- No evaluation — teams ship RAG with no way to measure quality, no regression testing, no way to know when the system is broken.
Chunking strategy
Default to 500-1500 tokens per chunk with 10-20% overlap. This handles most prose well. For dense reference material (regulations, contracts, technical specs), shrink to 300-500 tokens. For narrative content (long-form articles), expand to 1000-2000.
Test multiple chunk sizes during development. The right size is empirical, not theoretical.
Key decisions:
- Semantic chunking — split at natural boundaries (paragraph, section, topic shift) rather than fixed character counts. Tools like LangChain's RecursiveCharacterTextSplitter or Llama Index's SemanticSplitterNodeParser.
- Overlap — 10-20% overlap preserves context across chunk boundaries. Higher overlap costs storage and dilutes retrieval; lower overlap loses cross-boundary context.
- Metadata enrichment — attach source, section, date, author to each chunk. Filter retrieval by metadata for high-precision use cases.
Embedding model selection
OpenAI's text-embedding-3-large (3072 dim) is a strong default. Cohere embed-v3, Voyage AI embed-3, and open-source models like BGE-large or E5-mistral are competitive.
Match the embedding model to the domain:
- General text — OpenAI 3-large, Cohere embed-v3
- Legal, medical, scientific — domain-specific models (e.g. SPECTER for scientific papers)
- Code — code-specific embeddings (CodeBERT, Voyage code-3)
- Multilingual — Cohere multilingual or BGE-M3
Vector database choice
| Database | Best for | Avoid for |
|---|---|---|
| Pinecone | Production scale, managed, simple API | Self-hosted, on-prem requirements |
| Weaviate | Hybrid search (vector + keyword), self-hostable | Pure simplicity |
| Qdrant | Open-source, performant, rich filtering | Smallest possible footprint |
| pgvector | Already using Postgres, modest scale | Billions of vectors |
| Milvus | Massive scale (billions of vectors) | Small projects (operational complexity) |
Reranking
Top-k vector retrieval returns plausible-but-imperfect candidates. A reranker — a more expensive model that scores query-document relevance directly — reorders the top 20-50 candidates to surface the genuinely best matches.
Cohere Rerank, BGE rerankers, and Cross-Encoder/MS-MARCO models work well. Reranking adds 100-300ms of latency but typically improves answer quality by 10-30% on hard queries.
Production retrieval pipeline:
- Embed query
- Retrieve top 50 candidates from vector DB by cosine similarity
- Rerank with cross-encoder
- Pass top 5-10 reranked chunks to the LLM as context
Evaluation framework
Without evaluation, you can't ship RAG safely. Build:
- Golden test set — 100-500 query/expected-answer pairs covering normal, edge, and adversarial cases.
- Automated evals — Run the test set on every retrieval pipeline change. Use LLM-as-judge for answer quality, classical metrics (NDCG, MRR) for retrieval quality.
- Online metrics — track answer click-through, user feedback ratings, follow-up question rate.
- Drift detection — alert when retrieval quality degrades over time. Common cause: index getting stale, document distribution shifting.
Operating considerations
- Cost monitoring — embedding costs scale with document corpus changes; LLM costs scale with traffic. Budget per-query and alert on anomalies.
- Index refresh strategy — most docs change infrequently. Incremental update is far cheaper than full re-embed.
- Caching — cache LLM responses for repeated queries. Even semantic caching (cosine-similarity-of-query-embeddings) drops cost dramatically.
- Observability — log query, retrieved chunks, reranked chunks, LLM input, LLM output. Production debugging requires this trace.
When NOT to use RAG
RAG isn't always the answer:
- Small fixed knowledge — just stuff it in the system prompt.
- Procedural knowledge — fine-tuning often outperforms RAG.
- Highly precise retrieval — sometimes a structured database query is better.
- Latency-critical use cases — RAG adds retrieval latency. Cache or denormalize.
Share this article
Related articles
AI Search Optimization (AISO) in 2026: How to Rank in ChatGPT, Claude, Perplexity, and Gemini
Traditional SEO is well-trodden. The newest frontier is making your site the authoritative source LLMs cite when users ask ChatGPT, Claude, Perplexity, or Gemini for recommendations.
AI Agent Evaluation in 2026: How to Ship Production AI Agents That Actually Work
Production AI agents fail not because the underlying model is incapable, but because evaluation is missing. This guide covers the framework Empire325 uses.
Ready to talk strategy?
Empire325 partners with enterprise companies on the engagements we write about. Schedule a call to discuss your situation.
Schedule a strategy call