The day your RAG leaves the notebook
The first RAG demo almost always works. You index 50 PDFs, set up a Pinecone or pgvector, plug in an LLM and ask three questions you already know are in the documentation. The model answers coherently. Someone claps. We call it a proof of concept.
The problem starts the day that same system has to answer a salesperson asking about a customer who isn't in the official documentation, a technician using the team's internal jargon, and a finance director who needs a figure that lives in a database, not a PDF. That's the day RAG stops being a demo and turns into an engineering problem.
We've spent the last couple of years putting RAGs into production for companies with thousands-of-products catalogs, heavy document bases, and agents that take real actions on CRMs and ERPs. These are the seven problems that show up systematically between PoC and production — and how we solve them.
Problem 1: Chunking without domain context
The standard tutorial tells you to split documents into 500-token chunks with 50-token overlap. You apply it, index, and discover the agent returns fragments that miss the information that actually matters.
A typical case: a retail client of ours with a catalog of about 12,000 SKUs. Each product sheet had a name, internal code, technical specifications and price. Size-based chunking cut fragments in the middle of tables, leaving the product code in one chunk and the specifications in another. The model retrieved the specs, didn't know which product they belonged to, and answered with the wrong code.
The fix isn't bigger chunks — it's chunking by the document's semantic structure, not by token count. For product sheets: chunk = full sheet, with the SKU repeated at the top as an anchor. For contracts: chunk = clause, with section and subsection prefixed. For Confluence: chunk = H2 section with the full breadcrumb in metadata.
This requires a custom parser per document type. It's boring work, but it's what separates a RAG that works from one that guesses.
Problem 2: Embeddings that don't understand your jargon
Embedding models are trained on the internet. If your company uses "purchase order" internally and your documents talk about "PO-2024" and "order intent" interchangeably, the embedding won't know they're the same thing. The user asks about "POs" and the system retrieves generic fragments about purchase orders across any sector.
Three ways to tackle it, in order of increasing cost:
- Glossary injected into the query. Before embedding the user's question, you expand the query with internal synonyms. "How many POs do we have open?" becomes "How many POs (purchase orders, order intent) do we have open?". Near-zero cost, solves 70% of cases.
- Trained re-ranker. After the initial vector search, a smaller model re-orders results using a dataset labeled by the team. Medium cost, noticeable improvement.
- Fine-tuned embeddings. Adapt an embeddings model to your domain with (query, relevant doc) pairs. High cost, only worth it when the internal jargon is very specialized (regulated sectors, technical B2B).
Start with option 1. Most PoCs never go beyond it — and most don't need to.
Problem 3: High recall, low precision (and it costs money)
The PoC instinct is to crank top_k up to 20. "More context, better answer." False for two reasons.
First, the model gets confused. Feeding 20 similar fragments into the prompt dilutes the reasoning. The LLM ends up mixing information from several chunks and producing answers that look coherent but are invented from correct pieces.
Second, it costs money. Every token sent to the model is billed. An agent with top_k=20 and embeddings over 1,000 daily queries can quadruple the LLM bill compared to a well-tuned top_k=5.
Our formula: low top_k (3-5) + re-ranker that orders by relevance + a hard filter on minimum score. If the top 3 chunks don't beat a similarity threshold, the agent answers "I don't have enough information" instead of inventing. That last part is counterintuitive coming from a demo — having the system "give up" feels like a bug. In production it's what protects user trust.
Problem 4: No citation, no trust
An agent that answers "the customer has 3 outstanding invoices" without linking to those invoices isn't usable inside a serious business. The person reading the answer has no way to verify and won't use it to make a decision.
Citation isn't decoration. It's the proof that the model didn't invent. And forcing citations also dramatically reduces hallucinations — the model "knows" it'll have to justify itself and sticks to the retrieved fragments.
In projects where RAG mixes documents and structured data (a common case: Confluence + Postgres), the agent must be able to cite both: a link to the document and to the SQL row. Internally this is built with metadata that travels with each chunk and with SQL tools that return, alongside the result, the query that was executed. The final answer includes both references.
Operating rule: if it can't be cited, it isn't answered. If the model tries to answer without a citation, a guardrail intercepts it and retries with a stricter prompt.
Problem 5: Re-indexing, the hidden cost
The PoC indexes 50 PDFs once. Production indexes documents that change every day. If you re-index everything every night, you pay for embeddings that haven't changed and block the system during indexing. If you don't re-index, the RAG ages.
The piece missing in almost every PoC is an incremental indexing pipeline: detect changes in the source (Drive, SharePoint, Confluence, Postgres), figure out which chunks have changed, generate embeddings only for those, and update the vector store. In a client with a corporate Drive of around 30,000 documents, this change cut monthly embeddings cost by 92% versus daily full re-indexing.
Requirements to build it: a hash per chunk to detect real changes (no false positives from metadata), a document-level version, and a job that runs every N minutes instead of every night. Not trivial, but without this the RAG is expensive and always lags behind reality.
Problem 6: Continuous evals, not a one-off benchmark
"We tested 50 questions and it answered well" is not a metric. It's a static benchmark from day zero.
The problem: a RAG in production degrades for reasons that aren't obvious. The LLM provider updates the model's behavior. 200 new documents get added and introduce ambiguities. A team changes internal naming conventions. Any of these can drop quality without anyone noticing until a user complains.
What does work: a curated eval dataset (50-200 questions with expected answers and expected fragments), a pipeline that runs it automatically every week or after every prompt change, and retrieval metrics (recall@k, MRR) plus answer metrics (groundedness, factual accuracy assessed by LLM-as-judge calibrated against human review).
When a metric drops below a threshold, an alert fires. This turns the RAG into a system with a technical SLO, not into a "experiment that seemed to be going well".
Problem 7: When RAG isn't the answer
RAG is oversold. There are questions vector search answers badly by design — and forcing them through it is the cause of many stalled PoCs.
Three examples:
- "How many contracts did we sign in March?" — That's SQL, not RAG. The answer lives in a database. The agent should call a tool that runs
SELECT COUNT(*) FROM contracts WHERE month = 3, not retrieve fragments. - "Generate a monthly incident summary" — That's SQL + aggregation + generation. RAG adds nothing.
- "Who is the security lead in infrastructure?" — That's exact search over an HR system, not semantic search.
The right pattern: the agent chooses which tool to use based on the question. RAG for documentary knowledge, SQL/tool calling for structured data, exact search for identifiers. "RAG-only" is the symptom of a PoC that hasn't matured into a hybrid system.
RAG as one of six context layers
There's a perspective that cleanly separates a mature technical RAG from a PoC RAG: understanding that RAG isn't the context of the model, but one of six context layers that a well-built agent orchestrates:
- System prompt — who the agent is, how it behaves, what it can and can't do
- Long-term memory — what it knows about this user, this session, this case
- RAG — which documents are relevant to the current question
- Tools — what actions it can execute and what they return
- Conversation history — what's been said before
- User prompt — the actual question
This framing captures well why PoCs get stuck: they treat RAG as a single layer instead of one piece inside a complete context design.
Designing all six layers well — what information lives in each one, how much token budget it gets, how they sync with each other — is what we call context engineering. And it's where the difference between an agent that works and one that guesses lives today.
How we build it in production
When we step into a RAG project, the first deliverable is rarely the indexer. It's a map of the questions the agent has to answer, segmented by type (documentary, structured data, mixed), estimated volume, and risk level if it fails.
On top of that map we decide which parts are RAG, which parts are tool calling over structured data, and which parts are hybrid. Then we set up the incremental indexing pipeline, the eval dataset with real customer cases, the mandatory-citation guardrails, and the metrics dashboard.
Behind those pipelines is a team of engineers who believe serious engineering starts by understanding the business, not by picking the model. Before embeddings or vector stores, we map how your team makes decisions and what failure you can't afford. The technical layer comes after — and we treat it the way we treat any system that has to last years in production.
Production means the system still works six months from now, not that it answered well today. That's the border between PoC and project.
If you have a RAG stuck between demo and production — or your team is vectorizing everything without a clear plan for evals, citation, and re-indexing — we can audit your pipeline and hand you the plan to take it to production with criteria.


