Context Engineering: The New Software Engineering (and What Happens When You Skip It)

You switch from Claude to GPT and it stops working

It happens almost every time.

You've had an AI agent in production for three months. It works reasonably well. A new model comes out — looks better, looks cheaper — and you decide to try the migration. You swap the endpoint, tweak a couple of parameters, and suddenly the agent responds differently. Some things better, some worse. Answers that used to cite sources now make them up. The tone goes off-key. Edge cases you had under control start failing again.

First instinct: blame the new model. "It isn't as good as the old one." In reality, that's almost never it. What's happened is that your agent was leaning on idiosyncrasies of the old model — its particular way of interpreting ambiguous instructions, its default assumptions — and the new model, just as capable, interprets them differently. Switching models shouldn't break your agent. If it does, it's a sign your context was badly designed.

This is what separates today's AI projects that hold up from the ones that die the first time a provider changes: the difference isn't made by the model, it's made by the context the model receives on every call. And context isn't written — it's designed. We're starting to call that discipline context engineering, and it's the new layer where the quality of production AI is decided.

Prompt engineering isn't context engineering

For a couple of years, the conversation around applied AI was dominated by prompt engineering: optimizing the exact phrase you send to the model. "You are an expert at…", "think step by step", "answer only in JSON". That was useful when projects were simple and context was a single text box. Today, in agents operating on real systems, the prompt is just one of the pieces the model receives, and rarely the most important.

The shift in framing is this: the model doesn't reason over your prompt. It reasons over everything it sees in each call. And what it sees isn't a sentence — it's a package made of six layers, each with its function, its cost, and its tradeoffs.

Designing that package well is what we call context engineering. And the difference between a team that does it and one that just "prompts" is the difference between an agent that holds up for months in production and one that breaks at the first turn.

The six context layers

Every call an agent makes to a model is composed, explicitly or implicitly, of six kinds of information:

System prompt. The agent's identity and rules. "You are a commercial agent operating on HubSpot. You never close deals without human approval. You always cite the source of every piece of data you provide." Sits at the start of the prompt and doesn't change between calls. It's the most reinforced layer — the model weights it more than the rest.
Long-term memory. Facts the agent should remember between sessions, projects, or users. "This client prefers meetings on Tuesdays at 10. The last conversation ended with a pending proposal." It isn't general context; it's persistent context about specific entities.
RAG (retrieval-augmented generation). Document fragments retrieved based on the current question. Changes every call because it depends on what's being asked. It's the layer connecting the model to knowledge that doesn't fit in the system prompt.
Tools and their responses. The actions available to the agent (each with its contract) and the results of the ones it has already executed in the current session. If the agent just queried the CRM, that query's result is in the context when it decides what to do next.
Conversation history. What's been said so far in the session. Grows with each turn. The easiest layer to lose control of — and the most expensive when it gets out of hand.
User prompt. The actual instruction of the current turn. The sentence that triggers the next response. Usually the smallest in tokens and, paradoxically, the only one traditional "prompt engineering" focuses on.

Each layer has its size, its update frequency, its per-token cost, and its impact on model behavior. Designing the agent well means deciding what information lives in each one and why.

Decision 1: what goes in the system prompt and what goes in RAG

The most common decision and the most poorly understood. The simplified rule:

System prompt = what's stable, reinforced, and small. Agent identity, inviolable rules, expected format, closed list of tools.
RAG = what changes, is large, or depends on the question. Operational documentation, catalogs, contracts, past cases.

The classic mistake is putting in the system prompt things that should live in RAG. "Here's the full return policy, all the products, all the procedures." The moment that policy hits 30 pages, it doesn't fit. And even if it did, it'd be wasteful: if only 2% of queries touch the return policy, you're paying to send 30 pages on every call.

The reverse mistake also happens: putting in RAG things that should be in the system prompt. "Your identity as an agent is in this document retrieved by similarity." That's fragile — the day a more relevant fragment displaces the identity, the agent goes off the rails.

An operational rule we apply: if information is used on every call, it goes in the system prompt; if used occasionally, it goes in RAG; if it's huge and used occasionally, it goes through tool calling to an external system.

Decision 2: how much history to keep (and what costs you not deciding)

History is the layer that most easily breaks in production. Each turn adds tokens to the prompt, and cost grows linearly. In a 20-turn conversation with 800-token responses each, history weighs over 16K tokens — just in history, before adding RAG, tools, and system prompt.

Three strategies, and it's worth picking one explicitly:

Keep everything. Simple, expensive, fails when the conversation exceeds the model's window. Only viable in agents with short sessions and low traffic.

Summarize periodically. Every N turns, the model itself (or a cheaper one) summarizes the previous turns into a paragraph and the originals are discarded. Cuts cost, but loses nuance. Works well when what matters is the facts, not the exact wording.

Structured memory. Instead of preserving history text, the agent extracts relevant facts and stores them in structured memory. What gets passed to the model each turn isn't a transcript but synthesized facts. More complex to build, but radically cheaper in operation.

The wrong decision — the one we see most often — is not deciding. Keeping everything "just in case" and discovering three months in that the LLM bill has doubled because of history weight in every call. The choice of what to do with history is an architectural decision, not an implementation detail.

Decision 3: when a tool replaces RAG

RAG is a good hammer, but not every piece of data is a nail. There are questions vector search answers badly by design, and where the right fix isn't tuning the chunking — it's not using RAG.

Three examples where tool calling beats RAG:

"How many contracts did we sign last month?" — SQL tool over the CRM database, not fragment retrieval.
"What's the current account balance?" — Tool that calls the ERP API, not a static document that could be stale.
"What's the availability of this resource?" — Tool that checks the calendar, not a fragment about availability policies.

The rule: if the answer can change between the indexing moment and the query moment, it isn't RAG's job. It's tool calling against the live source.

The right way to design a hybrid agent is to decide, for each class of question, which tool to use — and to expose both (RAG and the specific tool) to the model with clear descriptions so the model picks. "If the question is about concrete numbers or current state, use the tool. If it's about documentary knowledge, use RAG." The model applies the rule, and the agent responds with the right source.

Decision 4: long-term memory (and when you don't need it)

Persistent memory is the most overrated layer. Almost every project thinks they need it; most don't need it at all. Before setting it up, three questions are worth asking:

Is there real multiplicity of users or contexts? If the agent operates on a single project with a single dataset, "agent memory" can simply be files in the project repository. No vector store needed.
What exactly do you want to remember? If the answer is vague ("everything important"), there's no design yet. If the answer is concrete ("each individual client's communication preferences"), then there's a real use case.
When does the remembered information get consulted? Only at predictable moments (start of a conversation with a known user)? That's structured-slot memory. Anywhere, by semantic similarity? That's vector memory.

When there is a real case, what gets stored isn't the conversation transcripts. It's extracted facts: "prefers morning meetings", "common objection: price", "last project state: waiting for legal validation". Those facts get retrieved on each call, not the transcripts.

A poorly designed memory architecture doesn't just fail to help — it degrades the agent, by injecting noise into the context and displacing relevant fragments.

Context as code

What separates a serious project from an improvised one is how each context layer is treated. The rule we apply: every context layer is treated as code.

Versioned in git, with a clear changelog.
Automated tests for expected behavior (if you change the system prompt, the critical cases must still pass).
Continuous eval pipeline over a representative dataset, with alerts if quality drops.
A/B testing between versions of the system prompt or the retrieval logic.
Operational documentation per layer, not just for the agent as a black box.

This sounds obvious when said. In practice it's what almost no AI project has set up. Most operate with a system prompt in a .txt file someone edited two weeks ago and nobody knows what changed. Without versioning, no rollback. Without evals, the team finds out something broke when a user complains. Without A/B, prompt changes are bets.

Switching models, in this framing, stops being a scare and becomes a managed exercise: swap the model, run the evals, see which cases degraded, adjust the affected layers, run the evals again. It's normal engineering, not ghost hunting.

What a real context-engineering team looks like

There's no "prompt engineer" sitting next to the team. There are engineers — the same ones working on the product backend — who understand that context layers are part of the architecture and get designed, versioned, and tested like any other component.

In projects where we apply this, there's a consistent pattern:

The system prompt lives in a system_prompt.md file in the repo, with change history.
The retrieval logic (what's searched, how it's chunked, what filters apply) lives in tested code, not a generic library call.
Tool descriptions are versioned and modified with the same friction as the tools' own code.
There's an eval dataset with real customer cases (including the ones that failed in production) that runs automatically on every change.
The token budget per layer is measured. We know that in agent X, the system prompt consumes ~1,200 tokens, RAG averages ~2,800, tools and responses ~1,500, history ~3,500. That accounting allows optimization decisions backed by data.

The most useful part of this pattern: the day the model changes, the day an integrated system updates, the day a new edge case appears, the team knows which layer to act on. They aren't doing engineering by feel; they have the components mapped.

How we build it in production

In AI projects where the client wants something that holds up for years, we dedicate the first weeks to something that still surprises: not picking the model or building the agent. Mapping the context.

We list what information the model needs to do its job. We decide, for each piece, what layer it lives in. We design the token budget per layer. We set up the agent's repo with the layers separated, versioned, with tests. Only then do we plug in the model, knowing what it'll be asked and with what.

The result of that work isn't visible in a demo. It's visible six months later, when the client changes an integrated system and the agent keeps working because the affected layer was one and it was mapped. Or when a new model shows up and the migration is a managed change, not an odyssey.

Behind each of these systems is a team of engineers who believe AI engineering isn't writing clever prompts — it's designing the six layers that prompt doesn't even see. Technical excellence isn't measured by the magic of the first prompt; it's measured because the agent survives two model changes and three RAG provider changes without the business noticing.

Production means changing a context piece doesn't break the system — because each piece is code, not folklore. That's the difference between context engineering and lucky prompting.

If your AI agent starts behaving oddly when you switch models or feed new data, it isn't the model — it's that the context isn't designed. We can audit the six layers, hand you the map of how they live today and the plan to stop them depending on luck.