Prompt Engineering Is Now a Subset of a Bigger Job

The wording stopped being where the work is

For the first two years I used these models seriously, the craft lived in the wording. I'd spend ten minutes on a single sentence — the exact phrasing of an instruction, the order of the constraints, whether to ask or tell. A better-worded prompt reliably meant a better outcome. That felt like the skill.

Then it stopped being true — not all at once, but gradually enough that I only noticed in retrospect. I was running a project with persistent memory, subagents handling discrete tasks, a structured file of project constraints loaded every session. When I looked honestly at where the variance in output quality was coming from, it was almost never the wording of the prompt itself. It was whether the model had the right files, the right history, the right constraint in the window at the moment it acted. What it saw. The prompt wording had become nearly a rounding error.

That's the thesis here, and I'll state it once clearly: prompt engineering didn't die. It shrank — from being the job to being one tool inside a bigger job. The bigger job is curating the whole context window: deciding what goes in, in what order, how much, from where. That curation is what determines what the model does with your prompt, and it matters more than the prompt itself.

Prompt engineering became a subset

This isn't a pattern I noticed alone, and the convergence is what makes it interesting.

In mid-2025, a loose cluster of practitioners independently, within the span of a few weeks, landed on the same framing. Andrej Karpathy called it context engineering: "the delicate art and science of filling the context window with just the right information for the next step." That's not a prompt engineering rebranding. It's a different unit of analysis — the entire context window at the moment the model acts, not the single instruction you hand it.

Simon Willison made the case for why the term stuck — unlike "prompt engineering," you can mostly infer what "context engineering" means straight from the words. That self-describing quality is rarer than it should be in this field. My own read on why it landed: it names the actual activity — deciding what information to curate into the context, and when — instead of defining itself against what it isn't.

A Gartner research note in July 2025 put it flatly: "context engineering is in, and prompt engineering is out." When an analyst firm that lags practitioners by twelve months has caught up, the shift has already happened.

Anthropic's September 2025 engineering post offered the tightest definition: context engineering is "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." The word maintaining is doing a lot of work there. Not just filling — maintaining. The implication is that context is active, not static: you keep it right over time, task by task, not just at session start.

And from Cognition, Walden Yan drew the operational conclusion: context engineering is "effectively the #1 job of engineers building AI agents." Not one concern among several — the first one.

The leverage moved because the models changed. Early models rewarded clever phrasing; current models are already highly capable, and the ceiling on what they can do from a well-curated context far exceeds what better wording alone can extract. Prompt engineering is now a subset of context engineering — one of the inputs you curate, not the whole game.

The shift is mechanical, not fashion

The reason curation beats dumping isn't taste — it's measurable degradation. A fuller context window is demonstrably a worse context window, even well before you approach the token limit. I dug into the evidence — the Chroma study, lost-in-the-middle, the attention-budget framing — in a separate post on context rot and agent memory; I won't re-run it here.

The reason it matters for the rest of this post: if the fuller context measurably degrades output, then the question "what do I put in the window" is a performance question, not just an organizational nicety. Every discipline that follows — how reasoning models change the equation, when to use multiple agents, how to run evals — is downstream of the same mechanical fact. The art is curation. That's what context engineering means in practice.

Reasoning models broke the playbook

The advice I gave in 2023 now actively backfires on a specific class of models — and I say that without qualifications, because the reversal is that clean.

When reasoning models arrived — models that run a chain of internal deliberation before producing a response — the old playbook didn't just become less effective. Parts of it became actively harmful. The prompt engineering heuristics that reliably helped on earlier models can degrade output on current reasoning models. Understanding why gets you most of the way to using them well.

Less prompting often wins. Anthropic's guidance for extended thinking makes this explicit: rather than prescribing a step-by-step procedure, let the model think deeply. A high-level direction — "think thoroughly about this" or "work through this carefully" — tends to outperform a hand-written breakdown of every reasoning step, because the model's own chain of deliberation is better than what you'd prescribe. When you write the steps yourself, you're competing with the reasoning process, not augmenting it. That's a disorienting inversion if you spent two years learning to write meticulous, structured prompts.

"Think step by step" can hurt. This was the defining heuristic of the 2023 era — add the phrase, get better outputs. On reasoning models, it frequently doesn't help and can actively interfere. OpenAI's reasoning guidance says these models "perform best with straightforward prompts" and explicitly notes that "think step by step" is unnecessary and may hinder performance: the model is already running that process internally, and your instruction becomes a redundant, potentially conflicting overlay. A 2025 paper revisiting chain-of-thought — arXiv 2506.14641 — found that for strong models, zero-shot chain-of-thought can match or outperform few-shot exemplars, particularly when the exemplars constrain the solution path.

Few-shot can actively compress the model's solution space. You hand it three examples of how you've approached this kind of problem; it concludes that this is how such problems should be approached. If your examples don't cover the full range, you've traded flexible reasoning for pattern-matched retrieval — exactly what you paid for the reasoning model to avoid.

You now steer with parameters, not wording. This is the most practically important shift. The lever that controls how much effort a reasoning model expends isn't a keyword in your prompt — it's a parameter passed at inference time. Anthropic has moved toward higher-level effort settings that modulate thinking depth without requiring precise token counts; OpenAI exposes a comparable reasoning effort control; the 2026 trend is toward adaptive thinking where the model calibrates its own deliberation based on apparent task difficulty. The exact API surface shifts model by model and month by month — hard-coding field names here would be a mistake. Which models expose which of these knobs — and where each one is actually strongest — is the whole point of my 2026 model comparison.

The honest caveat. None of this means you should use a reasoning model by default. They can over-think: give a reasoning model a simple lookup or a formatting task and you'll pay in latency and cost for a lengthy internal deliberation before an answer you could have gotten from a cheaper model in milliseconds. Extended thinking earns its cost on genuinely multi-step, ambiguous, or compositional problems. For everything simpler, the non-reasoning pass is often faster, cheaper, and no less accurate. The inversion is real — but reserve it for the problems that actually need it.

What still works, and what quietly died

Persona padding died. "You are a world-class expert in..." had a measurable effect on earlier models — assigning a role seemed to calibrate how the model positioned its response. Current models don't need the persona framing; they need accurate task framing. If you want expertise applied in a specific direction, describe the direction and the constraints. Telling the model it's a senior engineer wastes tokens that could carry actual requirements. The persona is overhead.

Say it explicitly. The "fill in the blanks" era rewarded clever indirection — leaving obvious implications unspecified and trusting the model to infer them. Modern models follow instructions more literally, which sounds like a downgrade but is actually precision. If you want a specific format, specify it. If you want the explanation omitted, say so. If you want a direct answer without hedging, ask for a direct answer. The model will do what you asked, not what you implied — which means the instructions you write are the actual spec. Write the actual spec.

Positive over negative. A wall of prohibitions — "don't do X, never do Y, avoid Z" — is harder to process than the same information stated as requirements. "Use markdown headers for structure" outperforms "don't write an unstructured prose response." The prohibitions aren't wrong; they just make the model work backward from the negation to find the target. Describe the target directly.

Motivation still helps — and this is the one honest exception to the anti-anthropomorphism trend. Explaining why a constraint exists genuinely helps current models generalize it correctly. "Keep the summary under 200 words" gets applied narrowly. "Keep the summary under 200 words because it will be read aloud in a ten-second slot" gets applied more flexibly across the edge cases the rule doesn't explicitly cover. The model uses the explanation to understand the intent — and that's not anthropomorphism. That's giving the model the actual problem, not just the rule. The why is context; context is signal; this is the job.

Use more agents — but only for reading

The 2025 multi-agent debate had two respected teams reach opposite conclusions within the same month, and the unusual thing is that both were right.

Cognition published "Don't Build Multi-Agents," arguing that multi-agent systems compound fragility: context doesn't transfer cleanly between agents, decisions from one can contradict decisions from another, and the seams between agents are exactly where failure concentrates. Their default recommendation: single-threaded, unless you have a specific reason not to be.

Around the same time, Anthropic published results from a multi-agent research system — a lead agent coordinating subagents — that outperformed single-agent baselines on their internal research evaluation, reaching 90.2%. That figure is Anthropic's internal research-eval number, not a general performance guarantee. But the directional finding is real: for the right shape of task, a coordinated multi-agent system can beat what any single agent does alone.

Both teams were describing the same system from the right angle. The reconciliation is in the shape of the work. Multi-agent wins for breadth-first, parallelizable, read-heavy tasks: research, investigation, auditing, review. You fan out subagents across independent areas, they read without writing to shared state, they return compact findings to a coordinator that synthesizes. No conflicting edits, no shared mutation, no seams that break. The parallelism is a genuine win.

Single-threaded wins where writes and shared state are involved. When agents coordinate around data or code being mutated, the seams between them create exactly the failure Cognition described. A 2025 UC Berkeley study (MAST) — arXiv:2503.13657 — surveyed multi-agent failures across a large benchmark and found that inter-agent misalignment, agents working from inconsistent or conflicting pictures of shared state, accounted for roughly 32% of failures. That's the leading failure category, and it's almost entirely a write-coordination problem.

The practical rule: parallelize reads, keep writes single-threaded. Fan out investigation, synthesis, and review. Keep implementation in one thread that owns the state. A subagent that read a hundred files and returned a compact finding did its job cleanly; now one agent executes, in sequence, against the actual files.

I walk through the orchestration side of this — clean-context reviewers, write isolation — in the post on verifying AI-generated code. The short version: a reviewer that didn't write the code sees it fresh. Isolation buys both independence and a clean context budget in one move.

Treat prompts like code

Most teams iterate on prompts by feel: change something, test it on a few examples, decide it seems better, ship it. That works until it doesn't — and the failure is usually silent. A prompt that looks better in casual testing can silently regress on the cases that weren't in front of you.

The fix is evals — a small, version-controlled set of golden cases you run against every prompt change. Not a full ML pipeline; the entry-level version is a folder of input–expected-output pairs you can run in minutes. Some checks are deterministic: does the output contain the required field? Is the format valid? Does the length fall in range? Others need an LLM-as-judge: is this explanation accurate? Does this match the required tone? The deterministic checks are fast and cheap; the LLM-as-judge handles dimensions that can't reduce to a regex.

A "better" prompt is a hypothesis, not a fact. A 2025 paper, When "Better" Prompts Hurt, documented this directly: a generic "improved" template dropped an extraction task's pass rate from 100% to 90%. The prompt was more polished, more thorough, better-reasoned — and worse. Without a golden set to run it against, that regression would have shipped.

A version-controlled golden set turns "iterate by vibes" into a development process. The prompt lives in your repo, each version has a commit, and the diff is reviewable. A change that looks like an improvement but regresses the eval set gets caught before it ships. You can pull it back.

Start smaller than you think you need. Five to ten representative cases cover most of the variance that matters. You add to the set when you see a new failure mode in production — the failure becomes the test case that prevents the same failure from returning. Run the eval before and after any change to system prompt, memory files, or document context. It gates prompt changes, but also anything else that alters what the model sees.

Spend tokens on signal, not noise

Bloated context doesn't just degrade output — it costs money. Curation is also a cost optimization.

The highest-leverage lever on the token bill is prompt caching. Both major providers offer it, and the discounts are large. Anthropic's cache reads run roughly 90% cheaper than the equivalent uncached input tokens; OpenAI's caching applies automatically at roughly 50% off. On any workload that reruns the same system prompt — which describes most agent loops — caching cuts the input cost sharply.

The structural rule: static content first, volatile content last. Cache hits depend on byte-identical prefix matching — the first differing character breaks the hit. Anything that changes per-request goes at the end, after the static content. System prompt, loaded memory files, and document context first; dynamic variables last. One ordering flip — a timestamp before the system prompt — can eliminate caching entirely.

This is the throughline. What goes in the context window, in what order, and how much — that's context engineering, and it shows up in every direction at once: better outputs from leaner signal, lower costs from structure that enables caching. The token bill and the output quality are both downstream of the same curation decision.

It was never the wording

Every tactic in this post — curating the window, prompting reasoning models less, parallelizing only reads, running evals like code, structuring the prompt so the static prefix caches — is the same refusal from a different angle. More context is not better context. The discipline is curation, not accumulation.

If you think the job is writing better prompts, you keep optimizing the one message you hand the model. If you think the job is curating the entire context window — what goes in, in what order, how much, from where — you have a different set of levers, and those are the ones that move the needle now.

Prompt engineering still exists. Write clear instructions. State requirements explicitly. Give the model the why behind your constraints. But it's one input you curate, not the whole game. The work moved. The wording was never where most of the leverage was.

If you're building with AI agents and want to talk through context engineering, reasoning-model workflows, or shipping production software with this kind of discipline, get in touch.