← Blog

Why Most AI Agent Failures Are Context Failures

The models are good enough. The problem is what you feed them.

Sombra Team·March 6, 2026

Developers blame the model. The agent hallucinated an API. It used a deprecated pattern. It proposed a migration path that contradicts the docs you read yesterday. The model must be too stupid, or too prone to making things up, or it needs a bigger context window.

Mostly, the model is fine. What's broken is the information it was given.

Prompt engineering is over. Context engineering replaced it.

For a few years, the applied AI world cared about prompt engineering — how to write instructions that make a model behave. It was useful. Better prompts do produce better outputs.

But prompt engineering treats the model as a static function. You write the perfect input string, you get good output. Few-shot prompting and ICL (in-context learning) are great solutions for teaching an AI what we want as an output, but even that approach can break down when we move to agents — systems that run multi-step workflows, call tools, maintain state across turns, and read external data. In an agent, the prompt chain is a small fraction of what determines the outcome. The rest is the entire token state: system instructions, tool definitions, retrieved documents, conversation history, external context.

Anthropic's engineering team calls context engineering "the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time." It's not about writing a better system prompt. It's about controlling the information environment the model operates in.

Sombra directly operates in this space, and is a tool that provides a citation chain plus context, and makes it dynamically available to an agent, on demand.

Four documented ways context breaks

Research from 2025 mapped out a taxonomy of failure modes. They explain most of what goes wrong with agents in production.

Poisoning

A hallucination appears early in a conversation and embeds in the context. The model treats it as fact from that point forward — referencing it, building on it, citing it. The error compounds every turn. Copy-pasting unverified information into agent prompts makes this worse, not better.

Distraction

As context grows, the model's attention spreads thin. Chroma Research measured this directly: performance degrades well before the technical context limit. They call it "context rot." The effective working window for most models sits under 256k tokens, regardless of the spec sheet. Gemini 2.5 Pro shows measurable distraction beyond 100k tokens. Smaller models hit the wall much sooner.

Forty pages of raw documentation in a prompt doesn't give your agent more knowledge. It gives it more noise. Past a threshold, the agent makes worse decisions than it would with no context at all.

Confusion

Irrelevant information doesn't sit inert — it actively misleads. Models perform measurably worse when given tools they don't need for the current task. DeepSeek-v3 degrades above 30 tools and fails above 100. Llama 3.1 8b fails with 46 tools but works with 19.

If you dump your entire research archive into an agent's context, the pages about a different project aren't harmless padding. They pull the model's attention away from the task.

Clash

Contradictory information in the same context produces confident nonsense. A study on prompt fragmentation measured a 39% average performance drop from inconsistent prompts. OpenAI's o3 dropped from 98.1 to 64.1 on a benchmark when the same content was split and shuffled rather than presented coherently.

Paste docs from three different library versions into the same prompt and the model doesn't pick the right one. It averages across them and produces something that matches none.

More information makes agents worse

The dream of million-token context windows was that curation would become unnecessary. Throw everything in, let the model sort it out. The reality: transformer attention scales quadratically with token count. Every token added creates pairwise relationships with every other token. Past a point, the model spends more compute attending to noise than to signal.

The best agent architectures in 2025 got simpler, not more complex. Manus — one of the most capable agent systems built — was rewritten five times in six months. The biggest performance gains came from removing things. Stripping complex tool definitions. Eliminating management agents. Reducing scaffolding. Their conclusion: "If your harness is getting more complex while models improve, you are likely over-engineering."

The same applies to knowledge. Raw saved pages injected wholesale are bad context. Dense, structured, signal-only material — where every token earns its place — is good context.

The older idea underneath

Context engineering is a 2025 term. The underlying idea is older.

Clark and Chalmers argued in 1998 (The Extended Mind) that cognition extends into the environment through tools. Your notebook isn't just an aid to thinking — it's part of the cognitive system. Risko and Gilbert formalised this as "cognitive offloading": externalising memory tasks to free working memory for harder reasoning.

The research consistently shows offloading improves task performance at high cognitive load. Grinschgl et al. (2021) documented the trade-off: you remember less of what you offloaded, but you perform better on the actual task.

That trade-off used to matter. You offloaded your notes to a tool, but you still had to retrieve and re-read them yourself. Now retrieval is done by an AI agent. The agent doesn't care whether you remember the material. It cares whether the material is structured, accurate, and available.

The best cognitive offloading device for AI agents isn't one that helps you recall what you saved. It's one that helps your agent receive what you saved as clean context.

Personal knowledge management needs a different retrieval model

PKM — the practice of collecting, classifying, and retrieving knowledge for your work — has been a discipline since the late 1990s. The tools that grew up around it (Notion, Obsidian, Roam, Logseq) assume a human does the retrieval. You browse, search, follow backlinks, traverse graphs. The metaphor is a library where you are the reader.

When your primary consumer of saved knowledge is an AI agent with a token window, the retrieval interface changes. The agent doesn't browse your notes. It receives a block of tokens and works with what's there. It can't follow backlinks. It can't "explore" a graph. It needs a flat, coherent, well-structured block of text at the right density.

Distillation — taking a collection of sources and synthesising the signal into a structured context note — is the retrieval interface that works for agents. Not search. Not graph traversal. Dense, curated, focused context that fits the attention budget.

What follows from all this

If context quality determines agent quality, the workflow follows:

Save selectively. Not every page you read belongs in your agent's knowledge. Save the authoritative sources — official docs, well-written guides, primary references. Leave the Stack Overflow threads and medium posts unless they contain something the docs don't.

Organise by task, not by topic. Your agent doesn't need everything you know about React. It needs the context for the migration you're doing this week. Collections scoped to a project or task produce better context than sprawling topic archives.

Distil ruthlessly. Write a focused context note that captures facts, decisions, patterns, and gotchas — preserving code examples verbatim, cutting everything else to signal. One distillation replaces fifty pages of raw docs. This is the single highest-leverage thing you can do for your AI agent.

Serve context through infrastructure, not clipboard. MCP exists for this. Connect your research library once. Every AI tool you use reads it. The context is always there, always current, always structured.

The numbers

Some of the evidence, for those who want it:

  • Isolating sub-agent contexts improved performance by 90.2% over single-agent approaches (Anthropic's multi-agent research). Even how context is partitioned matters.
  • Dynamic tool selection improved performance by 44% on Llama 3.1 8b. Fewer irrelevant tools, better results. (Schmid, 2025)
  • KV-cache optimisation — keeping stable context prefixes — reduces cost by 10x on Claude Sonnet. Well-structured context is cheaper as well as better.
  • Provence, a context pruner, achieves 95% content reduction while preserving task-relevant information. Most raw context is noise.
  • Sharded prompts cause a 39% average performance drop. OpenAI o3 drops from 98.1 to 64.1.

The pattern is consistent. Less noise, more signal, better results. The models are ready. Whether your research workflow feeds them signal or noise is the variable.

Sombra is a research library for AI agents. Save web pages, organise into collections, distil what matters, serve it through MCP. Start free.

References