The Agent Fallacy

Why We Need Orchestration, Not Chatbots

"Autonomous Agents" are a dumb marketing term that's actively hurting us. Yes, so is "AI" but the problem with Agents is that engineers are drinking this kool-aid, too.

Much like "RAG" (Retrieval-Augmented Generation) was positioned to make Enterprise adopt AI faster by promising a cure for hallucinations (something I've broken down here) the concept of the Agent is meant to intensify the speed of adoption of LLMs and other models in engineering through emotion. It stems from a deep, emotional attachment to the concept of AGI. It sells the dream that we are on the brink of creating digital life. Anthropic, on their main Claude Code page, states, "With Claude, you can build AI agents that plan, act, and collaborate more effectively."

Nothing we've built so far works like an autonomous employee. If we're being honest with ourselves, it doesn't really make sense. Those who do try to build things that are "agentic" end up with fragile products because, frankly, none of these tools are capable of making complex, multi-step decisions on their own. And if you're building "agents" that aren't making decisions, I'm wondering if you've, like, heard of Zapier? Automations are cool, but they're not new.

BUT! There are amazing AI products already, in particular AI coding tools. Both Cursor and Claudecode use this weird "agent" language.

What actually makes sense when building an AI-centric product like these coding tools is treating LLMs as endpoints. They are black boxes where you can insert noise and get interesting, structured data out. This is extremely powerful. But when you reframe that black box as an "Agent" that autonomously makes decisions, you end up with a poor product.

We can see the alternative and superior path emerging in tools like Claude Code or Cursor, as long as we ignore how they actually talk about it.

Take Cursor's "Plan Mode" as an example. If you've never used it before, it's basically a tool that is called - like a specialized prompt system that you can call. All of these products are essentially building moats with the quality of these 'tools.'

Ok - back to plan mode. The way it works is, when you ask it to build a feature, it doesn't just run off and start hacking away at the code. Instead, it builds a markdown file outlining a plan. The engineer then reads this file, edits it, suggests improvements (via chat), or swaps out parts of the plan entirely. Some engineers go ahead and implement the plan manually if its not too complicated. It's a great tool, and a testament to how many angles of engineering work tools like Cursor solve.

This Evaluation Step where the user steps in midway through the process, is what drives high-quality outcomes. "Agents," by their very definition, don't lend themselves to this concept. They imply a "fire and forget" workflow that rarely works in practice. Ironically, in the UI, every new chat in Cursor is called an "Agent."

The term "Agent" implies a single stream of thought, a single linear prompt leading to a single action, leading to another prompt, and so on. But effective AI tools are rarely linear. Under the hood, they are often sending out multiple requests to the AI black box, or multiple black boxes in the case of multi-model infrastructures like what Cursor uses. They might fan out ten requests, use a second model to evaluate the responses, accept the best two, and merge them for the final action.

It is less like a single robot butler and more like a spiderweb, where every node involves interactions between one model evaluating another, or a user evaluating the model.

Real reliability in AI doesn't come from using the smartest model and breathing life into it (lol). It comes from building smarter systems of evaluation and user interaction. We don't need agents that go off and do things by themselves; we need interfaces that support humans to act as QA to provide guidance while the models do the heavy lifting.

The Spore Cloud and Natural Selection

When I am building these applications, and when I describe this to people who aren't engineers, I visualize the process not as a robot butler doing work, but as a mushroom spore explosion.

In nature, a fungus doesn't release a single spore and hope it lands in the perfect spot. It releases a cloud. It relies on volume and redundancy to ensure survival.

Effective AI orchestration works the same way. You don't send one precious prompt to the model. You trigger a "spore cloud" of requests, parallel processes exploring different angles of the problem (and including variable versions of context) and then you rely on a process of Natural Selection to determine which output survives.

Dynamic Prompt Construction

This selection process begins before the request is even sent, with a technique I call Dynamic Prompt Construction.

There is a tendency to over-engineer this phase, using LLMs to write prompts for other LLMs. But I've found that manual construction, good old-fashioned string concatenation, is often the superior approach. You essentially splice together different context strings, user choices, and system instructions to create permutations of a prompt.

The beauty of modern LLMs is that they are incredibly resilient. They can digest inputs that aren't perfectly structured. This allows us to evolve our prompts manually, testing which permutation of context pieces produces the best results.

Survival of the Fittest

Once the requests (the spores) are sent, the Evaluation Phase begins. This is the survival of the fittest. We filter the responses through three distinct layers of selection:

Code Evaluators: The first layer is rigid. Does the output match our schema? If I asked for JSON, is it valid JSON? If the spore fails this basic anatomy check, it dies immediately. We can also do some strict elimination if the prompt contains words or characters we want to avoid or if the prompt is not within the character range we want to accept. Alternatively, we can replace those words or phrases, like taking out em dashes in a writing task and seeing if the result can be salvaged.
LLM Evaluators: The second layer is qualitative. We can use smaller, faster models to grade the output of larger models. Does this response match the tone? Is it hallucinating?
User Evaluators: This is the ultimate filter. In my apps, users either approve/disapprove, use scales, like rating a response on a 1-10 scale, or more importantly, I ask them to tweak it. I can send the tweaked version along with the original version to an LLM to start to gauge what is missing from the original context to try to get it closer to the intended result. I'll touch more on an example where tweaking can be really impactful later on.

The Infrastructure Gap

The difficult reality, however, is that building this "Spore Cloud" architecture is currently painful.

I have built this specific pattern three or four times now across different products, and I find myself reinventing the wheel every time. In the world of web development, we have libraries like React to handle state and component updates. We have nothing comparable for AI orchestration.

We are missing a "React for Context Orchestration": a generalized framework that tracks the state of these parallel requests, manages the repository of prompt permutations, and logs which inputs led to which outputs. Right now, we are all writing custom glue code to manage this chaos. We need tools that treat dynamic in-app prompt engineering and context management not as an art, but as a version-controlled, state-managed engineering discipline.

Case Study: The Blog Post Generator

Let's look at a specific example: creating a 2,000-word blog post.

If you were building a standard chatbot wrapper, you would ask the user to type out their desired voice, tone, and authority into a text box. But I stay away from open text fields when they're not essential. Users inevitably pollute them and I think it puts undue pressure on the user to do a good job. Even if they aren't trying to break the system, they rarely provide the right information for a high-quality result.

Creating a 2,000 word blogpost is also a great example for this because you can imagine doing it with one of the chatbot tools available, but never well. My goal is to show you here how prompt and context orchestration can create much better results.

For the topic, let's imagine we do let the user type out their main keyword or what they want their blog post to be about, and we have them select from a list of tags to define their audience, reading level, intended outcome and tone.

Now, how do we best build a blog post?

The Limits of "Attention"

Once we have the topic and voice, we face a technical reality: LLMs are bad at writing long-form content.

Despite the phenomenal "Attention" mechanism, models struggle to maintain narrative threads at 2,000 words. They tend to become "obtuse," forgetting the nuance of a paragraph written three pages ago or contradicting themselves. This is why I send out a spore cloud: The requests are always as tiny as I can make them. Massive contexts are not spores. Even for very performant models, this holds true, no matter how good they get: As context length increases, performance consistently degrades across all models.

To solve this, we don't ask for the post. We built the blog through steps, not too different from how a human might do this through a creative process. So first, we ask for the structure: How many sections, what are the headings, etc.

We send a request for a JSON object containing the title, H1, H2, and H3 tags and the estimated word count in each section. We validate the JSON is valid. Then, we pause. We run this through the ultimate evaluation layer: the user.

We allow the user to tweak the outline directly. But here is the trick: When a user tweaks a title from "Conquer the Clutter: 25 Rules for Keeping Tiny Homes Immaculate" to "The Designer's Guide: 25 Pro Tips for Organizing Small Spaces" we don't just update the text. We can trigger a background "spore" request to analyze why they made that change. The system learns: "User prefers a sophisticated, aspirational voice rooted in professional design expertise." We add this meta-data to our context packet for the next step.

Context Management: The "Fresh Chat" Philosophy

This is where the distinction between a Chatbot and an API Endpoint becomes critical.

If you try to write this blog post in a continuous chat interface (like ChatGPT or Claude), you are at the mercy of their context management logic. As the conversation gets long, these systems truncate context to save tokens. They do this generically, often cutting off the initial instructions you gave at the start. (Why do they do this? I'm gonna repeat myself 🙃 "As context length increases, performance consistently degrades across all models.")

In In-App Context Orchestration, we treat the LLM as a stateless API endpoint.

When it is time to write "Section 3," we do not rely on the chat history. We construct a brand new, surgical context. We concatenate strings to build a prompt that contains only what is needed:

The User's Voice settings.
The meta-data from their previous tweaks (e.g., "User prefers serious tone").
The specific heading for Section 3.
The full text of Section 2 (for continuity).

We treat every single section as a fresh "conversation."

This approach yields significantly better results than a massive, degrading context window. It allows us to be smarter than the model. It doesn't matter if the model has a 200k token window; a curated context of 500 tokens will almost always outperform a messy history of 10,000. Just because you can have a massive context, doesn't mean you should.

The Spore Cloud in Action

Finally, because we have broken the task into chunks, we can introduce parallelism.

For any given section, we can fan out requests to different models or with different context permutations (leaving off the meta data, using alternative phrasing for voice settings). As the gap between proprietary and open-source models narrows, the specific model matters less than the Dynamic Prompt Construction.

We aren't relying on a single "Agent" to be a genius writer. We are relying on a traditional software system that creates an output methodically with checkpoints and validation, using AI an endpoint. Of course, it's not just any endpoint- it requires that we have helpers throughout to construct the perfect context, enable user evaluation, and validate the best result.

Conclusion: The Era of the Invisible Endpoint

We need to fundamentally reframe how we build. We need to demote the LLM from "Agent" to "Endpoint."

The current trend is to put AI front and center: to slap a "Generate with AI" button on everything and demand that non-experts write perfect prompts to get value. Or worse, calling everything an Agent when it's just a chat. Users are understandably fatigued. They are tired of having AI shoved down their throats without receiving immediate, tangible utility. (Professional programmers are an exception, likely because they are already used to writing large amounts of precise instructions, and now they get to write less.)

The most interesting products of the next few years won't be the ones that show off their AI. They will be the ones that hide it.

To build these products, we must return to the basics of interface design. We need to stop relying on the chat format as a crutch for lazy UX. Users are constantly giving us data through their behavior and we can use this data to enhance their experience. At every single step of a user's flow, we should ask: Could an AI endpoint support this specific moment?

Would it make the experience smoother?
Would it make the data entry faster?
Would it make the output more delightful?

If the answer is yes, we should incorporate it. It probably won't be a "chat"-it could come in all shapes and sizes, but most importantly it will be a value-add that was impossible before affordable LLMs. We should aim for benefits that respect the user's intent without demanding their labor.

When you stop treating the model as an Agent that needs to be talked to, and start treating it as a probabilistic engine that can generate chunks of text and data to be orchestrated, you find surprising ways to create software that feels "smart" without ever feeling artificial.

That is the promise of In-App Context Orchestration. It is the shift from building chatbots that talk, to building software that feels magical again.

But the theory is only half the battle. This is just the beginning; I can’t wait to share the next phase of building with AI soon.