Why Generalist Agents Fail and Specialists Win

Most AI agent demos promise the world. “.” “I can solve any task.” “I am your universal assistant.”

But when you actually use them in production, the promise collapses. You type: “Build me a landing page.” The agent churns. It produces a generic component with broken CSS. You spend 45 minutes fixing it.

Net time saved: negative thirty minutes.

Trust score: zero.

This is the generalist trap.

We are building agents with the goal of being “capable of everything,” and as a result, they’re reliably great at nothing.

The agents that are actually sticking—the ones with retention curves that look like a smile instead of a cliff—are doing something much less sexy. In terms, the successful agents are designed around a specific Job, not a broad and all-encompassing goal.

And until we stop building “Do Anything” and start building “Do Something Specific” too many agent demos will continue to over promise and under deliver.

The Context Collapse Problem

Why do general-purpose agents fail so hard in real life? Because of . In any real-world job, a huge portion of the requirements are known without having to say them.

If a VP of Marketing tells a copywriter, “Write a launch post for this feature,” the copywriter knows specifics about voice, strategy, campaign history, and the stakeholders. They take all that into account without being explicitly told.

A “Marketing Agent” knows none of this. It only knows the prompt: “Write a launch post.” So it produces a generic, emoji-filled, synergy-heavy post that sounds like all the other AI-generated slop on LinkedIn.

The user reads it and thinks: *“This is garbage. I have to rewrite the whole thing.” But a Job-First Agent avoids by narrowing the scope to where the is bounded.

The Fix: Design for the Context, Not Just the Text

The Specialist Agent succeeds at this same job not by being smarter, but by being narrower.

It doesn’t try to “do marketing.” It tries to “Draft a Launch Post based on Release Notes.”

Because it is specialized, it knows what inputs it needs to do well:

Input 1: The raw release notes (Facts).
Input 2: The "Banned Words" list (Voice).
Input 3: The (Strategy).

It guides the user to provide the missing before it starts. The Generalist guesses. The Specialist asks for the right ingredients. That is why the Specialist wins: it respects the complexity instead of pretending it doesn't exist.

The Specialist Advantage

Agents that are winning aren't hypothetical. They exist and they’re ruthlessly narrow.

Klarna’s Support Agent

: “Resolve routine disputes instantly.”

The Struggle: Customers hate waiting 24 hours to ask “Where is my refund?”

Klarna didn’t build a “Chat with Klarna” bot to talk about the weather. They built an agent integrated into their order management system.

Trigger: Customer asks “Where is my order?”
Action: It checks the shipping API. It checks the policy. It processes the refund or updates the status.
Result: It handles 2/3rds of all customer chats (2.3 million conversations) and does the work of 700 full-time agents.

It wins because it doesn’t try to be a friend. It tries to be a Resolution Machine.

Cursor

: “Keep me in the flow.”

The Struggle: Every time I hit a syntax error, I have to -switch to Stack Overflow or documentation.

Cursor isn’t just a “chatbot in the IDE.” It’s a -Aware Editor.

Trigger: A compiler error or a natural language command (“Fix this bug”).
Action: It reads the entire codebase (not just the file). It applies the fix directly to the code (diff view).
Result: Developers accept 70%+ of its suggestions because it understands the project , not just the syntax.

It wins because it turns “debugging” (a 20-minute detour) into “tab-to-fix” (a 1-second action).

Harvey

: “Synthesize complex legal precedents into a first draft.”

The Struggle: Associates spend hours searching internal databases to find relevant clauses and case law.

Harvey doesn’t try to be a generic writer. It tries to be a Legal Research Assistant.

Trigger: “Draft a memo on X based on our previous cases.”

Action: It retrieves relevant documents from the firm’s secure database. It synthesizes the arguments. It cites the sources.

Result: It turns hours of research into a high-quality first draft for partner review.

It wins because it understands legal reasoning and firm-specific precedent, not just grammar.

How to Scope an Agent to a Job

If you’re building an AI agent, stop asking: “What can the model do?” Start asking: “What is the hiring criteria for this job?”

To define a Job-First Agent, you need to answer three questions. If you can’t answer them, you don’t have a product yet.

The Trigger: When does the user hire this agent?

Bad Answer:“Whenever they want help.” (Too vague).

Good Answer: “When a customer asks for a refund.” (Specific event).

Good Answer: “When a pull request fails the build.” (Specific event).

The trigger defines the . Without a specific trigger, the user has to remember to use your tool. With a specific trigger, you can integrate directly into the workflow.

The Input: What does the agent need to know to do without asking?

Bad Answer: “Whatever the user types in the chat.” (High friction).

Good Answer: “The customer’s order history and the refund policy.” ().

Good Answer: “The error log and the changed files.” ().

The more you can ingest automatically, the less the user has to prompt. The best agents are “zero-shot” for the user—they just work.

The Success Criteria: How do we know is done?

Bad Answer: “The user is satisfied.” (Subjective).

Good Answer: “The refund is processed in Stripe.” (Binary).

Good Answer:“The build passes.” (Binary).

Binary success criteria allow you to measure performance. Subjective criteria lead to “vibes-based” product management.

The Shift from “Magic” to “Labor”

The first wave of Generative AI (2022-2023) was about Magic. “Look, it wrote a poem about a pirate!” “Look, it made a picture of a cat in space!”

The current wave is about Labor. “Look, it triaged 500 tickets.” “Look, it migrated this database schema.”

Magic is fun. Labor is valuable. And truly valuable labor is specific.

You don’t hire a “General Employee” to “do work.” You hire a “React Developer” or a “Content Marketer” or a “Paralegal.” You give them a job description. You give them constraints. You measure their output.

Your AI agents need the same specificity. Don’t build a “Marketing Agent.” Build an agent that “Keeps users informed of technical changes.” Don’t build a “ Agent.” Build an agent that “Protects revenue by flagging recurring dips.”

Specific jobs get hired. Vague goals get fired. And in the world of AI agents, specificity is the deepest moat you have.