The numbers everyone is quoting this month are not flattering. 88% of agent pilots never reach production. 22% of the deployments that do ship report negative ROI at the 12-month mark. Across all enterprises, only 23% report significant ROI from agents at all. Those are Forrester and KPMG numbers from the last six weeks, and they are consistent across surveys that did not coordinate.
The headlines on these surveys mostly read "AI hype meets reality." That is not the useful version of the story. The useful version is that the 12% of pilots that do survive look almost identical to each other, the 88% that die look almost identical to each other, and the difference has very little to do with model quality, vendor selection, or any of the things vendors want to sell you.
Forrester's own root-cause breakdown of the failures is the place to start. 41% died from unclear success criteria. 33% died from insufficient tool or data access. 26% died from drift in evaluation coverage. Those are the three reasons. Almost nothing else.
Here is what each one looks like in practice.
Reason one. Nobody wrote down what working means.
The most common version of this is a pilot that launched with a sentence like "see if the agent can help support." That is not a success criterion. That is a vibe.
A real success criterion has a number, a scope, and a time horizon. "The agent closes 60% of inbound tier-1 tickets without a human touch, with escalation reversal under 2%, measured weekly over the next six weeks." That is a thing you can fail or pass. "Help support" is a thing you can argue about forever.
The reason pilots launch without numbers is that the team running the pilot does not want to commit to one. The vendor doesn't push for one because vague criteria are easier to claim victory against. The exec sponsoring the work doesn't push for one because they aren't close enough to the work to know what is realistic. Six weeks in, nobody knows whether the pilot is on track, the team running it gets quiet, the exec loses interest, and the project gets reclassified as "ongoing." A quarter later it is dead, but politely.
The fix is to write the criterion before the pilot starts. One sentence. Three components. A number, a scope, a horizon. If you cannot write that sentence, you do not have a pilot, you have a demo.
Reason two. The agent does not actually have the tools it needs.
This one shows up after the agent is wired up but before it touches real work. The agent has been given read access to the CRM. It has been given read access to the ticketing system. It has not been given write access to either, because somebody upstream got nervous in week two and walked back the scope.
The agent is now a fancy template engine. It reads the ticket. It drafts a reply. A human pastes the reply, clicks send, marks the ticket closed. The agent does not close tickets. The agent does not update records. The agent does not perform actions in any system. That is not an agent. That is a writing assistant. It will not produce ROI because it has not been allowed to actually do the work.
This pattern is not paranoid security, although it is sold as such. It is fear of accountability. Nobody wants to be the person who let the agent send the wrong email or refund the wrong order. So write access gets locked down to nothing, the agent gets neutered, and the pilot dies of irrelevance.
The fix is not to throw the gates open. The fix is to scope the write access narrowly and explicitly, then watch it. "The agent can close tickets in the 'shipping question' category, but not refund tickets, and cannot edit customer records." That is a scope an exec can defend. "The agent has read access to everything" is a scope that produces no value.
If your pilot does not include at least one action the agent can take end to end without a human, you do not have a pilot. You have a co-pilot, and the productivity gains from those are measurable but small, and not what was on the slide that got the budget approved.
Reason three. Nobody is watching whether it still works.
Agents drift. The input distribution shifts when marketing runs a new campaign and the ticket mix changes. The tool responses shift when a vendor updates an API. The team's definition of a good answer shifts when a new customer segment comes online. None of that is dramatic. All of it adds up.
Most pilots launch with no held-out evaluation set. The team that built the agent reviewed some sample outputs in week one, decided they looked fine, and moved on. There is no weekly evaluation. There is no regression baseline. When the agent starts performing worse, nobody notices until somebody escalates a specific bad case, and by then the trust is gone.
The fix is a small held-out set of representative inputs, maybe 30 to 100 examples, with known expected outputs, that you run the agent against every week. Track accuracy over time. If it drops, you have a signal weeks before a human catches it through complaints. The set does not have to be large. It has to be representative and it has to actually run.
The 12% that survive look like this.
The pilots that ship and produce real ROI share three properties. None of them are about model selection.
One named owner with ROI accountability.Not a steering committee. Not "the AI team and the support team jointly." One person whose performance review for the next two quarters includes "did the agent ship and did it produce the number we said it would." Without an owner, the pilot has no advocate when somebody upstream gets nervous. With one, the project has a forcing function.
Scoped write access to the systems that matter. The agent can close tickets, post messages, update records, send emails, or whatever the actual work is, within an explicit scope. The team accepts that mistakes will happen at a knowable rate, and has a rollback plan when they do.
A weekly evaluation run on a held-out set.Small, representative, written down. Run every week. Anomalies investigated within the week. This is the difference between "we have an agent in production" and "we know our agent is still doing what we deployed."
That is the pattern. Vendors don't sell it because none of the three are products. They are operating discipline, and they cost almost nothing once the agent itself exists.
The four-question checklist before you greenlight a pilot
Before you sign off on the next agent pilot in your org, ask these four questions. If the answer to any of them is unclear, the pilot is going to fail. You do not have to cancel it. You have to fix the answer first.
One. What is the single sentence that describes success, with a number, a scope, and a time horizon?
Two. What action can the agent take end to end, in production, without a human touch?
Three. Who is the single person whose performance review depends on the answer to question one?
Four. What is the held-out evaluation set, where does it live, and who runs it weekly?
That is it. Four questions, two minutes, and you will have already done more diligence than most of the 88%.
If you're looking at a stalled pilot and trying to figure out which of the three failure modes you are in, or want a second pair of eyes on the checklist before greenlighting the next one, reach out. We run this audit for clients every week.