Why do most enterprise AI pilots fail to deliver ROI?

MIT's 2025 study found 95% of enterprise generative-AI pilots produced no measurable P&L impact, and that the cause was implementation, not model quality. In our experience the failure sits in the 'last mile' — embedding the output where work happens and giving the agent long enough to improve — not in the model.

What does 'the last mile' mean for AI adoption?

It's everything between a capable model and actual use: which channel the work is delivered on, who owns the outcome once it arrives, and whether anyone acts on it. A tool people have to remember to open rarely gets used; a colleague who emails you the work does.

Should you kill an AI pilot that underperforms in the first weeks?

Not on output quality alone. An agent that learns from corrections looks ordinary early and improves later — week-one polish is a poor kill signal. Judge instead on whether the work is owned, whether anyone acts on it, and whether corrections are sticking.

We made our agents email people. That's when the AI started working.

Capability vs Delivery

01The decision that mattered wasn't technical

The single most useful thing we did to make AI work inside our company was decide that our agents would send emails.

Not build a smarter model. Not buy a better tool. Decide that Lola, our Chief of Staff agent, would land a briefing in your inbox at 5am — and that Cody would send a deploy note, and the night shift would attach a finished draft — on the channels people already live in. No new app. No new tab. No “remember to go and prompt the thing.”

That sounds too small to matter. It was the difference between agents people use and agents people forget.[S2]

02The number everyone misreads

In 2025, MIT looked at 300 enterprise generative-AI deployments. Spend was thirty to forty billion dollars. 95% of the pilots produced no measurable impact on the P&L. One in twenty created real value.[S1]

The usual read is “the models aren’t ready.” That’s the comfortable conclusion, because it means you wait and the problem solves itself. It’s also wrong. The same study found the divide isn’t driven by model quality or regulation[S1] — over 80% of these companies had already put ChatGPT or Copilot in front of staff. The capability was in the building. It just didn’t show up in the numbers.

It didn’t show up because a tool dropped next to a busy person doesn’t take work off their plate. It adds a thing to manage. The capability arrives; the last mile — getting the work to actually leave a human’s hands and land somewhere useful — never gets built. That mile is unglamorous. No demo, no launch, no slide. So it gets skipped, and the pilot joins the 95%.

03Failure mode one: nobody adopts a tool they have to open

A colleague who emails you a competitor breakdown gets read. A dashboard you have to remember to log into gets ignored by March. Same content. Different last mile.

When we place an agent, we don’t ask “what interface will we give it.” We ask “where does this work already happen, and how would a human colleague deliver it there.” Lola pushes; she doesn’t wait to be pulled. The morning brief is in the inbox before anyone’s awake. Pipeline changes arrive as a message, not a report you have to fetch. The agent meets people where they work — email, Telegram, a deck attached to a thread — because that’s where adoption lives. Build a chat box nobody opens and you’ve bought capability you’ll never collect on.

04Failure mode two: killing the slow starters

Here’s the part that gets expensive. A real agent — one that learns from your corrections — is unremarkable in week one. It performs about like an off-the-shelf model, because it hasn’t been corrected enough yet to stop making your specific mistakes. The value compounds later, after it has been told three times that the tone is wrong and started getting it right.

Most pilots never reach later. They’re judged at week two, the output looks ordinary, and the verdict is “doesn’t work.” The MIT funnel shows the carnage: 60% of firms evaluate enterprise systems, 20% pilot, 5% reach production.[S1] A lot of that drop-off isn’t capability. It’s calling a slow starter a failure.

This cuts both ways, and we’ve been on the other side of it. We once ran an internal agent whose job was to “coordinate.” It produced 764 messages in thirteen days that nobody acted on, and we shut it down.[S3] We wrote up that failure in the Major Tom post-mortem . So the lesson isn’t “never kill an agent.” It’s that week-one output quality is the wrong kill signal. The right signals are whether the work is owned , whether anyone acts on it, and whether the corrections are sticking. Judge on those and you can tell a slow-starting winner from real noise. Judge on week-one polish and you’ll kill the wrong ones.

05What the last mile looks like Monday morning

Our agents have run in production with our team since January 2026, on the same inbox, pipeline, and client data the humans use — not a sandbox. The work that actually moves is the work that was embedded, not just enabled. Lola owns the pipeline and the inbox and reports into them daily. The night shift turns the day’s backlog into reviewed drafts by morning. Cody ships code while the team sleeps. None of that is impressive because of the model underneath. It moved the needle because the output had somewhere to go and someone who owned it.

The companies in the 5% share this pattern, and the MIT authors name it: they judge AI by business outcomes rather than benchmarks, and they put it where the structured, repeatable work actually sits — back-office and operations, where the returns concentrate, rather than the sales-and-marketing pilots that soak up most budgets and return the least.[S1] That’s a last-mile choice, not a model choice.

06One thing to take from this

Your AI doesn’t fail in the model. It fails in the last mile — and the last mile is unglamorous work nobody puts on a slide.

If a pilot isn’t moving the P&L, the fix is almost never a better model. It’s the boring questions: where does this work land, who owns it once it lands, and did anyone give it longer than two weeks to get good. Get those right and ordinary models produce real outcomes. Skip them — buy the capability, skip the mile — and you’ve built the 95%.

We made our agents email people. That's when the AI started working.

01The decision that mattered wasn't technical

02The number everyone misreads

03Failure mode one: nobody adopts a tool they have to open

04Failure mode two: killing the slow starters

05What the last mile looks like Monday morning

06One thing to take from this

Sources

Frequently asked questions

We fired an AI agent after 13 days

We made our agents email people. That's when the AI started working.

01The decision that mattered wasn't technical

02The number everyone misreads

03Failure mode one: nobody adopts a tool they have to open

04Failure mode two: killing the slow starters

05What the last mile looks like Monday morning

06One thing to take from this

Sources

Frequently asked questions

Quick Answers

More from Insights

We fired an AI agent after 13 days