Why Most AI Projects Fail
I have sat in the room when the demo lands. The screen lights up, the model answers the impossible question, somebody senior says “this changes everything”, and a budget appears out of nowhere. Six months later I am in a different room, quieter, where the same project is being quietly defunded. Nobody calls it a failure. It just stops being mentioned.
I have seen this cycle enough times now to be unsentimental about it. Most AI projects fail. Not because the model was bad — the model is almost never the problem — but because the organisation around the model was never built to carry it. The uncomfortable truth is that AI projects fail for the same boring reasons every IT project has always failed: unclear ownership, poor data, no operational discipline, security treated as an afterthought, expectations set by a salesperson rather than an engineer. AI just amplifies all of it, because it fails confidently and it fails in prose.
This is a flagship opinion piece, so I am going to take positions rather than hedge. If you want the constructive version of the argument — what AI actually is once the hype burns off — read it alongside AI is becoming infrastructure, which is the spine of everything I believe about this subject. This article is the autopsy. That one is the blueprint.
The demo-to-production cliff
The single most expensive misunderstanding in this field is the belief that a working demo is a nearly-finished product. It is not. A demo and a production system are not the same thing at different stages of completion. They are different things entirely, and the gap between them is where most projects die.
A proof of concept is built to succeed. You pick the friendly question, the clean document, the happy path, the one example that makes the room gasp. That is fine — that is what a PoC is for. The dishonesty creeps in when the PoC’s success rate gets quietly extrapolated into a production promise. The slick demo answered ten curated questions perfectly. Production has to answer ten thousand questions it has never seen, from users actively trying to break it, against data that changes hourly, while staying within budget, latency, and compliance limits nobody mentioned in the demo.
I think of it as a cliff rather than a slope. The work does not get gradually harder. It falls off the edge.
The PoC lives on the left. Production lives on the right. Everything in the middle is the work that does not demo well, gets no applause, and decides whether the thing survives. When I qualify an AI opportunity now — using the same discipline I describe in the AI consultancy toolkit — my first question is never “can we build a demo?” It is “who is going to operate this at 3am in March?” If there is no answer, the project is already failing; it just does not know it yet.
Garbage in, confident garbage out
The oldest law in computing did not retire when the transformers arrived. It got a promotion. “Garbage in, garbage out” used to produce obviously broken output — a null where a name should be, a report that clearly did not add up. You could see the garbage. An LLM does something far more dangerous: it takes your garbage data and returns it as fluent, confident, well-formatted prose. Garbage in, authoritative garbage out.
This is the failure mode that terrifies me most, because it is invisible to exactly the people making decisions on it. If your CRM has three conflicting records for the same customer, a traditional report shows three rows and somebody notices. Ask an LLM and it will smoothly synthesise a single confident answer — and you will never know it averaged three contradictory facts into one plausible lie. The model does not flag uncertainty unless you force it to. It is a fluency engine, not a truth engine, and fluency is precisely what disarms a reader’s scepticism.
Most organisations dramatically underestimate how bad their data is, because they have never had to confront it head-on. The dashboards papered over it. Retrieval rips the paper off. The first time you point a serious retrieval pipeline at a real SharePoint estate, you find duplicate policies, three versions of the “current” pricing sheet, a 2019 document that contradicts the 2024 one, and an org chart describing people who left. The model dutifully grounds itself in all of it. This is why I am so insistent about doing grounding properly — retrieval is not a magic trick that fixes bad data, it is a magnifying glass held over it.
The hard position: if you would not trust a junior analyst to write a report from your data unsupervised, you cannot trust an LLM to do it either, and the LLM is a far more convincing liar.
Magical thinking and the vendor hype machine
A large share of AI failures are baked in before a single line of code is written, in the gap between what the vendor demo promised and what the technology can actually do. The market is awash with magical thinking, and the people fuelling it are not engineers — they are marketing departments with a quarterly number to hit.
I work in presales. I understand the gravitational pull of a good demo, and I have a low tolerance for the version of it that sets a customer up to fail. When a slide promises “AI that understands your entire business”, what is actually being sold is a probabilistic text generator with a context window and a retrieval index. Those are genuinely useful. They are not magic. The damage is done when a senior stakeholder, primed by twelve months of breathless coverage, sets a success criterion that no system could meet — “it should just know the answer to anything” — and then judges a perfectly good tool a failure for not being omniscient.
If you cannot say in one sentence what the AI is allowed to be wrong about, you have not scoped the project. You have placed a bet.
The cure is unglamorous and it is the presales engineer’s actual job: translate the magical expectation into a measurable one before the contract is signed. Not “summarise our knowledge” but “draft a first-pass answer to tier-one support questions, correctly cited, that a human approves before it reaches a customer”. The second framing can succeed. The first can only disappoint. This is the same expectation-management discipline that decides whether a deal survives the journey from proposal to production.
Nobody owns it
Here is the failure that hides in plain sight. The pilot works. Everyone is pleased. And then the question nobody wants to answer arrives: whose job is this now?
An astonishing number of AI initiatives are orphans. They are championed by an enthusiast — often someone brilliant and slightly bored in their actual role — who builds something genuinely clever in their evenings. Then that person gets promoted, or leaves, or simply runs out of evenings, and the system has no product owner, no budget line, no roadmap, and no one accountable when it breaks. It does not get switched off. It just decays. Models drift, the index goes stale, an API key expires, and one day it is quietly wrong about everything and nobody is watching.
Worse is the shadow-IT version, which I am seeing constantly now. A team, frustrated with the official backlog, wires up a copilot or a third-party AI tool against company data without telling anyone. It works. It spreads. Now there is a production system processing sensitive data that security has never reviewed, that has no owner of record, and that nobody can confidently switch off because three departments depend on it. The convenience that made it spread is exactly what makes it dangerous.
Ownership is not a nice-to-have. It is the difference between a system and a science project. A real owner means a named human accountable for the outcome, a budget that survives the next reorganisation, and a roadmap that treats the thing as a product with a lifecycle rather than a clever demo frozen in time. I treat this as a qualification gate: no owner, no project. I would rather kill an AI initiative at the proposal stage than let it become an unowned liability eighteen months later.
The model is a new attack surface
Security is where I see the most dangerous combination: high stakes and low awareness. People are bolting LLMs onto their most sensitive systems while reasoning about security as though it were a normal web app. It is not. An LLM connected to your data and your tools is a new and genuinely strange attack surface, and the old playbook does not fully cover it.
Start with prompt injection, which is not a theoretical worry — it is the defining vulnerability class of this technology. If your model reads untrusted content — a web page, an inbound email, a document a user uploaded, a calendar invite — that content can contain instructions, and the model has no reliable way to distinguish data it should read from commands it should obey. A support assistant that ingests customer emails can be told, by a customer email, to ignore its rules and exfiltrate the last ten tickets. We have spent decades learning to separate code from data; LLMs cheerfully blend them back together.
Then there is oversharing through copilots, which is quietly the most common real-world incident. Deploy an enterprise copilot over a document estate where permissions were always “a bit loose but nobody could find anything anyway”, and you have just built a search engine that finds everything. The salary spreadsheet that was technically shared with the whole company but buried in a forgotten folder is now one polite question away. The AI did not breach anything. It just made existing bad permissions usable, and that is enough to cause a disaster.
The defensive posture has to assume the model can be manipulated and that its outputs and tool-calls are untrusted by default:
# The model is untrusted by default. Constrain it like one.
ai_assistant:
data_access:
enforce_user_permissions: true # the model sees only what the asking user can see
no_service_account_shortcuts: true # never run as an all-seeing identity
tools:
allowlist: [search_kb, create_ticket]
deny: [delete, send_external_email, run_shell]
human_in_the_loop: [send_external_email]
inputs:
treat_retrieved_content_as_untrusted: true
strip_instructions_from_documents: true
logging:
prompts: true
tool_calls: true
retention_days: 90
The principle is old and it still holds: least privilege, defence in depth, log everything, trust nothing. The novelty is that one of the things you must now distrust is the AI itself. I build the underlying platform with this assumption baked in, which is part of why I care so much about getting the infrastructure for AI workloads right — security at the model layer is worthless if the box underneath is wide open.
Governance and the questions nobody asked
Closely related, and just as fatal, is governance — the set of questions that are boring to ask and catastrophic to skip. Who is allowed to use this? On what data? Where does that data physically live, and does sending it to a US-hosted API breach a contract or a regulation? Can we produce an audit trail of what the model was asked and what it answered? If a regulator asks why a decision was made, can we explain it, or does the answer disappear into a 70-billion-parameter shrug?
These questions kill more projects than any technical limitation — usually late, expensively, just before go-live, when legal or compliance finally looks at what has been built and says no. I have watched a genuinely excellent pilot get vetoed at the final gate because nobody had asked, on day one, whether the data was allowed to leave the tenant. It was not. Months of work, dead, over a question that takes an afternoon to answer at the start.
Data residency, model risk, access control, auditability, retention — these are not blockers invented to slow you down. They are the difference between a system you are allowed to run and a system that gets you fined. The same logic that makes me build repeatable health checks for customer environments applies here: governance is not a one-off sign-off, it is a posture you maintain. Bolt it on at the end and it becomes the thing that says no. Build it in from the start and it becomes the thing that lets you say yes safely.
No one is on call
The last failure mode is the one that separates people who have run production systems from people who have only built demos: operational readiness. Most AI projects have none.
Ask a team how they will know the model has started giving worse answers and you usually get a blank look. There is no evaluation harness, so quality is measured by vibes and the occasional complaint. There is no monitoring of answer quality, latency, cost, or refusal rate, so a model update that quietly degrades output goes unnoticed for weeks. There is no rollback plan, so when a new prompt or model makes things worse, the only option is to thrash in production. And there is no one on call, because everyone assumed an AI system, unlike every other production system in history, would somehow look after itself.
It will not. An AI feature in production is a production system, and production systems need the unglamorous scaffolding: a set of evals you can run on every change, monitoring that watches quality and cost, alerting when something drifts, and a human who is accountable when the alert fires. None of this is novel. It is the same operational maturity we expect of a database or a payment gateway. The mistake is exempting AI from it because it feels new and magical. Magic does not page someone at 3am. Engineering does.
What success actually requires
If the failures are mostly old IT failures wearing a new costume, then so are the cures, and I find that genuinely encouraging. We are not short of knowledge about how to run reliable systems. We are just refusing to apply it to AI because the hype told us this time was different. It is not different. It is the same, amplified.
A project that survives the cliff has, in my experience, the same handful of things in place from the start. There is a named owner who is accountable for the outcome and has a budget that outlives the excitement. The expectations were set by an engineer, not a slide, and written down as something measurable the system is actually allowed to fail at. The data was treated as the real project, because it is — retrieval is grounded properly and the bad data was confronted, not papered over. Governance and security were designed in on day one, with least privilege, audit trails, and a clear answer to where the data lives. And the thing is operated like production: evals on every change, monitoring on quality and cost, a rollback that works, and a human on call.
Notice what is not on that list. The choice of model. The cleverness of the prompt. The frontier benchmark scores. Those are real, but they are the part everyone already obsesses over, and they are almost never why a project dies. The model is not the product — the system around it is, and the system is built from ownership, data, governance, security, and operations.
So when someone shows me a dazzling demo and asks how fast we can ship, I have learned to be the person who asks the unwelcome questions. Who owns this? How bad is the data, really? Where is it allowed to go? What happens when it is wrong, and who finds out? They are not exciting questions. They will not make the room gasp. But they are the entire difference between a system that is still running, quietly and reliably, in three years — and one more orphaned pilot that nobody quite remembers switching off.