Building an AI Consultancy Toolkit

Most of what passes for “AI consulting” right now is one of two things. It is either a hype-led demo — someone wires a chatbot to a sample PDF, the room makes impressed noises, and nobody asks what happens on day ninety. Or it is a strategy deck: forty slides of maturity curves and “transformation pillars” written by people who have never deployed a model, never watched a retrieval pipeline return confidently wrong answers, and never had to explain to a CFO why the proof of concept that wowed everyone cannot go live.

I sit in the middle of that, in technical presales and solutions architecture, and I have come to believe the gap between those two failure modes is where the actual job lives. The demo people can build but cannot scope. The strategy people can scope but cannot build. The work that matters — and the work customers will pay for and trust — is qualifying hard, discovering honestly, and turning that into something defensible an engineering team can actually deliver.

The single most useful thing in my toolkit is not a framework. It is the willingness to say no. Most requests that arrive with the words “can you add AI to this?” should not become projects, and a consultant who cannot disqualify is just an expensive order-taker. Everything below is built around that discipline. The frameworks exist to make a defensible “no” — or a defensible “yes” — repeatable instead of a matter of mood.

I write this as someone who actually builds the stuff. I run local models at home on a single RTX 3090, an Ollama-based assistant I call Project Atlas, and a stack of retrieval and automation that I have broken and rebuilt enough times to know where the bodies are. That hands-on credibility is not a vanity point. It is the thing that lets me sit across from a customer and tell the difference between a five-day integration and a six-month research project, because I have personally been burned by mistaking one for the other.

The problem with how this gets sold

The typical AI enquiry arrives backwards. A customer has read that competitors are “using AI”, a board member has asked an awkward question, and someone has been told to “look into it”. So the request lands as a solution looking for a problem: we want a copilot, we want to chat with our documents, can the system summarise tickets. There is no use case underneath it, only a capability they have heard exists.

What is almost always missing at this point is the unglamorous half of the conversation. Nobody has decided what decision the AI would actually change. Nobody knows whether the data it would need is clean, accessible, or even allowed to be used that way. There is no definition of “good” — no number, no threshold, no acceptance criteria — so the project can never be declared finished or successful. And crucially there is rarely a sponsor with the authority to make the trade-off decisions the work will demand.

I have written separately about why most AI projects fail, and the post-mortems nearly always trace back to this opening moment. The failure is designed in before a line of code is written, because the enquiry was treated as a brief instead of a symptom. The job of good consulting is to refuse the brief and diagnose the symptom.

If you cannot name the decision the model will change, you do not have a use case. You have a wish.

Qualification — telling an opportunity from a science project

Qualification is where the money is saved or lost, and it happens before discovery proper. I am trying to answer one question: is there a real opportunity here, or a science project dressed as one? A science project is interesting, open-ended, and funded by enthusiasm; it has no owner and no end state. A real opportunity has tension behind it — someone is genuinely losing time, money, or sleep — and someone with authority wants it fixed.

Five things have to be present, and I score them deliberately. Value: is there a quantifiable outcome, or at least a defensible estimate of one? “Save advisors ten minutes per case across two hundred cases a day” is value. “Be more innovative” is not. Data availability: does the data the use case depends on actually exist, in a place we can reach, in a state we can trust? Ownership: is there a named business owner who will live with the result, not just an IT contact relaying messages? Risk appetite: is this a domain where a wrong answer is recoverable, or one where a single hallucination is a regulatory incident? Sponsor: is there someone who will make decisions and defend a budget when the novelty wears off?

Here is the scoring model I actually use. It is deliberately blunt.

Dimension	0 — disqualify	1 — caution	2 — strong
Value	No measurable outcome	Plausible but unquantified	Named metric and baseline
Data readiness	Doesn’t exist or off-limits	Exists but messy or siloed	Accessible, governed, trustworthy
Ownership	IT-only, no business owner	Business owner, low engagement	Engaged owner who feels the pain
Risk fit	Zero tolerance for error	Errors need human review	Errors are cheap and recoverable
Sponsor	Curiosity, no budget	Budget but no decision power	Funded sponsor who decides

I total the five. Anything scoring a zero on a single dimension is a hard stop regardless of the total — a brilliant use case with no usable data is not a project, it is a data project pretending to be an AI one. Eight to ten and I will write a proposal with confidence. Five to seven and I will propose a paid discovery to resolve the unknowns, never a build. Below five I say no, and I say it plainly, because letting a doomed engagement start is the most expensive kindness a consultant can offer.

Saying no well is a skill. I do not say “this is a bad idea”. I say “here is what would have to be true for this to work, and here is which of those things is missing today”. That reframes the no as a map. Half the time the customer comes back six months later with the missing piece solved, and now they trust me because I did not take their money the first time.

flowchart TD
  A[AI enquiry arrives] --> B{Real decision identified}
  B -- No --> X[Disqualify or reframe]
  B -- Yes --> C{Data exists and usable}
  C -- No --> D[Data readiness engagement first]
  C -- Yes --> E{Owner and sponsor present}
  E -- No --> X
  E -- Yes --> F{Risk tolerance fits errors}
  F -- No --> X
  F -- Yes --> G[Paid discovery workshop]
  G --> H[Use case scoring and architecture]
  H --> I{Score eight or above}
  I -- No --> J[Smaller pilot or stop]
  I -- Yes --> K[Defensible proposal]

The discovery workshop

Once an opportunity qualifies, discovery is where I earn the right to propose. I run it as a structured workshop, not a casual chat, because the structure is what surfaces the things nobody volunteers. I want the business owner in the room, a couple of the people who actually do the work, and someone who knows where the data lives. Half a day, sometimes a full day. It is paid, and saying so up front is itself a qualifier — people who will not fund discovery were never going to fund delivery.

I move through five things in order. First, process mapping: walk the real workflow end to end, the version that happens on a Tuesday when someone is off sick, not the tidy diagram in the quality manual. Second, where decisions are made: every step where a human judges, chooses, or interprets is a candidate for AI assistance and, equally, a candidate for AI to get dangerously wrong. Third, what data exists and whether it can be trusted: not just “do you have the documents” but are they current, are they contradictory, who updates them, and what happens when they are wrong. Fourth, what good looks like: I push hard for a number and an acceptance threshold, because a use case without a definition of success is a use case that never ends. Fifth, build versus buy: whether this capability is differentiating enough to build or whether a product already does it for less than our day rate.

The most valuable output of discovery is often a smaller, sharper problem than the one we walked in with. “Chat with all our documents” becomes “answer the twelve questions the support desk asks most, from these four documents, with a citation”. That narrowing is the work. It is also where I find the landmines — the data that turns out to live in someone’s personal mailbox, the decision that is actually governed by a regulation nobody mentioned, the “simple” classification that requires judgement no current model can reliably make.

The data readiness assessment

I treat data readiness as a gate, not a footnote, because it is the most common reason a confident proposal turns into a quiet disaster. The questions are unglamorous and that is the point.

data_readiness:
  exists:        true        # does the data physically exist
  accessible:    true        # can we reach it via api or export
  governed:      false       # is there a clear owner and lawful basis
  current:       true        # is it kept up to date, or stale
  structured:    partial     # structured, semi, or free text
  trustworthy:   unknown     # is it correct, or full of contradictions
  volume:        sufficient  # enough to be useful, not so much it is noise
  sensitivity:   high        # pii, commercial, regulated

Any false or unknown on governed, trustworthy, or accessible stops the build conversation and starts a data conversation instead. I have learned the hard way that a retrieval system over untrustworthy source data does not fail loudly. It fails politely, with fluent, well-formatted, completely wrong answers, and by the time anyone notices, trust in the whole project is gone. This is the same instinct I bring to repeatable customer health checks — measure the ground truth before you promise anything built on top of it.

Build, buy, or integrate

The next decision is whether we build anything at all. My default bias is do not build what you can buy, because bespoke AI carries a maintenance tail most customers underestimate. I frame it as three options. Buy when a mature product already does the job and the use case is not a competitive differentiator — most “summarise my tickets” and “draft my emails” requests are already features in tools the customer owns. Integrate when the value is in connecting existing capabilities to the customer’s specific data and workflow, which is the sweet spot for most engagements. Build only when the use case is genuinely differentiating, the data is proprietary, and no product fits — the rarest case, and the one people reach for first.

The honest version of this conversation often loses me the bigger project and wins me a customer for life. Telling someone their idea is already a checkbox in their Microsoft 365 licence is not a lost sale. It is the reason they call you for the hard thing next year.

The architecture pattern catalogue

When we do build or integrate, I reach for a small catalogue of patterns and resist the urge to over-engineer. The mistake I see most often is reaching for an autonomous agent when an API call would do, because agents demo well and feel modern. They are also the hardest thing to make reliable.

Pattern	Use when	Cost and risk	Avoid when
Just an API	The model needs no private context, task is one shot	Lowest, mostly prompt design	You need grounding in private data
RAG	Answers must be grounded in a known document set	Moderate, retrieval quality is the work	The knowledge is procedural not factual
Fine-tune	You need a consistent style or format at scale	High, needs data and re-training	Facts change often, RAG is cheaper
Agent or tool-calling	The task spans multiple steps and systems	Highest, reliability is hard	A deterministic workflow would do

Most of the value I deliver is plain retrieval-augmented generation over a curated, governed document set, with citations, and a human in the loop. It is unfashionable and it works. I reach for fine-tuning rarely, because the moment the underlying facts change, a fine-tuned model is confidently out of date and RAG is not. I reach for agents only when the workflow genuinely needs to plan and act across systems, and even then I bound them tightly with deterministic tools rather than letting them improvise. I build mine on Ollama and n8n precisely so I can keep the clever bits small and the plumbing inspectable. This is also why I keep arguing that AI is becoming infrastructure: the durable engineering is in retrieval, data pipelines, and orchestration, not in the model itself, which is increasingly a commodity you swap out.

Risk, governance, and the boring checklist that saves you

Before anything goes in a proposal I run a governance pass, because the questions that sink AI projects in regulated organisations are never technical. Where does the data go, and does that cross a boundary the customer’s legal team would object to? What is the lawful basis for using this data this way? How do we handle a wrong answer — is there a human review step, an audit trail, a way to trace why the system said what it said? Who is accountable when it fails, and have they agreed to be? What is the fallback when the model or the provider is unavailable?

I keep this as a literal checklist and I do not skip it because the customer is enthusiastic. Enthusiasm is exactly when the boring questions get waved away, and exactly when they matter most. A consultant who raises governance early looks slower than the demo merchant in the next meeting. They also look a great deal smarter eighteen months later.

From discovery to a defensible proposal

Everything above feeds one output: a proposal that survives scrutiny. By defensible I mean every claim in it traces back to something we established in discovery — the value to the scored metric, the architecture to the data assessment, the timeline to the genuine unknowns rather than to optimism. It states what we are not doing as clearly as what we are, because uncontrolled scope is the quiet killer. And it sets expectations honestly: where the risks are, what could push the timeline, and what “good enough to go live” actually means.

The handover from a signed proposal to a delivery team is its own discipline, and I have written about that journey from proposal to production — the short version is that a proposal which hides the hard parts to win the deal simply moves the failure to the delivery team and burns the relationship anyway. A defensible proposal is one I would be happy to be held to in twelve months, which is a much higher bar than one that wins the meeting.

What I have got wrong

I have made every mistake in this article at least once. I have let a project start that I knew in my gut was a science project, because the customer was keen and the number was big, and it ended exactly as my gut predicted — a beautiful proof of concept that could never survive contact with real data. I have under-weighted governance because the use case was exciting, and watched it stall in legal for three months. I have proposed an agent where a scheduled script would have been more reliable and a tenth of the cost, because the agent was more fun to build.

The pattern is the same each time: I let enthusiasm override the discipline of the frameworks. The frameworks are not there for the easy decisions. They are there to hold the line when everyone in the room, including me, wants to skip to the fun part.

Where this goes next

The toolkit keeps evolving. I am turning the qualification scorecard and the data readiness assessment into actual artefacts — structured templates that feed a workflow rather than living in my head and a slide. There is a natural project in connecting them to the same retrieval-and-automation stack I use for everything else, so that discovery notes become searchable, comparable knowledge across engagements rather than documents that die in a folder. That is the same instinct that drives my whole approach to building knowledge instead of documents, and it is where I think presales is heading anyway, as I argued in the future of technical presales: less polished pitching, more genuine engineering judgement applied early and honestly.

The thing I will not automate is the no. Disqualifying a bad opportunity depends on reading a room, hearing what a sponsor does not say, and having the credibility to be believed when you tell someone their favourite idea will not work. That credibility comes from having actually built the thing. It is the reason I keep running models on my own hardware, breaking my own pipelines, and writing it all down — not because the homelab is the job, but because it is the thing that lets me sit across a table and tell the difference between what is possible and what merely demos well.

The best AI consulting I do looks, from the outside, like talking people out of things. That is not a failure of ambition. It is the whole value.