Building Repeatable Customer Health Checks

Most of my career has a recurring scene in it. A customer asks for a “health check” of something — their Microsoft 365 tenant, their VMware estate, their backup posture, their Citrix farm — and a good engineer spends two or three days clicking through admin consoles, copying screenshots into a Word document, and writing the same observations they wrote for the last customer with the names changed. The customer pays a day rate for someone to look at a screen and type. The output is a forty-page document that is already going stale by the time it lands in their inbox.

I have done this dozens of times. I have also been the person who reviewed those documents, and the uncomfortable truth is that the quality varied wildly depending on who ran it. One engineer would catch a missing Conditional Access policy; another would miss it but write three paragraphs about mailbox sizes nobody asked about. Same service, same price, completely different value. That is not a delivery problem. That is a design problem.

This article is about the pattern I have been building to fix it. The Microsoft 365 AI health check is one concrete instance of it, but the interesting thing was never the M365 part. It was realising that almost every infrastructure review is the same shape underneath, and that the shape can be turned into a product instead of a heroic effort.

Why the manual way is broken

A manual health check fails on four fronts at once, and they compound.

It is slow. Days of an experienced engineer’s time go into navigating interfaces, and most of that time is data gathering, not thinking. The judgement — the bit the customer is actually paying a senior person for — is maybe ten percent of the effort. The rest is clerical.

It is inconsistent. The findings depend on which consoles the engineer remembered to open and what they happened to notice. There is no guarantee that two reviews of identical tenants produce the same report, which means the report tells you as much about the reviewer as about the environment.

It goes stale instantly. A screenshot of a Conditional Access policy is true for exactly as long as nobody changes it. The document is a photograph of a moving thing. A month later it is fiction, and nobody re-runs it because re-running it means another three days.

And it does not scale. The only way to do more health checks is to consume more engineer-days, which is exactly the resource you are short of. The economics are linear and the margin is thin, because you are essentially reselling a person’s afternoon.

The customer is not paying for someone to click through consoles. They are paying for an opinion they can trust. Everything between those two things is waste.

Once I framed it that way, the goal became obvious. Collect the facts by machine, deterministically, every time. Spend the human — and the AI — only on the judgement and the narrative. Make the whole thing re-runnable so that “what changed since last quarter” is a button, not a project.

Design principles I settled on

The first and most important decision was to separate fact-collection from judgement, completely and structurally. These are different kinds of work with different failure modes, and mixing them is what makes manual reviews unreliable.

Facts are collected programmatically, through APIs, never through a human reading a screen. If Microsoft Graph can tell me whether security defaults are enabled, I do not want an engineer eyeballing a toggle. The collector asks the API and records the answer. This is the same argument I made in why every infrastructure engineer should learn Python — the API is the source of truth, and driving it by hand is the slow, unauditable path.

Judgement on those facts is deterministic wherever it can be. “MFA is not enforced for admins” is not an opinion; it is a rule applied to a fact. Rules like that belong in code, where they are explicit, testable, and identical for every customer. I do not want a language model deciding whether something is a finding. I want it deciding how to explain a finding that the rules have already established.

That leaves a clear division of labour. Deterministic rules decide what is true and what is wrong. The AI decides how to say it — the narrative, the prioritisation, the readability, turning a list of thirty raw findings into a report a busy IT manager will actually read and act on. The model is a writer, not an auditor.

Output is templated, so quality does not depend on who ran the tool. The branding, the section structure, the executive summary format — all fixed. The engineer running it this week and the engineer running it next month produce documents that look and read the same, because the document is generated, not authored.

And everything is versioned and re-runnable. Each run is stored against the customer with a timestamp. Because the data model is stable between runs, I can diff them and show drift — the policy that was disabled last month, the new licences that appeared, the Secure Score that quietly dropped. That diff is something a manual process simply cannot produce, and it is often the most valuable page in the report.

The reusable framework

The architecture is deliberately boring, because boring is what makes it reusable. It is a pipeline of five stages, and the whole point is that only the first stage knows or cares which product is being checked.

flowchart TD
    A[Collectors] --> B[Normalised data model]
    B --> C[Rules and scoring engine]
    C --> D[AI report writer]
    D --> E[Branded output]
    F[Check definitions] --> C
    G[Local LLM via Ollama] --> D
    B --> H[Historical store]
    H --> C

Collectors are the only product-specific part. An M365 collector talks to Microsoft Graph. A VMware collector talks to vCenter via the REST or PowerCLI surface. A Veeam collector hits the Veeam REST API. A Citrix collector queries the CVAD or DaaS APIs. Each one has a single job: pull raw configuration and posture data and hand it onward. Nothing in a collector makes a judgement. It does not decide what is good or bad. It just gathers.

The normalised data model is the heart of the whole thing, and I will say more about why below. Every collector writes into the same structured shape — entities, properties, relationships, a score-able set of facts — regardless of which platform produced them. A “control” from M365 and a “control” from Veeam land in the same schema. This is what lets the rest of the pipeline stay product-agnostic.

The rules and scoring engine reads the normalised data and the check definitions, and produces findings. A check definition is a small, declarative description of one thing worth evaluating: what to look at, what counts as a pass, how severe a failure is, and which category it rolls up into. Rules are deterministic and unit-tested. The scoring engine aggregates findings into category scores and an overall posture.

The AI report writer takes the findings — never the raw data — and writes the human-facing document: an executive summary, prioritised recommendations, and readable explanations of each finding. It runs against a local model through Ollama, for reasons I will get to.

Branded output renders the writer’s prose into the deliverable: a templated document with the consultancy’s styling, consistent every time.

A check definition looks like this

The unit of work is the check definition, and keeping it small and declarative is what makes the system extensible. Here is roughly the shape I use, in YAML:

- id: m365.identity.admin_mfa
  title: MFA enforced for privileged roles
  product: m365
  category: identity
  severity: critical
  source: graph.conditional_access.policies
  rule: >
    any(policy.state == "enabled"
        and "All" in policy.conditions.roles
        and policy.grant_controls includes "mfa")
  pass_message: All privileged roles require multi-factor authentication.
  fail_message: One or more admin roles can sign in without MFA.
  remediation: >
    Create a Conditional Access policy targeting privileged
    directory roles that requires multi-factor authentication.
  weight: 10

And the engine that consumes it stays tiny and generic, because all the product knowledge lives in the definition and the collector, not in the loop:

def evaluate(check, data_model):
    facts = data_model.resolve(check["source"])
    passed = run_rule(check["rule"], facts)
    return Finding(
        check_id=check["id"],
        category=check["category"],
        severity=check["severity"],
        passed=passed,
        message=check["pass_message"] if passed else check["fail_message"],
        remediation=None if passed else check["remediation"],
        weight=check["weight"],
    )

Adding a new check is adding a YAML block. Adding a new product is writing one collector that fills the data model, then writing check definitions against it. The M365 instance and a hypothetical Veeam instance share every line of the engine, the writer and the renderer. That is the leverage. It is the same instinct as turning a one-off engagement into a system that I wrote about in from proposal to production — the value is in the repeatable machinery, not the heroic delivery.

Grounding the AI so it cannot invent findings

The single biggest risk in putting a language model anywhere near a customer-facing audit document is that it confidently writes something that is not true. A report that invents a finding is worse than no report, because it destroys trust in the whole exercise. So the writer is constrained hard.

The model never sees the customer’s raw data and is never asked to discover problems. It receives a structured list of findings that the deterministic engine has already decided are true, and its instructions are explicit: write only about these findings, do not introduce new ones, do not speculate about anything not present in the input. Every claim in the prose must trace back to a finding object with an ID.

In practice this means the prompt carries the findings as data and the model’s job is transformation, not generation of fact. If a finding is not in the list, the model has nothing to say about it, and the template will not have a slot for it. I also do a cheap post-check: the rendered report’s claims are cross-referenced against the finding IDs, and anything that does not map gets flagged. This is the same discipline that separates the AI projects that survive contact with reality from the ones that do not, which is most of them — a theme I went into in why most AI projects fail. The model is a writer working from a brief. It is not allowed to do research.

Making it safe enough to point at a customer

You cannot build a tool that hoovers up a customer’s tenant configuration without taking the security of the tool itself seriously. Three things matter.

It is read-only. Every collector authenticates with credentials scoped to read, never write. The M365 app registration requests read permissions on Graph and nothing else. There is no code path in any collector that modifies the customer environment, because there is no business reason for one and every reason against.

It uses least privilege. Rather than a single god-mode account, each collector gets exactly the permissions it needs to read exactly what it reads. If a customer wants to review the consent grant before they approve it, it should be short and obviously harmless.

And the data goes nowhere it should not. The findings are written by a local model running on my own hardware through Ollama, on the GPU box I built for exactly this kind of work — a single RTX 3090, 24GB of VRAM, sized for VRAM-per-pound rather than raw speed. A customer’s identity and infrastructure posture is precisely the sort of data you do not want flowing into a third-party API to be summarised. Keeping the model local means the sensitive part of the pipeline never leaves a machine I control. That is not a performance decision; it is a trust decision, and it is one of the strongest arguments for running your own models that I make when customers ask why I bothered.

The business case

The reason I care about this is not technical elegance. It is margin and consistency.

A manual health check is a bespoke services effort: linear cost, thin margin, quality that depends on the individual. The automated version is a product. The expensive engineering happens once, in building the collectors and the data model and the check library. After that, running a check for a new customer costs minutes of compute and a short review, not days of senior time. The same deliverable, produced at a fraction of the cost, at higher and more predictable quality.

That changes what the engagement is. The engineer is freed from data-gathering and spends their time on the part that genuinely needs a human: interpreting the findings in the customer’s context, having the conversation, deciding what actually matters for this organisation this quarter. That is where presales value lives, and it is exactly the shift I argued for in the future of technical presales and stocked the toolbox for in building an AI consultancy toolkit. The tool does the clerical work. The human does the judgement. Both are now doing the thing they are good at.

It also turns a one-off into a relationship. Because the check is re-runnable and shows drift, it naturally becomes a recurring service — a quarterly posture review that gets more valuable each time because the history accumulates. That is a far better business than selling the same forty-page snapshot to a new customer every week.

What I got wrong, and what I learnt

The data model is the hard part. I went in thinking the collectors would be the difficult bit — wrangling Graph, vCenter, Veeam, all their quirks. They are fiddly but tractable. The genuinely hard, genuinely intellectual work is designing a normalised shape that an M365 control and a VMware setting and a backup policy can all live in without being mangled. Get that right and everything downstream is easy and shared. Get it wrong and you end up with product-specific code leaking into the engine and the writer, and you are back to building a separate tool for every platform. I rebuilt the data model twice. I should have spent longer on it before writing a single collector.

Customers trust consistency more than they trust brilliance. I expected pushback on the reports being “machine-generated”. The opposite happened. The fact that the same check produces the same structured assessment every time, that two environments can be compared on the same scale, that last quarter’s report and this quarter’s are directly comparable — that is what built confidence. A slightly less eloquent report that is rigorously consistent beats a beautifully written one-off that nobody can reproduce.

The AI is a writer, not an auditor, and the moment I forgot that I got burned. An early version let the model reason a little more freely about the data, and it produced lovely prose containing a finding that was simply not true — it had pattern-matched a plausible-sounding problem. Nobody shipped it, but it was the wake-up call that hardened the grounding. The model’s freedom ends at how to phrase what the rules already decided.

Where this goes next

The obvious next step is making the checks continuous rather than episodic. The whole pipeline is already re-runnable, so scheduling it is mostly plumbing — the same n8n orchestration spine that runs my other automations can trigger a check on a cadence, store the result, and alert when a score drops or a new critical finding appears. A health check that runs itself every night and only speaks up when something changes is a different and better product than a document.

After that, benchmarking. Once you have run the same check across many environments, you can tell a customer not just “your Secure Score is 62” but “that sits in the bottom quartile for organisations your size”. Anonymised, aggregated, careful with the data — but genuinely useful context that no single review can offer.

And eventually, self-service. A portal where a customer connects their own tenant under a read-only consent and runs the check themselves, gets the report, and watches their posture over time. The engineer steps in for the conversation, not the click-through. That is the end state I am building towards: the tool handles the facts and the writing, and the human is reserved entirely for judgement.

A closing thought

The thing I keep coming back to is that the health check was never really a document. The document was just the artefact we produced because producing it by hand was all we knew how to do. What the customer actually wants is a trustworthy, current, comparable opinion about whether their environment is in good shape — and the manual process was a slow, expensive, inconsistent way of approximating that.

Once you collect the facts by machine, decide the findings by rule, and use the model only to make it readable, the document stops being the work. It becomes a side effect. And the engineer who used to spend three days producing it gets to spend those three days being the thing a machine cannot be: someone with an opinion worth paying for.