Microsoft 365 AI Health Check

How I built an automated Microsoft 365 health assessment with Graph, n8n and a local LLM that turns raw tenant findings into a prioritised report.

Microsoft 365 AI Health Check

A Microsoft 365 health check, done properly, is a few days of tedious clicking followed by an evening of writing it up. I have done enough of them to resent both halves.

The clicking is the same every time. You log into the admin centre, the Entra portal, the Security portal, the compliance portal, the Exchange admin centre, and you copy posture out of a dozen blades into a spreadsheet. The writing-up is where it gets worse, because two engineers looking at the same tenant will produce two different reports, with different priorities, different tone, and different things quietly forgotten. The output quality depends entirely on who happened to pick up the job and how tired they were when they wrote it.

This bothered me for a long time before I did anything about it. A health check is, fundamentally, a repeatable thing. The facts come from an API. The judgement is mostly pattern matching against the same set of known-good positions. So I built an assistant that collects the facts deterministically and writes the report consistently, running entirely on infrastructure I own. This article is how it works, what I got wrong, and where it goes next. It is a concrete instance of the broader argument I make in building repeatable customer health checks — that a health check is a product, not a craft.

The problem with how we do this

The honest version of a manual M365 health check is that it is inconsistent, slow, and unauditable.

Inconsistent, because there is no fixed checklist that survives contact with a real tenant. Everyone has their own mental list. Mine is good. My colleague’s is good in different places. Neither is written down in a way that guarantees the same coverage twice. Some engineers obsess over Conditional Access and skip licensing waste entirely. Some produce a beautiful Secure Score narrative and never look at who holds Global Administrator.

Slow, because the data collection is manual and the data is spread across portals that genuinely do not want to talk to each other. A thorough check is two to three days, most of which is not thinking, it is navigation and copy-paste.

Unauditable, because the report is prose written from memory and a spreadsheet. If a customer asks “how did you conclude our MFA coverage was 71%?” three months later, the honest answer is often “I counted it in my head at the time.” That is not good enough when the report drives remediation spend.

A finding nobody can reproduce is an opinion wearing a suit.

I wanted findings that were reproducible, coverage that was guaranteed, and a report whose quality did not depend on my mood. That meant separating the two things a health check actually is: gathering facts, and reasoning about them.

Design decisions

The central decision, the one everything else hangs off, is the collect-then-reason split. The system collects raw tenant data deterministically, normalises it, runs deterministic rules over it to produce facts, and only then hands those facts to an LLM to turn into a prioritised narrative. The LLM never touches the tenant. It never calls Graph. It never decides what is true. It writes.

This matters because of everything I learned writing why most AI projects fail: a language model asked to both gather and judge will confidently invent the gathering. If you let it near the API and ask it “is MFA enforced?”, it will reason its way to a plausible answer rather than a correct one. The fix is to never give it that job. Facts are computed in code. The model is a writer with a locked source of truth.

The second decision was read-only, least-privilege access. I created a dedicated app registration in Entra ID using client-credentials auth — no user, no delegated session, no interactive sign-in. It holds only the application Graph scopes it genuinely needs, all of them read-only: Directory.Read.All, Policy.Read.All, Reports.Read.All, SecurityEvents.Read.All, RoleManagement.Read.Directory, Sites.Read.All. There is no write scope anywhere in the consent. The worst this credential can do, if it leaks, is read configuration. That is a property I can put in writing to a customer’s security team, and it is the difference between them saying yes and saying no.

The third decision was n8n as the orchestrator rather than a monolithic script. n8n is already the automation spine of my lab — it is the same tool I lean on across nearly everything, for the reasons in why every infrastructure engineer should learn Python (n8n is glue, Python is the muscle inside the glue). Using it here gave me retry handling, credential storage, scheduling, and a visual map of the pipeline for free. The collectors are n8n HTTP nodes with Python in Code nodes where the logic gets gnarly.

The fourth decision was the model stays local. Tenant configuration is sensitive. Nothing about this tenant — not user counts, not policy names, not admin lists — goes to a public AI service. The report writer is a local model on Ollama, the same runtime that powers Project Atlas. I will come back to this because the data-handling story is the part customers care about most.

What it actually inspects

The deterministic collectors gather a fixed set of areas every run, which is how I guarantee coverage. The list is opinionated and it is the same every time:

  • Identity and Conditional Access — every CA policy, its state (on, off, report-only), its assignments and exclusions. Excluded users are flagged because exclusions are where good policy goes to die.
  • MFA coverage — registration and capability pulled from the authentication methods and registration reports, expressed as a real percentage with the denominator shown.
  • Licensing and assignment waste — assigned versus enabled SKUs, unused premium licences, P2 features being paid for but not used, accounts holding licences that have not signed in for 90 days.
  • Secure Score — current score, the control profiles, and the highest-impact unactioned controls.
  • Sharing posture — Exchange external forwarding and mail flow rules, SharePoint and OneDrive external sharing settings, Teams guest and external access.
  • Privileged roles — who holds the directory roles that matter, how many Global Admins exist, whether they have MFA, and whether PIM is in use or the roles are permanently assigned.

Each of these produces structured facts, not prose. “9 Global Administrators, 2 without registered MFA, none under PIM” is a fact. The model’s job is to know that this is bad, explain why, and rank it.

How it fits together

The architecture is a straight line with one deliberate firebreak in the middle. Data flows from Graph through the collectors into normalised JSON, the rules engine turns that JSON into graded findings, and only the findings — never the raw tenant dump — cross into the LLM.

The firebreak is the line between the rules engine and the LLM. To the left of it, everything is deterministic and reproducible — same tenant, same day, same findings. To the right, the model does language, not facts.

Authentication is the dull, important part. The collector starts by exchanging the app credentials for a token:

curl -s -X POST \
  "https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token" \
  -d "client_id=${CLIENT_ID}" \
  -d "client_secret=${CLIENT_SECRET}" \
  -d "scope=https://graph.microsoft.com/.default" \
  -d "grant_type=client_credentials"

Then the collectors page through Graph. The key detail, and the one that bit me, is that Graph hands you data in pages via @odata.nextLink, and you have to follow the chain or you silently under-report. A naive single call to the CA policies endpoint looks like it works on a small tenant and quietly truncates on a large one. The collector loop is boring on purpose:

# n8n Code node: page every Graph collection, never trust the first response
import requests

def collect_all(url, token):
    items, headers = [], {"Authorization": f"Bearer {token}"}
    while url:
        r = requests.get(url, headers=headers, timeout=30)
        if r.status_code == 429:                       # throttled
            wait = int(r.headers.get("Retry-After", 10))
            time.sleep(wait)
            continue
        r.raise_for_status()
        body = r.json()
        items.extend(body.get("value", []))
        url = body.get("@odata.nextLink")              # follow the chain
    return items

ca_policies = collect_all(
    "https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies",
    token,
)

Once everything is collected and normalised, the rules engine grades it. A rule is a tiny pure function: it takes the canonical JSON and emits a finding with an id, a severity, the evidence, and the metric it measured. Nothing here is clever. That is the point — clever is where reproducibility goes to die.

{
  "id": "ROLE-001",
  "area": "Privileged Roles",
  "severity": "high",
  "title": "Excessive Global Administrators",
  "evidence": {
    "global_admins": 9,
    "without_mfa": 2,
    "pim_managed": 0
  },
  "metric": "9 Global Administrators, 2 without MFA, 0 under PIM",
  "recommended_target": "fewer than 5, all MFA-enforced, all PIM-eligible"
}

The model receives an array of these objects. It does not receive the tenant. It receives facts that have already been judged true and graded, and its job is to write them up well and in the right order.

Stopping the model from lying

This is the part I spent the most time on, because a language model writing a security report is a liability unless you cage it carefully. Left to its own devices it will smooth over gaps, invent plausible-sounding posture to fill a section, and assert things the data never said. Confidently. Every time.

The grounding system prompt does three things: it tells the model it is a writer not an investigator, it forbids any claim not present in the findings, and it gives it a fixed template so the structure is never its decision. A sketch of it:

You are writing a Microsoft 365 health check report.

You will be given a JSON array of FINDINGS. Each finding is already
verified and graded. These are your ONLY source of truth.

Rules:
- Never state a fact that is not present in the findings JSON.
- Never invent metrics, counts, policy names, or percentages.
- If an area has no findings, write exactly: "No issues detected in
  this area." Do not speculate about why.
- Order the report by severity: critical, then high, then medium, low.
- For each finding: state the issue, why it matters, the evidence
  verbatim from the JSON, and the recommended action.
- Do not soften severities. Do not add reassurance the data
  does not support.

Output Markdown using the section headings provided. Nothing else.

The single most effective line in that prompt is the instruction to quote evidence verbatim from the JSON. It forces the model to anchor every sentence to a value it was actually given, and it makes hallucination visibly inconsistent with the rendered numbers, so it shows up immediately in review. The “No issues detected” escape hatch matters just as much — without an explicit thing to say about an empty area, the model will fill the silence, and what it fills it with is fiction.

I also learned to stop trusting free-form output and lean on templating. The model writes the prose for each finding, but the document skeleton, the executive summary table, the severity counts and the cover page are rendered from the findings JSON by a deterministic template. The model never decides how many high-severity issues there are. It is told, and the count is computed in code. Templating beats free-form every time you can get away with it.

What I got wrong

Graph pagination cost me a real finding once. Early on, on a large tenant, my CA collection truncated at the first page and the report cheerfully concluded the tenant had far fewer policies than it did. Nobody was harmed because I caught it in review, but it taught me to never trust a single Graph response and to assert expected counts where I can. Always follow @odata.nextLink.

Throttling is not an edge case, it is Tuesday. Run enough collectors in parallel against a big tenant and Graph will start returning 429 with a Retry-After. My first version did not honour it and the run failed halfway. Now every call respects Retry-After and the collectors are deliberately a bit patient. A health check that takes four minutes instead of two but never falls over is the better product.

Consent and permissions are the genuinely hard part. I expected the engineering to be the work. It was not. The work was getting the scopes exactly right — least privilege means you discover a missing permission at the worst moment, mid-run, as a 403 — and getting a customer’s tenant admin to grant admin consent to an app they did not create. That conversation is easier precisely because the app is read-only and I can prove it, but it is still the bottleneck. The technology was the easy half.

The model lied to me before I grounded it. My first naive version handed the model a big blob of tenant JSON and asked for a report. It produced something beautiful and partly fictional. It asserted MFA percentages that were not in the data, named a Conditional Access policy that did not exist, and was completely calm about all of it. That failure is the whole reason for the collect-then-reason split and the grounding prompt. It is the single clearest lesson of the project, and it generalises: an LLM must be grounded in real data or it will lie confidently. I would not build any reporting system on a model again without a hard firebreak between facts and prose.

Where this goes next

The current system runs against one tenant on demand and produces a point-in-time report. Three things are coming.

Scheduled drift detection. Run the same collection on a schedule, store the normalised JSON in Git, and diff it. A health check that runs once is a photograph. A health check that runs weekly and tells you “someone added a Global Admin and disabled a CA policy on Tuesday” is a smoke alarm. Because the findings are already structured JSON, the diff is nearly free — this is exactly the kind of repeatable, version-controlled output I argue for in building knowledge instead of documents.

Multi-tenant. As a presales and consultancy tool this needs to fan out across many customer tenants from one orchestrator, each with its own isolated app registration and credential, results kept strictly separate. The collect-then-reason architecture already supports this; the work is credential management and tenant isolation, not new logic.

Benchmarking. Once there is a corpus of anonymised, structured findings across tenants, a finding stops being absolute and becomes relative — “your Secure Score is 47, the median for tenants your size is 63.” That comparison is far more persuasive to a customer than a bare number, and it is only possible because the findings were structured and reproducible from day one.

Closing thought

The thing I keep coming back to is that this project did not make the LLM smarter. It made the LLM smaller. The model’s job shrank from “assess this tenant” to “write up these graded facts in the right order,” and the report got better the more I took away from it.

That is the lesson I would hand to anyone building automation around a language model. The intelligence that matters here is not in the model — it is in the deterministic collection, the least-privilege access, and the rules that decide what is true. The model is a competent writer that I refuse to let near the source of truth. Keep the facts in code, keep the model on a short leash, and the result is something a customer can trust and an engineer can reproduce. That is the whole product. The model, as ever, is not.