Project Atlas

Atlas is the assistant I built because I was tired of explaining myself to a stranger every morning.

Every time I opened a hosted chatbot I had to re-establish who I was, what I was working on, what my homelab looked like, which acronyms meant what in my world. The conversation started from zero. The model was clever, but it knew nothing about me, and worse, it could not actually do anything. It could write me a paragraph about Microsoft Graph. It could not go and query my tenant.

Atlas is my attempt to fix both problems at once: an assistant that remembers my world and can act on it, running on hardware I own. It is the recurring brain of the lab I describe in building an AI infrastructure lab at home, and over the past year it has become the thing several other projects quietly plug into.

This is the honest account of how it works, what I got wrong, and where it goes next.

Why not just use ChatGPT

This is the first question anyone sensible asks, and it deserves a real answer rather than a privacy slogan.

There are four reasons I run my own, and only one of them is privacy.

Privacy is the obvious one. A lot of what I would want an assistant to be useful with is exactly the stuff I should not paste into someone else’s service: client tenant details, internal architecture, half-finished thinking about a deal. If the assistant is going to be genuinely useful it has to see genuinely sensitive context, and I would rather that context never leave the network.

Control is the second. Hosted models change underneath you. The behaviour shifts, the guardrails move, the price moves, the model gets deprecated. When the assistant is part of my daily workflow I do not want it changing personality because a vendor shipped a new system prompt on a Tuesday.

The third reason is the one people underrate: it knows my projects. A generic model is brilliant and amnesiac. Atlas has a knowledge base of my actual notes, so when I ask “what did I decide about VLAN segmentation,” it answers from what I wrote, not from the average of the internet.

The fourth, and the one that changed everything, is that it can act. Atlas is wired into n8n, so it can hit Microsoft Graph, poke Home Assistant, or search my files and come back with a real answer. A model that can only talk is a clever notepad. A model that can call tools is an assistant.

A model on its own is just a text box. The value is everything you build around it.

None of this means hosted models are bad. For throwaway questions I still reach for one. But the assistant I depend on is mine.

The hardest part is memory

If you only take one thing from this article, take this: memory is the hard problem, not the model.

People imagine an assistant remembers things the way a person does. It does not. A language model has no memory at all between calls. Everything it “knows” in a conversation is just text you stuffed into the context window this time. So the entire design of Atlas is really a design about what text goes into that window, and where the durable stuff lives in between.

I ended up with three distinct kinds of memory, and keeping them separate was the single best decision I made.

The first is short-term chat memory — the running conversation, held in Open WebUI, scoped to the current thread. This is genuinely ephemeral and I treat it as such. It is the working scratchpad, nothing more. When the thread ends, it is gone, and that is correct. You do not want yesterday’s tangent silently leaking into today’s reasoning.

The second is the durable knowledge base — Markdown files in a Git repository. This is the real memory. It is plain text, version-controlled, diffable, and it outlives any model, any tool, any platform. This is the same conviction that drives the whole site and that I set out in building knowledge instead of documents: the durable thing is the knowledge, written down in a format that will still open in twenty years.

The third is retrieval — the machinery that, at question time, pulls the relevant slices of that knowledge base into the context window. This is where RAG lives, and where most of the disappointment lives too.

The mistake I made early was conflating these three. I tried to make the chat history into the long-term memory, and it became an unmanageable, lossy mess. Separating “the conversation” from “what I actually know” from “how I fetch what I know” is what made Atlas reliable.

The knowledge layer, and the truth about RAG

My notes are Markdown in Git. That is the second brain I describe in building a second brain, and Atlas reads from the same repository. Nothing duplicated, one source of truth.

The retrieval pipeline is conventional. A small ingestion job chunks the Markdown, generates embeddings with a local embedding model, and stores them in a vector database. At question time, Atlas embeds the question, finds the nearest chunks, and prepends them to the prompt with instructions to answer from that context.

Here is the part the tutorials skip: naive RAG is mediocre, and it is mediocre for boring reasons.

The first reason is chunking. If you split documents on a fixed character count you slice sentences in half and orphan the heading from the paragraph that gives it meaning. I switched to chunking on Markdown structure — headings and sections — so a chunk is a coherent idea, not an arbitrary 500 characters. That single change did more for answer quality than any model swap.

The second reason is that semantic similarity is not relevance. The nearest vectors to your question are often near because they share vocabulary, not because they answer it. Ask “how is the battery charged” and you will happily retrieve five chunks that mention batteries and none that mention the charging schedule. Pure vector search has no notion of “the bit that actually answers this.”

The third reason is the killer: RAG quality is mostly data hygiene. If your notes are contradictory, stale, or vague, retrieval faithfully serves up contradictory, stale, vague context, and the model dutifully launders it into a confident wrong answer. The assistant is only ever as good as the knowledge you feed it. I spent weeks tuning retrieval parameters before I accepted that the problem was my notes, not my cosine similarity threshold.

So I stopped treating RAG as magic and started treating it as a search problem with an LLM on the end. I added a keyword pass alongside the vector pass — hybrid retrieval — and a re-ranking step so the chunks that survive are the ones that actually look like answers. It is less glamorous than “AI that reads your documents” and far more useful.

Tooling: how Atlas actually does things

Retrieval makes Atlas knowledgeable. Tools make it useful.

The mechanism is tool-calling. The model is told, in its system prompt, that it has a set of tools it can invoke by emitting a structured request. Open WebUI passes those requests to n8n, n8n is the spine that I lean on across the whole lab, and each tool is just an n8n workflow with a webhook trigger. When Atlas decides it needs live data, it calls the workflow, n8n does the real work against a real API, and the result comes back into the conversation for the model to reason over.

Three tools earn their keep daily. One queries Microsoft Graph — the same app-registration-and-client-credentials plumbing behind the Microsoft 365 AI health check — so Atlas can answer questions about a tenant from live config rather than guesswork. One hits Home Assistant, so I can ask whether the battery is charging and get the actual state, which ties straight into the AI battery optimiser. One searches my files, a deliberately separate path from RAG for when I want an exact filename or a grep, not a fuzzy semantic match.

Here is the shape of a tool definition as Atlas sees it. The model never touches credentials; it only knows the tool exists and what it returns.

{
  "name": "query_home_assistant",
  "description": "Get the live state of a Home Assistant entity. Use for real-time values like battery level, solar generation, or whether a device is on. Do not guess these values.",
  "parameters": {
    "type": "object",
    "properties": {
      "entity_id": {
        "type": "string",
        "description": "The HA entity, e.g. sensor.battery_level"
      }
    },
    "required": ["entity_id"]
  },
  "endpoint": "https://n8n.lab.internal/webhook/ha-state"
}

And the n8n side is a small workflow: a webhook node receives the call, an HTTP node queries Home Assistant with a token from the vault, and a function node trims the response to just the fields the model needs. Sending the model less is almost always better than sending it more.

[Webhook]  ->  [HTTP Request: GET /api/states/{{entity_id}}]  ->  [Function: pick(state, attributes.unit)]  ->  [Respond]

The discipline that matters: tools return data, they do not make decisions. n8n fetches the battery state. Atlas decides what to say about it. Keeping the judgement in the model and the actions in audited workflows is what lets me trust the thing.

Persona, guardrails, and saying “I don’t know”

A system prompt is not flavour text. It is the constitution of the assistant, and I rewrote mine more times than any other part of the project.

Atlas has a deliberately plain persona — a competent, terse engineering colleague, not a chirpy helper. Voice matters less than two behaviours I had to engineer hard: grounding and humility.

Grounding means it answers from retrieved context and tool output, and flags when it is going beyond them. Humility means it is allowed — encouraged — to say it does not know. The default failure mode of every model is confident fabrication, and in an assistant that can act, a confident wrong answer is worse than no answer. I would rather Atlas say “I don’t have a note on that” than invent one.

You are Atlas, a local engineering assistant for Kris's homelab and work.

Rules:
- Answer from the provided context and tool results. If they do not contain
  the answer, say so plainly. Do not guess.
- For any live value (battery, tenant config, device state) you MUST call a
  tool. Never state a current value from memory.
- Prefer "I don't know" or "I have no note on that" over a plausible
  invention. Being wrong is more expensive than being silent.
- Be concise. You are talking to an engineer, not writing marketing copy.
- When you use a note, name the source file so it can be checked.

That last line — name the source — turned out to be a quiet superpower. Citing the file it drew from makes Atlas auditable. I can click through and check, and when it cites a file that does not actually say what it claimed, that is a signal to go and fix the note. The assistant becomes a test of my own knowledge base.

How it fits together

The whole thing is the standard lab stack: Ollama for inference, Open WebUI as the front end, n8n as the orchestration spine, all in Docker with compose files in Git as the source of truth.

flowchart TD
    User[Me] --> UI[Open WebUI]
    UI --> LLM[Ollama assistant model]
    LLM -->|needs knowledge| RET[Retrieval]
    RET --> VEC[Vector store]
    VEC --> KB[Git Markdown notes]
    LLM -->|needs to act| N8N[n8n workflows]
    N8N --> GRAPH[Microsoft Graph]
    N8N --> HA[Home Assistant]
    N8N --> FILES[File search]
    N8N --> LLM
    RET --> LLM
    LLM --> UI

The flow is worth narrating because the loop is the point. I ask a question. Open WebUI sends it to the assistant model on Ollama. The model decides whether it needs knowledge, in which case retrieval pulls note chunks from the vector store backed by the Git repository, or whether it needs to act, in which case it calls an n8n workflow that hits a real system. The results come back, the model reasons over them, and I get an answer that is grounded in either my own notes or live state — and usually told where it came from.

On model choice: the assistant role wants instruction-following and reliable tool-calling far more than raw cleverness. I settled on Qwen2.5 at the 7B–14B range, quantised to Q4_K_M, because it follows the system prompt tightly and emits clean tool calls, which a more “intelligent” but sloppier model would not. This is exactly the lesson from my journey into local LLMs: you pick the model for the job. For Atlas the job is being obedient and grounded, not winning benchmarks. The bigger models I keep for one-off heavy reasoning, not for the assistant that runs all day on the RTX 3090.

The model is not the product. The system around it is.

What I got wrong

Plenty.

I built memory before I built retrieval discipline, and ended up with an assistant that confidently remembered things that were never true, because I had let stale chat history leak into its context. Separating the three memory types fixed it, but I should have designed that boundary on day one.

I over-trusted RAG. I assumed that “give the model my notes” would just work, and spent weeks tuning retrieval when the real fault was that my notes contradicted each other. Garbage in, confidently-phrased garbage out. The data hygiene is the work; the embeddings are the easy bit.

I gave it too many tools too early. Every tool you add is another way for the model to misfire — to call the wrong one, or to hallucinate a tool that does not exist. A handful of reliable, well-described tools beats a sprawling toolbox the model cannot navigate.

And I under-invested in guardrails until it embarrassed me by inventing a tenant setting that did not exist. That is when “prefer I don’t know” went to the top of the system prompt and stayed there.

Where this goes next

Three directions, all concrete.

The first is agentic loops. Today Atlas mostly does one retrieval or one tool call per turn. I want it to plan — call a tool, look at the result, decide the next call, iterate towards an answer — within strict limits on how many steps it may take before it has to report back. The ceiling matters; an unbounded agent is a great way to generate a large bill of nonsense.

The second is better retrieval. Hybrid search and re-ranking got me a long way, but I want query rewriting, where the model reformulates a vague question into a better search before retrieving, and I want to chunk on meaning rather than just on headings.

The third, and the one I have neglected longest, is evals. I have no systematic way to tell whether a change made Atlas better or just different. I am building a small set of golden questions with known good answers, so that when I swap a model or tweak the prompt I can measure it instead of guessing. Until that exists, every “improvement” is a vibe.

Closing thought

Atlas started as a vanity project — I wanted my own JARVIS. What it actually taught me is that the interesting engineering in an AI assistant is almost never the AI.

It is the boundary between three kinds of memory. It is the data hygiene of the notes underneath. It is the discipline of letting tools fetch and the model judge. It is a system prompt that gives the thing permission to be uncertain.

The model is the cheap part. You can swap it in an evening. The knowledge base, the retrieval that respects it, the tools that act safely, and the honesty about what the assistant does not know — that is the part that took a year and is still not finished.

Atlas is only ever as good as the knowledge I feed it. Which means, quietly, the project was never really about building an assistant. It was about getting serious about what I actually know, and writing it down well enough that a machine could use it.