Lessons from Building a Docker Homelab

Every homelab starts the same way. You spin up one container to try something, it works, and you forget about it. Then another. Then a database for that one. Then a second thing that talks to the first, on a port you picked because it was free that afternoon. Six months later you have a server doing useful work and absolutely no idea how to rebuild it if the disk dies.

That was me. The lab that now runs my AI infrastructure at home — Ollama, Open WebUI, n8n, Home Assistant, the whole spine — did not arrive as a tidy design. It accreted. The most valuable thing I have done in three years of running it was not adding a service. It was going back and making the mess reproducible.

This is the operational counterpart to the more architectural infrastructure for AI workloads piece. That one is about what the hardware should be. This one is about how you actually run forty-odd containers on it without losing your mind, and the specific things I got wrong on the way. None of it is theory. All of it cost me an evening at some point.

The sprawl problem nobody admits to

The honest failure mode of a homelab is not a dramatic outage. It is entropy. You end up with a pile of docker run commands that live nowhere except your shell history and a vague memory. Ports clash, so you start picking arbitrary high numbers and writing them on a sticky note. Two services both want 8080. You “temporarily” expose a database to the host to debug something and never close it. A container called app_final_v2 is doing something load-bearing and you genuinely cannot remember what.

If you cannot rebuild the host from a fresh OS in an afternoon, you do not have a homelab. You have a pet you are afraid of.

The thing that makes this insidious is that it works. Sprawl is functional right up until the moment it isn’t — a power cut, a failed SSD, a docker system prune you ran while tired. Then you discover that “it works” and “I can reproduce it” are completely different properties, and you only ever invested in the first one.

The fix is not a tool. It is a decision: the configuration of every service lives in a file, in Git, and the running state is downstream of that file. Everything else here is a consequence of taking that seriously.

Compose as code, Git as the source of truth

I run everything as docker-compose, and every compose file is in a Git repository. Not a backup of the compose file. The compose file itself, edited in place, committed when it changes. The repo is structured one directory per stack, each with its own docker-compose.yml and a .env that is pointedly not committed.

homelab/
├── caddy/
│   ├── docker-compose.yml
│   ├── Caddyfile
│   └── .env
├── monitoring/
│   ├── docker-compose.yml
│   ├── prometheus.yml
│   └── .env
├── ai/
│   ├── docker-compose.yml      # open-webui (Ollama runs native on the GPU box)
│   └── .env
└── .gitignore                  # *.env, acme.json, data/

People ask where Portainer fits, because I do run it. The answer matters. Portainer is for visibility, not authority. It is a lovely window — container status, logs, resource use, a quick restart when I am on my phone. But the moment you start editing stacks inside Portainer’s UI, you have created a second source of truth, and the two will drift. I have been burned by exactly this: a change made in the Portainer web editor that existed nowhere in Git, lost the next time I redeployed from the repo. So the rule is firm. Portainer reads. Git writes. If I want a change to persist, it goes in the file and gets committed, and Portainer simply reflects reality.

This is the same instinct that made me move this whole site to plain-text, version-controlled content. Configuration that lives only in a UI is configuration you do not really own. The compose files win every time because they are the only artefact that survives the host.

Docker networking, properly this time

Networking is where most homelabs quietly go wrong, because the defaults are forgiving enough to hide the mistakes. Docker drops every container on a default bridge where everything can talk to everything, and you can publish ports to the host with a single line. Both of those conveniences are traps.

The first real lesson was to stop publishing ports I did not need to. Every ports: entry is a hole in the host firewall. A database does not need a published port. The only thing that needs to reach Postgres is the application sitting next to it, and they can find each other over a private network by container name. So I create user-defined bridge networks, one per logical stack, and only publish to the host the handful of things that genuinely must be reachable from outside — really just the reverse proxy.

networks:
  web:        # shared with Caddy, the only externally reachable plane
    external: true
  internal:   # private, never published
    external: false

User-defined bridges give you something the default bridge does not: automatic DNS. Inside a network, postgres resolves to the Postgres container. No IP addresses, no links, no host ports. The app talks to redis:6379 and postgres:5432 directly, on the internal network, and neither of those ports is exposed to the host at all.

The other three-quarters of Docker networking confusion comes from not knowing which driver you actually want:

bridge — the default and the right answer almost always. NAT behind the host, isolated, DNS by name. Use it.
host — the container shares the host’s network stack directly. Fast, no isolation, and it ignores ports: entirely. I reach for it rarely — occasionally for something doing service discovery or needing the real client IP — and I treat it as a smell.
macvlan — gives a container its own MAC and IP on the physical LAN, as if it were a separate machine on the network. Genuinely useful for something like a Pi-hole or a service that wants to look like real hardware, but it bypasses the host firewall and does not talk to the host easily. Powerful and sharp. I use it deliberately, never casually.

The mental model I wish I had started with: the published-port trap is thinking that exposing a port is how containers communicate. It is not. Containers on the same user-defined network already talk to each other freely. Publishing a port is only about reaching the host from outside. Once that clicked, the set of exposed services shrank dramatically, and the database came off the public network where it had no business being.

Architecture: how it actually fits together

Here is the shape of the network now. One ingress point, TLS terminated once, a public-facing plane shared with the proxy, and private planes that the outside world cannot see at all.

flowchart TD
    Internet[Internet] --> Router[Router and Firewall]
    Router -->|443 only| Caddy[Caddy reverse proxy]

    subgraph web [web network exposed]
        Caddy --> WebUI[Open WebUI]
        Caddy --> N8N[n8n]
        Caddy --> Grafana[Grafana]
        Caddy --> Kuma[Uptime Kuma]
    end

    subgraph internal [internal network private]
        N8N --> Postgres[(Postgres)]
        WebUI --> Ollama[Ollama on GPU box, bare metal]
        Grafana --> Prometheus[Prometheus]
        Prometheus --> Cadvisor[cAdvisor]
        Prometheus --> NodeExp[node-exporter]
    end

    Portainer[Portainer] -.read only view.-> web
    Portainer -.read only view.-> internal

The important property of that picture is the dotted line for the databases. Postgres sits on the internal network only, and nothing from the internet can reach it. Ollama is not in Docker at all — it runs natively on the separate GPU box, reachable only across the lab LAN — so the containers call it over the network rather than hosting it. Caddy is the single front door, and the only thing the router forwards is 443.

Why Caddy, and why I left Nginx Proxy Manager

I started on Nginx Proxy Manager. It is a fine tool and a kind on-ramp — a clean UI, click to add a proxy host, click to get a Let’s Encrypt certificate. For a handful of services it is genuinely pleasant.

The problem is the same one as Portainer’s stack editor, only worse. NPM’s configuration lives in its own database, behind a UI. It is not in Git. Adding a service means a human clicking through a form, which means it is not reproducible and not reviewable. As the lab grew past a dozen services, the gap between “what is proxied” and “what is in my repo” became the single biggest piece of undocumented state I owned.

Caddy solves this by putting the entire routing intent in one plain-text file — the Caddyfile — committed to the same repo as everything else. There is no UI and no database. The file is the source of truth, and Caddy’s headline feature is that it obtains and renews Let’s Encrypt certificates automatically, with no ACME plumbing to wire up. A service is exposed when it has a block in the Caddyfile, and not before.

# Caddyfile — the routing intent for the whole lab, in Git
chat.lab.example.com {
    reverse_proxy open-webui:8080
}

That block is the whole story for a service: a hostname, and the container and port to send it to. Caddy reaches open-webui by name over the shared web network — the container publishes no host ports of its own — terminates TLS, and keeps the certificate valid without being asked.

The service compose, then, carries no proxy configuration at all. It just joins the web network so Caddy can reach it:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:0.5.20
    container_name: open-webui
    restart: unless-stopped
    environment:
      - OLLAMA_BASE_URL=http://gpu-box.lab.internal:11434   # native Ollama, over the LAN
    networks:
      - web
      - internal
    # note: no ports: published. Caddy reaches it over the web network.

networks:
  web:
    external: true
  internal:
    external: true

Caddy itself is the only container that publishes ports to the host. It mounts the Caddyfile read-only and persists its certificates in a named volume:

services:
  caddy:
    image: caddy:2.8
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    environment:
      - CF_API_TOKEN=${CF_API_TOKEN}      # for the DNS challenge on internal-only hosts
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data                  # certificates and ACME state
      - caddy_config:/config
    networks:
      - web

volumes:
  caddy_data:
  caddy_config:

The thing that matters most is that nothing is exposed unless I write a block for it. With NPM, everything became reachable the moment I clicked it in; with a Caddyfile, the proxied surface is exactly the set of hostnames in one reviewable file, and a service with no block is simply invisible from outside. That is the only safe default — nothing is published unless I deliberately declare it — and it lives in version control where I can diff it. TLS everywhere, automatically, and the routing intent in a single committed file.

A note on the Docker socket: the label-based proxies need to mount it so they can watch containers come and go, and even read-only that is a real trust decision, because a container that can read the socket learns a great deal about the host. Caddy reading a static Caddyfile needs no such access — it never touches the socket — which is one fewer privileged mount in the lab and one fewer thing to reason about when I think about blast radius.

The `.env` and secrets discipline

The networking and proxy work is wasted if your secrets are sitting in the compose file in Git. They were, for an embarrassingly long time. Database passwords, the Cloudflare API token, n8n’s encryption key — all committed in plain text because it was easier.

The discipline now is absolute. Every secret lives in a .env file next to its compose file, and .env is in .gitignore before the first commit. The compose file references variables; it never contains values:

    environment:
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}

# .env  — never committed, backed up separately and encrypted
POSTGRES_PASSWORD=...
N8N_ENCRYPTION_KEY=...
CF_DNS_API_TOKEN=...

I keep a committed .env.example with the keys and no values, so the repo documents what a stack needs without leaking anything. The real .env files are backed up encrypted, separately from the code, because losing n8n’s encryption key means losing every stored credential it holds. If you take one thing from this section: run git log -p against your old compose files and check what is in your history. Secrets in Git history are still secrets in Git, even after you delete them from the current file. I had to rotate a few.

Monitoring you actually act on

There are two distinct questions monitoring answers, and conflating them is why so many homelab dashboards are beautiful and useless.

The first is “is it up?”. For that I run Uptime Kuma. It pings every service, shows a wall of green, and shouts at me when something goes red. It is simple, it is reliable, and it is the thing I actually look at. The second question is “how is it behaving over time?” — CPU, memory, container restarts, disk filling up. For that I run Prometheus scraping cAdvisor (per-container metrics) and node-exporter (host metrics), with Grafana on top for dashboards.

The honest lesson here is about alerting, not collection. My first instinct was to alert on everything, and I quickly trained myself to ignore the alerts, which is worse than having none. An alert that does not change your behaviour is just noise wearing a uniform. So I cut it down hard, to a handful of things I will genuinely get up and fix: a service down for more than a few minutes, host disk above 85%, memory pressure that will start killing containers, the backup job failing. Everything else is a dashboard I look at when curious, not a notification that interrupts dinner. Fewer, sharper alerts that you act on every time beat a comprehensive system you have learned to dismiss.

Updates: where I most changed my mind

I started with Watchtower pointed at everything, set to pull and recreate any container with a newer image automatically. It felt responsible. It was not.

The problem is that “latest” moves under you, and not always in the direction you want. One morning a service I depended on had silently jumped a major version overnight, its config schema had changed, and it would not start. Nothing in my repo had changed. I had not touched it. An automatic update had broken a working system while I slept, and I spent the morning working out which of forty containers had quietly changed.

So I stopped auto-updating everything, and I would not go back. The policy now:

Pin tags. No service runs :latest. Every image is pinned to a specific version — caddy:2.8, open-webui:0.5.20. The running version is therefore recorded in Git, which means I can see exactly what changed and when.
Update deliberately. Updating a service is a commit: I bump the tag in the compose file, redeploy that one stack, watch it come up, and move on. If it breaks, git revert and I am back to a known-good version in seconds.
Watchtower still runs, but only in notify mode. It tells me an update is available; it does not apply it. The decision stays mine.

Pinned tags and deliberate updates are slightly more work each week and dramatically less work the one week something would have broken. The lab is more boring now. Boring is the goal.

Alongside that sits plain image hygiene. Pinning versions means old images pile up, so a periodic docker image prune and an occasional docker system prune (carefully — never with --volumes unless I mean it) keeps the disk from filling. Disk full is one of the few things that takes the whole host down, which is exactly why it is on the short list of things I alert on.

Backups, and the only thing that actually matters

Here is the realisation that reorganised how I think about the whole lab. Containers are disposable. Volumes are not. A container is just a running copy of an image I can pull again in thirty seconds. The compose file that defines it is in Git. Neither of those needs backing up in any meaningful sense — they are already reproducible. The thing that is irreplaceable is the data in the volumes: the Postgres databases, n8n’s workflows, Home Assistant’s history, Grafana’s dashboards, Open WebUI’s chats.

So my backups target volumes, and they follow 3-2-1: three copies of the data, on two different media, with one off-site. In practice that is the live volumes on the server, a nightly snapshot to the NAS, and an encrypted copy pushed off-site. The backup I care about is a pg_dump of the databases and a tarball of the named volumes, run nightly, with the job itself monitored — a failed backup that fails silently is the same as no backup, and I learned that the way everyone does, by needing one that was not there.

The test that matters is not “does the backup run”. It is “can I restore it”. I have done it from cold deliberately, on a spare box, and proving I could rebuild the lab from Git plus the volume backups was the moment it stopped being a pet I was afraid of.

What I got wrong

A short, honest list, because the mistakes taught me more than the successes:

latest tags everywhere. Covered above. The root of more 7am debugging than anything else.
Secrets committed to Git. Also covered, also rotated in a hurry. Set up .gitignore before the first commit, not after the first leak.
The single host as a SPOF. Everything runs on one bare-metal Ubuntu box. Backups mean I can rebuild, but I cannot fail over. That is a deliberate, eyes-open trade-off for a homelab, but I do not pretend it is resilient.
The load-bearing container with no documentation. I had a small custom container doing something important and undocumented. When it broke I had to reverse-engineer my own work. Now anything non-obvious gets a comment in the compose file and a line in the repo’s README. The glue scripts that hold this together get the same treatment — part of why I argue every infrastructure engineer should learn Python is so that glue is readable and maintainable rather than a black box.

Where this goes next

The roadmap is concrete rather than aspirational. The first job is finishing the VLAN segmentation I have been slowly doing — trust, IoT and lab on separate networks, so an IoT device’s compromise cannot reach the database plane. The flat network is the oldest debt in the lab.

Second is removing the single point of failure for the always-on services. The mini-PC fleet is the obvious target: a small Docker Swarm or a lightweight k3s across two or three N100 boxes so the critical services — DNS, the proxy, monitoring — survive one host dying. The AI workloads stay on the GPU box because they are pinned to that hardware anyway.

Third is treating deployment itself as code: a CI pipeline that, on a push to the homelab repo, validates the compose files and redeploys the changed stacks, so the gap between “committed” and “running” closes to zero. That makes the lab a genuine practice ground for the patterns I use professionally — the same reason I treat it as a home lab as a learning platform rather than just a place to run services.

Closing thought

The homelab taught me something I keep carrying into client work: the running system is never the artefact worth protecting. The artefact is the description of the system — the compose files, the labels, the .env.example, the documented mistakes. Get that right and the running system becomes a cheap, regenerable consequence. Get it wrong and you are one dead SSD away from archaeology.

I did not learn that by reading it. I learned it by getting it wrong, on my own hardware, on a quiet evening when something I could not reproduce stopped working. That is the real value of a homelab: it lets you make every one of these mistakes where the only thing at risk is your weekend, not someone’s production tenant.