Designing Infrastructure for AI Workloads
There is a comfortable lie going around that infrastructure no longer matters for AI. You just call an API. Someone else owns the GPUs, the cooling, the network, the lot. Your job is a prompt and a credit card.
I do not buy it. I have spent enough time as a solutions architect, and enough nights in my own home lab, to know that the API is the thin layer on top of a very physical, very opinionated stack. The moment you care about cost per token, data residency, latency, or running anything yourself, the infrastructure underneath reasserts itself with force. AI did not abolish infrastructure. It raised the stakes.
This article is the technical backbone for a lot of what I write elsewhere. When I size a GPU in another piece, or wave my hand at “the inference box”, this is where the reasoning lives. It is deliberately opinionated, because the worst infrastructure decisions I have seen came from people refusing to take a position.
Why this still matters when you can “just call an API”
The API model is genuinely good for a class of problems. Spiky demand, frontier-model quality, no desire to own hardware — call the API and move on. I do exactly that for plenty of things.
But “just call an API” quietly assumes your data is allowed to leave the building, your latency budget tolerates a round trip to someone else’s data centre, your costs scale linearly and forgivingly, and the model you depend on will still exist, unchanged, next quarter. For a homelab those assumptions are mostly fine. For a regulated enterprise tenant they are frequently false, which is half the reason most AI projects fail — they treat a production system as a demo with a bigger prompt.
The interesting truth I keep coming back to is that AI is becoming infrastructure in the same way databases and networks did. It stops being a feature you bolt on and becomes a substrate everything else assumes is there. And substrate has to be designed, not summoned.
So I will treat an AI workload the way I treat any other tier-one system: what does it consume, where does it hurt, and what happens when it grows.
The real constraint is VRAM, not FLOPS
Everyone fixates on GPU compute. The marketing is all TFLOPS and tensor cores. In practice, for the workloads most of us actually run, the wall you hit first is video memory.
A model has to fit in VRAM to run well. If it does not fit, you either offload layers to system RAM — which collapses your throughput as data shuffles across the PCIe bus — or you drop to a smaller model or a harsher quantisation. VRAM is the gate. Everything else is negotiable.
This is why I chose a single RTX 3090 with 24GB for the inference box rather than something newer and faster with less memory. I sized for VRAM-per-pound, not raw speed, and I have never regretted it. Twenty-four gigabytes is, in my experience, the practical line where local models stop being toys. Below it you are constantly compromising; above it you are paying data-centre prices. I explain the model side of this in more depth in my journey into local LLMs, but the hardware logic is simple: buy memory.
A rough way to think about what fits. A model’s weights consume, in bytes, roughly the parameter count multiplied by the bytes-per-parameter of the quantisation. At Q4_K_M — around 4.5 bits per weight, my usual default — an 8B model is comfortably under 6GB of weights, a 13B sits around 8GB, and a 70B lands near 40GB and simply will not fit on a 24GB card without aggressive offload or a brutal 2-bit quant that hurts quality.
VRAM budget on a 24GB card (Q4_K_M, rough)
8B ~5-6 GB weights + KV cache → easy, long context
13B ~8 GB weights + KV cache → comfortable
34B ~20 GB weights + KV cache → tight, short-ish context
70B ~40 GB weights → does not fit, don't pretend
But weights are only half the story, and the half everyone forgets is the KV cache. Every token in the context window has to keep its key and value tensors resident in VRAM so attention can see them. That cache grows linearly with context length and with batch size. Push the context from 4k to 32k tokens and the KV cache can swallow several gigabytes on its own. So your usable context length is not a free model property — it is a memory budget you spend out of the same 24GB the weights already claimed.
That gives a real design rule. Inference wants enough VRAM for weights plus your worst-case KV cache (context length times concurrency). Fine-tuning is a different animal — full fine-tuning needs memory for weights, gradients, and optimiser states, often three to four times the inference footprint, which is why I do parameter-efficient tuning (LoRA / QLoRA) at home and rent a fat cloud GPU for the rare full run. Sizing for fine-tuning on consumer hardware means sizing for QLoRA or not at all.
Batching is the lever that turns VRAM into throughput. A single request barely tickles the GPU; batching many concurrent requests amortises the cost of streaming weights and keeps the tensor cores fed. The catch is that batching multiplies the KV cache. So the honest sizing question is not “will the model fit” but “will the model plus the KV cache for N concurrent users at C context length fit”. For a household, N is tiny and it all just works. For an enterprise serving hundreds of sessions, N dominates and you are buying VRAM by the rack.
On consumer versus data-centre cards: a data-centre card buys you ECC memory, much more VRAM per card, NVLink for pooling memory across cards, and a support contract. A 3090 buys you 24GB at a fraction of the price with no ECC, no sanctioned NVLink in newer generations, and a power-hungry consumer board. For learning, prototyping, and personal inference, the consumer card wins on every metric that matters to me. For multi-tenant production with uptime guarantees, it does not. Know which problem you are solving before you spend.
Storage: three different jobs people lump into one
“Fast storage” is not a requirement, it is a slogan. AI workloads have three distinct storage profiles and conflating them is how you end up with an expensive array that is wrong for all three.
First, model weights. These get loaded into VRAM at startup and on every model switch. What you want here is low-latency, high-throughput sequential read — fast NVMe. Loading a 40GB model off a spinning disk is a coffee break; off a decent Gen4 NVMe it is a few seconds. I keep all the Ollama models on NVMe for exactly this reason. It is the difference between a model swap being invisible and being a wince.
Second, datasets for training or RAG ingestion. This is a throughput game, often sequential, often large. Bulk capacity matters more than latency. This is what the NAS is for — datasets and corpora live there, get pulled to fast local storage when actively worked on, and everything important is backed up with proper 3-2-1 thinking.
Third, and the one people get wrong, the vector database. A vector DB doing similarity search is a low-latency random-read workload, much closer to a transactional database than to a dataset dump. Put it on NVMe, give it RAM for its index, and do not let it share spindles with your backup jobs. Where the vector DB physically lives is a real decision: I keep it on the always-on side of the lab, on the mini-PC fleet near the orchestration layer, not on the GPU box — because retrieval should keep working even when the GPU is busy or offline. The retrieval layer behind Project Atlas is built on exactly that separation.
# the storage split that matters
#
# Ollama runs natively on the GPU box, not in Docker — its model
# directory points straight at fast NVMe, set on the host service:
# OLLAMA_MODELS=/mnt/nvme/models
# OLLAMA_KEEP_ALIVE=30m (don't reload weights constantly)
#
# Everything else is a container; the storage split below is the point:
services:
qdrant:
image: qdrant/qdrant
volumes:
- /mnt/nvme/qdrant:/qdrant/storage # vector index: low-latency NVMe
ingest-staging:
image: busybox
volumes:
- /mnt/nas/datasets:/data # bulk corpora on the NAS
The principle: match the medium to the access pattern, not to a generic idea of “fast”.
Networking: east-west, model pulls, and latency
Networking for AI is mostly invisible until it is the bottleneck, and then it is the only thing anyone talks about.
Three flows matter. East-west traffic is the chatter between your own services — the n8n workflow calling the model, the model calling out to retrieval, the embeddings service answering the vector DB. In a single-box setup this is loopback and free. The moment you spread services across hosts, this traffic crosses the wire, and a chatty RAG pipeline can put a surprising amount of east-west load on the network. Keep tightly coupled services close.
Bandwidth for model pulls is the bursty one. Pulling a fresh model is tens of gigabytes at once. On home gigabit that is a few minutes; nobody cares. In a cluster pulling the same model to twenty nodes it is a thundering herd, and you want a local registry so you pull once and distribute internally.
Latency for interactive use is the one users feel directly. Time-to-first-token is a latency budget, and every hop spends it. This is the strongest technical argument for keeping inference local: a local model answering in tens of milliseconds of network overhead beats a faster cloud model sitting behind a 100ms+ round trip when the interaction is conversational. For batch work latency is irrelevant; for a chat assistant it is the whole experience.
At home this is one flat network being slowly segmented into VLANs — trust, IoT, and lab. Segmenting the lab is not just tidiness; it is the network half of governance, which I will come back to.
GPU placement: bare metal, passthrough, MIG, and containers
Here is where I take a firm position, because the options genuinely differ — and because I changed my mind by getting it wrong first.
The inference box is bare-metal Ubuntu — AMD Ryzen, an RTX 3090, the NVIDIA driver and Ollama installed straight on the host. The question every local AI builder faces is how the GPU reaches the workload, and there are really three answers.
Bare metal installs the driver on the host and runs the workload there directly, with nothing between the runtime and the card. Lowest overhead, highest stability, least flexibility. This is what I use, and I arrived at it the hard way. I first tried PCIe passthrough — handing the whole physical card to a single VM to get snapshots and rebuildability for free. It worked, but it was measurably slower and far more fragile across kernel and driver updates than running on the metal, and the snapshots bought me less than the fragility cost. For a single-GPU box that does one job — serve models — the hypervisor earned nothing, so I removed it. The longer version of that story is in the bare-metal box that runs Atlas.
PCIe passthrough is still the right answer when you genuinely need the card inside a VM — when the same host must also run unrelated guests, or when isolation between tenants matters more than raw simplicity. It hands the whole physical GPU to one VM at near-native speed. The cost is a tower of configuration that breaks in subtle ways every time something underneath it updates, which is exactly the cost that pushed me off it for a single-purpose box.
MIG — Multi-Instance GPU — slices one physical data-centre card into several hardware-isolated instances, each with its own slice of compute and memory. It is brilliant for multi-tenant serving where you want guaranteed isolation between workloads. It is also a data-centre-card feature. My consumer 3090 cannot do it, so the question is academic at home but central in the enterprise, where MIG is how you stop one noisy tenant starving another.
On containers: every portable service in the lab runs in Docker — reproducible, defined in compose, disposable — and that is the right unit for an AI service. The one deliberate exception is Ollama itself, which runs natively on the GPU box so it talks to the driver with nothing in the way. The lazy take is “containerise everything”; the honest answer is that the model runtime earns its place on bare metal for the same reason the box does — it is faster and simpler when one process owns one card — while everything around it stays in a container. That split is a recurring theme in lessons from building a Docker homelab.
Power and cooling: the constraint nobody costs
This is the section people skip, and it is the one that has changed my thinking most.
A 3090 under sustained inference pulls real power — comfortably 300+ watts, the host around it adding more. Run that around the clock and it is a continuous load that shows up on the electricity bill and as heat in the room. Heat is not a metaphor; a GPU dumping a few hundred watts into a small space needs that heat moved, and “the room got uncomfortably warm” is a genuine engineering signal that you are under-provisioned on cooling.
Which leads to the metric that actually matters: cost per token, and where you pay it. Cloud inference bundles power, cooling, and amortised hardware into a per-token price. Running locally, you pay for the card once and then pay for every watt-hour forever. At low utilisation the cloud is cheaper because you are not paying to keep idle silicon warm. At high, steady utilisation, owned hardware wins because you have amortised the capital and you are only buying electricity. The crossover point depends entirely on your duty cycle, and almost nobody calculates it before deciding.
Cost-per-token sanity check (illustrative)
Local: (card cost / lifetime tokens) + (watts × hours × £/kWh) / tokens
Cloud: £ per million tokens, all-in, zero capital
Low duty cycle → cloud wins (you stop paying when idle)
High duty cycle → local wins (capital amortised, only power left)
There is a pleasing symmetry here with my home energy work: the same time-of-use tariff that makes me shift the battery charge into cheap half-hours makes me think about scheduling heavy local inference and fine-tuning into those same windows. Power-aware compute is not a data-centre-only idea. It is just more visible when the meter is yours.
Governance: who can call what, with whose data
An AI system that anyone can call with any data is not a capability, it is a liability. Governance is infrastructure, not paperwork.
Three questions decide the design. Data residency — where is the data allowed to be processed? This is frequently the deciding factor between local and cloud, and it is the single biggest reason a regulated customer cannot “just call an API”. If the data may not leave the tenant, the inference comes to the data. Access — who, or which service, is permitted to call which model, and with what scope? Treat the model endpoint like any other privileged API: authenticated, authorised, rate-limited, not an open port on the lab network. Audit — every call logged with who, when, what prompt, what context was retrieved, and what came back. Without that you cannot answer the questions that always eventually get asked.
This is where the enterprise lesson and the home lab meet. The Microsoft 365 health check work taught me that the report is only trusted if you can show exactly what was queried and why. The same discipline applies to my own assistant: the retrieval is logged, the tool calls go through n8n where they are recorded, and the model only sees what it is allowed to see. Governance you build in from the start is a feature. Governance you bolt on after an incident is a remediation project.
What I got wrong, and would not repeat
I under-sized storage latency first time round and ran models off a slow disk, then spent weeks blaming the GPU for slow model switches that were entirely an I/O problem. Match the medium to the access pattern early.
I also under-estimated the KV cache badly. I sized a model to fit comfortably, then watched it OOM the moment I gave it a long context and a couple of concurrent users, because I had budgeted for weights and forgotten that context costs memory too. Now I size for weights plus worst-case cache, always.
And I ignored power for far too long, treating the electricity as free until the bill and the room temperature both disagreed. The home lab is honest in a way the cloud is not — you feel every watt — and that feedback made me a better architect for the day job.
Where this goes next
The roadmap is concrete. Finish the VLAN segmentation so the lab network is properly isolated from trust and IoT, which closes the network side of governance. Stand up a small local model registry so model pulls are cached internally rather than re-fetched. Add proper GPU metrics into the existing Prometheus and Grafana so VRAM headroom, power draw, and tokens-per-second are graphed, not guessed — you cannot manage a constraint you do not measure. And formalise the cost-per-token model into a real spreadsheet with my actual duty cycle and tariff, so the local-versus-cloud decision for each workload is a number, not a vibe.
Longer term, a second GPU is tempting, but only if a workload genuinely needs the pooled VRAM. Buying compute I have no use for is the mistake I tell customers not to make.
Closing thought
The cloud did not delete infrastructure. It abstracted it, charged you for it, and let you forget it was there. AI workloads pull it straight back into view, because they are heavy, hungry, latency-sensitive, and bound by where data is allowed to live. The architects who do well in this era are not the ones who learned to call an API. They are the ones who still understand what the API is standing on.
Design the substrate deliberately. Buy memory, match storage to access patterns, keep interactive inference close, respect the watts, govern from day one. Do that and the model becomes what it should always have been — the easy part.