My Journey into Local LLMs

The first time I ran a language model on my own hardware, it was slow, the output was mediocre, and I was completely hooked. There is a particular feeling the first time a model answers you with no network involved at all — no API key, no token meter ticking over, no terms of service deciding what you are allowed to ask. The weights are on a disk you own. The inference happens on silicon in a box three feet away. It is yours.

That feeling is not a good enough reason on its own, and I want to be honest about that up front. The frontier hosted models are still smarter than anything I can run at home, and for the genuinely hard problems they remain the right tool. But somewhere between “this is a toy” and “this replaces everything” there is a large, useful, and rapidly growing space where running models locally is the correct engineering decision — and most people never explore it because they assume it is either trivial or impossible. It is neither.

This is the article I wish I had read before I started. It is about why you would bother running LLMs locally at all, the GPU reality that nobody tells you, what quantisation actually trades, how to pick a model for a job instead of a leaderboard, and how Ollama ties it together. It sits underneath a lot of the rest of my work: the assistant in Project Atlas runs on exactly this stack, and the hardware reasoning here is the practical companion to the deeper sizing exercise in designing infrastructure for AI workloads.

Why local at all

The honest answer is that no single reason justifies it, but a stack of them does.

Privacy is the one everyone reaches for first, and it is real. When I run a health-check report or summarise a customer’s tenant configuration, that data never leaves my network. There is no clause in a vendor’s data-processing agreement to read three times. For anything touching client systems, “the inference happened on a machine I control” is a sentence that ends a lot of awkward conversations.

Control is the quieter, more important one. A hosted model can change underneath you without warning — a new version, a different refusal behaviour, a deprecated endpoint, a tightened rate limit on the Tuesday you had a deadline. A local model is a file. It does exactly what it did yesterday because it is byte-for-byte the same thing. When I build automation on top of a model, that determinism is worth a great deal. I describe the broader version of this argument in AI is becoming infrastructure: the moment you depend on something operationally, you want to own its failure modes.

Then there is cost and latency. Bulk work — classifying a thousand documents, drafting boilerplate, summarising logs overnight — is exactly the kind of high-volume, low-stakes task where per-token pricing adds up and where a model sitting warm on local VRAM answers in milliseconds with no round trip. No rate limits. No 429s at 2am when a batch job runs hot.

And finally, learning. You do not really understand how these systems behave until you have watched one fill your VRAM and slow to a crawl because you asked for one more billion parameters than your card could hold. Running models locally taught me more about how they actually work than any amount of using them through a polished API ever did.

If hosted models are the frontier, local models are the foundation. You want to own the foundation.

What local is not is a replacement for the best hosted models on the hardest tasks. When I need genuine reasoning over a thorny architecture problem, the largest frontier models are still ahead, and pretending otherwise is how people end up disappointed. The skill is knowing which jobs fall on which side of the line.

The GPU reality: VRAM is king

Here is the single most important thing I learned, and it is the thing the benchmark culture obscures: for local inference, VRAM is king. Not clock speed, not the headline TFLOPS, not the marketing tier of the card. The question that decides whether a model runs at all is simply: does it fit in video memory?

A language model has to load its weights into VRAM to run quickly. If the weights plus the context plus the working memory fit, you get fast inference. If they do not fit, you spill into system RAM over the PCIe bus, and performance falls off a cliff — we are talking an order of magnitude slower, the difference between a conversation and a coffee break.

This is why I run a single RTX 3090 with 24GB of VRAM rather than something newer and faster with less memory. A 4070-class card might win a gaming benchmark, but with 12GB it simply cannot hold the models I actually want to run at a quality I am happy with. The 3090, bought second-hand off the back of the gaming market, gives me 24GB for a sane price, and 24GB turns out to be the practical line where local LLMs get genuinely useful. I made the same VRAM-per-pound argument when I built the wider AI infrastructure lab at home, and I would make it again today.

What actually fits on 24GB, in my experience:

7–8B models run comfortably, fast, with a generous context window. This is the daily-driver class.
13–14B models run fine at a sensible quant — a little slower, noticeably more capable for harder instruct and coding work.
32–34B models are the stretch: doable at Q4, with a tighter context budget and patience.
70B models only at aggressive quantisation, with a small context, and slowly. It works, it is occasionally worth it, but you feel every gigabyte.

The lesson that took me longest to internalise: a 14B model that fits entirely in VRAM will beat a 70B model that is half spilling into system RAM, every time, on responsiveness. Fit first. Cleverness second.

Quantisation, explained properly

If you come from an engineering background, quantisation deserves a proper explanation rather than the hand-wave it usually gets.

A model’s weights are originally trained in 16-bit floating point. Quantisation stores those weights at lower precision — 8-bit, 5-bit, 4-bit, sometimes lower — so each parameter takes fewer bytes. The whole file shrinks roughly in proportion. An 8B model in full FP16 is around 16GB; at 4-bit it is closer to 4.5GB. That is the difference between “barely fits” and “fits four times over with room for a big context”.

The format I live in is GGUF, the packaging used by llama.cpp and therefore by Ollama. Within GGUF you choose a quantisation level, and the naming looks cryptic until you decode it. The one you will see most is Q4_K_M: 4-bit, “K-quant” method, medium variant. The K schemes are smart — they spend more bits on the weights that matter most and fewer on the rest, which is why a modern Q4_K_M holds up far better than a naive 4-bit quant from a few years ago.

The trade is precisely this: lower bits means smaller file, less VRAM, and faster inference, at the cost of some quality. Quality loss is usually measured as perplexity — how surprised the model is by real text — and the curve is the key insight. Going from FP16 down to Q6 or Q5 costs almost nothing measurable. Q4_K_M sits at the sweet spot of the curve, where you get most of the size saving for a quality drop you genuinely struggle to notice on most tasks. Below 4-bit, the curve turns sharp and quality degrades fast.

So my pragmatic rule:

Q4_K_M is my default. Best balance of size, speed, and quality for almost everything.
Q5_K_M / Q6_K when the model is small enough that I have VRAM to spare and want the last few percent of quality — coding and structured-output tasks benefit most.
Q8 only when I am specifically checking how much quantisation is costing me on a task.
Below Q4 only to squeeze a model in that otherwise would not fit at all, and only after testing that it has not gone stupid.

The perplexity numbers are a guide, not a verdict. What matters is whether the quant still does your job.

Choosing a model by the job, not the leaderboard

This is the part where I have changed my mind the most. When I started, I chased leaderboards — whatever topped the charts that week got downloaded. It was a waste of bandwidth. The recurring lesson, the one I keep relearning, is that the model is not the product. The product is the thing you build around it, and different models are simply better at different jobs.

The model is not the product.

In rough terms, here is how I actually allocate work across the models I keep around:

Llama 3.1 (8B) — my reliable generalist. Good instruction-following, sane defaults, well-behaved in automation. When in doubt, start here.
Qwen2.5 — my pick for coding and tight instruct work. The coder variants in particular punch well above their parameter count, and for structured output it is consistently the strongest small model I run.
Mistral — fast, lean, good at summarisation and high-volume bulk work where I want throughput.
gpt-oss — a genuinely capable open-weight option that I reach for when I want stronger reasoning locally and can spare the VRAM.
DeepSeek — the one I pull out for harder reasoning and maths-flavoured problems when I want to see how far local can stretch.

None of these is “the best”. Each is the best at something. A leaderboard collapses that into a single number and throws away the only information I care about. The way I evaluate is to keep a small folder of my own real tasks — a tricky n8n transformation, a Conditional Access summary, a Python refactor — and run candidate models against them. That homemade eval set has been worth more than every public benchmark combined.

Ollama as the runtime

Everything above would be academic without a runtime that makes it pleasant, and for me that is Ollama. It wraps llama.cpp, handles model download and storage, manages what is loaded in VRAM, and exposes a clean HTTP API. It is the layer that turned “an interesting weekend” into “a thing I actually use”.

The day-to-day is unremarkable in the best way:

# Pull a model at a specific quant and chat with it
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M

# See what is downloaded and what is currently loaded in VRAM
ollama list
ollama ps

Two settings matter more than any other. The first is the context window, num_ctx. Bigger context lets the model see more at once, but — and this is the trap — context costs VRAM, and it costs it on top of the weights. Double the context and you can push a model that fit comfortably into spilling over the edge. I size context to the job, not to the maximum the model supports.

The second is keeping models warm. The first request after a model loads is slow because the weights have to stream into VRAM; subsequent requests are fast while it stays resident. Ollama unloads models after an idle timeout to free memory, so for anything latency-sensitive I keep the primary model pinned in memory and accept that it is occupying VRAM full time. That trade — memory for responsiveness — is one you make deliberately.

The thing that makes Ollama genuinely useful for building, though, is the Modelfile. It lets me bake a base model, a system prompt, and parameters into a named, version-controlled artefact:

# Modelfile — a focused assistant for infra summarisation
FROM qwen2.5:14b-instruct-q4_K_M

PARAMETER num_ctx 8192
PARAMETER temperature 0.3
PARAMETER top_p 0.9

SYSTEM """
You are an infrastructure assistant. Answer concisely and technically.
Prefer British English. When asked for config, return valid, copy-pasteable
snippets. If you are unsure, say so rather than inventing detail.
"""

ollama create infra-assistant -f ./Modelfile
ollama run infra-assistant

Now infra-assistant is a reproducible thing I can commit to Git, the same way I treat every other piece of my homelab as code, in the spirit of my Docker homelab lessons. The API is equally plain — a POST to /api/generate or /api/chat — which is exactly how n8n and the rest of Atlas talk to it. The model becomes just another service on the network with an endpoint.

A router mindset: where local fits versus hosted

The mistake is treating “local versus hosted” as a religious choice. It is not. It is a routing decision you make per task, and once I started thinking of it that way the whole thing clicked.

Cheap, private, bulk, latency-sensitive, or determinism-critical work goes local. Hard reasoning, frontier-level capability, the genuinely novel problem — that goes hosted. The trick is having an explicit rule for which is which, rather than defaulting to whichever you happen to have a tab open for.

flowchart TD
    A[Incoming task] --> B{Sensitive or private data}
    B -- Yes --> L[Run local]
    B -- No --> C{High volume or bulk}
    C -- Yes --> L
    C -- No --> D{Needs frontier reasoning}
    D -- Yes --> H[Use hosted frontier model]
    D -- No --> E{Coding or structured output}
    E -- Yes --> Q[Local Qwen2.5 at Q5]
    E -- No --> F{General chat or summary}
    F -- Yes --> G[Local Llama or Mistral at Q4]
    F -- No --> H
    L --> M{Fits in 24GB at Q4}
    M -- Yes --> N[Load and keep warm]
    M -- No --> O[Drop a size or raise quant pressure]

In practice the router lives in n8n. A workflow inspects the task, decides the destination, and only escalates to a hosted model when the local one is genuinely the wrong tool. Most days, most tasks never leave the building.

Lessons learnt

Some of these I learned the slow way.

Context length is not free, and it is the silent VRAM killer. I have spent more time than I would like debugging an out-of-memory error that turned out to be a context window I had bumped up and forgotten about. The weights are the obvious cost; the KV cache for a large context creeps up behind you.

Tokens per second sets the experience, and expectations matter. On the 3090, an 8B model at Q4 runs fast enough to feel conversational. A 70B at aggressive quant produces tokens at reading pace at best. Neither is wrong — but if you expect frontier-API speed from a 70B on a single consumer card you will be perpetually disappointed. Know the number for each model before you build a UX on top of it.

Benchmarks lie, or at least mislead. A model that tops a public leaderboard can be mediocre at the specific shape of work you do, and a humbler model can be excellent at it. Test on your own tasks. I will say it again because it cost me real time to learn: build a small eval set of your actual work and trust that over any chart.

The ecosystem moves weekly. A new model, a better quant, a runtime improvement, a context-length breakthrough — it genuinely changes month to month. This is exhilarating and exhausting in equal measure. I have learned to pin a working stack and resist upgrading mid-project, while keeping a scratch environment for the churn. Treat it like any fast-moving dependency: control when the change lands.

I got the GPU question wrong at first. My instinct was to optimise for speed, and I nearly bought a faster card with less memory. The 3090 with its 24GB was the right call, and it was VRAM, not flops, that made it right. If I were buying again today I would still start the question with “how much VRAM” and only then ask “how fast”.

Where this goes next

A few things I am actively working towards.

Small models keep getting better, faster than I expected. The 7–8B class today is comfortably ahead of where the 13B class sat eighteen months ago, and that trend is the most exciting thing in the space because it directly expands what fits on hardware I already own. More capability per gigabyte is the gift that keeps giving.

Speculative decoding is next on my list to set up properly — a small draft model proposing tokens that a larger model verifies, buying real speed without a quality cost. On a single card the gains are meaningful, and it is exactly the kind of optimisation that makes a 14B feel closer to an 8B in responsiveness.

I want to fine-tune my own small model on my knowledge base and writing, rather than relying on retrieval alone — a model that has genuinely internalised my domain, paired with the retrieval layer that already feeds Project Atlas. And underpinning all of it, a proper, growing eval harness, because the only way to navigate a weekly-moving ecosystem without thrashing is to be able to measure, quickly and honestly, whether a change actually made my real work better.

Closing thought

I did not get into local LLMs because they were better than the hosted frontier. They are not, and I have tried hard not to pretend otherwise. I got into them because owning the whole stack — the weights, the hardware, the runtime, the failure modes — taught me how this technology actually behaves, and made me a far better judge of when to reach for the frontier and when not to bother.

The gap between what I can run at home and what the largest hosted models can do will open and close as the field lurches forward. None of that changes the core of it. There is a box three feet away that answers when I ask, on data that never leaves my network, doing exactly the same thing today as it did yesterday. For a growing share of the work I actually do, that is not a compromise. It is the better answer.