Picking a Local Model: Quant, VRAM and the 8B Sweet Spot

When I started running models locally I made the mistake everyone makes: I reached for the biggest one that would physically fit in my GPU’s memory, ran it once, watched it crawl, and concluded that local AI was not ready. The model was not the problem. My selection criteria were. “Biggest that fits” is the single worst way to choose a local model, and unlearning it was the thing that made the whole setup actually useful.

There are three dials that matter — parameter count, quantisation, and the VRAM you have to spend — and choosing well is about balancing them against the job, not maximising any one of them.

The biggest model your card can hold is almost never the right model to run on it. The right one is the smallest that does the job well, run fast enough that you actually use it.

The three dials

Parameter count is the headline number — 8B, 14B, 70B. More parameters generally means more capability, but with steeply diminishing returns for most real tasks and a steeply rising cost in memory and speed. The jump from a 3B to an 8B model is enormous in practice. The jump from a 14B to a 70B is real but far smaller, and it is rarely worth what it costs you in latency for everyday work.

Quantisation is the dial people understand least and that matters most. A model’s weights are originally high-precision numbers; quantisation compresses them to fewer bits — 8-bit, 5-bit, 4-bit — trading a little quality for a lot of memory and speed. The crucial, counterintuitive fact is that a larger model at a lower quant usually beats a smaller model at full precision, for the same memory budget. A 14B model squeezed to 4-bit often outperforms an 8B model at 8-bit, while using similar VRAM.

VRAM is the hard ceiling everything else has to fit under. It is non-negotiable in a way the other two are not: exceed it and the model either refuses to load or spills into system memory and slows to a crawl. Every choice is really a negotiation against this fixed budget.

flowchart LR
    A[The job] --> B{How much VRAM?}
    B --> C[Pick the largest params
that fit at a sane quant]
    C --> D[Q4 or Q5 usually
beats full-precision smaller]
    D --> E[Measure speed on YOUR box]
    E --> F[Fast enough to actually use?]
    F -->|no| C
    F -->|yes| G[Ship it]

Why 8B is the sweet spot for most jobs

For the bulk of what I do locally — drafting, summarising, classifying, the retrieval-augmented work behind Atlas — an 8B model at a 4- or 5-bit quant is the sweet spot, and I keep being surprised by how rarely I need more. It fits comfortably in a single consumer GPU’s memory with room left for context. It runs fast enough that the assistant feels responsive rather than something I set going and walk away from. And it is more than capable enough for tasks that are really about applying knowledge I have retrieved, rather than dredging obscure facts out of the weights.

That last point is the key. Most of my local tasks are not tests of how much the model knows. They are tests of how well it can work with the context I hand it. And for “read this and do something sensible with it”, an 8B model is plenty. The capability I would gain from a 70B model is mostly capability I do not need, bought at a latency I would resent.

When bigger is actually worth it

This is not an argument that small always wins. There are jobs where the larger model earns its keep, and refusing to use one out of minimalist principle is the same mistake as always reaching for the biggest. Genuinely hard reasoning — multi-step problems where a smaller model loses the thread — benefits from more parameters. Tasks that lean on broad world knowledge baked into the weights, rather than knowledge you retrieve and supply, reward size. And anything where quality matters far more than speed, and you are happy to wait, can justify the big model.

The discipline is matching the model to the task instead of running one model for everything. I keep a small one for fast interactive work and reach for a larger one deliberately, for the specific jobs that need it. The mistake is treating model choice as a single permanent decision rather than a per-task one.

Measure on your own box

The numbers people quote online are from someone else’s hardware, with someone else’s context length, doing someone else’s task. The only benchmark that matters is tokens per second on your card, with your typical prompt. So I measure. Pull the model, run a representative prompt, watch the actual throughput, and decide whether it is fast enough to be a tool I reach for rather than a demo I ran once.

# Pull a sensible default and see how it actually performs here
ollama pull llama3.1:8b
ollama run llama3.1:8b --verbose "Summarise the following in three bullets: ..."
# watch the eval rate — tokens/sec on THIS box is the only number that counts

“Fast enough to actually use” is a real threshold and it is personal. Below it, a model is a thing you avoid, and an AI tool you avoid is worthless no matter how clever it is. Above it, the model disappears into the workflow and you stop thinking about it, which is the whole goal.

The thing I keep relearning

The model is not the product. The context you feed it is, and the speed at which it responds decides whether you use it at all. I spent my first weeks of local AI optimising the wrong variable — chasing parameter count as if it were a score — and the setup only became genuinely useful once I started choosing the smallest model that did each job well and running it fast.

A well-chosen 8B model at a 4-bit quant, fed good retrieved context and responding quickly, beats a 70B model you are too impatient to wait for every single day. Pick for the job, quantise without fear, measure on your own hardware, and let go of the instinct to maximise the headline number. The headline number was never the point.