What It Actually Takes to Run Your Own AI: One Consumer GPU, a Routed Model Mesh, and $100 a Month
Every "local AI vs cloud" article I've read shares one quiet feature: the author never touched the hardware they describe.
They quote NVIDIA A100s at €15,000 each. They build €60,000 GPU "nodes." They add "0.5 FTE of staffing," AWS SageMaker line items, €36,000 a year. The numbers are clean, the tables are tidy, and the whole thing is fiction — generated from training data by people benchmarking a spreadsheet they invented.
I want to do the opposite: show you the real system. What I run, how it's wired, what it costs, and the discipline that keeps it cheap. It's less glamorous than a fantasy datacenter. It's also something you could actually build.
The Hardware: One Card, Doing Everything
The "AI infrastructure" is a single consumer graphics card. An RTX 4070 Ti. 12 GB of VRAM. Around €800, bought as a general-purpose GPU.
There's no rack and no A100. That card lives inside a bare-metal host that is also running dozens of other containers. Zoom out and the whole homelab is 35 stacks and roughly 100 Docker containers spread across three machines — a GPU box, a second Docker host (vm-docker), and a NAS. The AI doesn't get a dedicated palace. It shares the building with everything else I self-host.
The Wall Nobody Shows You
Here's what those local models actually do on 12 GB of VRAM — real ollama --verbose generation speeds, measured this week:
| Model | Size | Speed | Fits 12 GB? |
|---|---|---|---|
| Llama 3.2 (3B) | 2.0 GB | 152 tok/s | ✓ easily |
| Mistral 7B | 4.4 GB | 98 tok/s | ✓ |
| Llama 3.1 (8B) | 4.9 GB | 84 tok/s | ✓ |
| Qwen2.5-Coder (32B) | 19 GB | 3.9 tok/s | ✗ spills to CPU |
Read the last row twice, because it governs everything.
The 32-billion-parameter model — the impressive one on every capabilities chart — runs at 3.9 tokens per second. Roughly one word every two seconds. Unusable for anything interactive. The 8B model on the same card is 20× faster.
Why? The 32B needs 19 GB; I have 12. The instant a model overflows VRAM, it spills into system RAM and CPU and falls off a cliff. This is the VRAM wall, and it's the single fact the fantasy benchmarks paper over. On a 12 GB consumer card your real ceiling is 8B-class. You don't buy past it with patience — you buy past it with VRAM.
Which is exactly why the next hardware step is a 24 GB RTX 3090, bought opportunistically second-hand. Not for prestige — to move that 32B model from 3.9 tok/s to genuinely usable, and pull more of the mesh back in-house. The whole upgrade is a response to one measured number.
So I Don't Run a Model. I Route a Mesh.
This is the part the "local vs cloud" framing misses entirely. The question was never which one. It's which job goes where.
A small proxy — LiteLLM — sits in front of everything and routes each request to the cheapest box that can do it well:
| Job | Goes to | Why |
|---|---|---|
| Bulk reasoning, drafts, classification | DeepSeek V3 (cloud API) | ~$0.27 per million input tokens — cents where frontier charges dollars |
| Step-by-step reasoning | DeepSeek R1 | replaced my local 7B reasoner — faster and cheaper than running it myself |
| Long context / code | Kimi K2 (Moonshot, 262k context) | holds an entire codebase or book in one prompt |
| Needs to be fast | Groq (Llama 3.3 70B) | absurd throughput when latency matters |
| Hard reasoning, the actual writing | Claude | the 5% that has to be genuinely good |
| Private / offline / high-volume | Local Ollama (Llama 3.1 8B, nomic-embed) | never leaves the building; zero marginal cost |
And it's not just routing — it's failover. The config has explicit fallback chains: if DeepSeek's API is down, the request drops to a local model; if the local model can't carry it, it escalates to Claude. Embeddings and the private RAG path have no silent fallback at all — if the retrieval service is down, the system errors out rather than quietly answering without sources. That's a deliberate security choice, not an accident.
That's the real answer to "local or cloud": neither — a routed mesh, chosen per task, with the cheap options first and the expensive ones held in reserve.
The Cost Has Two Regimes, Not One
Here's where the fabricated €36,000/year falls apart completely. My AI spend lives in two buckets:
1. A flat subscription for interactive work. The heavy agent sessions — the ones writing code, refactoring, driving the homelab — run on a $100/month flat Claude plan. Not metered. Ten requests or ten thousand, same bill. I did not spend $3,000 on API calls; that sentence describes a pricing model I don't use.
2. Pay-per-million credits for the pipelines. Everything automated — document OCR, email triage, RAG, the article generator you may be reading right now — runs through the mesh on DeepSeek and Kimi API credits. And the unit here is the million tokens. DeepSeek V3 is around $0.27 per million in, ~$1.10 out. The pipelines cost single-digit dollars a month. Local inference on the GPU costs a fraction of a cent per answer in electricity, on a card that idles at 8 W.
So the honest total for a working, daily-use AI stack is: one flat subscription, a handful of dollars in API credit, and a few euros of power. The fantasy benchmark wasn't just wrong on the numbers — it modelled the wrong animal: a corporation replacing a cloud contract, instead of one person assembling more capability than a 2020 enterprise had, for less than a phone plan.
The 5-Hour Problem (and the Console That Fixes It)
A flat plan has a catch: a rolling usage window. Push hard enough in a session and you hit the limit, and the interactive agent goes quiet for a while.
That used to mean I was blocked. Now it doesn't. An OpenWebUI console sits on top of the same LiteLLM mesh, so when the Claude window is spent, I keep working — the same prompts route straight to the DeepSeek and Kimi credits instead. The flat plan handles the bulk; the pay-per-million mesh absorbs the overflow. This handoff is the piece I'm actively wiring now, and it's the difference between "blocked until the window resets" and "never fully blocked."
That's the real resilience story. Not redundant A100s. A second front door onto a cheaper set of models.
Compression: Spending Fewer Tokens on Purpose
Cheap models still reward discipline, and a lot of the optimization is simply not spending tokens you don't need to — on both ends:
- Input side: prompt caching (a caching proxy in front of Claude so repeated context isn't re-billed every call), and compressed memory files — the long instruction and context files the agents load get squeezed to a denser form that says the same thing in far fewer tokens.
- Output side: terse-by-default response modes that strip filler, articles, and pleasantries — easily a 50–75% cut in output tokens for the same information, which on a per-million meter is real money and, locally, real watts.
In/out compression isn't a gimmick. On metered models it's the bill; on the local GPU it's the power draw and the latency. Same lever, three payoffs.
The Actual Job: Harden, Optimize, Reduce
None of this was designed once and left alone. The real work is a loop, and every iteration has the same three verbs:
- Harden — tighten what's exposed, remove silent fallbacks where a wrong-but-confident answer is dangerous, keep private data on private paths.
- Optimize — better routing, more caching, the right model for each job instead of the biggest model for every job.
- Reduce consumption — schedule heavy jobs off-peak (some providers are markedly cheaper at off-peak hours), push non-urgent work to CPU so the GPU can sleep, wake the GPU on demand instead of idling it hot, and organize tasks so the machine does less to deliver the same result.
That loop is the system. The hardware is almost incidental. What makes it work is treating a closet like infrastructure: measure, route, cache, schedule, repeat.
Where This Is Heading (Stated as Intent, Not Fact)
I'll be precise about the difference between what runs today and what I'm aiming at — because blurring the two is how these articles turn into fiction.
What's planned: the 24 GB RTX 3090 to break the VRAM wall and pull more of the mesh in-house; finishing the OpenWebUI handoff so the 5-hour overflow is fully automatic; and a longer-term move toward a dedicated GPU host so the AI stops sharing a card with everything else.
What I'm still questioning: there's a whole shelf of heavier orchestration — agent frameworks, long-running autonomous daemons, multi-agent "swarm" tooling. The honest answer is I haven't deployed them, and I'm not convinced they earn their place here yet. On a single 12 GB card, a long-running multi-agent framework competes for the exact VRAM my real workloads need. The bar isn't "is it impressive" — it's "does it earn its VRAM and its complexity against what I already get from a simple routed mesh." So far, for a solo operator, the simple mesh keeps winning. That may change when the 3090 lands. I'll measure it, not assume it.
That's the discipline: name the destination, but never describe it as if you'd already arrived.
What This Actually Proves
Strip away the spec sheet and here's what's left.
A complete, sovereign, mostly-private AI stack — one that classifies my documents, searches my archives, transcribes recordings, triages my inbox, runs retrieval over my own corpus, and drafts my work — runs on a single gaming GPU, a routed mesh of cheap-and-frontier models, one flat subscription, and a handful of dollars in credit. It lives in a closet next to 100 other containers. And it gets cheaper and tougher every iteration, by design.
That's the story the fantasy benchmarks hide, probably without meaning to. The frontier of AI automation isn't the hyperscaler datacenter with its €60,000 nodes. It's the collapse in the cost of doing real work — the capability that needed an enterprise budget in 2020 now needs a hobbyist's, plus the discipline to wire it well.
When automation lived in a megacorp datacenter, you could at least see it coming. The next wave doesn't have a datacenter. It has a fan, 12 GB of VRAM, a routing table, and a $100 subscription. It's already running.
Run by a practitioner, on real hardware, with measured numbers. If you're building something similar — a routed model mesh, a sovereign homelab, a cheaper way to do serious AI work — contact me. Happy to compare notes and share what I've learned.**