Frédéric GuarientoCybersécurité · IA · Souveraineté numérique — notes de terrain

2026-06-26 · EN

What It Actually Takes to Run Your Own AI: One Consumer GPU, a Routed Model Mesh, and $100 a Month

Every "local AI vs cloud" article I've read shares one quiet feature: the author never touched the hardware they describe.

They quote NVIDIA A100s at €15,000 each. They build €60,000 GPU "nodes." They add "0.5 FTE of staffing," AWS SageMaker line items, €36,000 a year. The numbers are clean, the tables are tidy, and the whole thing is fiction — generated from training data by people benchmarking a spreadsheet they invented.

I want to do the opposite: show you the real system. What I run, how it's wired, what it costs, and the discipline that keeps it cheap. It's less glamorous than a fantasy datacenter. It's also something you could actually build.

The Hardware: One Card, Doing Everything

The "AI infrastructure" is a single consumer graphics card. An RTX 4070 Ti. 12 GB of VRAM. Around €800, bought as a general-purpose GPU.

There's no rack and no A100. That card lives inside a bare-metal host that is also running dozens of other containers. Zoom out and the whole homelab is 35 stacks and roughly 100 Docker containers spread across three machines — a GPU box, a second Docker host (vm-docker), and a NAS. The AI doesn't get a dedicated palace. It shares the building with everything else I self-host.

The Wall Nobody Shows You

Here's what those local models actually do on 12 GB of VRAM — real ollama --verbose generation speeds, measured this week:

Model Size Speed Fits 12 GB?
Llama 3.2 (3B) 2.0 GB 152 tok/s ✓ easily
Mistral 7B 4.4 GB 98 tok/s
Llama 3.1 (8B) 4.9 GB 84 tok/s
Qwen2.5-Coder (32B) 19 GB 3.9 tok/s ✗ spills to CPU

Read the last row twice, because it governs everything.

The 32-billion-parameter model — the impressive one on every capabilities chart — runs at 3.9 tokens per second. Roughly one word every two seconds. Unusable for anything interactive. The 8B model on the same card is 20× faster.

Why? The 32B needs 19 GB; I have 12. The instant a model overflows VRAM, it spills into system RAM and CPU and falls off a cliff. This is the VRAM wall, and it's the single fact the fantasy benchmarks paper over. On a 12 GB consumer card your real ceiling is 8B-class. You don't buy past it with patience — you buy past it with VRAM.

Which is exactly why the next hardware step is a 24 GB RTX 3090, bought opportunistically second-hand. Not for prestige — to move that 32B model from 3.9 tok/s to genuinely usable, and pull more of the mesh back in-house. The whole upgrade is a response to one measured number.

So I Don't Run a Model. I Route a Mesh.

This is the part the "local vs cloud" framing misses entirely. The question was never which one. It's which job goes where.

A small proxy — LiteLLM — sits in front of everything and routes each request to the cheapest box that can do it well:

Job Goes to Why
Bulk reasoning, drafts, classification DeepSeek V3 (cloud API) ~$0.27 per million input tokens — cents where frontier charges dollars
Step-by-step reasoning DeepSeek R1 replaced my local 7B reasoner — faster and cheaper than running it myself
Long context / code Kimi K2 (Moonshot, 262k context) holds an entire codebase or book in one prompt
Needs to be fast Groq (Llama 3.3 70B) absurd throughput when latency matters
Hard reasoning, the actual writing Claude the 5% that has to be genuinely good
Private / offline / high-volume Local Ollama (Llama 3.1 8B, nomic-embed) never leaves the building; zero marginal cost

And it's not just routing — it's failover. The config has explicit fallback chains: if DeepSeek's API is down, the request drops to a local model; if the local model can't carry it, it escalates to Claude. Embeddings and the private RAG path have no silent fallback at all — if the retrieval service is down, the system errors out rather than quietly answering without sources. That's a deliberate security choice, not an accident.

That's the real answer to "local or cloud": neither — a routed mesh, chosen per task, with the cheap options first and the expensive ones held in reserve.

The Cost Has Two Regimes, Not One

Here's where the fabricated €36,000/year falls apart completely. My AI spend lives in two buckets:

1. A flat subscription for interactive work. The heavy agent sessions — the ones writing code, refactoring, driving the homelab — run on a $100/month flat Claude plan. Not metered. Ten requests or ten thousand, same bill. I did not spend $3,000 on API calls; that sentence describes a pricing model I don't use.

2. Pay-per-million credits for the pipelines. Everything automated — document OCR, email triage, RAG, the article generator you may be reading right now — runs through the mesh on DeepSeek and Kimi API credits. And the unit here is the million tokens. DeepSeek V3 is around $0.27 per million in, ~$1.10 out. The pipelines cost single-digit dollars a month. Local inference on the GPU costs a fraction of a cent per answer in electricity, on a card that idles at 8 W.

So the honest total for a working, daily-use AI stack is: one flat subscription, a handful of dollars in API credit, and a few euros of power. The fantasy benchmark wasn't just wrong on the numbers — it modelled the wrong animal: a corporation replacing a cloud contract, instead of one person assembling more capability than a 2020 enterprise had, for less than a phone plan.

The 5-Hour Problem (and the Console That Fixes It)

A flat plan has a catch: a rolling usage window. Push hard enough in a session and you hit the limit, and the interactive agent goes quiet for a while.

That used to mean I was blocked. Now it doesn't. An OpenWebUI console sits on top of the same LiteLLM mesh, so when the Claude window is spent, I keep working — the same prompts route straight to the DeepSeek and Kimi credits instead. The flat plan handles the bulk; the pay-per-million mesh absorbs the overflow. This handoff is the piece I'm actively wiring now, and it's the difference between "blocked until the window resets" and "never fully blocked."

That's the real resilience story. Not redundant A100s. A second front door onto a cheaper set of models.

Compression: Spending Fewer Tokens on Purpose

Cheap models still reward discipline, and a lot of the optimization is simply not spending tokens you don't need to — on both ends:

In/out compression isn't a gimmick. On metered models it's the bill; on the local GPU it's the power draw and the latency. Same lever, three payoffs.

The Actual Job: Harden, Optimize, Reduce

None of this was designed once and left alone. The real work is a loop, and every iteration has the same three verbs:

That loop is the system. The hardware is almost incidental. What makes it work is treating a closet like infrastructure: measure, route, cache, schedule, repeat.

Where This Is Heading (Stated as Intent, Not Fact)

I'll be precise about the difference between what runs today and what I'm aiming at — because blurring the two is how these articles turn into fiction.

What's planned: the 24 GB RTX 3090 to break the VRAM wall and pull more of the mesh in-house; finishing the OpenWebUI handoff so the 5-hour overflow is fully automatic; and a longer-term move toward a dedicated GPU host so the AI stops sharing a card with everything else.

What I'm still questioning: there's a whole shelf of heavier orchestration — agent frameworks, long-running autonomous daemons, multi-agent "swarm" tooling. The honest answer is I haven't deployed them, and I'm not convinced they earn their place here yet. On a single 12 GB card, a long-running multi-agent framework competes for the exact VRAM my real workloads need. The bar isn't "is it impressive" — it's "does it earn its VRAM and its complexity against what I already get from a simple routed mesh." So far, for a solo operator, the simple mesh keeps winning. That may change when the 3090 lands. I'll measure it, not assume it.

That's the discipline: name the destination, but never describe it as if you'd already arrived.

What This Actually Proves

Strip away the spec sheet and here's what's left.

A complete, sovereign, mostly-private AI stack — one that classifies my documents, searches my archives, transcribes recordings, triages my inbox, runs retrieval over my own corpus, and drafts my work — runs on a single gaming GPU, a routed mesh of cheap-and-frontier models, one flat subscription, and a handful of dollars in credit. It lives in a closet next to 100 other containers. And it gets cheaper and tougher every iteration, by design.

That's the story the fantasy benchmarks hide, probably without meaning to. The frontier of AI automation isn't the hyperscaler datacenter with its €60,000 nodes. It's the collapse in the cost of doing real work — the capability that needed an enterprise budget in 2020 now needs a hobbyist's, plus the discipline to wire it well.

When automation lived in a megacorp datacenter, you could at least see it coming. The next wave doesn't have a datacenter. It has a fan, 12 GB of VRAM, a routing table, and a $100 subscription. It's already running.


Run by a practitioner, on real hardware, with measured numbers. If you're building something similar — a routed model mesh, a sovereign homelab, a cheaper way to do serious AI work — contact me. Happy to compare notes and share what I've learned.**

LocalAI #Ollama #LiteLLM #DeepSeek #DigitalSovereignty #Homelab #SelfHosted #EdgeAI #AIInfrastructure #TechStrategy #Europe