Edge AI: Running Models on Your Own Hardware
The headlines have been hard to miss. The biggest AI companies in the world are burning through money at a rate that's difficult to comprehend — billions per year to run the models you interact with every day. OpenAI loses money on almost every query. Anthropic is in a similar position. The prevailing assumption has been that the economics will eventually work themselves out as hardware gets cheaper and models get more efficient.
That may be true. But a parallel shift is already happening: AI is starting to move off the cloud entirely, onto personal hardware and private servers. Understanding why — and what that actually looks like in practice — requires getting clear on a few foundational ideas.
The Cloud AI Model (and Why It Costs So Much)
When you send a message to ChatGPT or Claude, the flow looks like this:
graph LR
A[Your Device] -->|send message| B[Cloud API]
B -->|route request| C[Data Center\nthousands of GPUs]
C -->|run inference| C
C -->|stream tokens| B
B -->|deliver response| A
Every token you receive was generated by a massive GPU cluster somewhere, running a model that might have hundreds of billions of parameters. The hardware is extraordinary, the electricity bills are enormous, and the cost per query adds up fast at scale.
This is the fundamental tension: the models that produce the best results are the ones that are most expensive to run. And right now, the companies offering those models are charging less than it costs to run them, betting that scale and efficiency improvements will eventually close the gap.
The Alternative: Local Inference
Local inference flips the model entirely:
graph LR
A[Your Device] -->|send message| B[Local Model\non your hardware]
B -->|run inference| B
B -->|deliver response| A
No round trip. No API billing. No data leaving your machine. The model runs on your CPU or GPU, and the result comes back in seconds.
This isn't new — researchers have been running models locally for years. What's changed is that the models have gotten good enough, and the tools have gotten simple enough, that it's becoming accessible to anyone. Tools like Ollama make pulling and running a local model about as involved as installing an app.
Parameters: What They Are and Why They Matter
Every AI model you've heard of is described by a parameter count. The number attached to local models — 7B, 13B, 70B — refers to how many parameters they contain.
A parameter is just a number. One of billions that define how a model processes and generates language. During training, these values are adjusted continuously across enormous amounts of text until the model produces good outputs. Once training is done, those values are frozen. The model you download is essentially a very large file of those numbers.
More parameters means the model can hold more nuance, handle longer context, and reason through harder problems. It also means the model needs more memory to run.
| Parameter Count | Example Models | RAM Required | What It's Good For |
|---|---|---|---|
| 1–3B | Phi-3 Mini, Gemma 2B | 2–4 GB | Simple Q&A, mobile devices, edge hardware |
| 7–8B | Llama 3 8B, Mistral 7B | 5–8 GB | General use, runs on most modern laptops |
| 13–14B | Llama 2 13B, Qwen 14B | 10–14 GB | Stronger reasoning, needs a decent machine |
| 32–34B | Qwen2.5-Coder 32B | 20–28 GB | Near-professional quality, high-end hardware |
| 70B+ | Llama 3 70B | 40–80 GB | Cloud-competitive, requires a GPU workstation |
| 100B–1T+ (estimated) | GPT-4, Claude | Cloud only | State of the art — not runnable locally |
The practical breakpoint for most people is the 7B range — capable enough to be useful, small enough to run comfortably on a laptop with 8–16GB of RAM. A 7B model won't match GPT-4 on hard reasoning tasks. But for summarizing, drafting, explaining, and light coding help, it's often good enough.
Quantization: The Compression Trick That Makes It Practical
There's a problem with that table above. A 7B model at full precision needs around 28GB of RAM — out of reach for most laptops. The RAM numbers I listed are lower than that because of a technique called quantization.
Quantization is controlled precision reduction. Every parameter is stored as a number, and numbers take up space depending on how precisely they're represented. A 32-bit float can store very fine-grained decimal values. An 8-bit integer can only represent 256 distinct values. Quantization converts the model's parameters to lower precision — accepting a small quality loss in exchange for a dramatic reduction in size.
| Format | Bits Per Parameter | 7B Model Size | Quality Impact |
|---|---|---|---|
| FP32 (full precision) | 32 bits | ~28 GB | Reference baseline |
| FP16 (half precision) | 16 bits | ~14 GB | Imperceptible difference |
| Q8 (8-bit) | 8 bits | ~7 GB | Very close to FP16 |
| Q4 (4-bit) | 4 bits | ~4 GB | Slight degradation, rarely noticeable |
| Q2 (2-bit) | 2 bits | ~2 GB | Noticeable quality loss |
Q4 is the sweet spot. A 7B model at Q4 fits in 4GB of RAM with quality that's hard to distinguish from the full-precision version on everyday tasks. That's the difference between needing a research-grade GPU cluster and needing a modern laptop.
Without quantization, "run AI locally" would still be a niche research project. It's the technical unlock that made everything else in this conversation possible.
Cloud vs. Local: An Honest Comparison
This isn't a binary choice — it's a tradeoff with clear lines depending on what you're doing.
| Cloud Models (GPT-4, Claude) | Local Models (7B–70B) | |
|---|---|---|
| Quality | State of the art | Good to very good |
| Privacy | Data sent to third party | Stays on your device |
| Cost | Per-token billing | Hardware only (one-time) |
| Latency | Network round-trip | Sub-second locally |
| Offline | Requires internet | Fully offline capable |
| Hard reasoning | Excellent | Decent at 32B+, weaker at 7B |
| Long context | 128K–1M tokens | Typically 8K–128K |
For most everyday tasks — drafting, summarizing, explaining, light coding help — a local 7B or 13B model is good enough. For hard reasoning, complex multi-step tasks, or long-context work, cloud models still hold a clear advantage.
That gap is narrowing. A year ago, 7B models were noticeably weaker on anything beyond simple prompts. Today, fine-tuned models like Qwen2.5-Coder 32B are genuinely competitive with GPT-4 for coding tasks. The trajectory is clear.
Why This Matters Beyond the Hardware
The shift from cloud to local isn't just a technical curiosity. It's a change in who controls the AI you use.
Every query you send to a cloud provider crosses their servers. Your data is subject to their logging policies, their pricing decisions, and whether their business remains solvent. For individuals, that's a privacy concern. For enterprises, it's a compliance and sovereignty concern. For countries, it's a national security argument — which is why the term "sovereign AI" has entered policy conversations at the governmental level.
Local models change the equation. You own the weights. You run the inference. You control what gets stored and what doesn't. The compute cost becomes a one-time hardware purchase rather than a recurring subscription tied to how much you use the model.
The economics of cloud AI being under pressure and the quality of open-weight models improving are happening simultaneously, for related reasons. Hardware is getting more efficient. Research on small, capable models is accelerating. The gap between what you get from a $20/month API subscription and what you can run for free on your own machine is shrinking every few months.
What's happening isn't AI getting cheaper. It's AI moving toward a fundamentally different economic model — one that puts the infrastructure back in your hands. That shift is already underway.