So here’s the thing about cloud AI: it’s fast, it’s convenient, and one day you check your API usage and realize you’ve spent €47 on tokens for a script that summarizes your grocery lists.

That was me. So I bought a Mac Mini M2 Pro and decided to run everything locally.

No, I’m not rich. Yes, the Mac paid for itself in API savings within two months. Let me show you how.

Why Local AI in a Home Lab?

Three reasons:

  1. Cost — After the hardware investment, inference is free. Forever.
  2. Privacy — My data stays on my network. No API logs, no training on my prompts.
  3. Latency — No network round-trip. Models respond in milliseconds, not seconds.

And one bonus reason that surprised me: reliability. When your internet goes down (and it will), your local models don’t care.

The Hardware

ComponentSpec
MachineApple Mac Mini M2 Pro
CPU12-core (8 performance + 4 efficiency)
RAM32GB unified memory
GPU19-core Metal
Storage512GB SSD

The key spec: 32GB unified memory. This is what makes it all work. Apple’s unified memory means the GPU and CPU share the same memory pool — models load once and both can access them. No VRAM bottleneck like on traditional GPUs.

Total power draw: ~15W idle, ~50W under heavy inference. My electricity meter has stopped complaining.

The Stack: Ollama

Ollama is the simplest way to run LLMs locally. One binary, one command, models just work.

# Install
curl -fsSL https://ollama.ai/install.sh | sh
 
# Pull a model
ollama pull qwen3.5
 
# Run it
ollama run qwen3.5
# >>> Tell me about homelabs
# (instant response, no API key needed)

That’s it. No Docker, no Python environments, no dependency hell. Just a binary and some model files.

The Models

Here’s what I’m running and why:

ModelSizeWhy I Keep It
qwen3.56.6GBDaily driver. Fast, smart enough for most tasks
mistral-small3.215GBWhen I need better reasoning. Slower but worth it
qwen2.5:14b9.3GBAlternative general-purpose model
qwen2.5-coder:14b9.3GBCode-specific. Better at programming tasks
llama3.1:8b4.9GBMeta’s model. Good baseline for comparison
qwen3:4b2.4GBLightweight. Runs on literally anything
devstral-small-28.2GBMistral’s coding model
gpt-oss:20b12GBExperimental. For when I’m feeling adventurous
moondream1.7GBVision model. Can “see” images
gemma4:12b7.6GBVision + text. My go-to for image analysis
nomic-embed-text274MBEmbedding model. Powers RAG and search
bge-m31.2GBMultilingual embeddings. French-friendly
glm-ocr800MBOCR specialist. Reads screenshots and docs

Total disk usage: ~70GB. Fits comfortably on the 512GB SSD alongside everything else.

The Strategy

You don’t need 11 models. You need 3:

  1. A fast general model (qwen3.5 or qwen3:4b) — for quick questions, summaries, everyday tasks
  2. A strong reasoning model (mistral-small3.2 or qwen2.5-coder:14b) — for code, analysis, complex tasks
  3. A vision model (gemma4:12b or moondream) — for image analysis

The rest are for experimentation. Which is the whole point of a home lab.

The Real Magic: Hermes Agent

Here’s where it gets interesting. I don’t just chat with models — I have an agent that uses them.

Hermes Agent is an AI assistant that runs locally and can:

  • Execute shell commands on my servers
  • Manage Docker containers
  • Read and write files
  • Control my smart home via Home Assistant
  • Write and deploy Ansible playbooks
  • Schedule recurring tasks (cron jobs)
  • Remember things across sessions

And it does all this using my local Ollama models. No API keys. No cloud dependency.

The Setup

# In Hermes config
model: qwen3.5:397b  # Main model (cloud for complex tasks)
auxiliary:
  vision: gemma4:12b        # Image analysis
  compression: mistral-small3.2  # Summarizing long outputs
  skills_hub: qwen3:4b      # Quick skill lookups
  approval: qwen3:4b        # Safety checks

Hermes routes tasks to the right model. Vision tasks go to gemma4. Code review goes to qwen2.5-coder. Quick lookups use the lightweight qwen3:4b. Complex reasoning hits the big model.

Pretty neat, right? It’s like having a team of specialists, all running on hardware I own.

What It Actually Does

Here’s a real example. Last week:

Me: "Check disk usage on all servers and warn me if anything is above 80%"

Hermes: SSH'd into 6 servers, ran df -h, parsed the output, 
        found ubu-immich at 82%, and sent me a notification.
        Total time: 8 seconds.

Another one:

Me: "Update Plex on the ZimaBoard"

Hermes: SSH'd into zima-ubu-serv-1, pulled the new Plex image, 
        restarted the container, verified it was responding, 
        and reported success.

All local. All private. All mine.

The Numbers: Local vs Cloud

Let’s do the math. I use these models approximately:

ModelDaily requestsCloud equivalent costMonthly (cloud)
qwen3.5 (general)~200$0.002/1K tokens~€24
mistral-small3.2 (reasoning)~50$0.008/1K tokens~€18
gemma4:12b (vision)~30$0.005/1K tokens~€6
Embeddings~100$0.0001/1K tokens~€2

Cloud total: ~€50/month.

My Mac Mini M2 Pro cost €1,400. It paid for itself in 7 months. Everything after that is free inference.

The Honest Limitations

Let me be real. Local AI is not a perfect replacement for cloud:

  1. No GPT-4/Claude-level reasoning — The best local models are good, but they’re not frontier. Complex reasoning tasks still benefit from cloud models.

  2. Memory limits — 32GB is a lot, but you can’t run a 70B parameter model at full speed. Quantized models trade quality for size.

  3. No multimodal magic — Local vision models work, but they’re not GPT-4V. Don’t expect to analyze complex diagrams perfectly.

  4. Setup friction — Ollama is simple, but orchestrating multiple models with an agent takes work. It’s not “sign up and go.”

But for 90% of what I actually do? Local is enough. More than enough.

Quick Start: Run Your First Local Model

# 1. Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
 
# 2. Pull a model (start small)
ollama pull qwen3:4b
 
# 3. Chat
ollama run qwen3:4b
# >>> Write a haiku about Docker containers
# Layers stacked with care,
# Images wait in the dark,
# Compose brings them life.
 
# 4. Try a bigger model
ollama pull mistral-small3.2
ollama run mistral-small3.2
 
# 5. API mode (for scripts and agents)
ollama serve  # Already running if you installed it
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:4b",
  "prompt": "Explain VLANs in one paragraph"
}'

That’s it. You’re running AI locally. No API key. No credit card. No usage limits.

What I Learned

  • Start with one model — Don’t download 11 at once. Pick qwen3:4b or llama3.1:8b and learn how you actually use it.
  • RAM is king — More RAM means bigger models means better output. 16GB minimum, 32GB comfortable, 64GB dreamy.
  • Quantization is your friend — Q4_K_M quantized models are 60% smaller with maybe 5% quality loss. The tradeoff is worth it.
  • Use a model router — Don’t use a 14B model for a yes/no question. Route tasks to appropriately-sized models.
  • Local AI + agent = superpower — Chatting with a model is fun. Having an agent that uses models to manage your infrastructure is a game changer.

Running local models? I’d love to hear what hardware you’re using and which models you keep coming back to. Still paying per token? Grab a Mac Mini and pull qwen3:4b — you’ll see the difference in 5 minutes.

Now if you’ll excuse me, I have some embeddings to index. 🧠