OpenClaw + Ollama: Run AI Models Locally (Hardware Guide)
Run OpenClaw with Ollama for fully local AI — no API costs, no data leaves your machine. Hardware requirements by model size, setup walkthrough, and when local beats cloud.
Local models are why most people pick OpenClaw over a cloud-only assistant. The tradeoff is hardware — you bring the compute, you get the privacy and zero per-token cost.
This page is the practical guide: what hardware runs what, how to set it up with Ollama, and when local actually beats cloud.
TL;DR
| Your hardware | Models you can run | Use it for |
|---|---|---|
| 8 GB RAM/VRAM | 7-8B (Llama 3.1, Mistral) | Daily assistant, simple skills |
| 16 GB | 14B (Qwen 14B, Llama 13B) | Tool-calling agents, light coding |
| 24 GB unified (M-series) | 32B quantized | Serious coding agent, multi-step automations |
| 32 GB+ | 32B full / 70B quantized | Multi-agent, near-cloud quality |
| 64 GB+ | 70B full, MoE | Production-grade local stack |
| < 8 GB | Cloud-only | Use Claude, GPT, or Gemini |
If you're under 16GB, you're better off paying $5/month for cloud API access than fighting OOM errors. If you're at 24GB unified memory or above, local is competitive.
Quick setup with Ollama
1. Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | shStart it:
ollama serve2. Pull a model
ollama pull llama3.1:8b # 4.7 GB, good general-purpose
ollama pull qwen2.5:14b # 9 GB, better tool-calling
ollama pull qwen2.5-coder:7b # 4.7 GB, for code tasksBrowse the full library at ollama.com/library.
3. Configure OpenClaw
Edit ~/.openclaw/openclaw.json:
{
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"models": ["llama3.1:8b", "qwen2.5:14b"]
}
},
"agents": {
"default": {
"model": "ollama/llama3.1:8b"
}
}
}Restart and verify:
openclaw gateway restart
openclaw agent --message "Test — are you running locally?"If the response comes back, you're running fully local.
Hardware tiers in detail
Tier 1 — Light local (8 GB unified / VRAM)
Realistic models: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B.
These handle conversation, summarization, and simple tool calls. They struggle with multi-step reasoning, complex coding, and structured output. Token rate on Apple Silicon M2 8GB: ~20-30 tok/sec.
Good for: a personal Telegram bot, daily journal summarization, basic Q&A.
Tier 2 — Useful local (16 GB)
Realistic models: Qwen 2.5 14B, Llama 2 13B, Qwen2.5-Coder 14B.
The minimum where local feels "good enough" for daily use. Tool calling is reliable, code generation is competent, multi-step reasoning works on focused tasks. ~10-20 tok/sec on M-series.
Good for: a serious coding assistant, multi-skill agents, MCP-driven workflows.
Tier 3 — Strong local (24 GB unified memory)
Realistic models: DeepSeek-R1 32B (4-bit), Llama 3.3 70B (heavy quantization).
The official OpenClaw docs describe 24 GB as "suitable only for lighter prompts with higher latency" — translation: you can run good models, but expect 5-10 tok/sec and longer time-to-first-token on big contexts.
Good for: a primary daily-driver agent that replaces cloud usage for most tasks.
Tier 4 — Production local (32-64 GB+)
Realistic models: DeepSeek-R1 32B full, Llama 3.3 70B, Mixtral 8x22B.
Cloud-competitive quality for most tasks. Multi-agent setups become viable here — one model running, multiple agents sharing it. 5-15 tok/sec.
Good for: 24/7 multi-agent orchestration, fully air-gapped deployments.
When 24 GB isn't enough
The official docs note: "≥2 maxed-out Mac Studios or equivalent GPU rig (~$30k+)" is the recommended hardware for serious multi-agent local stacks. If your goal is replacing a team's worth of Claude/GPT-4 usage with local infrastructure, that's the realistic price floor.
Mixing local and cloud
Most OpenClaw users end up with a hybrid:
{
"providers": {
"ollama": { "baseUrl": "http://localhost:11434/v1" },
"anthropic": { "apiKey": "${ANTHROPIC_API_KEY}" }
},
"agents": {
"personal": { "model": "ollama/qwen2.5:14b" },
"research": { "model": "anthropic/claude-sonnet-4-5" }
}
}The personal agent handles your private data on-device. The research agent calls Claude when you need deep reasoning. Same gateway, same channels, both available.
For the cloud-orchestrator + local-text-workers pattern (a cloud model orchestrates, local models do bulk text work), see the OpenClaw local models guide.
Embeddings (memory search)
Local embeddings are separate from chat models. OpenClaw's memory search needs a dedicated embedding model:
ollama pull nomic-embed-text # Fast, 274 MB, good default{
"providers": {
"ollama": {
"embeddingModel": "nomic-embed-text"
}
}
}If you're on Apple Silicon, native MLX embedding support via oMLX is tracked in the OpenClaw roadmap.
Common gotchas
- Ollama not bound to network: by default Ollama listens on
localhostonly. If OpenClaw runs in Docker, setOLLAMA_HOST=0.0.0.0and use the host IP inbaseUrl. - Proxy interference: as of v2026.5.19 there's an open bug where SSRF defenses ignore
NO_PROXYwhen calling local Ollama embeddings. Workaround: disable the proxy for embedding traffic. - Cron model preflight: if your cron job's primary model is a local Ollama target and the local server is offline at preflight time, the entire run is skipped (cloud fallbacks ignored). Tracked bug — for production cron, prefer cloud as primary with local as fallback.
- Reasoning models (
<think>/<final>tags): some local reasoning models leak reasoning content. Configure your provider to strip these tags, or use a model without explicit reasoning channels.
→ Ollama model library · System requirements · OpenClaw + Ollama setup · Local models guide
FAQ
- Can OpenClaw run fully offline with local models?
- Yes. Configure Ollama (or any OpenAI-compatible local server) as your provider in `openclaw.json` and your agent runs without any cloud API. The only network traffic is whatever your tools need — web search, channel APIs, etc. For a fully air-gapped setup, disable cloud-dependent skills.
- What hardware do I need to run OpenClaw with local models?
- Depends on the model size. 7-8B models (Mistral, Llama 3.1) need 8GB unified memory / VRAM. 14B models need 16GB. 32B models need 32GB+. 70B models need 64GB+. Apple Silicon is unusually efficient because unified memory is shared between CPU and GPU. See the table below for the full breakdown.
- Which local model should I use with OpenClaw?
- For general assistant use on a 16GB machine, start with Llama 3.1 8B or Qwen 2.5 7B — both handle conversation, tool calling, and skill invocation reasonably well. For coding, Qwen2.5-Coder 7B or 14B is the current pick. For 24GB+ machines, DeepSeek-R1 32B or Llama 3.3 70B (quantized) deliver near-cloud quality.
- Why is my local model slower than ChatGPT?
- Because ChatGPT runs on data-center GPUs with batched inference at scale. A consumer machine running a 32B model at 24GB does ~5-15 tokens/sec. That's normal — the tradeoff is total privacy and zero API cost. Use smaller models (7-8B) for snappier responses, or offload heavy tasks to cloud and keep local for sensitive ones.
- Can I mix local and cloud models in OpenClaw?
- Yes. Configure multiple providers in `openclaw.json` and route different agents to different models. Common pattern: a `personal` agent on local Ollama (private data), a `research` agent on Claude or GPT-4o (heavy reasoning). The cloud-orchestrator + local-text-workers pattern is also tracked in the OpenClaw docs.