Skip to main content
HowOpenClaw

OpenClaw + Ollama: Run AI Models Locally (Hardware Guide)

Run OpenClaw with Ollama for fully local AI — no API costs, no data leaves your machine. Hardware requirements by model size, setup walkthrough, and when local beats cloud.

Local models are why most people pick OpenClaw over a cloud-only assistant. The tradeoff is hardware — you bring the compute, you get the privacy and zero per-token cost.

This page is the practical guide: what hardware runs what, how to set it up with Ollama, and when local actually beats cloud.

TL;DR

Your hardwareModels you can runUse it for
8 GB RAM/VRAM7-8B (Llama 3.1, Mistral)Daily assistant, simple skills
16 GB14B (Qwen 14B, Llama 13B)Tool-calling agents, light coding
24 GB unified (M-series)32B quantizedSerious coding agent, multi-step automations
32 GB+32B full / 70B quantizedMulti-agent, near-cloud quality
64 GB+70B full, MoEProduction-grade local stack
< 8 GBCloud-onlyUse Claude, GPT, or Gemini

If you're under 16GB, you're better off paying $5/month for cloud API access than fighting OOM errors. If you're at 24GB unified memory or above, local is competitive.


Quick setup with Ollama

1. Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Start it:

ollama serve

2. Pull a model

ollama pull llama3.1:8b          # 4.7 GB, good general-purpose
ollama pull qwen2.5:14b          # 9 GB, better tool-calling
ollama pull qwen2.5-coder:7b     # 4.7 GB, for code tasks

Browse the full library at ollama.com/library.

3. Configure OpenClaw

Edit ~/.openclaw/openclaw.json:

{
  "providers": {
    "ollama": {
      "baseUrl": "http://localhost:11434/v1",
      "models": ["llama3.1:8b", "qwen2.5:14b"]
    }
  },
  "agents": {
    "default": {
      "model": "ollama/llama3.1:8b"
    }
  }
}

Restart and verify:

openclaw gateway restart
openclaw agent --message "Test — are you running locally?"

If the response comes back, you're running fully local.


Hardware tiers in detail

Tier 1 — Light local (8 GB unified / VRAM)

Realistic models: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B.

These handle conversation, summarization, and simple tool calls. They struggle with multi-step reasoning, complex coding, and structured output. Token rate on Apple Silicon M2 8GB: ~20-30 tok/sec.

Good for: a personal Telegram bot, daily journal summarization, basic Q&A.

Tier 2 — Useful local (16 GB)

Realistic models: Qwen 2.5 14B, Llama 2 13B, Qwen2.5-Coder 14B.

The minimum where local feels "good enough" for daily use. Tool calling is reliable, code generation is competent, multi-step reasoning works on focused tasks. ~10-20 tok/sec on M-series.

Good for: a serious coding assistant, multi-skill agents, MCP-driven workflows.

Tier 3 — Strong local (24 GB unified memory)

Realistic models: DeepSeek-R1 32B (4-bit), Llama 3.3 70B (heavy quantization).

The official OpenClaw docs describe 24 GB as "suitable only for lighter prompts with higher latency" — translation: you can run good models, but expect 5-10 tok/sec and longer time-to-first-token on big contexts.

Good for: a primary daily-driver agent that replaces cloud usage for most tasks.

Tier 4 — Production local (32-64 GB+)

Realistic models: DeepSeek-R1 32B full, Llama 3.3 70B, Mixtral 8x22B.

Cloud-competitive quality for most tasks. Multi-agent setups become viable here — one model running, multiple agents sharing it. 5-15 tok/sec.

Good for: 24/7 multi-agent orchestration, fully air-gapped deployments.

When 24 GB isn't enough

The official docs note: "≥2 maxed-out Mac Studios or equivalent GPU rig (~$30k+)" is the recommended hardware for serious multi-agent local stacks. If your goal is replacing a team's worth of Claude/GPT-4 usage with local infrastructure, that's the realistic price floor.


Mixing local and cloud

Most OpenClaw users end up with a hybrid:

{
  "providers": {
    "ollama": { "baseUrl": "http://localhost:11434/v1" },
    "anthropic": { "apiKey": "${ANTHROPIC_API_KEY}" }
  },
  "agents": {
    "personal": { "model": "ollama/qwen2.5:14b" },
    "research": { "model": "anthropic/claude-sonnet-4-5" }
  }
}

The personal agent handles your private data on-device. The research agent calls Claude when you need deep reasoning. Same gateway, same channels, both available.

For the cloud-orchestrator + local-text-workers pattern (a cloud model orchestrates, local models do bulk text work), see the OpenClaw local models guide.


Local embeddings are separate from chat models. OpenClaw's memory search needs a dedicated embedding model:

ollama pull nomic-embed-text     # Fast, 274 MB, good default
{
  "providers": {
    "ollama": {
      "embeddingModel": "nomic-embed-text"
    }
  }
}

If you're on Apple Silicon, native MLX embedding support via oMLX is tracked in the OpenClaw roadmap.


Common gotchas

  • Ollama not bound to network: by default Ollama listens on localhost only. If OpenClaw runs in Docker, set OLLAMA_HOST=0.0.0.0 and use the host IP in baseUrl.
  • Proxy interference: as of v2026.5.19 there's an open bug where SSRF defenses ignore NO_PROXY when calling local Ollama embeddings. Workaround: disable the proxy for embedding traffic.
  • Cron model preflight: if your cron job's primary model is a local Ollama target and the local server is offline at preflight time, the entire run is skipped (cloud fallbacks ignored). Tracked bug — for production cron, prefer cloud as primary with local as fallback.
  • Reasoning models (<think>/<final> tags): some local reasoning models leak reasoning content. Configure your provider to strip these tags, or use a model without explicit reasoning channels.

Ollama model library · System requirements · OpenClaw + Ollama setup · Local models guide

FAQ

Can OpenClaw run fully offline with local models?
Yes. Configure Ollama (or any OpenAI-compatible local server) as your provider in `openclaw.json` and your agent runs without any cloud API. The only network traffic is whatever your tools need — web search, channel APIs, etc. For a fully air-gapped setup, disable cloud-dependent skills.
What hardware do I need to run OpenClaw with local models?
Depends on the model size. 7-8B models (Mistral, Llama 3.1) need 8GB unified memory / VRAM. 14B models need 16GB. 32B models need 32GB+. 70B models need 64GB+. Apple Silicon is unusually efficient because unified memory is shared between CPU and GPU. See the table below for the full breakdown.
Which local model should I use with OpenClaw?
For general assistant use on a 16GB machine, start with Llama 3.1 8B or Qwen 2.5 7B — both handle conversation, tool calling, and skill invocation reasonably well. For coding, Qwen2.5-Coder 7B or 14B is the current pick. For 24GB+ machines, DeepSeek-R1 32B or Llama 3.3 70B (quantized) deliver near-cloud quality.
Why is my local model slower than ChatGPT?
Because ChatGPT runs on data-center GPUs with batched inference at scale. A consumer machine running a 32B model at 24GB does ~5-15 tokens/sec. That's normal — the tradeoff is total privacy and zero API cost. Use smaller models (7-8B) for snappier responses, or offload heavy tasks to cloud and keep local for sensitive ones.
Can I mix local and cloud models in OpenClaw?
Yes. Configure multiple providers in `openclaw.json` and route different agents to different models. Common pattern: a `personal` agent on local Ollama (private data), a `research` agent on Claude or GPT-4o (heavy reasoning). The cloud-orchestrator + local-text-workers pattern is also tracked in the OpenClaw docs.