Can OpenClaw run fully offline with local models?

Yes. Configure Ollama (or any OpenAI-compatible local server) as your provider in `openclaw.json` and your agent runs without any cloud API. The only network traffic is whatever your tools need — web search, channel APIs, etc. For a fully air-gapped setup, disable cloud-dependent skills.

What hardware do I need to run OpenClaw with local models?

Depends on the model size. 7-8B models (Mistral, Llama 3.1) need 8GB unified memory / VRAM. 14B models need 16GB. 32B models need 32GB+. 70B models need 64GB+. Apple Silicon is unusually efficient because unified memory is shared between CPU and GPU. See the table below for the full breakdown.

Which local model should I use with OpenClaw?

For general assistant use on a 16GB machine, start with Llama 3.1 8B or Qwen 2.5 7B — both handle conversation, tool calling, and skill invocation reasonably well. For coding, Qwen2.5-Coder 7B or 14B is the current pick. For 24GB+ machines, DeepSeek-R1 32B or Llama 3.3 70B (quantized) deliver near-cloud quality.

Why is my local model slower than ChatGPT?

Because ChatGPT runs on data-center GPUs with batched inference at scale. A consumer machine running a 32B model at 24GB does ~5-15 tokens/sec. That's normal — the tradeoff is total privacy and zero API cost. Use smaller models (7-8B) for snappier responses, or offload heavy tasks to cloud and keep local for sensitive ones.

Can I mix local and cloud models in OpenClaw?

Yes. Configure multiple providers in `openclaw.json` and route different agents to different models. Common pattern: a `personal` agent on local Ollama (private data), a `research` agent on Claude or GPT-4o (heavy reasoning). The cloud-orchestrator + local-text-workers pattern is also tracked in the OpenClaw docs.

OpenClaw + Ollama: Run AI Models Locally (Hardware Guide)

Local models are why most people pick OpenClaw over a cloud-only assistant. The tradeoff is hardware — you bring the compute, you get the privacy and zero per-token cost.

This page is the practical guide: what hardware runs what, how to set it up with Ollama, and when local actually beats cloud.

TL;DR

Your hardware	Models you can run	Use it for
8 GB RAM/VRAM	7-8B (Llama 3.1, Mistral)	Daily assistant, simple skills
16 GB	14B (Qwen 14B, Llama 13B)	Tool-calling agents, light coding
24 GB unified (M-series)	32B quantized	Serious coding agent, multi-step automations
32 GB+	32B full / 70B quantized	Multi-agent, near-cloud quality
64 GB+	70B full, MoE	Production-grade local stack
< 8 GB	Cloud-only	Use Claude, GPT, or Gemini

If you're under 16GB, you're better off paying $5/month for cloud API access than fighting OOM errors. If you're at 24GB unified memory or above, local is competitive.

Quick setup with Ollama

1. Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Start it:

ollama serve

2. Pull a model

ollama pull llama3.1:8b          # 4.7 GB, good general-purpose
ollama pull qwen2.5:14b          # 9 GB, better tool-calling
ollama pull qwen2.5-coder:7b     # 4.7 GB, for code tasks

Browse the full library at ollama.com/library.

3. Configure OpenClaw

Edit ~/.openclaw/openclaw.json:

{
  "providers": {
    "ollama": {
      "baseUrl": "http://localhost:11434/v1",
      "models": ["llama3.1:8b", "qwen2.5:14b"]
    }
  },
  "agents": {
    "default": {
      "model": "ollama/llama3.1:8b"
    }
  }
}

Restart and verify:

openclaw gateway restart
openclaw agent --message "Test — are you running locally?"

If the response comes back, you're running fully local.

Hardware tiers in detail

Tier 1 — Light local (8 GB unified / VRAM)

Realistic models: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B.

These handle conversation, summarization, and simple tool calls. They struggle with multi-step reasoning, complex coding, and structured output. Token rate on Apple Silicon M2 8GB: ~20-30 tok/sec.

Good for: a personal Telegram bot, daily journal summarization, basic Q&A.

Tier 2 — Useful local (16 GB)

Realistic models: Qwen 2.5 14B, Llama 2 13B, Qwen2.5-Coder 14B.

The minimum where local feels "good enough" for daily use. Tool calling is reliable, code generation is competent, multi-step reasoning works on focused tasks. ~10-20 tok/sec on M-series.

Good for: a serious coding assistant, multi-skill agents, MCP-driven workflows.

Tier 3 — Strong local (24 GB unified memory)

Realistic models: DeepSeek-R1 32B (4-bit), Llama 3.3 70B (heavy quantization).

The official OpenClaw docs describe 24 GB as "suitable only for lighter prompts with higher latency" — translation: you can run good models, but expect 5-10 tok/sec and longer time-to-first-token on big contexts.

Good for: a primary daily-driver agent that replaces cloud usage for most tasks.

Tier 4 — Production local (32-64 GB+)

Realistic models: DeepSeek-R1 32B full, Llama 3.3 70B, Mixtral 8x22B.

Cloud-competitive quality for most tasks. Multi-agent setups become viable here — one model running, multiple agents sharing it. 5-15 tok/sec.

Good for: 24/7 multi-agent orchestration, fully air-gapped deployments.

When 24 GB isn't enough

The official docs note: "≥2 maxed-out Mac Studios or equivalent GPU rig (~$30k+)" is the recommended hardware for serious multi-agent local stacks. If your goal is replacing a team's worth of Claude/GPT-4 usage with local infrastructure, that's the realistic price floor.

Mixing local and cloud

Most OpenClaw users end up with a hybrid:

{
  "providers": {
    "ollama": { "baseUrl": "http://localhost:11434/v1" },
    "anthropic": { "apiKey": "${ANTHROPIC_API_KEY}" }
  },
  "agents": {
    "personal": { "model": "ollama/qwen2.5:14b" },
    "research": { "model": "anthropic/claude-sonnet-4-5" }
  }
}

The personal agent handles your private data on-device. The research agent calls Claude when you need deep reasoning. Same gateway, same channels, both available.

For the cloud-orchestrator + local-text-workers pattern (a cloud model orchestrates, local models do bulk text work), see the OpenClaw local models guide.

New in v2026.5.20

Per-agent lean local-model mode. agents.list[].experimental.localModelLean can now be enabled on a single configured agent instead of globally. Useful when you want one lightweight local agent (smaller context, leaner tool set, more forgiving of slow first-token times) alongside agents that run on cloud models with full context budgets.

{
  "agents": {
    "list": [
      {
        "name": "local-helper",
        "model": "ollama/qwen2.5:14b",
        "experimental": { "localModelLean": true }
      },
      {
        "name": "research",
        "model": "anthropic/claude-sonnet-4-5"
      }
    ]
  }
}

Embeddings (memory search)

Local embeddings are separate from chat models. OpenClaw's memory search needs a dedicated embedding model:

ollama pull nomic-embed-text     # Fast, 274 MB, good default

{
  "providers": {
    "ollama": {
      "embeddingModel": "nomic-embed-text"
    }
  }
}

If you're on Apple Silicon, native MLX embedding support via oMLX is tracked in the OpenClaw roadmap.

OpenAI-compatible embeddings are now core (v2026.5.26)

You no longer need Ollama specifically for embeddings. OpenClaw ships a built-in OpenAI-compatible embedding provider that works with any local or hosted OpenAI-style /v1/embeddings endpoint (LM Studio, vLLM, llama.cpp, Infinity, or a hosted API). Point it at your endpoint and model:

{
  "memory": {
    "embeddings": {
      "provider": "openai-compatible",
      "baseUrl": "http://localhost:1234/v1",
      "model": "text-embedding-nomic-embed-text-v1.5"
    }
  }
}

openclaw doctor validates the endpoint and reports an unreachable embeddings provider instead of silently degrading memory search. If the configured embedding provider is temporarily unavailable, OpenClaw now aborts the sync rather than downgrading an existing semantic vector index to keyword-only (v2026.5.26).

Common gotchas

Ollama not bound to network: by default Ollama listens on localhost only. If OpenClaw runs in Docker, set OLLAMA_HOST=0.0.0.0 and use the host IP in baseUrl.
Proxy interference: a v2026.5.19 bug where SSRF defenses ignored NO_PROXY for local Ollama embeddings was fixed in v2026.5.22 — OpenClaw now bypasses the managed proxy for configured local embedding origins while keeping SSRF guardrails on unconfigured targets. If you're on an older build, the workaround is still to disable the proxy for embedding traffic; otherwise just upgrade.
Cron model preflight: older builds skipped a whole scheduled run if the cron job's primary model was a local Ollama target that was offline at preflight time (cloud fallbacks ignored). Fixed in v2026.5.28 — cron now preflights model fallbacks before skipping work, so a configured cloud fallback runs instead. For production cron it's still cleanest to set cloud as primary with local as fallback.
Reasoning models (<think>/<final> tags): some local reasoning models leak reasoning content. Configure your provider to strip these tags, or use a model without explicit reasoning channels.

→ Ollama model library · System requirements · OpenClaw + Ollama setup · Local models guide