Choosing a model

The setup wizard offers four model options. Your hardware decides which is realistic.

Model	VRAM	Disk	Speed	Notes
qwen3.6:35b-a3b (default)	24 GB+	~23 GB	Fast	MoE. 35B params, only 3B active at a time. Best quality, surprisingly fast for its size.
qwen3.5:9b	8 GB+	~5.2 GB	Medium	Older fallback. Works fine on a 12 GB card.
gemma3:4b	4 GB+ or CPU	~3.3 GB	Slow	Google. CPU-friendly.
llama3.2:3b	CPU only	~2 GB	Very slow	Anything with 4 GB RAM. Goals take much longer.

What “fast” actually means

For the recommended qwen3.6 on a 24 GB+ GPU:

A goal cycle is typically 200-600 seconds (3-10 min).
One existence-loop call (picking the next goal) is ~5-30s.
One planning call is ~30-90s.
One LLM step (ollama_chat) is ~50-100s.
Validation (semantic + fact-check) is ~30-120s.

On CPU with llama3.2:3b: multiply everything by ~10. The system still works but you’ll see fewer goal completions per hour.

Why qwen3.6 is the default

Local-only (no cloud calls)
MoE means inference is fast (3B active params) but reasoning quality matches 30B+ dense models
Handles 32K context comfortably, important because the existence prompt is large (worldview + suffering + peer state + capabilities)
Doesn’t produce schema-shaped output with empty fields under format=json (a problem with some thinking-mode models)

Switching models later

Edit config.json:

{
  "ollama": {
    "default_model": "qwen3.5:9b"
  }
}

Then python3 hollow.py stop and python3 hollow.py to restart. Make sure the model is pulled first: ollama pull qwen3.5:9b.

What about Claude / OpenAI / etc.?

Not supported and won’t be. The system is designed for sustained 24/7 unattended operation; cloud LLMs would burn $100+/day in goal-selection alone. Everything runs locally on whatever model you pick, and the substrate’s design (suffering, capability locks, lesson promotion) is calibrated for local inference speeds.

The exception is invoke_claude: agents can submit implementation requests to YOU, the operator. If you happen to use Claude Code as your editor, you can hook MCP and read/satisfy those requests there. But it’s a human-in-the-loop channel, not an agent-side LLM call.