You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`.env.example` — Added Ollama section with comments.
23
+
-`scripts/check_ollama_health.py` — New script (stdlib only). Checks server reachability, pulled models, loaded models, auto-preloads if needed. Exit 0 = healthy.
24
+
-`.claude/skills/trade.md` — Step 0 now runs health check first; abort if models not loaded.
25
+
-`.claude/skills/workshop.md` — Step 0 now runs health check + AWS credential verify.
26
+
-`CLAUDE.md` — LLM Configuration section replaced with local model tables, cost breakdown, health check instructions, launchd startup config.
27
+
28
+
**Why:** M3 Max has 128GB unified memory. Two models (~36GB peak total) run permanently at 0 cost vs ~$0.03/crew run on Bedrock. NUM_PARALLEL=10 enables all 10 ICs to hit qwen3.5:9b simultaneously. Thinking mode disabled for agents via `extra_body={"think": False}` — saves tokens/latency with no quality loss for focused IC work. Workshop sessions keep Bedrock Sonnet for deep reasoning quality.
29
+
30
+
**Still needed (manual steps — Ollama not running at time of config):**
31
+
1.`ollama serve` or start Ollama app
32
+
2.`ollama pull qwen3.5:9b` and `ollama pull qwen3.5:35b-a3b`
33
+
3. Set launchd env vars: KEEP_ALIVE=-1, FLASH_ATTENTION=1, NUM_PARALLEL=10, then restart Ollama
34
+
4. Run `python scripts/check_ollama_health.py` to verify
0 commit comments