Run Gemma 4 26B-A4B locally on Windows with OpenCode as the agentic coding harness — fully offline, no data leaves your machine.
┌──────────────────────────────────────────────────────────────────┐
│ llama.cpp server (GPU) ←── OpenCode (terminal UI) │
│ Gemma 4 26B-A4B Q4_K_M Tool calls: read, write, bash, │
│ http://127.0.0.1:8080 grep, glob, edit, web fetch │
└──────────────────────────────────────────────────────────────────┘
This is a Windows fork of gemma-code, which was built for macOS with Apple Silicon. This version replaces the custom Python harness with OpenCode, a Go-based agentic CLI that connects to any OpenAI-compatible endpoint.
| Component | Role |
|---|---|
| Gemma 4 26B-A4B | MoE model — 26B total params, 4B active per token. 256K context window. Fast inference despite large capacity. |
| llama.cpp | Runs the GGUF model locally with GPU acceleration (CUDA, Vulkan, or CPU). Exposes an OpenAI-compatible API. |
| OpenCode | Terminal-based agentic UI with built-in tools (file I/O, shell, grep, web fetch). Talks to llama.cpp as its backend. |
| Minimum | Recommended | |
|---|---|---|
| OS | Windows 10 (64-bit) | Windows 11 |
| GPU (NVIDIA) | RTX 3090 / 4090 (24 GB VRAM) | RTX 5090 (32 GB) |
| GPU (AMD) | RX 7900 XTX (24 GB) via Vulkan | — |
| System RAM | 32 GB | 64 GB |
| Disk | ~20 GB free | ~25 GB free |
| Software | — | CUDA Toolkit 12.4+ (NVIDIA) or Vulkan SDK (AMD) |
GPU VRAM is the bottleneck. The Q4_K_M quant is ~16 GB; you need ~20 GB VRAM to run it with reasonable context. A 24 GB GPU (RTX 3090/4090) is the practical minimum for full GPU offload.
No 24 GB GPU? Options:
- Use a smaller quant (Q2_K, ~10 GB) — lower quality but fits 16 GB VRAM
- Use partial GPU offload (
-ngl 20instead of-ngl 99) — slower, uses system RAM for remaining layers- Use CPU-only inference — works but very slow (~5-15x slower)
Choose one method:
Option A — Pre-built binaries (recommended)
Download the latest release from github.com/ggml-org/llama.cpp/releases.
Pick the right zip for your GPU:
| GPU | Download |
|---|---|
| NVIDIA (CUDA 12.4) | llama-bXXXX-bin-win-cuda-12.4-x64.zip |
| NVIDIA (CUDA 13.x) | llama-bXXXX-bin-win-cuda-13.1-x64.zip |
| AMD / Intel / any (Vulkan) | llama-bXXXX-bin-win-vulkan-x64.zip |
| CPU only | llama-bXXXX-bin-win-cpu-x64.zip |
Extract to a short path (e.g. C:\llama\) and add it to your PATH:
# Extract
Expand-Archive -Path "llama-bXXXX-bin-win-cuda-12.4-x64.zip" -DestinationPath "C:\llama"
# Add to PATH (current session)
$env:PATH = "C:\llama;" + $env:PATH
# Add to PATH (permanent — run as admin)
[Environment]::SetEnvironmentVariable("Path", "C:\llama;" + [Environment]::GetEnvironmentVariable("Path", "Machine"), "Machine")Verify:
llama-server --versionOption B — winget
winget install llama.cppNote: the winget package may ship Vulkan or CPU-only. For CUDA, use Option A.
Option C — Build from source
Requires Visual Studio 2022 with "Desktop development with C++" workload. Run from a Developer Command Prompt:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
REM For NVIDIA CUDA:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
REM For Vulkan (AMD / Intel / NVIDIA):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
REM CPU only:
cmake -B build
cmake --build build --config ReleaseExecutables end up in build\bin\Release\.
# Install the HuggingFace CLI if you don't have it
pip install huggingface-hub
# Download Q4_K_M (~16 GB)
huggingface-cli download ggml-org/gemma-4-26b-a4b-it-GGUF gemma-4-26b-a4b-it-Q4_K_M.gguf --local-dir C:\models\gemma4Or download directly from huggingface.co/ggml-org/gemma-4-26b-a4b-it-GGUF.
Important: Use the 26B-A4B variant (not E4B). The 26B-A4B has a 256K context window versus 128K on E4B.
Which quant?
| Quant | Size | VRAM needed | Quality |
|---|---|---|---|
Q4_K_M |
~16 GB | ~20 GB | Good — default choice |
Q5_K_M |
~19 GB | ~22 GB | Better |
Q2_K |
~10 GB | ~13 GB | Lower — fits 16 GB GPUs |
Q8_0 |
~27 GB | ~30 GB | Near-lossless — needs 32 GB VRAM |
PowerShell (recommended):
.\start_server.ps1 -Model C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.ggufCommand Prompt:
start_server.bat C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.ggufManual command (if you prefer):
llama-server -m "C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf" `
-ngl 99 -c 32768 -fa on `
-ctk q8_0 -ctv q4_0 `
--host 127.0.0.1 --port 8080Wait for the "model loaded" message. You can verify at http://127.0.0.1:8080 — llama.cpp includes a built-in chat web UI.
Key flags explained:
| Flag | Purpose |
|---|---|
-ngl 99 |
Offload all layers to GPU (reduce for partial offload) |
-c 32768 |
Context window size in tokens |
-fa on |
Flash Attention — faster, less memory |
-ctk q8_0 -ctv q4_0 |
KV cache quantisation — saves ~3x VRAM on context |
--host 127.0.0.1 |
Listen on localhost only |
--port 8080 |
API port (OpenCode connects here) |
Adjusting context size:
| Context | Approx. capacity | Extra VRAM (with KV quant) |
|---|---|---|
| 32K tokens | ~25,000 words | ~0.5 GB |
| 64K tokens | ~50,000 words | ~1 GB |
| 128K tokens | ~100,000 words | ~2 GB |
Choose one method:
# Scoop
scoop install opencode
# Chocolatey
choco install opencode
# npm
npm install -g opencode-ai
# Go
go install github.com/anomalyco/opencode@latestOr download the Windows binary from github.com/anomalyco/opencode/releases.
Verify:
opencode --versionCopy the opencode.json from this directory to your project root (the directory you'll run OpenCode from):
copy opencode.json C:\your\project\opencode.jsonOr create opencode.json in your project root with this content:
{
"$schema": "https://opencode.ai/config.json",
"model": "gemma-local/gemma-4-26b",
"provider": {
"gemma-local": {
"npm": "@ai-sdk/openai-compatible",
"name": "Gemma 4 26B-A4B (local llama.cpp)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1"
},
"models": {
"gemma-4-26b": {
"name": "Gemma 4 26B-A4B (Q4_K_M)",
"tool_call": true,
"limit": {
"context": 32768,
"output": 8192
}
}
}
}
},
"agent": {
"build": {
"model": "gemma-local/gemma-4-26b"
},
"general": {
"model": "gemma-local/gemma-4-26b"
},
"plan": {
"model": "gemma-local/gemma-4-26b"
},
"title": {
"model": "gemma-local/gemma-4-26b"
}
}
}Context limit: The
contextvalue inopencode.jsonshould match the-cflag you passed to llama-server. If you increased context to 65536, update it here too.
With the llama.cpp server running in one terminal, open a second terminal in your project directory:
cd C:\your\project
opencode -m gemma-local/gemma-4-26bOpenCode will connect to your local Gemma model and you can start coding.
Terminal 1: Terminal 2:
───────────────────────────────── ─────────────────────────────────────
.\start_server.ps1 -Model ... cd C:\your\project
(wait for "model loaded") opencode -m gemma-local/gemma-4-26b
→ start working
- Start llama-server (Terminal 1)
- Wait for "model loaded"
- Run
opencode -m gemma-local/gemma-4-26bin your project directory (Terminal 2)
OpenCode provides these built-in tools (no extra configuration needed):
| Tool | Description |
|---|---|
read |
Read file contents |
write |
Create or overwrite a file |
edit |
Precise substring replacement |
glob |
Find files by pattern |
grep |
Search text inside files |
bash |
Run shell commands |
webfetch |
Fetch and read a web page |
apply_patch |
Apply unified diffs |
todowrite |
Track task progress |
- "DLL not found" — Install the Visual C++ Redistributable. If using CUDA, ensure the CUDA Toolkit
bindirectory is on your PATH. - Crashes immediately — Reduce context size:
-c 8192to rule out OOM. Check GPU VRAM withnvidia-smi. - Very slow — Ensure GPU offload is working (
-ngl 99). Checknvidia-smi— GPU utilisation should be high during inference. If VRAM is full, reduce-cor use a smaller quant.
- Verify the server is running: open http://127.0.0.1:8080/health in a browser.
- Check the port matches between
start_serverandopencode.json. - Windows Firewall may block the port — allow
llama-server.exethrough or use a different port.
- Gemma 4's GGUF includes a native chat template with tool-calling support. If tool calls aren't being made, check the llama-server logs for errors.
- Try a different quant — some community quants may have incomplete chat templates.
- Use the official
ggml-org/gemma-4-26b-a4b-it-GGUFquants which embed the correct template.
- Use partial offload:
-ngl 30(some layers on GPU, rest on CPU RAM) - Use a smaller quant: Q2_K (~10 GB) fits 16 GB GPUs
- Use CPU only:
-ngl 0(slow but works with enough system RAM)
- Keep paths short (e.g.
C:\llama\,C:\models\) to avoid Windows path-length issues - Both forward slashes and backslashes work in model paths
| Feature | macOS (gemma-code) | Windows (this fork) |
|---|---|---|
| Harness | Custom Python REPL | OpenCode (Go binary) |
| GPU backend | Metal (Apple Silicon) | CUDA / Vulkan / HIP |
| KV cache compression | TurboQuant (turbo4) | Standard quant (q8_0/q4_0) |
| Tool calling | Native API + text-tools fallback | OpenCode built-in tools |
| Multi-agent orchestrator | Yes (orchestrator.py) | Not included |
| CaseVault integration | Yes (vault.py) | Not included |
| Permission layer | Yes (permissions.py) | Not included |
| Session persistence | Built-in | OpenCode manages sessions |
| MCP server support | No | Yes (OpenCode feature) |
| Custom tools | Python tool functions | JS/TS files in .opencode/tools/ |
To use more of Gemma's 256K context:
# 64K context (good for 50+ page documents)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 65536
# 128K context (needs ~22 GB VRAM with KV quant)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 131072Update opencode.json to match:
"limit": {
"context": 65536,
"output": 8192
}.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Port 9090Update opencode.json:
"baseURL": "http://127.0.0.1:9090/v1"OpenCode supports MCP servers. Add to opencode.json:
{
"mcpServers": {
"my-server": {
"type": "stdio",
"command": "path/to/mcp-server.exe",
"args": [],
"env": []
}
}
}Create .opencode/tools/ in your project directory and add .ts or .js files. Each file becomes a tool. See OpenCode custom tools docs.
gemma-opencode-windows/
start_server.bat Windows batch launcher for llama-server
start_server.ps1 PowerShell launcher (recommended)
opencode.json OpenCode provider config — copy to your project root
README.md This guide