Gemma Code for Windows (OpenCode Edition)

Run Gemma 4 26B-A4B locally on Windows with OpenCode as the agentic coding harness — fully offline, no data leaves your machine.

┌──────────────────────────────────────────────────────────────────┐
│  llama.cpp server (GPU)  ←──  OpenCode (terminal UI)            │
│  Gemma 4 26B-A4B Q4_K_M      Tool calls: read, write, bash,    │
│  http://127.0.0.1:8080        grep, glob, edit, web fetch       │
└──────────────────────────────────────────────────────────────────┘

This is a Windows fork of gemma-code, which was built for macOS with Apple Silicon. This version replaces the custom Python harness with OpenCode, a Go-based agentic CLI that connects to any OpenAI-compatible endpoint.

Why this setup?

Component	Role
Gemma 4 26B-A4B	MoE model — 26B total params, 4B active per token. 256K context window. Fast inference despite large capacity.
llama.cpp	Runs the GGUF model locally with GPU acceleration (CUDA, Vulkan, or CPU). Exposes an OpenAI-compatible API.
OpenCode	Terminal-based agentic UI with built-in tools (file I/O, shell, grep, web fetch). Talks to llama.cpp as its backend.

System requirements

	Minimum	Recommended
OS	Windows 10 (64-bit)	Windows 11
GPU (NVIDIA)	RTX 3090 / 4090 (24 GB VRAM)	RTX 5090 (32 GB)
GPU (AMD)	RX 7900 XTX (24 GB) via Vulkan	—
System RAM	32 GB	64 GB
Disk	~20 GB free	~25 GB free
Software	—	CUDA Toolkit 12.4+ (NVIDIA) or Vulkan SDK (AMD)

GPU VRAM is the bottleneck. The Q4_K_M quant is ~16 GB; you need ~20 GB VRAM to run it with reasonable context. A 24 GB GPU (RTX 3090/4090) is the practical minimum for full GPU offload.

No 24 GB GPU? Options:

Use a smaller quant (Q2_K, ~10 GB) — lower quality but fits 16 GB VRAM

Use partial GPU offload (-ngl 20 instead of -ngl 99) — slower, uses system RAM for remaining layers

Use CPU-only inference — works but very slow (~5-15x slower)

Setup

Step 1: Install llama.cpp

Choose one method:

Option A — Pre-built binaries (recommended)

Download the latest release from github.com/ggml-org/llama.cpp/releases.

Pick the right zip for your GPU:

GPU	Download
NVIDIA (CUDA 12.4)	`llama-bXXXX-bin-win-cuda-12.4-x64.zip`
NVIDIA (CUDA 13.x)	`llama-bXXXX-bin-win-cuda-13.1-x64.zip`
AMD / Intel / any (Vulkan)	`llama-bXXXX-bin-win-vulkan-x64.zip`
CPU only	`llama-bXXXX-bin-win-cpu-x64.zip`

Extract to a short path (e.g. C:\llama\) and add it to your PATH:

# Extract
Expand-Archive -Path "llama-bXXXX-bin-win-cuda-12.4-x64.zip" -DestinationPath "C:\llama"

# Add to PATH (current session)
$env:PATH = "C:\llama;" + $env:PATH

# Add to PATH (permanent — run as admin)
[Environment]::SetEnvironmentVariable("Path", "C:\llama;" + [Environment]::GetEnvironmentVariable("Path", "Machine"), "Machine")

Verify:

llama-server --version

Option B — winget

winget install llama.cpp

Note: the winget package may ship Vulkan or CPU-only. For CUDA, use Option A.

Option C — Build from source

Requires Visual Studio 2022 with "Desktop development with C++" workload. Run from a Developer Command Prompt:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

REM For NVIDIA CUDA:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

REM For Vulkan (AMD / Intel / NVIDIA):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

REM CPU only:
cmake -B build
cmake --build build --config Release

Executables end up in build\bin\Release\.

Step 2: Download the Gemma 4 model

# Install the HuggingFace CLI if you don't have it
pip install huggingface-hub

# Download Q4_K_M (~16 GB)
huggingface-cli download ggml-org/gemma-4-26b-a4b-it-GGUF gemma-4-26b-a4b-it-Q4_K_M.gguf --local-dir C:\models\gemma4

Or download directly from huggingface.co/ggml-org/gemma-4-26b-a4b-it-GGUF.

Important: Use the 26B-A4B variant (not E4B). The 26B-A4B has a 256K context window versus 128K on E4B.

Which quant?

Quant	Size	VRAM needed	Quality
`Q4_K_M`	~16 GB	~20 GB	Good — default choice
`Q5_K_M`	~19 GB	~22 GB	Better
`Q2_K`	~10 GB	~13 GB	Lower — fits 16 GB GPUs
`Q8_0`	~27 GB	~30 GB	Near-lossless — needs 32 GB VRAM

Step 3: Start the llama.cpp server

PowerShell (recommended):

.\start_server.ps1 -Model C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf

Command Prompt:

start_server.bat C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf

Manual command (if you prefer):

llama-server -m "C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf" `
  -ngl 99 -c 32768 -fa on `
  -ctk q8_0 -ctv q4_0 `
  --host 127.0.0.1 --port 8080

Wait for the "model loaded" message. You can verify at http://127.0.0.1:8080 — llama.cpp includes a built-in chat web UI.

Key flags explained:

Flag	Purpose
`-ngl 99`	Offload all layers to GPU (reduce for partial offload)
`-c 32768`	Context window size in tokens
`-fa on`	Flash Attention — faster, less memory
`-ctk q8_0 -ctv q4_0`	KV cache quantisation — saves ~3x VRAM on context
`--host 127.0.0.1`	Listen on localhost only
`--port 8080`	API port (OpenCode connects here)

Adjusting context size:

Context	Approx. capacity	Extra VRAM (with KV quant)
32K tokens	~25,000 words	~0.5 GB
64K tokens	~50,000 words	~1 GB
128K tokens	~100,000 words	~2 GB

Step 4: Install OpenCode

Choose one method:

# Scoop
scoop install opencode

# Chocolatey
choco install opencode

# npm
npm install -g opencode-ai

# Go
go install github.com/anomalyco/opencode@latest

Or download the Windows binary from github.com/anomalyco/opencode/releases.

Verify:

opencode --version

Step 5: Configure OpenCode

Copy the opencode.json from this directory to your project root (the directory you'll run OpenCode from):

copy opencode.json C:\your\project\opencode.json

Or create opencode.json in your project root with this content:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "gemma-local/gemma-4-26b",
  "provider": {
    "gemma-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Gemma 4 26B-A4B (local llama.cpp)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma-4-26b": {
          "name": "Gemma 4 26B-A4B (Q4_K_M)",
          "tool_call": true,
          "limit": {
            "context": 32768,
            "output": 8192
          }
        }
      }
    }
  },
  "agent": {
    "build": {
      "model": "gemma-local/gemma-4-26b"
    },
    "general": {
      "model": "gemma-local/gemma-4-26b"
    },
    "plan": {
      "model": "gemma-local/gemma-4-26b"
    },
    "title": {
      "model": "gemma-local/gemma-4-26b"
    }
  }
}

Context limit: The context value in opencode.json should match the -c flag you passed to llama-server. If you increased context to 65536, update it here too.

Step 6: Run OpenCode

With the llama.cpp server running in one terminal, open a second terminal in your project directory:

cd C:\your\project
opencode -m gemma-local/gemma-4-26b

OpenCode will connect to your local Gemma model and you can start coding.

Quick-start checklist

Terminal 1:                              Terminal 2:
─────────────────────────────────        ─────────────────────────────────────
.\start_server.ps1 -Model ...           cd C:\your\project
  (wait for "model loaded")             opencode -m gemma-local/gemma-4-26b
                                           → start working

Start llama-server (Terminal 1)
Wait for "model loaded"
Run opencode -m gemma-local/gemma-4-26b in your project directory (Terminal 2)

OpenCode tools

OpenCode provides these built-in tools (no extra configuration needed):

Tool	Description
`read`	Read file contents
`write`	Create or overwrite a file
`edit`	Precise substring replacement
`glob`	Find files by pattern
`grep`	Search text inside files
`bash`	Run shell commands
`webfetch`	Fetch and read a web page
`apply_patch`	Apply unified diffs
`todowrite`	Track task progress

Troubleshooting

llama-server won't start

"DLL not found" — Install the Visual C++ Redistributable. If using CUDA, ensure the CUDA Toolkit bin directory is on your PATH.
Crashes immediately — Reduce context size: -c 8192 to rule out OOM. Check GPU VRAM with nvidia-smi.
Very slow — Ensure GPU offload is working (-ngl 99). Check nvidia-smi — GPU utilisation should be high during inference. If VRAM is full, reduce -c or use a smaller quant.

OpenCode can't connect

Verify the server is running: open http://127.0.0.1:8080/health in a browser.
Check the port matches between start_server and opencode.json.
Windows Firewall may block the port — allow llama-server.exe through or use a different port.

Tool calling not working

Gemma 4's GGUF includes a native chat template with tool-calling support. If tool calls aren't being made, check the llama-server logs for errors.
Try a different quant — some community quants may have incomplete chat templates.
Use the official ggml-org/gemma-4-26b-a4b-it-GGUF quants which embed the correct template.

"Model too large for GPU"

Use partial offload: -ngl 30 (some layers on GPU, rest on CPU RAM)
Use a smaller quant: Q2_K (~10 GB) fits 16 GB GPUs
Use CPU only: -ngl 0 (slow but works with enough system RAM)

Path issues

Keep paths short (e.g. C:\llama\, C:\models\) to avoid Windows path-length issues
Both forward slashes and backslashes work in model paths

Differences from the macOS version

Feature	macOS (gemma-code)	Windows (this fork)
Harness	Custom Python REPL	OpenCode (Go binary)
GPU backend	Metal (Apple Silicon)	CUDA / Vulkan / HIP
KV cache compression	TurboQuant (turbo4)	Standard quant (q8_0/q4_0)
Tool calling	Native API + text-tools fallback	OpenCode built-in tools
Multi-agent orchestrator	Yes (orchestrator.py)	Not included
CaseVault integration	Yes (vault.py)	Not included
Permission layer	Yes (permissions.py)	Not included
Session persistence	Built-in	OpenCode manages sessions
MCP server support	No	Yes (OpenCode feature)
Custom tools	Python tool functions	JS/TS files in .opencode/tools/

Advanced configuration

Increasing context window

To use more of Gemma's 256K context:

# 64K context (good for 50+ page documents)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 65536

# 128K context (needs ~22 GB VRAM with KV quant)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 131072

Update opencode.json to match:

"limit": {
  "context": 65536,
  "output": 8192
}

Using a different port

.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Port 9090

Update opencode.json:

"baseURL": "http://127.0.0.1:9090/v1"

Adding MCP servers

OpenCode supports MCP servers. Add to opencode.json:

{
  "mcpServers": {
    "my-server": {
      "type": "stdio",
      "command": "path/to/mcp-server.exe",
      "args": [],
      "env": []
    }
  }
}

Custom tools

Create .opencode/tools/ in your project directory and add .ts or .js files. Each file becomes a tool. See OpenCode custom tools docs.

Project structure

gemma-opencode-windows/
  start_server.bat     Windows batch launcher for llama-server
  start_server.ps1     PowerShell launcher (recommended)
  opencode.json        OpenCode provider config — copy to your project root
  README.md            This guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma Code for Windows (OpenCode Edition)

Why this setup?

System requirements

Setup

Step 1: Install llama.cpp

Step 2: Download the Gemma 4 model

Step 3: Start the llama.cpp server

Step 4: Install OpenCode

Step 5: Configure OpenCode

Step 6: Run OpenCode

Quick-start checklist

OpenCode tools

Troubleshooting

llama-server won't start

OpenCode can't connect

Tool calling not working

"Model too large for GPU"

Path issues

Differences from the macOS version

Advanced configuration

Increasing context window

Using a different port

Adding MCP servers

Custom tools

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
opencode.json		opencode.json
start_server.bat		start_server.bat
start_server.ps1		start_server.ps1

Folders and files

Latest commit

History

Repository files navigation

Gemma Code for Windows (OpenCode Edition)

Why this setup?

System requirements

Setup

Step 1: Install llama.cpp

Step 2: Download the Gemma 4 model

Step 3: Start the llama.cpp server

Step 4: Install OpenCode

Step 5: Configure OpenCode

Step 6: Run OpenCode

Quick-start checklist

OpenCode tools

Troubleshooting

llama-server won't start

OpenCode can't connect

Tool calling not working

"Model too large for GPU"

Path issues

Differences from the macOS version

Advanced configuration

Increasing context window

Using a different port

Adding MCP servers

Custom tools

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages