Skip to content

cbjeyes/gemma-opencode-windows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Gemma Code for Windows (OpenCode Edition)

Run Gemma 4 26B-A4B locally on Windows with OpenCode as the agentic coding harness — fully offline, no data leaves your machine.

┌──────────────────────────────────────────────────────────────────┐
│  llama.cpp server (GPU)  ←──  OpenCode (terminal UI)            │
│  Gemma 4 26B-A4B Q4_K_M      Tool calls: read, write, bash,    │
│  http://127.0.0.1:8080        grep, glob, edit, web fetch       │
└──────────────────────────────────────────────────────────────────┘

This is a Windows fork of gemma-code, which was built for macOS with Apple Silicon. This version replaces the custom Python harness with OpenCode, a Go-based agentic CLI that connects to any OpenAI-compatible endpoint.


Why this setup?

Component Role
Gemma 4 26B-A4B MoE model — 26B total params, 4B active per token. 256K context window. Fast inference despite large capacity.
llama.cpp Runs the GGUF model locally with GPU acceleration (CUDA, Vulkan, or CPU). Exposes an OpenAI-compatible API.
OpenCode Terminal-based agentic UI with built-in tools (file I/O, shell, grep, web fetch). Talks to llama.cpp as its backend.

System requirements

Minimum Recommended
OS Windows 10 (64-bit) Windows 11
GPU (NVIDIA) RTX 3090 / 4090 (24 GB VRAM) RTX 5090 (32 GB)
GPU (AMD) RX 7900 XTX (24 GB) via Vulkan
System RAM 32 GB 64 GB
Disk ~20 GB free ~25 GB free
Software CUDA Toolkit 12.4+ (NVIDIA) or Vulkan SDK (AMD)

GPU VRAM is the bottleneck. The Q4_K_M quant is ~16 GB; you need ~20 GB VRAM to run it with reasonable context. A 24 GB GPU (RTX 3090/4090) is the practical minimum for full GPU offload.

No 24 GB GPU? Options:

  • Use a smaller quant (Q2_K, ~10 GB) — lower quality but fits 16 GB VRAM
  • Use partial GPU offload (-ngl 20 instead of -ngl 99) — slower, uses system RAM for remaining layers
  • Use CPU-only inference — works but very slow (~5-15x slower)

Setup

Step 1: Install llama.cpp

Choose one method:

Option A — Pre-built binaries (recommended)

Download the latest release from github.com/ggml-org/llama.cpp/releases.

Pick the right zip for your GPU:

GPU Download
NVIDIA (CUDA 12.4) llama-bXXXX-bin-win-cuda-12.4-x64.zip
NVIDIA (CUDA 13.x) llama-bXXXX-bin-win-cuda-13.1-x64.zip
AMD / Intel / any (Vulkan) llama-bXXXX-bin-win-vulkan-x64.zip
CPU only llama-bXXXX-bin-win-cpu-x64.zip

Extract to a short path (e.g. C:\llama\) and add it to your PATH:

# Extract
Expand-Archive -Path "llama-bXXXX-bin-win-cuda-12.4-x64.zip" -DestinationPath "C:\llama"

# Add to PATH (current session)
$env:PATH = "C:\llama;" + $env:PATH

# Add to PATH (permanent — run as admin)
[Environment]::SetEnvironmentVariable("Path", "C:\llama;" + [Environment]::GetEnvironmentVariable("Path", "Machine"), "Machine")

Verify:

llama-server --version

Option B — winget

winget install llama.cpp

Note: the winget package may ship Vulkan or CPU-only. For CUDA, use Option A.

Option C — Build from source

Requires Visual Studio 2022 with "Desktop development with C++" workload. Run from a Developer Command Prompt:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

REM For NVIDIA CUDA:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

REM For Vulkan (AMD / Intel / NVIDIA):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

REM CPU only:
cmake -B build
cmake --build build --config Release

Executables end up in build\bin\Release\.


Step 2: Download the Gemma 4 model

# Install the HuggingFace CLI if you don't have it
pip install huggingface-hub

# Download Q4_K_M (~16 GB)
huggingface-cli download ggml-org/gemma-4-26b-a4b-it-GGUF gemma-4-26b-a4b-it-Q4_K_M.gguf --local-dir C:\models\gemma4

Or download directly from huggingface.co/ggml-org/gemma-4-26b-a4b-it-GGUF.

Important: Use the 26B-A4B variant (not E4B). The 26B-A4B has a 256K context window versus 128K on E4B.

Which quant?

Quant Size VRAM needed Quality
Q4_K_M ~16 GB ~20 GB Good — default choice
Q5_K_M ~19 GB ~22 GB Better
Q2_K ~10 GB ~13 GB Lower — fits 16 GB GPUs
Q8_0 ~27 GB ~30 GB Near-lossless — needs 32 GB VRAM

Step 3: Start the llama.cpp server

PowerShell (recommended):

.\start_server.ps1 -Model C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf

Command Prompt:

start_server.bat C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf

Manual command (if you prefer):

llama-server -m "C:\models\gemma4\gemma-4-26b-a4b-it-Q4_K_M.gguf" `
  -ngl 99 -c 32768 -fa on `
  -ctk q8_0 -ctv q4_0 `
  --host 127.0.0.1 --port 8080

Wait for the "model loaded" message. You can verify at http://127.0.0.1:8080 — llama.cpp includes a built-in chat web UI.

Key flags explained:

Flag Purpose
-ngl 99 Offload all layers to GPU (reduce for partial offload)
-c 32768 Context window size in tokens
-fa on Flash Attention — faster, less memory
-ctk q8_0 -ctv q4_0 KV cache quantisation — saves ~3x VRAM on context
--host 127.0.0.1 Listen on localhost only
--port 8080 API port (OpenCode connects here)

Adjusting context size:

Context Approx. capacity Extra VRAM (with KV quant)
32K tokens ~25,000 words ~0.5 GB
64K tokens ~50,000 words ~1 GB
128K tokens ~100,000 words ~2 GB

Step 4: Install OpenCode

Choose one method:

# Scoop
scoop install opencode

# Chocolatey
choco install opencode

# npm
npm install -g opencode-ai

# Go
go install github.com/anomalyco/opencode@latest

Or download the Windows binary from github.com/anomalyco/opencode/releases.

Verify:

opencode --version

Step 5: Configure OpenCode

Copy the opencode.json from this directory to your project root (the directory you'll run OpenCode from):

copy opencode.json C:\your\project\opencode.json

Or create opencode.json in your project root with this content:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "gemma-local/gemma-4-26b",
  "provider": {
    "gemma-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Gemma 4 26B-A4B (local llama.cpp)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma-4-26b": {
          "name": "Gemma 4 26B-A4B (Q4_K_M)",
          "tool_call": true,
          "limit": {
            "context": 32768,
            "output": 8192
          }
        }
      }
    }
  },
  "agent": {
    "build": {
      "model": "gemma-local/gemma-4-26b"
    },
    "general": {
      "model": "gemma-local/gemma-4-26b"
    },
    "plan": {
      "model": "gemma-local/gemma-4-26b"
    },
    "title": {
      "model": "gemma-local/gemma-4-26b"
    }
  }
}

Context limit: The context value in opencode.json should match the -c flag you passed to llama-server. If you increased context to 65536, update it here too.


Step 6: Run OpenCode

With the llama.cpp server running in one terminal, open a second terminal in your project directory:

cd C:\your\project
opencode -m gemma-local/gemma-4-26b

OpenCode will connect to your local Gemma model and you can start coding.


Quick-start checklist

Terminal 1:                              Terminal 2:
─────────────────────────────────        ─────────────────────────────────────
.\start_server.ps1 -Model ...           cd C:\your\project
  (wait for "model loaded")             opencode -m gemma-local/gemma-4-26b
                                           → start working
  1. Start llama-server (Terminal 1)
  2. Wait for "model loaded"
  3. Run opencode -m gemma-local/gemma-4-26b in your project directory (Terminal 2)

OpenCode tools

OpenCode provides these built-in tools (no extra configuration needed):

Tool Description
read Read file contents
write Create or overwrite a file
edit Precise substring replacement
glob Find files by pattern
grep Search text inside files
bash Run shell commands
webfetch Fetch and read a web page
apply_patch Apply unified diffs
todowrite Track task progress

Troubleshooting

llama-server won't start

  • "DLL not found" — Install the Visual C++ Redistributable. If using CUDA, ensure the CUDA Toolkit bin directory is on your PATH.
  • Crashes immediately — Reduce context size: -c 8192 to rule out OOM. Check GPU VRAM with nvidia-smi.
  • Very slow — Ensure GPU offload is working (-ngl 99). Check nvidia-smi — GPU utilisation should be high during inference. If VRAM is full, reduce -c or use a smaller quant.

OpenCode can't connect

  • Verify the server is running: open http://127.0.0.1:8080/health in a browser.
  • Check the port matches between start_server and opencode.json.
  • Windows Firewall may block the port — allow llama-server.exe through or use a different port.

Tool calling not working

  • Gemma 4's GGUF includes a native chat template with tool-calling support. If tool calls aren't being made, check the llama-server logs for errors.
  • Try a different quant — some community quants may have incomplete chat templates.
  • Use the official ggml-org/gemma-4-26b-a4b-it-GGUF quants which embed the correct template.

"Model too large for GPU"

  • Use partial offload: -ngl 30 (some layers on GPU, rest on CPU RAM)
  • Use a smaller quant: Q2_K (~10 GB) fits 16 GB GPUs
  • Use CPU only: -ngl 0 (slow but works with enough system RAM)

Path issues

  • Keep paths short (e.g. C:\llama\, C:\models\) to avoid Windows path-length issues
  • Both forward slashes and backslashes work in model paths

Differences from the macOS version

Feature macOS (gemma-code) Windows (this fork)
Harness Custom Python REPL OpenCode (Go binary)
GPU backend Metal (Apple Silicon) CUDA / Vulkan / HIP
KV cache compression TurboQuant (turbo4) Standard quant (q8_0/q4_0)
Tool calling Native API + text-tools fallback OpenCode built-in tools
Multi-agent orchestrator Yes (orchestrator.py) Not included
CaseVault integration Yes (vault.py) Not included
Permission layer Yes (permissions.py) Not included
Session persistence Built-in OpenCode manages sessions
MCP server support No Yes (OpenCode feature)
Custom tools Python tool functions JS/TS files in .opencode/tools/

Advanced configuration

Increasing context window

To use more of Gemma's 256K context:

# 64K context (good for 50+ page documents)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 65536

# 128K context (needs ~22 GB VRAM with KV quant)
.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Ctx 131072

Update opencode.json to match:

"limit": {
  "context": 65536,
  "output": 8192
}

Using a different port

.\start_server.ps1 -Model C:\models\gemma4\model.gguf -Port 9090

Update opencode.json:

"baseURL": "http://127.0.0.1:9090/v1"

Adding MCP servers

OpenCode supports MCP servers. Add to opencode.json:

{
  "mcpServers": {
    "my-server": {
      "type": "stdio",
      "command": "path/to/mcp-server.exe",
      "args": [],
      "env": []
    }
  }
}

Custom tools

Create .opencode/tools/ in your project directory and add .ts or .js files. Each file becomes a tool. See OpenCode custom tools docs.


Project structure

gemma-opencode-windows/
  start_server.bat     Windows batch launcher for llama-server
  start_server.ps1     PowerShell launcher (recommended)
  opencode.json        OpenCode provider config — copy to your project root
  README.md            This guide

About

Gemma 4 26B-A4B on Windows with OpenCode as the agentic coding harness and llama.cpp as the local inference backend

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors