This guide details how to seamlessly deploy a Ray cluster on Google Kubernetes Engine (GKE) configured explicitly for training models like Gemma 3 using the NVIDIA NeMo-RL framework.
The cluster provisions B200 Spot instances (8 GPUs per node) and mounts a high-performance Google Cloud Storage (GCS) bucket natively using the Fuse CSI driver.
- Authenticated
kubectlcontext pointing to your target GKE cluster. - The
manifests/01_Infra/ray-cluster-b200-nemo.yamlmanifest. - A valid HuggingFace Access Token exported into your terminal (
export HF_TOKEN="hf_...") to download gated Gemma 3 weights.
If you have an existing or stale Ray cluster running on the target node pools, you must tear it down first to free up the expensive GPU resources and avoid scheduling deadlocks.
# Delete any existing testing cluster
kubectl delete raycluster ray-cluster-b200
# Delete the target NeMo cluster if explicitly restarting
kubectl delete raycluster ray-cluster-b200-nemoThe manifest defines a RayCluster Custom Resource (CRD). The KubeRay operator will intercept this and begin provisioning the Head and Worker pods.
kubectl apply -f manifests/01_Infra/ray-cluster-b200-nemo.yamlNote: The YAML explicitly configures replicas: 2 under workerGroupSpecs, requesting 16 B200 GPUs total. Adjust this value in the YAML if you need a different scale.
Ray pods initialize via an initContainer (install-ray) which downloads the raw python orchestrator binaries into a temporary volume before the main container starts. This takes roughly 45-60 seconds.
Watch the pod spin-up process:
kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -wWait until both the ray-head and ray-worker pods reach the Running state and register Ready (e.g., 4/4 and 3/3).
Once the pods are running, connect to the Head node and query the native Ray daemon to ensure all worker GPUs are actively registered in the pool.
# Fetch the head pod ID
HEAD_POD=$(kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -l ray.io/node-type=head --no-headers | awk '{print $1}' | head -n 1)
# Execute 'ray status' (ensuring the pip-installed PATH is targeted)
kubectl exec $HEAD_POD -c ray-head -- /bin/bash -c "export PATH=/tmp/ray/packages/bin:\$PATH; ray status"If successful, you will see a console output reporting 232 active CPUs and 16.0 active GPUs (assuming 2 replicas).
You can monitor the cluster's GPU utilization, view logs, and track the real-time progress of jobs via the Ray Dashboard. Set up a local port-forward to the Ray Head service using kubectl:
kubectl port-forward svc/ray-cluster-b200-nemo-head-svc 8265:8265Once that is running, open your web browser and navigate to: http://localhost:8265
Because the grpo-*.yaml is configured to write logs to /data/nemo-rl-logs, we are streaming metrics directly into Google Cloud Storage via the Kubernetes FUSE mount. You can natively run TensorBoard on the cluster to track these logs!
Spin up the TensorBoard daemon in the background of the Head Pod:
RAY_HEAD_POD=$(kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -l ray.io/node-type=head -o 'jsonpath={.items[0].metadata.name}')
kubectl exec $RAY_HEAD_POD -- bash -c "nohup tensorboard --logdir=/data/nemo-rl-logs/ --host=0.0.0.0 --port=6006 > /tmp/tensorboard.log 2>&1 &"Then, execute a local port-forward to your machine:
kubectl port-forward $RAY_HEAD_POD 6006:6006Open http://localhost:6006 in your browser. As VLLM processes chunks of logic and the rewards calculate, this dashboard will update natively from the dataset in your Cloud Storage Bucket!
Step 1: Clean Up the View
On the left side of your screen, under the Run list, you see exp_001, exp_002, etc. Uncheck all the boxes except for your current active run. This will hide all the old, aborted test runs and leave behind one clean line on the graphs.
Step 2: The Only 3 Graphs You Need to Care About At the very top of the page, click the SCALARS tab. Search for and pin these three metrics:
-
π Average Reward (or Accuracy) ->
metrics/rewardormetrics/avg_reward- What it means: This is the ultimate "is it working" line. It combines how well your model is formatting the
<think>tags and whether it actually got the right math answer. - What you want: A steady climb up and to the right from your baseline (e.g., 30%).
- What it means: This is the ultimate "is it working" line. It combines how well your model is formatting the
-
π§ Mean Generation Length ->
metrics/mean_generation_length(orgeneration_tokens)- What it means: How long Gemma 3 is thinking before it spits out a final answer.
- What you want: In reasoning tasks, you want this line to go up over time! It proves the model is learning to spend more tokens exploring the problem logically before rushing an answer.
-
π KL Divergence Error ->
metrics/kl_error(orpolicy_kl)- What it means: This measures how much the model's "brain structure" has drifted from the original, un-trained state.
- What you want: A low, stable line! If this spikes wildly and stays high, it means the model has "mode-collapsed" (it found a hack to cheat the reward system by forgetting English and just spamming XML tags infinitely). If the KL penalty is working, this will flatline ideally.
(Everything elseβlike samples_per_sec or policy_training_lckβare just hardware diagnostics. Keep your eyes on the Reward and Generation Length!)
Your infrastructure is now fully stabilized and distributed! You do not need a local Python environment or a local Ray installation.
To initiate the training pipeline, we leverage a native bash wrapper (launch_grpo.sh) that uses kubectl exec to proxy into the Head pod and trigger Ray's native JobSubmissionClient from inside the cluster:
# Ensure your token is exported locally!
export HF_TOKEN="your_huggingface_token"
# Execute the launcher
./scripts/launch_grpo.shThis script will automatically:
- Find your Ray Head Pod.
- Inject your
$HF_TOKEN,setup_nemo_rl.sh, and customgrpo-gemma*.yamlmanifests directly into the Head node's/workspace. - Broadcast the codebase to all Workers via
ray job submitand commence OpenMathInstruct-2 GRPO modeling!
The framework aggregates all generation trajectories, prompt inputs, and computed rewards at every step. Because they are not currently synced to an external dashboard, they reside natively inside the Ray Head Pod.
You can view the raw generated strings and exactly how the LLM reasoned its answers by executing this command locally in your terminal. It targets the train_data_step[X].jsonl file and uses jq to nicely format the 'prompt' and the generated 'text' columns:
RAY_HEAD_POD=$(kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -l ray.io/node-type=head -o 'jsonpath={.items[0].metadata.name}')
kubectl exec -it $RAY_HEAD_POD -- bash -c "tail -n 1 /data/nemo-rl-logs/grpo-gemma3-1b-it-1n8g-fsdp2tp1-b200/exp_001/train_data_step1.jsonl | jq '. | {prompt: .content[0][0], generated_text: .content[0][1], rewards: .rewards}'"Running that will output a clean JSON block showing you the exact math problem, exactly what Gemma printed inside its <reasoning> and <answer> tags, and the reward score it received for that specific trajectory!
To view different steps as training progresses, you can simply change train_data_step1.jsonl to train_data_step60.jsonl (or whichever step you want to inspect). Note: Ensure you update exp_008 to match your actual experiment run directory.
During each evaluation phase, the model is tested on its mathematical reasoning capabilities. The following table showcases the model's trajectory across different training steps, highlighting how its <think> process evolves over time:
| Step | Generation Profile | Step | Generation Profile |
|---|---|---|---|
| Step 10 | ![]() |
Step 20 | ![]() |
| Step 30 | ![]() |
Step 40 | ![]() |
| Step 50 | ![]() |
Gemma 3 introduces several new special control tokens. NeMo-RL's math_hf_data_processor handles them natively via HuggingFace's apply_chat_template:
<start_of_turn>and<end_of_turn>: Injected automatically around user/model prompts whenadd_generation_prompt=Trueis executed.<|thought|>(Reasoning Mode): This is turned off by default in the chat template. By modifyingpolicy.tokenizer.chat_template_kwargs: {enable_thinking: true}in your YAML config, this token is injected to force the model into its internal reasoning phase.<|file_separator|>and<|n_th_step|>: Strictly used for multimodal layout or tool-calling steps. NeMo-RL safely ignores these during text-only generation.
By default, NeMo-RL will implicitly resume from the last available PyTorch checkpoint if it detects an existing results/<experiment_name> folder on the local disk. There is no explicit --no-resume flag in the YAML config.
If you abort a training job and want to start a brand new run with the exact same experiment name, you must delete the old results folders from both the Head Pod AND all Worker Pods (because PyTorch FSDP saves the fragmented model shards onto the local disks of the worker pods).
Run these commands locally to completely flush the cluster's local disks before launching a fresh job:
EXPERIMENT="grpo-gemma3-1b-it-1n8g-fsdp2tp1-b200"
# 1. Wipe the Head Pod's configuration checkpoint
RAY_HEAD_POD=$(kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -l ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
kubectl exec $RAY_HEAD_POD -- bash -c "rm -rf /opt/nemo-rl/results/$EXPERIMENT /opt/nemo-rl/logs/$EXPERIMENT"
# 2. Wipe every Worker Pod's FSDP Shard weights
WORKER_PODS=$(kubectl get pods -l ray.io/cluster=ray-cluster-b200-nemo -l ray.io/node-type=worker -o jsonpath='{.items[*].metadata.name}')
for POD in $WORKER_PODS; do
kubectl exec $POD -- bash -c "rm -rf /opt/nemo-rl/results/$EXPERIMENT /opt/nemo-rl/logs/$EXPERIMENT"
done
echo "Cluster wiped. Safe to launch!"Because this pipeline uses Fully Sharded Data Parallel (FSDP) to train across all 8 GPUs, PyTorch distributes the model state across multiple nodes natively in Distributed Checkpoint (DCP) format. Before evaluating or serving these weights with standard HuggingFace APIs (like vLLM), they must be converted and merged into a unified format.
Do not use save_consolidated: true inside your YAML config with model_save_format: safetensors. A bug in the underlying nemo_automodel subsystem skips mathematical concatenation of FSDP embedding/projection shards, resulting in corrupted outputs and AssertionErrors during vLLM initialization.
The Flawless Workflow:
Train natively using model_save_format: "pytorch" and save_consolidated: false. This guarantees the checkpoints flush rapidly and safely during distribution without metadata corruption.
Once your training halts or saves a step sequence, you have two options to merge the checkpoint for evaluation.
We authored a standalone offline script (scripts/merge_benchmarking/manual_merge_fsdp.py) that explicitly loads the partitioned DCP safetensors into CPU memory, identifies the sharded dimensions, and utilizes torch.cat() to physically crush the partitions into a unified HuggingFace .safetensors model.
We have bundled this seamlessly into our evaluation script. You simply run:
./scripts/merge_benchmarking/run_math500_eval.shThis script will automatically:
kubectl cpthe merge script to the Ray Head Pod.- Execute the physical merge and copy the metadata/config indices.
- Automatically copy the HF
tokenizer.modelfrom your cluster's GCS cache. - Broadcast the unified footprint to all Worker node SSDs (avoiding GCS-Fuse
mmapcrashes). - Launch the native NeMo-RL evaluation job on MATH-500.
If you prefer to strictly utilize the official NVIDIA-NeMo/RL conversion workflows without our offline patch, you must configure your checkpointing as follows:
checkpointing:
model_save_format: "torch_save" # Must use torch_save instead of safetensors
save_consolidated: false # Avoids the FSDP consolidation bugOnce the model step saves natively in PyTorch DCP format, execute their converter:
# Example for a GRPO checkpoint at step 170
uv run python examples/converters/convert_dcp_to_hf.py \
--config results/grpo/step_170/config.yaml \
--dcp-ckpt-path results/grpo/step_170/policy/weights/ \
--hf-ckpt-path results/grpo/hfNote: Adjust the paths according to your training output directory structure. Once the conversion is complete, override the generation.model_name in the evaluation script to point to the results/grpo/hf directory.
We utilized our bundled evaluation script to measure the zero-shot Pass@1 accuracy against the MATH-500 dataset, explicitly comparing the vanilla google/gemma-3-1b-it model against our fine-tuned policy after 30 FSDP/GRPO steps.
Vanilla Zero-Shot Baseline:
export HF_TOKEN="your_hf_token" && ./scripts/merge_benchmarking/run_math500_eval.sh --vanilla-model google/gemma-3-1b-it============================================================
model_name='gemma-3-1b-it' dataset_name='math500'
max_new_tokens=8192 temperature=0.0 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=1
score=0.4020 (201.0/500)
============================================================
Fine-Tuned GRPO Model (Step 30):
./scripts/merge_benchmarking/run_math500_eval.sh --skip-sync=false --skip-merge=false============================================================
model_name='hf_merged_model_eval' dataset_name='math500'
max_new_tokens=8192 temperature=0.0 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=1
score=0.4640 (232.0/500)
============================================================
Result: A +6.2% absolute gain in mathematical reasoning accuracy after only a handful of GRPO training steps!
NeMo-RL is natively designed for modular reward engineering. Rather than strictly entangling rewards with the PPO/GRPO core loops, the framework executes simple Python functions mapped via YAML configs.
To add a completely new reward (e.g., regex penalties, semantic formatting, custom logical rules):
Open /opt/nemo-rl/nemo_rl/environments/rewards.py (either on your active cluster or in your fork) and define your function. It must accept the ground truth and model response, returning a (float_reward, boolean_passed) tuple.
def no_swearing_penalty(ground_truth: str, response: str) -> tuple[float, bool]:
if "darn" in response.lower():
return -1.0, False # Apply heavy penalty
return 0.0, True # Neutral/PassOpen /opt/nemo-rl/nemo_rl/environments/vlm_environment.py. Locate the _instantiate_reward_functions method and map a YAML string identifier to your new python function:
elif reward_func_name == "no_swearing":
reward_func = no_swearing_penaltyYou can now construct massive multi-objective reward pipelines natively from your experiment's configuration file. The framework will automatically invoke combine_reward_functions() to calculate the weighted sum of all functions listed!
env:
reward_functions:
- name: "format"
weight: 0.2
kwargs:
think_tag: "think"
answer_tag: "answer"
- name: "exact_alnum"
weight: 0.8
- name: "no_swearing"
weight: 1.0 # Applied as a heavy penalty multiplierIf your Gemma 3 training job crashes immediately during vllmWorker initialization with the FlashInfer assertion error, this is because Gemma 3 has a head size of 256, which breaks FlashInfer's default block size of 16.
To fix this, ensure your configuration YAML (e.g., manifests/02_Job/grpo-gemma3-1b-it-1n8g-fsdp2tp1-b200.yaml) explicitly defines the block_size: 32 override specifically inside the vllm_kwargs dictionary (NOT vllm_cfg), as NeMo RL requires explicit kwargs passing for the vLLM engine constructor:
policy:
generation:
vllm_kwargs:
block_size: 32





