Expected Behavior
I'm trying to test Tunix on a small v5p cluster and using the qwen example script.
I'm following the quick-start and using XPK to create a Pathways-enabled GKE cluster with TPU v5p. Here is the script I'm using to create the GKE cluster:
# install gcloud beta commands
gcloud components install beta
# create pathways cluster
export CLUSTER_NAME='<redacted>'
export ZONE='us-central1-a'
export TPU_TYPE='v5p-8' # e.g. v5p-16
export CLUSTER_CPU_MACHINE_TYPE=n2d-standard-32 # you can adjust this to use beefier CPU node
export PROJECT='<redacted>'
NETWORK_NAME=${CLUSTER_NAME}-mtu9k-wx
NETWORK_FW_NAME=${NETWORK_NAME}-fw-wx
export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME}"
# run `gcloud auth application-default login` and
# `gcloud auth login --update-adc` if you encounter permission issue when creating the network.
# Check if this is the service account you want to use.
gcloud auth list
gcloud compute networks create ${NETWORK_NAME} \
--mtu=8896 \
--project=${PROJECT} \
--subnet-mode=auto \
--bgp-routing-mode=regional
gcloud compute firewall-rules create ${NETWORK_FW_NAME} \
--network ${NETWORK_NAME} \
--allow tcp,icmp,udp \
--project=${PROJECT}
xpk cluster create-pathways \
--cluster $CLUSTER_NAME \
--cluster-cpu-machine-type=$CLUSTER_CPU_MACHINE_TYPE \
--num-slices=1 \
--tpu-type=$TPU_TYPE \
--zone $ZONE \
--project $PROJECT \
--spot \
--custom-cluster-arguments="${CLUSTER_ARGUMENTS}"
And here is the script to run the Tunix script:
export ZONE=us-central1-a
export PROJECT='<redacted>'
xpk workload create-pathways \
--cluster=<redacted> \
--workload=tunix-rl \
--command="WANDB_MODE=disabled TPU_MIN_LOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=0 TPU_STDERR_LOG_LEVEL=0 JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE='1' source run_qwen.sh" \
--num-slices=1 \
--script-dir . \
--zone $ZONE \
--project $PROJECT \
--tpu-type=v5p-8 \
--base-docker-image <redacted> \
--priority=medium
Actual Behavior
I run into this exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/tunix/cli/grpo_main.py", line 831, in <module>
app.run(main)
File "/opt/venv/lib/python3.12/site-packages/absl/app.py", line 367, in run
_run_main(main, args)
File "/opt/venv/lib/python3.12/site-packages/absl/app.py", line 312, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/app/tunix/cli/grpo_main.py", line 827, in main
pipeline.run_grpo_trainer()
File "/app/tunix/cli/grpo_main.py", line 798, in run_grpo_trainer
self._run(mode=mode)
File "/app/tunix/cli/grpo_main.py", line 753, in _run
grpo_trainer.train(dataset)
File "/app/tunix/rl/grpo/grpo_learner.py", line 449, in train
super().train(train_ds, eval_ds, skip_jit)
File "/app/tunix/rl/rl_learner.py", line 529, in train
first_item = next(full_batch_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
--------------------
For simplicity, Grain has removed its internal frames from the traceback of the following exception. Set --grain_py_traceback_filtering=off to include these.
Dataset gsm8k downloaded and prepared to data/train/gsm8k/1.0.0. Subsequent calls will reuse this data.
I0501 21:01:14.979062 10 client.cc:221] Client::~Client() starting.
Steps to Reproduce the Problem
- Use XPK to create a GKE pathways cluster
- Use XPK to deploy training script
Environment
- OS: [e.g., Ubuntu, etc.]
- Project Version: [e.g., 0.0.1]
Checklist
Would you like to help us fix it?
Expected Behavior
I'm trying to test Tunix on a small v5p cluster and using the qwen example script.
I'm following the quick-start and using XPK to create a Pathways-enabled GKE cluster with TPU v5p. Here is the script I'm using to create the GKE cluster:
And here is the script to run the Tunix script:
Actual Behavior
I run into this exception:
Steps to Reproduce the Problem
Environment
Checklist
Would you like to help us fix it?