Skip to content

[bug] batch iterator exception when running Qwen example #1474

@andrewsykim

Description

@andrewsykim

Expected Behavior

I'm trying to test Tunix on a small v5p cluster and using the qwen example script.

I'm following the quick-start and using XPK to create a Pathways-enabled GKE cluster with TPU v5p. Here is the script I'm using to create the GKE cluster:

# install gcloud beta commands
gcloud components install beta

# create pathways cluster
export CLUSTER_NAME='<redacted>'
export ZONE='us-central1-a'
export TPU_TYPE='v5p-8' # e.g. v5p-16
export CLUSTER_CPU_MACHINE_TYPE=n2d-standard-32 # you can adjust this to use beefier CPU node
export PROJECT='<redacted>'

NETWORK_NAME=${CLUSTER_NAME}-mtu9k-wx
NETWORK_FW_NAME=${NETWORK_NAME}-fw-wx

export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME}"

# run `gcloud auth application-default login` and
# `gcloud auth login --update-adc` if you encounter permission issue when creating the network.

# Check if this is the service account you want to use.
gcloud auth list

gcloud compute networks create ${NETWORK_NAME} \
    --mtu=8896 \
    --project=${PROJECT} \
    --subnet-mode=auto \
    --bgp-routing-mode=regional

gcloud compute firewall-rules create ${NETWORK_FW_NAME} \
    --network ${NETWORK_NAME} \
    --allow tcp,icmp,udp \
    --project=${PROJECT}

xpk cluster create-pathways \
    --cluster $CLUSTER_NAME \
    --cluster-cpu-machine-type=$CLUSTER_CPU_MACHINE_TYPE \
    --num-slices=1 \
    --tpu-type=$TPU_TYPE \
    --zone $ZONE \
    --project $PROJECT \
    --spot \
    --custom-cluster-arguments="${CLUSTER_ARGUMENTS}"

And here is the script to run the Tunix script:

export ZONE=us-central1-a
export PROJECT='<redacted>'

xpk workload create-pathways \
  --cluster=<redacted> \
  --workload=tunix-rl \
  --command="WANDB_MODE=disabled TPU_MIN_LOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=0 TPU_STDERR_LOG_LEVEL=0 JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE='1' source run_qwen.sh" \
  --num-slices=1 \
  --script-dir . \
  --zone $ZONE \
  --project $PROJECT \
  --tpu-type=v5p-8 \
  --base-docker-image <redacted> \
  --priority=medium

Actual Behavior

I run into this exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/tunix/cli/grpo_main.py", line 831, in <module>
    app.run(main)
  File "/opt/venv/lib/python3.12/site-packages/absl/app.py", line 367, in run
    _run_main(main, args)
  File "/opt/venv/lib/python3.12/site-packages/absl/app.py", line 312, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/app/tunix/cli/grpo_main.py", line 827, in main
    pipeline.run_grpo_trainer()
  File "/app/tunix/cli/grpo_main.py", line 798, in run_grpo_trainer
    self._run(mode=mode)
  File "/app/tunix/cli/grpo_main.py", line 753, in _run
    grpo_trainer.train(dataset)
  File "/app/tunix/rl/grpo/grpo_learner.py", line 449, in train
    super().train(train_ds, eval_ds, skip_jit)
  File "/app/tunix/rl/rl_learner.py", line 529, in train
    first_item = next(full_batch_iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
--------------------
For simplicity, Grain has removed its internal frames from the traceback of the following exception. Set --grain_py_traceback_filtering=off to include these.
Dataset gsm8k downloaded and prepared to data/train/gsm8k/1.0.0. Subsequent calls will reuse this data.
I0501 21:01:14.979062      10 client.cc:221] Client::~Client() starting.

Steps to Reproduce the Problem

  1. Use XPK to create a GKE pathways cluster
  2. Use XPK to deploy training script

Environment

  • OS: [e.g., Ubuntu, etc.]
  • Project Version: [e.g., 0.0.1]

Checklist

  • I have searched the existing issues for a similar bug report.
  • I have provided all the required information in the "Environment" section.
  • I have provided a minimal, reproducible example.

Would you like to help us fix it?

Metadata

Metadata

Assignees

Labels

type:bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions