Tpu7x recipe gcs by seonjunmoon · Pull Request #174 · AI-Hypercomputer/tpu-recipes

seonjunmoon · 2026-03-05T17:34:44Z

No description provided.

…aloading / checkpointing settings

…o tpu7x-recipe-gcs

…instructions in README

…nvironment configuration instructions for the DeepSeek3-671B workload

lepan-google · 2026-03-06T21:01:46Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

+checkpointStorageTargetDataFileSizeBytes=209715200 \
+dataset_type='grain' \
+grain_file_type=arrayrecord \
+grain_train_files=${DATASET_BUCKET_MOUNTED_PATH} \


Are we using GCS direct loading here? If so, should we update this name to DATASET_FOLDER_PATH etc as we didn't mount the bucket.

Per sync offline, the dataset is using GCSFuse not GCS direct. If so, pv and pvc creation should be included in the recipe.

Thanks, I have added PV/PVC instruction in the GCS bucket setup instructions.

Updated with more details for pv/pvc and the yaml files for dataset_pvc and checkpoint_pvc.

lepan-google · 2026-03-06T21:04:50Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

@@ -0,0 +1,341 @@
+# Instructions for training DeepSeek3-671B on TPU Ironwood (tpu7x-4x8x8) with Google Cloud Storage (GCS)


Hey @seonjunmoon:

For this draft, I have several suggestions:

Keep and update the existing introduction, Workload Details, Prerequisite sections in CMCS recipe: https://github.com/AI-Hypercomputer/tpu-recipes/tree/cba377b6e0cba6cdfe81d5cbec882a4abc071f2b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/k8s.

Put the output bucket setup in GCS bucket setup too, add permission setups if needed.

I highly recommend securing some capacity to test this recipe end-to-end, as it is customer-facing. At first glance, it appears the DATASET_BUCKET environment variable is being overwritten in the script; additionally, variables like CHECKPOINT_BUCKET seem not used. Please utilize some small-scale capacity to test the recipe before merging.

Thanks for the comments!

Yes thanks, I have added the existing introduction, Workload Details, Prerequisite sections in CMCS recipe.

For the output bucket, it is actually the checkpoint bucket that we attach at first. I have this line in the README instruction:
export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}" I hope this is clear.

Not sure if we would have any but I will see if there's any capacity available for 256 chips (64 nodes). For DATASET_BUCKET and CHECKPOINT_BUCKET, they are what user would replace in the script as instructed in the README. Please let me know if anything is unclear.

Question 2: I didn't see line export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET} in the README, could you please check if the change gets uploaded?

Question 3: Please see: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2908172426.

Just replied to your comments, please let me know if anything's not clear.

lepan-google · 2026-03-06T21:05:24Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

+export BASE_OUTPUT_DIR=""
+export WORKLOAD_IMAGE=""
+export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
+export DATASET_BUCKET=""


This will overwrite the existing setups.

This is what I described in the README, the user would update these lines with the variables they defined (DATASET_BUCKET, CHECKPOINT_BUCKET, etc.) This is what I have in the README:

In run_recipe.sh, update these lines:
export PROJECT_ID="your-project-id"
export CLUSTER_NAME="your-cluster-name"
export ZONE="your-zone"
export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}"
export DATASET_BUCKET="${DATASET_BUCKET}" # e.g. "my-dataset-bucket"
export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset" # Ensure this matches where XPK mounts the bucket

Same as above, I didn't see these revision in README, please check the code.

Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.

lepan-google · 2026-03-06T21:06:01Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

+export WORKLOAD_IMAGE=""
+export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
+export DATASET_BUCKET=""
+export DATASET_BUCKET_MOUNTED_PATH=""


Nit: Need to add introduction to this variable.

The same as above, I have comments on each variable in the 'Configuring and Starting workload' section. Please let me know if anything is unclear to you.

I didn't see this section too... Could you please point me to the correct location?

Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.

…equisites to the DeepSeek3-671B training README.

lepan-google · 2026-03-09T22:03:31Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+## GCS Bucket setup
+1. Create two buckets: one to hold the dataset and one to use for checkpoints. To create regional HNS buckets use the following commands:
+```
+# Set variables


Reply to question 3 of the comment: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2897869089. Here it is saying that we will directly export these data, is my understanding not correct?

This is to create buckets if need to, and when we set variables in the run_recipe, we use these to initialize the env variables. If this causes confusions, I can explicitly just let user initialize bucket name again in the run_recipe. What do you think?

…environment variables and workload priority in run_recipe.sh

…and checkpoint

…guments

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

mkmg · 2026-03-23T19:42:09Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/checkpoint_pvc.yaml

+    - machine-type:a3-highgpu-8
+  csi:
+    driver: gcsfuse.csi.storage.gke.io
+    volumeHandle: tess-tpu-checkpointing-us-central1


This needs to be some sort of placeholder for the user to replace with their bucket name. @lepan-google may be able to share how she has approached this in other recipes.

Just updated it as a placeholder with comment.

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/checkpoint_pvc.yaml

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/dataset_pvc.yaml

mkmg · 2026-03-23T19:43:25Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/dataset_pvc.yaml

+    - machine-type:a3-highgpu-8 # known machine for setting purpose
+  csi:
+    driver: gcsfuse.csi.storage.gke.io
+    volumeHandle: tess-tpu-dataloading-us-central1


Needs to be a placeholder for the user to replace with their bucket name.

Deleted checkpoint_pvc as discussed.

mkmg · 2026-03-23T19:45:10Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+-   Sequence Length: 4096
+-   Precision: bf16
+-   Chips: 256 (4x8x8 topology)


Should we add that it uses GCS for dataloading and checkpointing? And specify what dataset is used?

Just added more details about GCS and dataset.

mkmg · 2026-03-23T20:03:53Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+# Checkpoint Bucket PV/PVC
+python3 xpk.py storage attach my-checkpoint-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/ckpt --readonly=false --bucket=$CHECKPOINT_BUCKET --size=64 --auto-mount=false --manifest=checkpoint_pvc.yaml
+```


As discussed in chat since we are doing direct to GCS checkpointing this can be removed.

Just updated, thanks!

mkmg · 2026-03-23T20:04:48Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+-   `WORKLOAD_NAME`: A unique name for your workload. This is set in
+    `run_recipe.sh` using the following command:
+    `export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"`
+-   `GKE_VERSION`: The GKE version, `1.34.0-gke.2201000` or later.


Where did this GKE version come from? Do we know what version we used for our runs?

I couldn't find the exact one we used, so I kept the version instruction used by the recipe without storage.

mkmg · 2026-03-23T20:05:13Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+-   Libtpu version: 0.0.32.dev20251215+nightly
+-   Jax version: 0.8.2.dev20251215
+-   Maxtext version: maxtext-tutorial-v1.5.0
+-   Python: 3.11
+-   XPK: 1.8.0


Where are these versions from?

If possible would be great to list the versions we used

I have updated to make Jax and Maxtext version to what we used. For Libtpu version, I couldn't find which version was used, so I kept the one that CMCS team's recipe used.

I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?

Thanks I have found the version used for the workload and updated it.

Did this get updated?

mkmg · 2026-03-23T20:06:05Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+uv venv --seed ${HOME}/.local/bin/venv-docker --python 3.12 --clear
+source ${HOME}/.local/bin/venv-docker/bin/activate
+pip install --upgrade pip
+
+# Make sure you're running on a Virtual Environment with python 3.12
+if [[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]]; then { echo "You have the correct Python version 3.12"; } else { >&2 echo "Error: Python version must be 3.12"; false;} fi


This is 3.12 but above says 3.11?

Oh thanks, just updated to 3.11 which is what we used.

mkmg · 2026-03-31T21:32:55Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+export PROJECT=cloud-tpu-multipod-dev
+export CLUSTER=bodaborg-tpu7x-nap-users
+export ZONE=us-central1-c
+export RECIPE_REPO="path-to-this-recipe-repo" # Update


Let's add more detail here so it's clear where path should end.

e.g. "your/dir/tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk"

Also I don't see this used? Is it supposed to be used as part of the manifest flag?

I had this variable previously as part of the going back to recipe repo after attaching storage through xpk, but then removed the part in this command instruction, so I don't need this variable anymore and just removed this. Thank you!

I see below XPK is installed via pip. In this case you can remove the cd into the XPK directory, and then have users execute the xpk command from the recipe directory (that way the dataset_pvc.yaml file can be found without having to specify the absolute path using the above mentioned variable).

That is true, we can use the globally installed xpk here, just updated.

mkmg · 2026-03-31T21:36:26Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+-   Libtpu version: 0.0.32.dev20251215+nightly
+-   Jax version: 0.8.2.dev20251215
+-   Maxtext version: maxtext-tutorial-v1.5.0
+-   Python: 3.11
+-   XPK: 1.8.0


I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?

mkmg · 2026-03-31T21:38:08Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+cd ~/xpk
+
+# Dataset Bucket PV/PVC
+python3 xpk.py storage attach my-dataset-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/dataset --readonly=false --bucket=$DATASET_BUCKET --size=64 --auto-mount=false --manifest=dataset_pvc.yaml
+```


This requires the cluster to be created; please move this section after the section with the cluster creation commands.

Right, I just moved the GCS bucket setup session after the cluster creation step.

I'm not seeing this change?

…ange in README

junjieqian · 2026-04-01T03:43:12Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+Be sure to update `volumeHandle` in the yamls with your correct bucket names. Creating a bucket and attaching xpk storage is a one time setup.
+```
+# Set variables
+export PROJECT=cloud-tpu-multipod-dev


remove these values, instead using , , etc.

Just replaced as "", thanks.

junjieqian · 2026-04-01T03:45:35Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+# Set variables
+export DATASET_BUCKET="dataloading-bucket-name"
+export CHECKPOINT_BUCKET="checkpoint-bucket-name"
+export REGION="us-central1"


replace us-central1 with

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

junjieqian · 2026-04-01T03:50:16Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

+export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp-gcs")-$(date +%Y%m%d-%H%M)"
+export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset"
+
+export MAXTEXT_ROOT="${HOME}/maxtext" # Update this to your maxtext root


if this is user defined path, shall we change this to export MAXTEXT_ROOT={MAXTEXT_ROOT} and ask user to define this in README?

I've updated run_recipe.sh to make MAXTEXT_ROOT an empty string by default and let user define this path. I've also added a note in the README to explicitly call out setting MAXTEXT_ROOT. Thanks!

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

junjieqian · 2026-04-03T05:02:25Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md

+
+```bash
+# Make sure BASE_OUTPUT_DIR is set in run_recipe.sh before running this.
+gcloud storage buckets create ${BASE_OUTPUT_DIR} --project=${PROJECT_ID} --location=US  --default-storage-class=STANDARD --uniform-bucket-level-access


--location=US may not true for all cases.

junjieqian

LGTM.. but please get an approval from storage TLs as well.
Thank you

junjieqian · 2026-04-03T05:05:40Z

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh

+export DATASET_BUCKET_MOUNTED_PATH=""
+export MAXTEXT_ROOT="" # e.g., ${HOME}/maxtext. Update this to the absolute path where you cloned the MaxText repository
+cd "$MAXTEXT_ROOT"
+


these two lines are not needed.

seonjunmoon added 8 commits January 8, 2026 22:50

Add a recipe for deepseek3-671b/4k-bf16-tpu7x-4x4x8

94a4e53

Remove outdated deepseek3-671b tpu7x gcs recipe

e82597a

Add a recipe for deepseek3-671b/4k-bf16-tpu7x-4x8x8

9d078df

Configure DeepSeekV3 recipe with GCS storage configs and advanced dat…

f5381e8

…aloading / checkpointing settings

Merge branch 'main' of https://github.com/seonjunmoon/tpu-recipes int…

930ed78

…o tpu7x-recipe-gcs

Add XLA DVFS p-state and checkpoint storage concurrent GB parameters

1a1d58d

Replace detailed XPK workload setup with focused GCS bucket creation …

28e6835

…instructions in README

Update README with comprehensive XPK setup, Docker image build, and e…

bc02432

…nvironment configuration instructions for the DeepSeek3-671B workload

lepan-google reviewed Mar 6, 2026

View reviewed changes

Add detailed introduction, workload specifics, and comprehensive prer…

135b86a

…equisites to the DeepSeek3-671B training README.

lepan-google reviewed Mar 9, 2026

View reviewed changes

seonjunmoon added 3 commits March 10, 2026 18:37

Add workload configuration instructions to README and update default …

fdc05ff

…environment variables and workload priority in run_recipe.sh

Add PV/PVC setup instructions to the README

5a15621

Add PVC files and documentation for GCSFuse PV/PVC setup for dataset …

62244a2

…and checkpoint

seonjunmoon requested a review from lepan-google March 12, 2026 22:21

Update XPK version and refine MaxText training script with updated ar…

df68378

…guments

mkmg reviewed Mar 23, 2026

View reviewed changes

seonjunmoon added 2 commits March 24, 2026 21:02

Update XPK configuration and software versions

86fa183

Remove unused checkpoint PVC

739644a

seonjunmoon requested a review from mkmg March 24, 2026 21:07

seonjunmoon marked this pull request as ready for review March 30, 2026 22:53

mkmg reviewed Mar 31, 2026

View reviewed changes

seonjunmoon added 2 commits March 31, 2026 23:34

Relocate GCS bucket setup section and update Libtpu version

fc6c2d4

Update xpk installation script version to v1.8.0

819b942

seonjunmoon requested a review from mkmg March 31, 2026 23:39

Update xpk storage attach command and remove unnecessary directory ch…

14e7a3f

…ange in README

junjieqian reviewed Apr 1, 2026

View reviewed changes

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh Show resolved Hide resolved

Clear default environment variable values and update documentation

705ff66

seonjunmoon requested a review from junjieqian April 2, 2026 22:40

junjieqian reviewed Apr 3, 2026

View reviewed changes

junjieqian approved these changes Apr 3, 2026

View reviewed changes

		@@ -0,0 +1,341 @@
		# Instructions for training DeepSeek3-671B on TPU Ironwood (tpu7x-4x8x8) with Google Cloud Storage (GCS)

Conversation

seonjunmoon commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seonjunmoon Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seonjunmoon Mar 6, 2026 •

edited

Loading