Skip to content

Tpu7x recipe gcs#174

Open
seonjunmoon wants to merge 19 commits intoAI-Hypercomputer:mainfrom
seonjunmoon:tpu7x-recipe-gcs
Open

Tpu7x recipe gcs#174
seonjunmoon wants to merge 19 commits intoAI-Hypercomputer:mainfrom
seonjunmoon:tpu7x-recipe-gcs

Conversation

@seonjunmoon
Copy link
Copy Markdown

No description provided.

checkpointStorageTargetDataFileSizeBytes=209715200 \
dataset_type='grain' \
grain_file_type=arrayrecord \
grain_train_files=${DATASET_BUCKET_MOUNTED_PATH} \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using GCS direct loading here? If so, should we update this name to DATASET_FOLDER_PATH etc as we didn't mount the bucket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per sync offline, the dataset is using GCSFuse not GCS direct. If so, pv and pvc creation should be included in the recipe.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added PV/PVC instruction in the GCS bucket setup instructions.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with more details for pv/pvc and the yaml files for dataset_pvc and checkpoint_pvc.

@@ -0,0 +1,341 @@
# Instructions for training DeepSeek3-671B on TPU Ironwood (tpu7x-4x8x8) with Google Cloud Storage (GCS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @seonjunmoon:

For this draft, I have several suggestions:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

  • Yes thanks, I have added the existing introduction, Workload Details, Prerequisite sections in CMCS recipe.
  • For the output bucket, it is actually the checkpoint bucket that we attach at first. I have this line in the README instruction:
    export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}" I hope this is clear.
  • Not sure if we would have any but I will see if there's any capacity available for 256 chips (64 nodes). For DATASET_BUCKET and CHECKPOINT_BUCKET, they are what user would replace in the script as instructed in the README. Please let me know if anything is unclear.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question 2: I didn't see line export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET} in the README, could you please check if the change gets uploaded?

Question 3: Please see: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2908172426.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just replied to your comments, please let me know if anything's not clear.

export BASE_OUTPUT_DIR=""
export WORKLOAD_IMAGE=""
export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
export DATASET_BUCKET=""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overwrite the existing setups.

Copy link
Copy Markdown
Author

@seonjunmoon seonjunmoon Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I described in the README, the user would update these lines with the variables they defined (DATASET_BUCKET, CHECKPOINT_BUCKET, etc.) This is what I have in the README:

In run_recipe.sh, update these lines:
export PROJECT_ID="your-project-id"
export CLUSTER_NAME="your-cluster-name"
export ZONE="your-zone"
export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}"
export DATASET_BUCKET="${DATASET_BUCKET}" # e.g. "my-dataset-bucket"
export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset" # Ensure this matches where XPK mounts the bucket

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I didn't see these revision in README, please check the code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.

export WORKLOAD_IMAGE=""
export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
export DATASET_BUCKET=""
export DATASET_BUCKET_MOUNTED_PATH=""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Need to add introduction to this variable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above, I have comments on each variable in the 'Configuring and Starting workload' section. Please let me know if anything is unclear to you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this section too... Could you please point me to the correct location?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.

…equisites to the DeepSeek3-671B training README.
## GCS Bucket setup
1. Create two buckets: one to hold the dataset and one to use for checkpoints. To create regional HNS buckets use the following commands:
```
# Set variables
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reply to question 3 of the comment: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2897869089. Here it is saying that we will directly export these data, is my understanding not correct?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to create buckets if need to, and when we set variables in the run_recipe, we use these to initialize the env variables. If this causes confusions, I can explicitly just let user initialize bucket name again in the run_recipe. What do you think?

- machine-type:a3-highgpu-8
csi:
driver: gcsfuse.csi.storage.gke.io
volumeHandle: tess-tpu-checkpointing-us-central1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be some sort of placeholder for the user to replace with their bucket name. @lepan-google may be able to share how she has approached this in other recipes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated it as a placeholder with comment.

- machine-type:a3-highgpu-8 # known machine for setting purpose
csi:
driver: gcsfuse.csi.storage.gke.io
volumeHandle: tess-tpu-dataloading-us-central1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a placeholder for the user to replace with their bucket name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted checkpoint_pvc as discussed.

Comment on lines +12 to +14
- Sequence Length: 4096
- Precision: bf16
- Chips: 256 (4x8x8 topology)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add that it uses GCS for dataloading and checkpointing? And specify what dataset is used?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added more details about GCS and dataset.

Comment on lines +81 to +83
# Checkpoint Bucket PV/PVC
python3 xpk.py storage attach my-checkpoint-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/ckpt --readonly=false --bucket=$CHECKPOINT_BUCKET --size=64 --auto-mount=false --manifest=checkpoint_pvc.yaml
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in chat since we are doing direct to GCS checkpointing this can be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated, thanks!

- `WORKLOAD_NAME`: A unique name for your workload. This is set in
`run_recipe.sh` using the following command:
`export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"`
- `GKE_VERSION`: The GKE version, `1.34.0-gke.2201000` or later.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this GKE version come from? Do we know what version we used for our runs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the exact one we used, so I kept the version instruction used by the recipe without storage.

Comment on lines +222 to +226
- Libtpu version: 0.0.32.dev20251215+nightly
- Jax version: 0.8.2.dev20251215
- Maxtext version: maxtext-tutorial-v1.5.0
- Python: 3.11
- XPK: 1.8.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these versions from?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible would be great to list the versions we used

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated to make Jax and Maxtext version to what we used. For Libtpu version, I couldn't find which version was used, so I kept the one that CMCS team's recipe used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I have found the version used for the workload and updated it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get updated?

Comment on lines +236 to +241
uv venv --seed ${HOME}/.local/bin/venv-docker --python 3.12 --clear
source ${HOME}/.local/bin/venv-docker/bin/activate
pip install --upgrade pip

# Make sure you're running on a Virtual Environment with python 3.12
if [[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]]; then { echo "You have the correct Python version 3.12"; } else { >&2 echo "Error: Python version must be 3.12"; false;} fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 3.12 but above says 3.11?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, just updated to 3.11 which is what we used.

@seonjunmoon seonjunmoon requested a review from mkmg March 24, 2026 21:07
@seonjunmoon seonjunmoon marked this pull request as ready for review March 30, 2026 22:53
export PROJECT=cloud-tpu-multipod-dev
export CLUSTER=bodaborg-tpu7x-nap-users
export ZONE=us-central1-c
export RECIPE_REPO="path-to-this-recipe-repo" # Update
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add more detail here so it's clear where path should end.

e.g. "your/dir/tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk"

Also I don't see this used? Is it supposed to be used as part of the manifest flag?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this variable previously as part of the going back to recipe repo after attaching storage through xpk, but then removed the part in this command instruction, so I don't need this variable anymore and just removed this. Thank you!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below XPK is installed via pip. In this case you can remove the cd into the XPK directory, and then have users execute the xpk command from the recipe directory (that way the dataset_pvc.yaml file can be found without having to specify the absolute path using the above mentioned variable).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, we can use the globally installed xpk here, just updated.

Comment on lines +222 to +226
- Libtpu version: 0.0.32.dev20251215+nightly
- Jax version: 0.8.2.dev20251215
- Maxtext version: maxtext-tutorial-v1.5.0
- Python: 3.11
- XPK: 1.8.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?

Comment on lines +78 to +82
cd ~/xpk

# Dataset Bucket PV/PVC
python3 xpk.py storage attach my-dataset-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/dataset --readonly=false --bucket=$DATASET_BUCKET --size=64 --auto-mount=false --manifest=dataset_pvc.yaml
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires the cluster to be created; please move this section after the section with the cluster creation commands.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just moved the GCS bucket setup session after the cluster creation step.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing this change?

@seonjunmoon seonjunmoon requested a review from mkmg March 31, 2026 23:39
Be sure to update `volumeHandle` in the yamls with your correct bucket names. Creating a bucket and attaching xpk storage is a one time setup.
```
# Set variables
export PROJECT=cloud-tpu-multipod-dev
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these values, instead using , , etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just replaced as "", thanks.

# Set variables
export DATASET_BUCKET="dataloading-bucket-name"
export CHECKPOINT_BUCKET="checkpoint-bucket-name"
export REGION="us-central1"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace us-central1 with

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp-gcs")-$(date +%Y%m%d-%H%M)"
export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset"

export MAXTEXT_ROOT="${HOME}/maxtext" # Update this to your maxtext root
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is user defined path, shall we change this to export MAXTEXT_ROOT={MAXTEXT_ROOT} and ask user to define this in README?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated run_recipe.sh to make MAXTEXT_ROOT an empty string by default and let user define this path. I've also added a note in the README to explicitly call out setting MAXTEXT_ROOT. Thanks!

@seonjunmoon seonjunmoon requested a review from junjieqian April 2, 2026 22:40

```bash
# Make sure BASE_OUTPUT_DIR is set in run_recipe.sh before running this.
gcloud storage buckets create ${BASE_OUTPUT_DIR} --project=${PROJECT_ID} --location=US --default-storage-class=STANDARD --uniform-bucket-level-access
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--location=US may not true for all cases.

Copy link
Copy Markdown
Collaborator

@junjieqian junjieqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.. but please get an approval from storage TLs as well.
Thank you

export DATASET_BUCKET_MOUNTED_PATH=""
export MAXTEXT_ROOT="" # e.g., ${HOME}/maxtext. Update this to the absolute path where you cloned the MaxText repository
cd "$MAXTEXT_ROOT"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two lines are not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants