Conversation
…aloading / checkpointing settings
…o tpu7x-recipe-gcs
…instructions in README
…nvironment configuration instructions for the DeepSeek3-671B workload
| checkpointStorageTargetDataFileSizeBytes=209715200 \ | ||
| dataset_type='grain' \ | ||
| grain_file_type=arrayrecord \ | ||
| grain_train_files=${DATASET_BUCKET_MOUNTED_PATH} \ |
There was a problem hiding this comment.
Are we using GCS direct loading here? If so, should we update this name to DATASET_FOLDER_PATH etc as we didn't mount the bucket.
There was a problem hiding this comment.
Per sync offline, the dataset is using GCSFuse not GCS direct. If so, pv and pvc creation should be included in the recipe.
There was a problem hiding this comment.
Thanks, I have added PV/PVC instruction in the GCS bucket setup instructions.
There was a problem hiding this comment.
Updated with more details for pv/pvc and the yaml files for dataset_pvc and checkpoint_pvc.
| @@ -0,0 +1,341 @@ | |||
| # Instructions for training DeepSeek3-671B on TPU Ironwood (tpu7x-4x8x8) with Google Cloud Storage (GCS) | |||
There was a problem hiding this comment.
Hey @seonjunmoon:
For this draft, I have several suggestions:
- Keep and update the existing introduction, Workload Details, Prerequisite sections in CMCS recipe: https://github.com/AI-Hypercomputer/tpu-recipes/tree/cba377b6e0cba6cdfe81d5cbec882a4abc071f2b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/k8s.
- Put the output bucket setup in GCS bucket setup too, add permission setups if needed.
- I highly recommend securing some capacity to test this recipe end-to-end, as it is customer-facing. At first glance, it appears the DATASET_BUCKET environment variable is being overwritten in the script; additionally, variables like CHECKPOINT_BUCKET seem not used. Please utilize some small-scale capacity to test the recipe before merging.
There was a problem hiding this comment.
Thanks for the comments!
- Yes thanks, I have added the existing introduction, Workload Details, Prerequisite sections in CMCS recipe.
- For the output bucket, it is actually the checkpoint bucket that we attach at first. I have this line in the README instruction:
export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}" I hope this is clear. - Not sure if we would have any but I will see if there's any capacity available for 256 chips (64 nodes). For DATASET_BUCKET and CHECKPOINT_BUCKET, they are what user would replace in the script as instructed in the README. Please let me know if anything is unclear.
There was a problem hiding this comment.
Question 2: I didn't see line export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET} in the README, could you please check if the change gets uploaded?
Question 3: Please see: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2908172426.
There was a problem hiding this comment.
Just replied to your comments, please let me know if anything's not clear.
| export BASE_OUTPUT_DIR="" | ||
| export WORKLOAD_IMAGE="" | ||
| export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)" | ||
| export DATASET_BUCKET="" |
There was a problem hiding this comment.
This will overwrite the existing setups.
There was a problem hiding this comment.
This is what I described in the README, the user would update these lines with the variables they defined (DATASET_BUCKET, CHECKPOINT_BUCKET, etc.) This is what I have in the README:
In run_recipe.sh, update these lines:
export PROJECT_ID="your-project-id"
export CLUSTER_NAME="your-cluster-name"
export ZONE="your-zone"
export BASE_OUTPUT_DIR="gs://${CHECKPOINT_BUCKET}"
export DATASET_BUCKET="${DATASET_BUCKET}" # e.g. "my-dataset-bucket"
export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset" # Ensure this matches where XPK mounts the bucket
There was a problem hiding this comment.
Same as above, I didn't see these revision in README, please check the code.
There was a problem hiding this comment.
Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.
| export WORKLOAD_IMAGE="" | ||
| export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)" | ||
| export DATASET_BUCKET="" | ||
| export DATASET_BUCKET_MOUNTED_PATH="" |
There was a problem hiding this comment.
Nit: Need to add introduction to this variable.
There was a problem hiding this comment.
The same as above, I have comments on each variable in the 'Configuring and Starting workload' section. Please let me know if anything is unclear to you.
There was a problem hiding this comment.
I didn't see this section too... Could you please point me to the correct location?
There was a problem hiding this comment.
Sorry, I think my recent commit redo the changes I made in the previous commit. I have updated the README again.
…equisites to the DeepSeek3-671B training README.
| ## GCS Bucket setup | ||
| 1. Create two buckets: one to hold the dataset and one to use for checkpoints. To create regional HNS buckets use the following commands: | ||
| ``` | ||
| # Set variables |
There was a problem hiding this comment.
Reply to question 3 of the comment: https://github.com/AI-Hypercomputer/tpu-recipes/pull/174/changes#r2897869089. Here it is saying that we will directly export these data, is my understanding not correct?
There was a problem hiding this comment.
This is to create buckets if need to, and when we set variables in the run_recipe, we use these to initialize the env variables. If this causes confusions, I can explicitly just let user initialize bucket name again in the run_recipe. What do you think?
…environment variables and workload priority in run_recipe.sh
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/README.md
Outdated
Show resolved
Hide resolved
| - machine-type:a3-highgpu-8 | ||
| csi: | ||
| driver: gcsfuse.csi.storage.gke.io | ||
| volumeHandle: tess-tpu-checkpointing-us-central1 |
There was a problem hiding this comment.
This needs to be some sort of placeholder for the user to replace with their bucket name. @lepan-google may be able to share how she has approached this in other recipes.
There was a problem hiding this comment.
Just updated it as a placeholder with comment.
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/checkpoint_pvc.yaml
Outdated
Show resolved
Hide resolved
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/dataset_pvc.yaml
Outdated
Show resolved
Hide resolved
| - machine-type:a3-highgpu-8 # known machine for setting purpose | ||
| csi: | ||
| driver: gcsfuse.csi.storage.gke.io | ||
| volumeHandle: tess-tpu-dataloading-us-central1 |
There was a problem hiding this comment.
Needs to be a placeholder for the user to replace with their bucket name.
There was a problem hiding this comment.
Deleted checkpoint_pvc as discussed.
| - Sequence Length: 4096 | ||
| - Precision: bf16 | ||
| - Chips: 256 (4x8x8 topology) |
There was a problem hiding this comment.
Should we add that it uses GCS for dataloading and checkpointing? And specify what dataset is used?
There was a problem hiding this comment.
Just added more details about GCS and dataset.
| # Checkpoint Bucket PV/PVC | ||
| python3 xpk.py storage attach my-checkpoint-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/ckpt --readonly=false --bucket=$CHECKPOINT_BUCKET --size=64 --auto-mount=false --manifest=checkpoint_pvc.yaml | ||
| ``` |
There was a problem hiding this comment.
As discussed in chat since we are doing direct to GCS checkpointing this can be removed.
| - `WORKLOAD_NAME`: A unique name for your workload. This is set in | ||
| `run_recipe.sh` using the following command: | ||
| `export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"` | ||
| - `GKE_VERSION`: The GKE version, `1.34.0-gke.2201000` or later. |
There was a problem hiding this comment.
Where did this GKE version come from? Do we know what version we used for our runs?
There was a problem hiding this comment.
I couldn't find the exact one we used, so I kept the version instruction used by the recipe without storage.
| - Libtpu version: 0.0.32.dev20251215+nightly | ||
| - Jax version: 0.8.2.dev20251215 | ||
| - Maxtext version: maxtext-tutorial-v1.5.0 | ||
| - Python: 3.11 | ||
| - XPK: 1.8.0 |
There was a problem hiding this comment.
If possible would be great to list the versions we used
There was a problem hiding this comment.
I have updated to make Jax and Maxtext version to what we used. For Libtpu version, I couldn't find which version was used, so I kept the one that CMCS team's recipe used.
There was a problem hiding this comment.
I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?
There was a problem hiding this comment.
Thanks I have found the version used for the workload and updated it.
| uv venv --seed ${HOME}/.local/bin/venv-docker --python 3.12 --clear | ||
| source ${HOME}/.local/bin/venv-docker/bin/activate | ||
| pip install --upgrade pip | ||
|
|
||
| # Make sure you're running on a Virtual Environment with python 3.12 | ||
| if [[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]]; then { echo "You have the correct Python version 3.12"; } else { >&2 echo "Error: Python version must be 3.12"; false;} fi |
There was a problem hiding this comment.
Oh thanks, just updated to 3.11 which is what we used.
| export PROJECT=cloud-tpu-multipod-dev | ||
| export CLUSTER=bodaborg-tpu7x-nap-users | ||
| export ZONE=us-central1-c | ||
| export RECIPE_REPO="path-to-this-recipe-repo" # Update |
There was a problem hiding this comment.
Let's add more detail here so it's clear where path should end.
e.g. "your/dir/tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk"
Also I don't see this used? Is it supposed to be used as part of the manifest flag?
There was a problem hiding this comment.
I had this variable previously as part of the going back to recipe repo after attaching storage through xpk, but then removed the part in this command instruction, so I don't need this variable anymore and just removed this. Thank you!
There was a problem hiding this comment.
I see below XPK is installed via pip. In this case you can remove the cd into the XPK directory, and then have users execute the xpk command from the recipe directory (that way the dataset_pvc.yaml file can be found without having to specify the absolute path using the above mentioned variable).
There was a problem hiding this comment.
That is true, we can use the globally installed xpk here, just updated.
| - Libtpu version: 0.0.32.dev20251215+nightly | ||
| - Jax version: 0.8.2.dev20251215 | ||
| - Maxtext version: maxtext-tutorial-v1.5.0 | ||
| - Python: 3.11 | ||
| - XPK: 1.8.0 |
There was a problem hiding this comment.
I don't think we can mix and match version like that. Can you ask in the Max and friends chat to see if there is a way to find the libtpu version?
| cd ~/xpk | ||
|
|
||
| # Dataset Bucket PV/PVC | ||
| python3 xpk.py storage attach my-dataset-bucket --type=gcsfuse --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point=/tmp/dataset --readonly=false --bucket=$DATASET_BUCKET --size=64 --auto-mount=false --manifest=dataset_pvc.yaml | ||
| ``` |
There was a problem hiding this comment.
This requires the cluster to be created; please move this section after the section with the cluster creation commands.
There was a problem hiding this comment.
Right, I just moved the GCS bucket setup session after the cluster creation step.
| Be sure to update `volumeHandle` in the yamls with your correct bucket names. Creating a bucket and attaching xpk storage is a one time setup. | ||
| ``` | ||
| # Set variables | ||
| export PROJECT=cloud-tpu-multipod-dev |
There was a problem hiding this comment.
remove these values, instead using , , etc.
There was a problem hiding this comment.
Just replaced as "", thanks.
| # Set variables | ||
| export DATASET_BUCKET="dataloading-bucket-name" | ||
| export CHECKPOINT_BUCKET="checkpoint-bucket-name" | ||
| export REGION="us-central1" |
There was a problem hiding this comment.
replace us-central1 with
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh
Show resolved
Hide resolved
| export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp-gcs")-$(date +%Y%m%d-%H%M)" | ||
| export DATASET_BUCKET_MOUNTED_PATH="/tmp/dataset" | ||
|
|
||
| export MAXTEXT_ROOT="${HOME}/maxtext" # Update this to your maxtext root |
There was a problem hiding this comment.
if this is user defined path, shall we change this to export MAXTEXT_ROOT={MAXTEXT_ROOT} and ask user to define this in README?
There was a problem hiding this comment.
I've updated run_recipe.sh to make MAXTEXT_ROOT an empty string by default and let user define this path. I've also added a note in the README to explicitly call out setting MAXTEXT_ROOT. Thanks!
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh
Show resolved
Hide resolved
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh
Show resolved
Hide resolved
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh
Show resolved
Hide resolved
training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8-gcs/xpk/run_recipe.sh
Show resolved
Hide resolved
|
|
||
| ```bash | ||
| # Make sure BASE_OUTPUT_DIR is set in run_recipe.sh before running this. | ||
| gcloud storage buckets create ${BASE_OUTPUT_DIR} --project=${PROJECT_ID} --location=US --default-storage-class=STANDARD --uniform-bucket-level-access |
There was a problem hiding this comment.
--location=US may not true for all cases.
| export DATASET_BUCKET_MOUNTED_PATH="" | ||
| export MAXTEXT_ROOT="" # e.g., ${HOME}/maxtext. Update this to the absolute path where you cloned the MaxText repository | ||
| cd "$MAXTEXT_ROOT" | ||
|
|
There was a problem hiding this comment.
these two lines are not needed.
No description provided.