Conversation
cbalioglu
left a comment
There was a problem hiding this comment.
I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?
i verified and continue doing so on real run where the chkpt happening on s3. |
|
i need to move checkpoints/model.yaml to the remove location as well... now we've locally and in s3 folder: 8.8M fs2_s3test_12/ws_1.d2b3ae4f/step_1200/data_reader
4.7G fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg
4.0K fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.run
4.0K fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stderr
0 fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stdout
5.6G fs2_s3test_12/ws_1.d2b3ae4f/step_1200/model
9.3G fs2_s3test_12/ws_1.d2b3ae4f/step_1200/optimizer
12K fs2_s3test_12/ws_1.d2b3ae4f/step_1200/trainer |
|
the issue is that s3fs removes folder content but does not remove the folders themselves ... |
|
@cbalioglu : what tests should run to make sure that it's ok ? |
|
ok. catching up hg models folder for copy |
9a06bc7 to
1d0203f
Compare
|
some rebase issues - fixing them |
fsspec is imported unconditionally in file_system.py but was only available transitively via torch and huggingface_hub.
Verify the FileSystemRegistry dispatch chain works for local paths after the LocalFileSystem → GlobalFileSystem singleton swap.
pathlib.Path normalizes s3:// to s3:/ which caused FileSystemRegistry to fall through to local filesystem. Match both URI forms in pattern check.
Taking this on from @artemru for the final push.
Small smoke testA quick S3 checkpointing benchmark (Llama 3.1 8B, 8x H100, FSDP):
AWS_PROFILE=YOUR_PROFILE_RW torchrun --nproc-per-node=8 -m recipes.lm.sft "/tmp/fsspec_s3_test" \
--checkpoint-dir s3://YOUR_BUCKET/path/ \
--config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \
--config model.name=llama3_1_8b --config tokenizer.name=llama3_1_8b \
--config dataset.chat_mode=false --config dataset.max_seq_len=2048 --config dataset.max_num_tokens=2048 \
--config trainer.data_parallelism=fsdp --config regime.num_steps=400 \
--config regime.checkpoint_every_n_steps=200 --config regime.validate_every_n_steps=200Disable HF export with Actual data sizes (from S3):
Per-rank S3 throughput is ~85-115 MB/s. Checkpoint saves are async during training. HF export is the bottleneck — single-stream from rank 0. |
Required for S3 checkpoint support via fsspec. Without it, the s3:// protocol silently fails to register.
What does this PR do? Please describe:
This PR allows to save and reload the checkpoints from an remote location using fsspec integration!
Wrapping the standard fsspec interface to fs2 FileSystem API
checkpoint-dirsupport in CLITested with e2e example "
python -m recipes.lm.sft "/tmp/tmp/fs2_s3test_1" --checkpoint-dir s3://bucket/folder/ \ --config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \ --config regime.checkpoint_every_n_steps=200with restarting as well!
Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.
Check list: