Skip to content

adding integration with fsspec FileSystem#1126

Merged
cirquit merged 52 commits intomainfrom
adding_fsspec
Mar 17, 2026
Merged

adding integration with fsspec FileSystem#1126
cirquit merged 52 commits intomainfrom
adding_fsspec

Conversation

@artemru
Copy link
Copy Markdown
Contributor

@artemru artemru commented Apr 18, 2025

What does this PR do? Please describe:

This PR allows to save and reload the checkpoints from an remote location using fsspec integration!

Wrapping the standard fsspec interface to fs2 FileSystem API

  • Using Wrapped FS instead of Native one
  • Adding some basic path resolvers (they should be used in dataset loading and model registry)
  • Adding some basic configurations for s3, hf, gcp
  • adding cat method (for possible faster files reading in concrete fsspec implementations)
  • chaning is_local to is_local_path
  • introducing checkpoint-dir support in CLI
  • fixing moving files logic for hf and normal checkpoints (s3 cannot do move operation as on nfs)

Tested with e2e example "

python -m recipes.lm.sft "/tmp/tmp/fs2_s3test_1" --checkpoint-dir s3://bucket/folder/ \
  --config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \
  --config regime.checkpoint_every_n_steps=200

with restarting as well!

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 18, 2025
@artemru artemru self-assigned this Apr 18, 2025
@artemru artemru marked this pull request as ready for review April 23, 2025 13:10
@artemru artemru requested a review from cbalioglu as a code owner April 23, 2025 13:10
Comment thread src/fairseq2/setup/_root.py Outdated
Comment thread src/fairseq2/file_system.py
Copy link
Copy Markdown
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

Comment thread src/fairseq2/assets/download_manager.py Outdated
Comment thread src/fairseq2/checkpoint/manager.py Outdated
Comment thread src/fairseq2/file_system.py Outdated
Comment thread src/fairseq2/file_system.py Outdated
Comment thread src/fairseq2/file_system.py Outdated
Comment thread src/fairseq2/file_system.py Outdated
Comment thread src/fairseq2/checkpoint/manager.py Outdated
Copy link
Copy Markdown
Contributor

@caciolai caciolai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Jan 26, 2026

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

i verified and continue doing so on real run where the chkpt happening on s3.
is there any e2e workflow that I can reuse for tests that checks that all saved primitives are correct ?

@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Jan 26, 2026

i need to move checkpoints/model.yaml to the remove location as well...
Ok, done, model def + hg models :

now we've locally

0       /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/cache
4.0K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/config.yaml
192K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/logs
212K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/metrics
312K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/tb

and in s3 folder:
model.yaml step_1000 step_1200 step_1400 step_1600 step_1800 step_2000 step_2200 step_2400 step_2600 step_800
with inner structure:

8.8M    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/data_reader
4.7G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg
4.0K    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.run
4.0K    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stderr
0       fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stdout
5.6G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/model
9.3G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/optimizer
12K     fs2_s3test_12/ws_1.d2b3ae4f/step_1200/trainer

@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Jan 26, 2026

the issue is that s3fs removes folder content but does not remove the folders themselves ...

Comment thread src/fairseq2/checkpoint/__init__.py
@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Jan 26, 2026

@cbalioglu : what tests should run to make sure that it's ok ?

@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Jan 26, 2026

ok. catching up hg models folder for copy

Comment thread src/fairseq2/checkpoint/hg.py
Comment thread src/fairseq2/file_system.py
Comment thread src/fairseq2/recipe/internal/metric_recorders.py
Comment thread src/fairseq2/recipe/internal/train_model.py
Comment thread src/fairseq2/checkpoint/hg.py
@artemru
Copy link
Copy Markdown
Contributor Author

artemru commented Feb 6, 2026

some rebase issues - fixing them

artemru and others added 10 commits February 6, 2026 13:48
fsspec is imported unconditionally in file_system.py but was only
available transitively via torch and huggingface_hub.
Verify the FileSystemRegistry dispatch chain works for local paths
after the LocalFileSystem → GlobalFileSystem singleton swap.
pathlib.Path normalizes s3:// to s3:/ which caused FileSystemRegistry
to fall through to local filesystem. Match both URI forms in pattern check.
@cirquit
Copy link
Copy Markdown
Contributor

cirquit commented Mar 16, 2026

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

Taking this on from @artemru for the final push.

  • Added fsspec as explicit dependecy given that we use it in file_system.py (93be990)
  • Added some tests for GlobalFileSystem (60338fe)
  • pathlib.Path normalizes s3:// to s3:/ which caused FileSystemRegistry to fall through to local filesystem instead of using S3. Match both URI forms in pattern check (a6577da). This is ultimately a hotfix and not a good patch given that this mangling comes from the use of pathlib.Path which simply does not behave in line with the way we access S3 with s3fs. Should be reworked in its own PR.

Small smoke test

A quick S3 checkpointing benchmark (Llama 3.1 8B, 8x H100, FSDP):

  • Add checkpoint: "hg://meta-llama/Llama-3.1-8B" and tokenizer: llama3_1_8b asset card llama.yaml (also something for the future to be reworked)
  • pip install s3fs, configure AWS profile in ~/.aws/config
AWS_PROFILE=YOUR_PROFILE_RW torchrun --nproc-per-node=8 -m recipes.lm.sft "/tmp/fsspec_s3_test" \
    --checkpoint-dir s3://YOUR_BUCKET/path/ \
    --config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \
    --config model.name=llama3_1_8b --config tokenizer.name=llama3_1_8b \
    --config dataset.chat_mode=false --config dataset.max_seq_len=2048 --config dataset.max_num_tokens=2048 \
    --config trainer.data_parallelism=fsdp --config regime.num_steps=400 \
    --config regime.checkpoint_every_n_steps=200 --config regime.validate_every_n_steps=200

Disable HF export with --config regime.save_model_only=true. Resume by re-running the same command.

Actual data sizes (from S3):

  • Model shard: 4.0 GB × 8 ranks = 29.9 GB
  • Optimizer shard: 8.0 GB × 8 ranks = 59.8 GB
  • Total checkpoint: ~90 GB/step
  • HF export: 7 safetensors files = 32.1 GB
Operation Wall clock Data Throughput
Checkpoint save ~2m ~90 GB (8 ranks) ~87 MB/s/rank
Resume: model load ~34s ~30 GB (8 ranks) ~110 MB/s/rank
Resume: optimizer load ~65s ~60 GB (8 ranks) ~115 MB/s/rank
Resume total ~1m 40s ~90 GB
HF export ~5-7m ~32 GB ~76 MB/s single stream (single-threaded export, bottleneck)
Save+export (2 steps) 16m 12s
Save only (2 steps) ~4m 14s save_model_only=true

Per-rank S3 throughput is ~85-115 MB/s. Checkpoint saves are async during training. HF export is the bottleneck — single-stream from rank 0.

cirquit added 2 commits March 16, 2026 21:47
Required for S3 checkpoint support via fsspec. Without it, the s3://
protocol silently fails to register.
@cirquit
Copy link
Copy Markdown
Contributor

cirquit commented Mar 16, 2026

Nits: Added s3fs as direct dependency (8d7bf20) and updated the CHANGELOG.md (f229254)

Copy link
Copy Markdown
Contributor

@cirquit cirquit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after refactoring.

@cirquit cirquit merged commit 869c234 into main Mar 17, 2026
19 checks passed
@cirquit cirquit deleted the adding_fsspec branch March 17, 2026 00:02
@cirquit cirquit restored the adding_fsspec branch March 17, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants