adding integration with fsspec FileSystem by artemru · Pull Request #1126 · facebookresearch/fairseq2

artemru · 2025-04-18T13:12:56Z

What does this PR do? Please describe:

This PR allows to save and reload the checkpoints from an remote location using fsspec integration!

Wrapping the standard fsspec interface to fs2 FileSystem API

Using Wrapped FS instead of Native one
Adding some basic path resolvers (they should be used in dataset loading and model registry)
Adding some basic configurations for s3, hf, gcp
adding cat method (for possible faster files reading in concrete fsspec implementations)
chaning is_local to is_local_path
introducing checkpoint-dir support in CLI
fixing moving files logic for hf and normal checkpoints (s3 cannot do move operation as on nfs)

Tested with e2e example "

python -m recipes.lm.sft "/tmp/tmp/fs2_s3test_1" --checkpoint-dir s3://bucket/folder/ \
  --config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \
  --config regime.checkpoint_every_n_steps=200

with restarting as well!

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

cbalioglu

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

caciolai

Some minor comments

artemru · 2026-01-26T18:27:59Z

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

i verified and continue doing so on real run where the chkpt happening on s3.
is there any e2e workflow that I can reuse for tests that checks that all saved primitives are correct ?

artemru · 2026-01-26T18:31:14Z

i need to move checkpoints/model.yaml to the remove location as well...
Ok, done, model def + hg models :

now we've locally

0       /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/cache
4.0K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/config.yaml
192K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/logs
212K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/metrics
312K    /tmp/tmp/fs2_s3test_12/ws_1.d2b3ae4f/tb

and in s3 folder:
model.yaml step_1000 step_1200 step_1400 step_1600 step_1800 step_2000 step_2200 step_2400 step_2600 step_800
with inner structure:

8.8M    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/data_reader
4.7G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg
4.0K    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.run
4.0K    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stderr
0       fs2_s3test_12/ws_1.d2b3ae4f/step_1200/hg.stdout
5.6G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/model
9.3G    fs2_s3test_12/ws_1.d2b3ae4f/step_1200/optimizer
12K     fs2_s3test_12/ws_1.d2b3ae4f/step_1200/trainer

artemru · 2026-01-26T18:58:01Z

the issue is that s3fs removes folder content but does not remove the folders themselves ...

artemru · 2026-01-26T19:35:39Z

@cbalioglu : what tests should run to make sure that it's ok ?

artemru · 2026-01-26T23:23:33Z

ok. catching up hg models folder for copy

artemru · 2026-02-06T12:48:24Z

some rebase issues - fixing them

fsspec is imported unconditionally in file_system.py but was only available transitively via torch and huggingface_hub.

Verify the FileSystemRegistry dispatch chain works for local paths after the LocalFileSystem → GlobalFileSystem singleton swap.

pathlib.Path normalizes s3:// to s3:/ which caused FileSystemRegistry to fall through to local filesystem. Match both URI forms in pattern check.

cirquit · 2026-03-16T21:44:33Z

I believe one major thing missing right now is some benchmark numbers as well as verification that the new file system implementation works end-to-end (i.e. a real world training run, asset metadata loading, Hugging Face model exports, and any other place where we used regular file system calls work without any regressions). Have you been able to verify those?

Taking this on from @artemru for the final push.

Added fsspec as explicit dependecy given that we use it in file_system.py (93be990)
Added some tests for GlobalFileSystem (60338fe)
pathlib.Path normalizes s3:// to s3:/ which caused FileSystemRegistry to fall through to local filesystem instead of using S3. Match both URI forms in pattern check (a6577da). This is ultimately a hotfix and not a good patch given that this mangling comes from the use of pathlib.Path which simply does not behave in line with the way we access S3 with s3fs. Should be reworked in its own PR.

Small smoke test

A quick S3 checkpointing benchmark (Llama 3.1 8B, 8x H100, FSDP):

Add checkpoint: "hg://meta-llama/Llama-3.1-8B" and tokenizer: llama3_1_8b asset card llama.yaml (also something for the future to be reworked)
pip install s3fs, configure AWS profile in ~/.aws/config

AWS_PROFILE=YOUR_PROFILE_RW torchrun --nproc-per-node=8 -m recipes.lm.sft "/tmp/fsspec_s3_test" \
    --checkpoint-dir s3://YOUR_BUCKET/path/ \
    --config-file recipes/lm/sft/configs/llama3_2_1b_instruct_gsm8k.yaml \
    --config model.name=llama3_1_8b --config tokenizer.name=llama3_1_8b \
    --config dataset.chat_mode=false --config dataset.max_seq_len=2048 --config dataset.max_num_tokens=2048 \
    --config trainer.data_parallelism=fsdp --config regime.num_steps=400 \
    --config regime.checkpoint_every_n_steps=200 --config regime.validate_every_n_steps=200

Disable HF export with --config regime.save_model_only=true. Resume by re-running the same command.

Actual data sizes (from S3):

Model shard: 4.0 GB × 8 ranks = 29.9 GB
Optimizer shard: 8.0 GB × 8 ranks = 59.8 GB
Total checkpoint: ~90 GB/step
HF export: 7 safetensors files = 32.1 GB

Operation	Wall clock	Data	Throughput
Checkpoint save	~2m	~90 GB (8 ranks)	~87 MB/s/rank
Resume: model load	~34s	~30 GB (8 ranks)	~110 MB/s/rank
Resume: optimizer load	~65s	~60 GB (8 ranks)	~115 MB/s/rank
Resume total	~1m 40s	~90 GB
HF export	~5-7m	~32 GB	~76 MB/s single stream (single-threaded export, bottleneck)
Save+export (2 steps)	16m 12s
Save only (2 steps)	~4m 14s		`save_model_only=true`

Per-rank S3 throughput is ~85-115 MB/s. Checkpoint saves are async during training. HF export is the bottleneck — single-stream from rank 0.

Required for S3 checkpoint support via fsspec. Without it, the s3:// protocol silently fails to register.

cirquit · 2026-03-16T21:49:35Z

Nits: Added s3fs as direct dependency (8d7bf20) and updated the CHANGELOG.md (f229254)

cirquit

LGTM after refactoring.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 18, 2025

artemru self-assigned this Apr 18, 2025

artemru force-pushed the adding_fsspec branch from 6b11ad9 to 171ab45 Compare April 23, 2025 10:45

artemru marked this pull request as ready for review April 23, 2025 13:10

artemru requested a review from cbalioglu as a code owner April 23, 2025 13:10

artemru commented Apr 23, 2025

View reviewed changes

Comment thread src/fairseq2/setup/_root.py Outdated

artemru force-pushed the adding_fsspec branch from 3245a68 to 4508cc2 Compare July 7, 2025 16:09

caciolai requested review from MartinGleize, cirquit and zyaoj as code owners January 23, 2026 11:39