fix: add NCCL-safe utilities by Yunnglin · Pull Request #233 · modelscope/twinkle

Yunnglin · 2026-06-24T12:41:05Z

Summary

This PR introduces production-grade fault tolerance and observability for distributed training, along with a complete E2E test infrastructure overhaul.

Changes

1. NCCL-safe fault tolerance (`src/twinkle/utils/nccl_safe.py`)

Three layers of protection to prevent NCCL hangs in production (TWINKLE_FAIL_FAST=0):

Layer 1 - safe_loss(): Wraps loss instances (as a Loss subclass) to catch computation errors and return graph-connected zero loss for FSDP ReduceScatter compatibility.
Layer 2 - @nccl_safe: Detects forward/backward boundaries and forces zero-gradient backward when forward ran but backward didn't, preventing DP rank desynchronization.
Layer 3 - @nccl_safe_megatron: Catches pre-communication errors in Megatron methods where the entire body is NCCL-critical.

2. Deep health check

Add ping() method to all model backends (mock/transformers/megatron)
Add /twinkle/healthz/deep gateway endpoint that probes model actor liveness
Update entrypoint.sh watchdog to check deep health URL
Returns 503 when model actors are dead (OOM/SIGKILL)

3. DPO ragged logps handling

Fix forward_backward in Megatron backend: flatten and pad_and_stack ragged ref_outputs['logps'] from variable-length PP microbatches
Fix _tensor_output_to_rows in common.py: handle ragged list[list] after to_cpu_safe_output serialization

4. Megatron backend `sync=True`

Add sync=True to twinkle-native forward_backward @remote_function, consistent with base class and tinker_forward_backward

5. `TWINKLE_FAIL_FAST` propagation

Propagate TWINKLE_FAIL_FAST: "0" to all service runtime_env in cookbook configs
Add env propagation in launcher/env_propagation.py

6. E2E test infrastructure

New tests/server/integration/e2e_helpers.py with shared utilities
Backend-agnostic tests: test_sft_e2e.py, test_dpo_e2e.py, test_grpo_e2e.py
NCCL-safe fault tolerance tests: test_nccl_safe_twinkle_e2e.py, test_nccl_safe_tinker_e2e.py
Refactored test_full_cycle_e2e.py (checkpoint save/load/resume)
One-click server startup: tests/server/start_e2e_server.py
Megatron E2E configs: server_config_4b_e2e_megatron.yaml

Testing

All tests verified on real GPU environment (4x model + 1x sampler):

SFT E2E (twinkle + tinker, transformers + megatron): loss decreases
DPO E2E (twinkle + tinker): loss decreases, reward margins increase
GRPO E2E: sampling + training completes without timeout
NCCL-safe E2E: graceful degradation under fault injection
Full cycle: save/load/resume checkpoint correctness

gemini-code-assist

Code Review

This pull request introduces NCCL-safe fault tolerance utilities (safe_loss and @nccl_safe) to prevent distributed training hangs by catching errors gracefully and forcing zero-gradient backward passes when necessary. It also propagates the TWINKLE_FAIL_FAST environment variable, updates server configurations, and adds extensive E2E and unit tests. However, several critical issues were identified in the new nccl_safe.py implementation: first, safe_loss returns a function wrapper instead of a Loss instance, which will fail pipeline assertions expecting a Loss subclass; second, the fallback path in _force_zero_backward can raise an AttributeError if model.model is a list, and detaching the parameter breaks the autograd graph connectivity required for FSDP hooks to fire.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…ack backward

…tron backend

- Add ping() method to all model backends (mock/transformers/megatron) - Add check_model_health() to ModelManagement that calls ping() - Add /healthz endpoint on model deployment (returns 503 if actors dead) - Add /twinkle/healthz/deep on gateway that probes all model deployments - Update entrypoint.sh watchdog to check deep health URL - Update validation middleware to skip sticky session for /healthz/* paths - Fix start_e2e_server.py DEFAULT_CONFIG path

…nly for PP>1 When variable_seq_lengths=True with Pipeline Parallelism, logps from different DP ranks / microbatches have different seq_lens. After HTTP serialization + collect_tensor_dict, they become ragged nested lists that torch.as_tensor cannot handle. Changes: - megatron_model.py: Flatten and pad_and_stack ragged ref_outputs logps in forward_backward before passing to MegatronModel (twinkle client) - common.py: Handle ragged list[list] in _tensor_output_to_rows by flattening microbatch nesting and pad_and_stack (tinker client) - nccl_safe.py: Add traceback to safe_loss and nccl_safe_megatron error logging for better diagnostics - Add tinker client DPO PP E2E test

Copilot

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

tests/server/integration/test_full_cycle_e2e.py:63

SAVE_DIR is hard-coded to a /mnt/nas2/... path, which makes this E2E test fail outside that specific environment. Use an env var override and default to a writable temp directory.

Yunnglin added 2 commits June 24, 2026 12:52

update tinker client sdk

6774894

fix: add NCCL-safe utilities and GRPO backend failure isolation

f9cb736

Copilot AI review requested due to automatic review settings June 24, 2026 12:41

Copilot started reviewing on behalf of Yunnglin June 24, 2026 12:41 View session

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/twinkle/utils/nccl_safe.py

Comment thread src/twinkle/utils/nccl_safe.py Outdated

Comment thread src/twinkle/utils/nccl_safe.py Outdated

Copilot AI reviewed Jun 24, 2026

Yunnglin added 8 commits June 24, 2026 21:02

fix(nccl_safe): make SafeLossWrapper a Loss subclass and robust fallb…

dff504d

…ack backward

test(e2e): adjust full-cycle and nccl-safe integration tests for Mega…

2669aea

…tron backend

check e2e

2caee8f

update nccl safe

e6fd665

update test scripte

5d67de9

fix lint

72ffacf

Yunnglin requested a review from Copilot June 26, 2026 02:15

Copilot started reviewing on behalf of Yunnglin June 26, 2026 02:16 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Yunnglin added 5 commits June 26, 2026 12:18

fix review

40e0a8d

fix nccl hang

99b596c

update test

61a7e2b

update reg test

70b2556

update reg test

b42cf06

Yunnglin changed the title ~~fix: add NCCL-safe utilities and GRPO backend failure isolation~~ fix: add NCCL-safe utilities Jun 26, 2026

tastelikefeet approved these changes Jun 26, 2026

View reviewed changes

revert comment

b80d14a

Yunnglin merged commit 1d35323 into main Jun 26, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add NCCL-safe utilities#233

fix: add NCCL-safe utilities#233
Yunnglin merged 16 commits into
mainfrom
fix/grpo-backend-failure-isolation

Yunnglin commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Yunnglin commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. NCCL-safe fault tolerance (src/twinkle/utils/nccl_safe.py)

2. Deep health check

3. DPO ragged logps handling

4. Megatron backend sync=True

5. TWINKLE_FAIL_FAST propagation

6. E2E test infrastructure

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yunnglin commented Jun 24, 2026 •

edited

Loading

1. NCCL-safe fault tolerance (`src/twinkle/utils/nccl_safe.py`)

4. Megatron backend `sync=True`

5. `TWINKLE_FAIL_FAST` propagation