fix: add NCCL-safe utilities#233
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces NCCL-safe fault tolerance utilities (safe_loss and @nccl_safe) to prevent distributed training hangs by catching errors gracefully and forcing zero-gradient backward passes when necessary. It also propagates the TWINKLE_FAIL_FAST environment variable, updates server configurations, and adds extensive E2E and unit tests. However, several critical issues were identified in the new nccl_safe.py implementation: first, safe_loss returns a function wrapper instead of a Loss instance, which will fail pipeline assertions expecting a Loss subclass; second, the fallback path in _force_zero_backward can raise an AttributeError if model.model is a list, and detaching the parameter breaks the autograd graph connectivity required for FSDP hooks to fire.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- Add ping() method to all model backends (mock/transformers/megatron) - Add check_model_health() to ModelManagement that calls ping() - Add /healthz endpoint on model deployment (returns 503 if actors dead) - Add /twinkle/healthz/deep on gateway that probes all model deployments - Update entrypoint.sh watchdog to check deep health URL - Update validation middleware to skip sticky session for /healthz/* paths - Fix start_e2e_server.py DEFAULT_CONFIG path
…nly for PP>1 When variable_seq_lengths=True with Pipeline Parallelism, logps from different DP ranks / microbatches have different seq_lens. After HTTP serialization + collect_tensor_dict, they become ragged nested lists that torch.as_tensor cannot handle. Changes: - megatron_model.py: Flatten and pad_and_stack ragged ref_outputs logps in forward_backward before passing to MegatronModel (twinkle client) - common.py: Handle ragged list[list] in _tensor_output_to_rows by flattening microbatch nesting and pad_and_stack (tinker client) - nccl_safe.py: Add traceback to safe_loss and nccl_safe_megatron error logging for better diagnostics - Add tinker client DPO PP E2E test
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
tests/server/integration/test_full_cycle_e2e.py:63
- SAVE_DIR is hard-coded to a
/mnt/nas2/...path, which makes this E2E test fail outside that specific environment. Use an env var override and default to a writable temp directory.
Summary
This PR introduces production-grade fault tolerance and observability for distributed training, along with a complete E2E test infrastructure overhaul.
Changes
1. NCCL-safe fault tolerance (
src/twinkle/utils/nccl_safe.py)Three layers of protection to prevent NCCL hangs in production (
TWINKLE_FAIL_FAST=0):safe_loss(): Wraps loss instances (as aLosssubclass) to catch computation errors and return graph-connected zero loss for FSDP ReduceScatter compatibility.@nccl_safe: Detects forward/backward boundaries and forces zero-gradient backward when forward ran but backward didn't, preventing DP rank desynchronization.@nccl_safe_megatron: Catches pre-communication errors in Megatron methods where the entire body is NCCL-critical.2. Deep health check
ping()method to all model backends (mock/transformers/megatron)/twinkle/healthz/deepgateway endpoint that probes model actor livenessentrypoint.shwatchdog to check deep health URL3. DPO ragged logps handling
forward_backwardin Megatron backend: flatten andpad_and_stackraggedref_outputs['logps']from variable-length PP microbatches_tensor_output_to_rowsincommon.py: handle raggedlist[list]afterto_cpu_safe_outputserialization4. Megatron backend
sync=Truesync=Trueto twinkle-nativeforward_backward@remote_function, consistent with base class andtinker_forward_backward5.
TWINKLE_FAIL_FASTpropagationTWINKLE_FAIL_FAST: "0"to all serviceruntime_envin cookbook configslauncher/env_propagation.py6. E2E test infrastructure
tests/server/integration/e2e_helpers.pywith shared utilitiestest_sft_e2e.py,test_dpo_e2e.py,test_grpo_e2e.pytest_nccl_safe_twinkle_e2e.py,test_nccl_safe_tinker_e2e.pytest_full_cycle_e2e.py(checkpoint save/load/resume)tests/server/start_e2e_server.pyserver_config_4b_e2e_megatron.yamlTesting
All tests verified on real GPU environment (4x model + 1x sampler):