Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102
Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102addsubmuldiv merged 11 commits intomodelscope:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several critical stability issues in the GRPO training flow's model-to-vLLM sampler weight synchronization path, particularly on NPU platforms. The changes aim to eliminate intermittent hangs, reduce HCCL initialization failures due to resource exhaustion, resolve shared memory cleanup errors, and fix shutdown warnings. By refactoring the metadata communication protocol, reusing HCCL communicators, and hardening IPC and memory management, the PR significantly improves the reliability, observability, and debuggability of the weight-sync pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the stability and robustness of the weight synchronization process for NPU environments by replacing the unreliable ZMQ PUB/SUB protocol with a more robust REQ/REP handshake, adding timeouts, and enhancing the vLLM IPC mechanism. However, it introduces critical security vulnerabilities due to insecure deserialization; specifically, the use of ZMQ's recv_pyobj() (which uses Python's pickle module) for inter-process communication over TCP and IPC sockets poses a significant risk of Remote Code Execution (RCE). This should be addressed by switching to a secure serialization format like JSON for metadata exchange. Additionally, an inconsistency was found in hccl_checkpoint_engine.py where the receiver supports chunking but the sender does not, which needs to be aligned.
There was a problem hiding this comment.
Pull request overview
This PR fixes multiple stability issues in the GRPO weight-sync pipeline, specifically targeting the model → vLLM sampler weight transfer path on Ascend NPU (HCCL) and the vLLM IPC path:
Changes:
- Refactored the HCCL checkpoint engine's metadata control plane from unreliable ZMQ PUB/SUB to strict REQ/REP handshakes with per-bucket synchronization, preventing message loss and bucket-step desynchronization across ranks.
- Added weight chunking support to handle weights larger than the bucket size, replacing the previous hard error for oversized weights, and set NPU default to
rebuild_group=Falseto avoid HCCL resource exhaustion from repeated communicator rebuilds. - Improved vLLM IPC robustness with per-sync unique ZMQ endpoints (avoiding cross-actor collisions), configurable timeouts, hardened SHM cleanup, and proper device binding with
local_rank.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/twinkle/utils/framework.py |
Fixes import path from .platform to .platforms for GPU/NPU device classes |
src/twinkle/checkpoint_engine/mixin.py |
Sets NPU default rebuild_group=False with env-var rollback (TWINKLE_CKPT_HCCL_REBUILD_GROUP) |
src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py |
Major refactor: PUB/SUB → REQ/REP metadata protocol, chunked weight transfer, timeout-based error handling, ZMQ context lifecycle management |
src/twinkle/sampler/vllm_sampler/vllm_engine.py |
Per-sync unique IPC endpoints, configurable IPC timeouts, chunked weight transfer, IPC socket cleanup |
src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py |
Fixed device binding, chunk reassembly, SHM cleanup hardening, timeout-based error handling |
You can also share your feedback on Copilot code review. Take the survey.
PR type
Problem Statement
In the
GRPOtraining flow, themodel -> vLLM samplerweight-sync path had multiple stability issues:recv_pyobj().not found specific resource/ link establish timeout).BufferError: cannot close exported pointers exist.Root Causes
PUB/SUB, which can lose messages under process timing jitter, causing bucket-step desynchronization across ranks.Key Changes
HCCL metadata protocol refactor (control plane)
PUB/SUBmetadata flow with strictREQ/REPhandshake keyed bybucket_id.File:
Default communicator reuse on NPU
rebuild_group=False(with env-based rollback).File:
vLLM IPC robustness improvements
zmq_handleto avoid endpoint collisions across actors.File:
Worker-side receive and memory handling fixes
Torch.get_device(local_rank)).BufferErrorrisk.File:
Compatibility & Config
TWINKLE_CKPT_HCCL_REBUILD_GROUP=1TWINKLE_CKPT_HCCL_META_TIMEOUT_STWINKLE_VLLM_IPC_TIMEOUT_SOutcome