From reading the code, it seems that the GRPO advantage computation effectively degenerates to using raw rewards.
Since a randomly generated UUID is assigned to each sample as the grouping key, every sample forms a group of size 1. As a result, the group-based normalization step in GRPO becomes a no-op, and the computed advantage is effectively equivalent to the raw reward (up to scaling or numerical safeguards).
Could you clarify whether this behavior is intentional?
If so, what was the design rationale for disabling group-relative normalization while still referring to the algorithm as GRPO?
If not, should the grouping key instead correspond to a shared task/prompt/rollout identifier so that advantages are normalized across multiple samples from the same task?
Thanks in advance for any clarification.