GRPO advantage computation uses random UUIDs instead of group_ids, effectively disabling group normalization

From reading the code, it seems that the GRPO advantage computation effectively degenerates to using raw rewards.

Since a randomly generated UUID is assigned to each sample as the grouping key, every sample forms a group of size 1. As a result, the group-based normalization step in GRPO becomes a no-op, and the computed advantage is effectively equivalent to the raw reward (up to scaling or numerical safeguards).

Could you clarify whether this behavior is intentional?
If so, what was the design rationale for disabling group-relative normalization while still referring to the algorithm as GRPO?
If not, should the grouping key instead correspond to a shared task/prompt/rollout identifier so that advantages are normalized across multiple samples from the same task?

Thanks in advance for any clarification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO advantage computation uses random UUIDs instead of group_ids, effectively disabling group normalization #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPO advantage computation uses random UUIDs instead of group_ids, effectively disabling group normalization #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions