Skip to content

GRPO advantage computation uses random UUIDs instead of group_ids, effectively disabling group normalization #36

@yueyiwen-create

Description

@yueyiwen-create

From reading the code, it seems that the GRPO advantage computation effectively degenerates to using raw rewards.

Since a randomly generated UUID is assigned to each sample as the grouping key, every sample forms a group of size 1. As a result, the group-based normalization step in GRPO becomes a no-op, and the computed advantage is effectively equivalent to the raw reward (up to scaling or numerical safeguards).

Could you clarify whether this behavior is intentional?
If so, what was the design rationale for disabling group-relative normalization while still referring to the algorithm as GRPO?
If not, should the grouping key instead correspond to a shared task/prompt/rollout identifier so that advantages are normalized across multiple samples from the same task?

Thanks in advance for any clarification.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions