Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval by Copilot · Pull Request #7 · code-name-57/pilla_rl

Copilot · 2026-04-02T15:19:37Z

Multiple isolated task directories (go2/walk/, go2/standup_copilot/, go2/upside_down_recovery/, go2/upside_down_standup/) each duplicated Go2Env, reward methods, and hardcoded configs with no shared infrastructure or way to chain policies across tasks. This adds a new pilla_rl/ package that consolidates everything into a clean, extensible curriculum learning architecture. Existing task directories are untouched.

Reward library (`pilla_rl/rewards/`)

19 standalone pure functions extracted from all existing envs — no self, explicit tensor args, returns (num_envs,) tensor
REWARD_REGISTRY: dict maps string names → functions for dynamic lookup

Base environment (`pilla_rl/envs/base_env.py`)

BaseQuadrupedEnv holds all duplicated logic: scene/robot setup, buffers, PD control, step()/reset()/reset_idx()
Rewards wired via registry + inspect.signature() auto-resolution — no more getattr(self, "_reward_" + name)
Three overridable hooks for per-task variation: _check_termination(), _reset_robot_pose(), _compute_observations()

Task subclasses (`pilla_rl/envs/`)

Class	Termination	Reset	Obs dims
`WalkEnv`	pitch/roll limit	upright default	45
`StandupEnv`	episode only	random euler ±60°, random DOF	48
`RecoveryEnv`	z < 0.05 or episode	70% upside-down / 30% on-side	48

YAML configs + loader (`pilla_rl/configs/`, `pilla_rl/config_loader.py`)

Task configs for all 4 tasks faithfully matching original get_cfgs() values, plus env_class for dynamic dispatch
load_task_config(path) + instantiate_env(cfg, num_envs, show_viewer) via dynamic import
Example curriculum config recovery_to_walk.yaml with reward_overrides, command_overrides, load_from: "previous"

Unified entry points

# Train any task from a single script
python -m pilla_rl.train --config pilla_rl/configs/tasks/walk.yaml --num_envs 4096

# Transfer learning / curriculum continuation
python -m pilla_rl.train \
    --config pilla_rl/configs/tasks/standup.yaml \
    --resume_from logs/go2-walking/model_5000.pt

# Evaluate
python -m pilla_rl.evaluate \
    --config pilla_rl/configs/tasks/walk.yaml \
    --checkpoint logs/go2-walking/model_5000.pt

train.py saves cfgs.pkl for backward compatibility with existing eval scripts
Uses rsl-rl-lib==2.3.3 consistently

Original prompt

Context

The pilla_rl repository currently has multiple isolated task directories (go2/walk/, go2/standup/, go2/standup_copilot/, go2/upside_down_recovery/, go2/upside_down_standup/) each with near-identical copies of Go2Env, go2_train.py, go2_eval.py, and go2_teleop.py. Reward functions are defined as _reward_* methods directly on each env class, and reward scales are hardcoded in get_cfgs() functions inside each go2_train.py. There is no mechanism to chain policies across tasks, no shared infrastructure, and no way to adjust rewards incrementally during training.

We need to build Phase 1 of a curriculum learning framework that consolidates all this into a clean, extensible architecture. The existing task directories and files MUST NOT be modified or deleted — they should continue working as-is. All new code goes into a new pilla_rl/ package directory at the repo root.

What to implement

1. Shared Reward Function Library (`pilla_rl/rewards/reward_functions.py`)

Create a centralized reward function library with all reward functions extracted from every existing task env as standalone pure functions (not methods). Each function should:

Accept explicit tensor arguments (e.g., base_lin_vel, commands, tracking_sigma) rather than accessing self
Return a per-environment reward tensor of shape (num_envs,)
Be stateless — no side effects

Include a REWARD_REGISTRY dict mapping string names to functions. Every reward function from these existing env files must be included:

From go2/walk/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/walk/go2_env.py):

tracking_lin_vel, tracking_ang_vel, lin_vel_z, action_rate, similar_to_default, base_height

From go2/standup_copilot/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/standup_copilot/go2_env.py):

upright_orientation, stability, stand_up_progress, recovery_effort, joint_regularization

From go2/upside_down_recovery/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_recovery/go2_env.py):

recovery_progress, legs_not_in_air, energy_efficiency, forward_progress, minimize_base_roll

From go2/upside_down_standup/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_standup/go2_env.py):

standup_height, complete_standup, height_when_upright

Also create pilla_rl/rewards/__init__.py.

2. Base Quadruped Environment (`pilla_rl/envs/base_env.py`)

Create a BaseQuadrupedEnv class that contains ALL the shared logic currently duplicated across env files:

Scene creation (Genesis sim setup, plane + robot URDF loading)
Buffer initialization (all the torch.zeros buffers)
PD control setup
step() method: action clipping, PD control, physics step, buffer updates (pos, quat, euler, lin_vel, ang_vel, gravity, dof_pos, dof_vel), command resampling, termination check, reward computation, observation computation
reset() and reset_idx() methods
get_observations(), get_privileged_observations()
Reward wiring via the REWARD_REGISTRY — use inspect.signature() to auto-resolve reward function arguments from env state (replacing the old getattr(self, "_reward_" + name) pattern)

The key difference from the old env: reward functions are looked up from REWARD_REGISTRY instead of being methods on the class. The _resolve_reward_args() method should inspect the function signature and map parameter names to env attributes:

base_lin_vel → self.base_lin_vel
base_ang_vel → self.base_ang_vel
base_pos → self.base_pos
base_euler → self.base_euler
commands → self.commands
dof_pos → self.dof_pos
dof_vel → self.dof_vel
actions → self.actions
last_actions → self.last_actions
default_dof_pos → self.default_dof_pos
tracking_sigma → self.reward_cfg["tracking_sigma"]
target_height / base_height_target → self.reward_cfg["base_height_target"]

The base class should have overridable methods for the parts that vary per task:

_check_termination() — default: pitch/roll limit termination
_reset_robot_pose(envs_idx) — default: reset to upright standing pose
_compute_observations() — default: 45-dim obs (ang_vel, gravity, commands, dof_pos, dof_vel, actions)

3. Task-Specific Environment Subclasses

Create thin subclasses in pilla_rl/envs/:

pilla_rl/envs/walk_env.py — WalkEnv(BaseQuadrupedEnv):

Default termination (pitch/roll > configured limits)
Default reset (upright standing)
Default 45-dim observations

pilla_rl/envs/standup_env.py — StandupEnv(BaseQuadrupedEnv):

Lenient termination (commented out pitch/roll, only episode length)
Random pose reset (random euler angles, random DOF positions from joint ...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/code-name-57/pilla_rl/sessions/c6726929-a02e-4608-b1da-2496b3e18e07 Co-authored-by: Macbull <11361002+Macbull@users.noreply.github.com>

Initial plan

9f6bd31

Copilot AI assigned Copilot and Macbull Apr 2, 2026

Copilot started work on behalf of Macbull April 2, 2026 15:19 View session

Add pilla_rl curriculum learning framework (Phase 1)

4f04ecf

Agent-Logs-Url: https://github.com/code-name-57/pilla_rl/sessions/c6726929-a02e-4608-b1da-2496b3e18e07 Co-authored-by: Macbull <11361002+Macbull@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add Phase 1 of curriculum learning framework~~ Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval Apr 2, 2026

Copilot AI requested a review from Macbull April 2, 2026 16:20

Copilot finished work on behalf of Macbull April 2, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval#7

Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval#7
Copilot wants to merge 2 commits intomainfrom
copilot/add-curriculum-learning-framework

Copilot AI commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reward library (pilla_rl/rewards/)

Base environment (pilla_rl/envs/base_env.py)

Task subclasses (pilla_rl/envs/)

YAML configs + loader (pilla_rl/configs/, pilla_rl/config_loader.py)

Unified entry points

Context

What to implement

1. Shared Reward Function Library (pilla_rl/rewards/reward_functions.py)

2. Base Quadruped Environment (pilla_rl/envs/base_env.py)

3. Task-Specific Environment Subclasses

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 2, 2026 •

edited

Loading

Reward library (`pilla_rl/rewards/`)

Base environment (`pilla_rl/envs/base_env.py`)

Task subclasses (`pilla_rl/envs/`)

YAML configs + loader (`pilla_rl/configs/`, `pilla_rl/config_loader.py`)

1. Shared Reward Function Library (`pilla_rl/rewards/reward_functions.py`)

2. Base Quadruped Environment (`pilla_rl/envs/base_env.py`)