Skip to content

Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval#7

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-curriculum-learning-framework
Draft

Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval#7
Copilot wants to merge 2 commits intomainfrom
copilot/add-curriculum-learning-framework

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 2, 2026

Multiple isolated task directories (go2/walk/, go2/standup_copilot/, go2/upside_down_recovery/, go2/upside_down_standup/) each duplicated Go2Env, reward methods, and hardcoded configs with no shared infrastructure or way to chain policies across tasks. This adds a new pilla_rl/ package that consolidates everything into a clean, extensible curriculum learning architecture. Existing task directories are untouched.

Reward library (pilla_rl/rewards/)

  • 19 standalone pure functions extracted from all existing envs — no self, explicit tensor args, returns (num_envs,) tensor
  • REWARD_REGISTRY: dict maps string names → functions for dynamic lookup

Base environment (pilla_rl/envs/base_env.py)

  • BaseQuadrupedEnv holds all duplicated logic: scene/robot setup, buffers, PD control, step()/reset()/reset_idx()
  • Rewards wired via registry + inspect.signature() auto-resolution — no more getattr(self, "_reward_" + name)
  • Three overridable hooks for per-task variation: _check_termination(), _reset_robot_pose(), _compute_observations()

Task subclasses (pilla_rl/envs/)

Class Termination Reset Obs dims
WalkEnv pitch/roll limit upright default 45
StandupEnv episode only random euler ±60°, random DOF 48
RecoveryEnv z < 0.05 or episode 70% upside-down / 30% on-side 48

YAML configs + loader (pilla_rl/configs/, pilla_rl/config_loader.py)

  • Task configs for all 4 tasks faithfully matching original get_cfgs() values, plus env_class for dynamic dispatch
  • load_task_config(path) + instantiate_env(cfg, num_envs, show_viewer) via dynamic import
  • Example curriculum config recovery_to_walk.yaml with reward_overrides, command_overrides, load_from: "previous"

Unified entry points

# Train any task from a single script
python -m pilla_rl.train --config pilla_rl/configs/tasks/walk.yaml --num_envs 4096

# Transfer learning / curriculum continuation
python -m pilla_rl.train \
    --config pilla_rl/configs/tasks/standup.yaml \
    --resume_from logs/go2-walking/model_5000.pt

# Evaluate
python -m pilla_rl.evaluate \
    --config pilla_rl/configs/tasks/walk.yaml \
    --checkpoint logs/go2-walking/model_5000.pt
  • train.py saves cfgs.pkl for backward compatibility with existing eval scripts
  • Uses rsl-rl-lib==2.3.3 consistently
Original prompt

Context

The pilla_rl repository currently has multiple isolated task directories (go2/walk/, go2/standup/, go2/standup_copilot/, go2/upside_down_recovery/, go2/upside_down_standup/) each with near-identical copies of Go2Env, go2_train.py, go2_eval.py, and go2_teleop.py. Reward functions are defined as _reward_* methods directly on each env class, and reward scales are hardcoded in get_cfgs() functions inside each go2_train.py. There is no mechanism to chain policies across tasks, no shared infrastructure, and no way to adjust rewards incrementally during training.

We need to build Phase 1 of a curriculum learning framework that consolidates all this into a clean, extensible architecture. The existing task directories and files MUST NOT be modified or deleted — they should continue working as-is. All new code goes into a new pilla_rl/ package directory at the repo root.

What to implement

1. Shared Reward Function Library (pilla_rl/rewards/reward_functions.py)

Create a centralized reward function library with all reward functions extracted from every existing task env as standalone pure functions (not methods). Each function should:

  • Accept explicit tensor arguments (e.g., base_lin_vel, commands, tracking_sigma) rather than accessing self
  • Return a per-environment reward tensor of shape (num_envs,)
  • Be stateless — no side effects

Include a REWARD_REGISTRY dict mapping string names to functions. Every reward function from these existing env files must be included:

From go2/walk/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/walk/go2_env.py):

  • tracking_lin_vel, tracking_ang_vel, lin_vel_z, action_rate, similar_to_default, base_height

From go2/standup_copilot/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/standup_copilot/go2_env.py):

  • upright_orientation, stability, stand_up_progress, recovery_effort, joint_regularization

From go2/upside_down_recovery/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_recovery/go2_env.py):

  • recovery_progress, legs_not_in_air, energy_efficiency, forward_progress, minimize_base_roll

From go2/upside_down_standup/go2_env.py (https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_standup/go2_env.py):

  • standup_height, complete_standup, height_when_upright

Also create pilla_rl/rewards/__init__.py.

2. Base Quadruped Environment (pilla_rl/envs/base_env.py)

Create a BaseQuadrupedEnv class that contains ALL the shared logic currently duplicated across env files:

  • Scene creation (Genesis sim setup, plane + robot URDF loading)
  • Buffer initialization (all the torch.zeros buffers)
  • PD control setup
  • step() method: action clipping, PD control, physics step, buffer updates (pos, quat, euler, lin_vel, ang_vel, gravity, dof_pos, dof_vel), command resampling, termination check, reward computation, observation computation
  • reset() and reset_idx() methods
  • get_observations(), get_privileged_observations()
  • Reward wiring via the REWARD_REGISTRY — use inspect.signature() to auto-resolve reward function arguments from env state (replacing the old getattr(self, "_reward_" + name) pattern)

The key difference from the old env: reward functions are looked up from REWARD_REGISTRY instead of being methods on the class. The _resolve_reward_args() method should inspect the function signature and map parameter names to env attributes:

  • base_lin_velself.base_lin_vel
  • base_ang_velself.base_ang_vel
  • base_posself.base_pos
  • base_eulerself.base_euler
  • commandsself.commands
  • dof_posself.dof_pos
  • dof_velself.dof_vel
  • actionsself.actions
  • last_actionsself.last_actions
  • default_dof_posself.default_dof_pos
  • tracking_sigmaself.reward_cfg["tracking_sigma"]
  • target_height / base_height_targetself.reward_cfg["base_height_target"]

The base class should have overridable methods for the parts that vary per task:

  • _check_termination() — default: pitch/roll limit termination
  • _reset_robot_pose(envs_idx) — default: reset to upright standing pose
  • _compute_observations() — default: 45-dim obs (ang_vel, gravity, commands, dof_pos, dof_vel, actions)

3. Task-Specific Environment Subclasses

Create thin subclasses in pilla_rl/envs/:

pilla_rl/envs/walk_env.pyWalkEnv(BaseQuadrupedEnv):

  • Default termination (pitch/roll > configured limits)
  • Default reset (upright standing)
  • Default 45-dim observations

pilla_rl/envs/standup_env.pyStandupEnv(BaseQuadrupedEnv):

  • Lenient termination (commented out pitch/roll, only episode length)
  • Random pose reset (random euler angles, random DOF positions from joint ...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add Phase 1 of curriculum learning framework Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval Apr 2, 2026
Copilot AI requested a review from Macbull April 2, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants