Conversation
Agent-Logs-Url: https://github.com/code-name-57/pilla_rl/sessions/c6726929-a02e-4608-b1da-2496b3e18e07 Co-authored-by: Macbull <11361002+Macbull@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add Phase 1 of curriculum learning framework
Phase 1: Curriculum learning framework — shared reward library, base env, task subclasses, YAML configs, unified train/eval
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multiple isolated task directories (
go2/walk/,go2/standup_copilot/,go2/upside_down_recovery/,go2/upside_down_standup/) each duplicatedGo2Env, reward methods, and hardcoded configs with no shared infrastructure or way to chain policies across tasks. This adds a newpilla_rl/package that consolidates everything into a clean, extensible curriculum learning architecture. Existing task directories are untouched.Reward library (
pilla_rl/rewards/)self, explicit tensor args, returns(num_envs,)tensorREWARD_REGISTRY: dictmaps string names → functions for dynamic lookupBase environment (
pilla_rl/envs/base_env.py)BaseQuadrupedEnvholds all duplicated logic: scene/robot setup, buffers, PD control,step()/reset()/reset_idx()inspect.signature()auto-resolution — no moregetattr(self, "_reward_" + name)_check_termination(),_reset_robot_pose(),_compute_observations()Task subclasses (
pilla_rl/envs/)WalkEnvStandupEnvRecoveryEnvYAML configs + loader (
pilla_rl/configs/,pilla_rl/config_loader.py)get_cfgs()values, plusenv_classfor dynamic dispatchload_task_config(path)+instantiate_env(cfg, num_envs, show_viewer)via dynamic importrecovery_to_walk.yamlwithreward_overrides,command_overrides,load_from: "previous"Unified entry points
train.pysavescfgs.pklfor backward compatibility with existing eval scriptsrsl-rl-lib==2.3.3consistentlyOriginal prompt
Context
The
pilla_rlrepository currently has multiple isolated task directories (go2/walk/,go2/standup/,go2/standup_copilot/,go2/upside_down_recovery/,go2/upside_down_standup/) each with near-identical copies ofGo2Env,go2_train.py,go2_eval.py, andgo2_teleop.py. Reward functions are defined as_reward_*methods directly on each env class, and reward scales are hardcoded inget_cfgs()functions inside eachgo2_train.py. There is no mechanism to chain policies across tasks, no shared infrastructure, and no way to adjust rewards incrementally during training.We need to build Phase 1 of a curriculum learning framework that consolidates all this into a clean, extensible architecture. The existing task directories and files MUST NOT be modified or deleted — they should continue working as-is. All new code goes into a new
pilla_rl/package directory at the repo root.What to implement
1. Shared Reward Function Library (
pilla_rl/rewards/reward_functions.py)Create a centralized reward function library with all reward functions extracted from every existing task env as standalone pure functions (not methods). Each function should:
base_lin_vel,commands,tracking_sigma) rather than accessingself(num_envs,)Include a
REWARD_REGISTRYdict mapping string names to functions. Every reward function from these existing env files must be included:From
go2/walk/go2_env.py(https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/walk/go2_env.py):tracking_lin_vel,tracking_ang_vel,lin_vel_z,action_rate,similar_to_default,base_heightFrom
go2/standup_copilot/go2_env.py(https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/standup_copilot/go2_env.py):upright_orientation,stability,stand_up_progress,recovery_effort,joint_regularizationFrom
go2/upside_down_recovery/go2_env.py(https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_recovery/go2_env.py):recovery_progress,legs_not_in_air,energy_efficiency,forward_progress,minimize_base_rollFrom
go2/upside_down_standup/go2_env.py(https://github.com/code-name-57/pilla_rl/blob/6ec378d971380f4ff253e419728716896b6dad29/go2/upside_down_standup/go2_env.py):standup_height,complete_standup,height_when_uprightAlso create
pilla_rl/rewards/__init__.py.2. Base Quadruped Environment (
pilla_rl/envs/base_env.py)Create a
BaseQuadrupedEnvclass that contains ALL the shared logic currently duplicated across env files:torch.zerosbuffers)step()method: action clipping, PD control, physics step, buffer updates (pos, quat, euler, lin_vel, ang_vel, gravity, dof_pos, dof_vel), command resampling, termination check, reward computation, observation computationreset()andreset_idx()methodsget_observations(),get_privileged_observations()REWARD_REGISTRY— useinspect.signature()to auto-resolve reward function arguments from env state (replacing the oldgetattr(self, "_reward_" + name)pattern)The key difference from the old env: reward functions are looked up from
REWARD_REGISTRYinstead of being methods on the class. The_resolve_reward_args()method should inspect the function signature and map parameter names to env attributes:base_lin_vel→self.base_lin_velbase_ang_vel→self.base_ang_velbase_pos→self.base_posbase_euler→self.base_eulercommands→self.commandsdof_pos→self.dof_posdof_vel→self.dof_velactions→self.actionslast_actions→self.last_actionsdefault_dof_pos→self.default_dof_postracking_sigma→self.reward_cfg["tracking_sigma"]target_height/base_height_target→self.reward_cfg["base_height_target"]The base class should have overridable methods for the parts that vary per task:
_check_termination()— default: pitch/roll limit termination_reset_robot_pose(envs_idx)— default: reset to upright standing pose_compute_observations()— default: 45-dim obs (ang_vel, gravity, commands, dof_pos, dof_vel, actions)3. Task-Specific Environment Subclasses
Create thin subclasses in
pilla_rl/envs/:pilla_rl/envs/walk_env.py—WalkEnv(BaseQuadrupedEnv):pilla_rl/envs/standup_env.py—StandupEnv(BaseQuadrupedEnv):This pull request was created from Copilot chat.