Skip to content

MAPPO Implementation#721

Open
Harshul-18 wants to merge 49 commits into
assume-framework:mainfrom
Harshul-18:with_PPO_and_DDPG
Open

MAPPO Implementation#721
Harshul-18 wants to merge 49 commits into
assume-framework:mainfrom
Harshul-18:with_PPO_and_DDPG

Conversation

@Harshul-18
Copy link
Copy Markdown

@Harshul-18 Harshul-18 commented Jan 14, 2026

Description

Implemented PPO algorithm in assume framework.

Checklist

  • Implement PPO specific Actor and Critic Classes in neural_network_architecture.py.
  • Add PPO specific Buffer class in buffer.py.
  • Update the existing learning_role.py to work with PPO.
  • Configured the input data pipeline to be compatible with PPO.
  • Implement the MAPPO algorithm in mappo.py

Additional Comments

  • The results of the already implemented algorithm might be different from previously benchmarked results due to change in common hyper-parameters like learning_rate, episodes, noise, constants, etc.

Results

Scenario PPO TD3 DDPG
example_02a
example_02b
example_02c
example_02d
example_02e
custom_scenario

self.save_critic_params(directory=f"{directory}/critics")
self.save_actor_params(directory=f"{directory}/actors")

def save_critic_params(self, directory: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is completly redundant, I see why it is not in the BAseAlgoithm class because not all hav actors and critics, does it makes sense to have another parent class though? Please check


def update_policy(self) -> None:
"""
Update actor and critic networks using the DDPG algorithm.
Copy link
Copy Markdown
Contributor

@kim-mskw kim-mskw Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make self explainatory dokstring that does not depend on dokstring of TD3

Comment thread assume/reinforcement_learning/buffer.py Outdated
value: np.ndarray,
log_prob: np.ndarray
) -> None:
"""Add a transition to the buffer."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a proper dokstring

Comment thread assume/reinforcement_learning/buffer.py
Comment thread assume/reinforcement_learning/buffer.py Outdated
start_idx += batch_size

def _get_samples(self, indices: np.ndarray) -> RolloutBufferSamples:
"""Convert numpy arrays to torch tensors for given indices."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a porper dokstring. In the original implementation of the PPO the dokstring are good, use these! https://github.com/assume-framework/assume/pull/462/files#diff-e7cf9fcee75b300570d21e1894f6aa672a59d0faa37d8afdcf9611d0ff28fab7R498

self.all_rewards = defaultdict(lambda: defaultdict(list))
self.all_regrets = defaultdict(lambda: defaultdict(list))
self.all_profits = defaultdict(lambda: defaultdict(list))
# PPO algorithm specific caches for on-policy learning
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The learning_role knw the algorithm, can we do these things algorithms specific?

device
)

if cache["values"].get(timestamp):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the if here? Is there an option that I do not have values in the cache and if so why?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it earlier to check the execution of the other files and the algorithm. But yes, it is not required here and is incorrect here to add zeros or any data explicitly if for some reason the values aren't added. the if..else.. can be removed and can be used in the same manner as rewards_data, actions_data, etc.

# Add to rollout buffer
if self.rollout_buffer is not None:
self.rollout_buffer.add(
obs = to_numpy(obs_data),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary here and not for the replay bufer we already convert the stuff into numpy at the end of transform_buffer_data

self.rl_eval = inter_episodic_data["all_eval"]
self.avg_rewards = inter_episodic_data["avg_all_eval"]
self.buffer = inter_episodic_data["buffer"]
self.rollout_buffer = inter_episodic_data["rollout_buffer"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only load one buffer, get rid of redunandcy and make this algorithm specific

Comment thread assume/reinforcement_learning/neural_network_architecture.py


class ActorPPO(nn.Module):
activation_function_limit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used across many classes and redundant. I would suggest moving it to the learning utils.

Comment thread assume/reinforcement_learning/neural_network_architecture.py
Comment on lines +286 to +309
curr_action = noise

# For PPO, store dummy log_prob and value during initial exploration
if self.algorithm == "mappo":
self._last_log_prob = th.tensor(0.0, device=self.device)
self._last_value = th.tensor(0.0, device=self.device)

else:
# if we are not in the initial exploration phase we chose the action with the actor neural net
# and add noise to the action
curr_action = self.actor(next_observation).detach()
noise = self.action_noise.noise(
device=self.device, dtype=self.float_type
)
curr_action += noise
# Check if we're using PPO algorithm
if self.algorithm == "mappo":
# PPO: use get_action_and_log_prob for proper stochastic sampling
curr_action, log_prob = self.actor.get_action_and_log_prob(next_observation.unsqueeze(0))
curr_action = curr_action.squeeze(0).detach()
self._last_log_prob = log_prob.squeeze(0).detach()

# Get value estimate from critic (if available)
if hasattr(self.learning_role, 'critics') and self.unit_id in self.learning_role.critics:
critic = self.learning_role.critics[self.unit_id]
self._last_value = critic(next_observation.unsqueeze(0)).squeeze().detach()
else:
self._last_value = th.tensor(0.0, device=self.device)

# PPO uses stochastic policy, no external noise needed
noise = th.zeros_like(curr_action, dtype=self.float_type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PPO should not have an initial experience mode or should it? since it is a rollout buffer it will be gone after one update anyhow. Do you know papers with PPO and initial experience? Can we integrate this better? If you look into the original ilementation the get_actions function was moved to the algirthm file and hence can have all the algorithmtic features. which is cleaner!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, PPO should not have an initial experience mode theoretically based on my knowledge.

But when I was running the training, the rewards graph was almost all of the time showing the decreasing trend. So I thought to have the initial experience where random actions are performed (which were rewarding in positively mostly), so thought first update might cause it to improve from start, but it did improve slightly though on the custom scenario example. The original idea I had was to always save the K number of samples with top K maximum rewards where K would be less than the buffer size and will be a hyper-parameter, just for the newer updates to be improved keeping the best performances in mind. (Yet to do a research on it, but I think that deviates from the PPO).

For now, at the use case point, the initial experience mode is could be set to 0.

- suggest subclass of A2C algorithms so that we can outsource loading and saving function to base algorithm class and inherit form it
@kim-mskw kim-mskw self-requested a review January 14, 2026 12:05
Copy link
Copy Markdown
Contributor

@kim-mskw kim-mskw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial implementation!

I added multiple comments about code redundancy and handling in the framework. I pushed non-working versions of what I mean so that you get a clear picture of how it could be improved. Many of these thoughts have already been made in https://github.com/assume-framework/assume/pull/462/files#diff-d9acf7632f3702baead73bae50128bce95b87863f5575f2731c87efbeacbc508, which is a working version of the PPO on an old branch. Please revisit and further revise your implementation.

Also, PPO is known to be less sample efficient, so before comparing them, please let it train longer.

@kim-mskw kim-mskw self-assigned this Feb 4, 2026
@Harshul-18 Harshul-18 requested a review from kim-mskw April 28, 2026 07:05
Harshul-18 added 2 commits May 8, 2026 01:55
…t erased due to updates before in example_02a/config.yaml tiny configuration
Comment thread assume/common/base.py
- do gradient steps and episode_collecting_initial_expereince tests actually in OffPolicy algotihm config
Comment thread assume/reinforcement_learning/algorithms/base_algorithm.py Outdated
kim-mskw added 2 commits May 8, 2026 10:14
- make default behavior A2CAlgorithm class independent of off-/on-policy to avoid mistakes
- use "uses_target_networks" flag for actor and critic creation as well, otherwise were created but never used?
- same conistency enforced for extract_policy
self.buffer.add(
obs=transform_buffer_data(cache["obs"], device, self.rl_strats.keys()),
actions=transform_buffer_data(
cache["actions"], device, self.rl_strats.keys()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree with this implemntation. We tlaked about this and I told you to use the implemenation on the main branch, which never sorts the rl_starts.keys(). they are used as a single source of toruth for unit order.

logger.debug("Updating Policy")

# Keeping strategy order aligned with rollout-buffer column order.
sorted_unit_ids = sorted(self.learning_role.rl_strats.keys())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with sorting as befroe

x = F.relu(layer(x))

x = self.q1_layers[-1](x) # Output layer (no activation)
x = nn.Sequential(*self.q1_layers)(x)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hestitant about this change, because this PR is huge as it is and this seems to be a style question. I will recert it we can do these chanegs in a seperate PR.

…the order in leanring role as singl source of truth
return np.zeros((0, 0, 1), dtype=np.float32)

# Get sorted lists of units and timestamps (for consistent ordering)
all_times = sorted(nested_dict.keys())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super important function. Why did you want to change that?
We must be using the key order with which they come in here and not sort them at any given place. We talked about taking the approach form the main branch. I commented it multiple times, please explain why you did it that way.

Returns:
th.Tensor: Shape (n_timesteps, n_powerplants, feature_dim)
Shape (n_timesteps, n_powerplants, feature_dim).
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you suggest this default behavior?

raise ValueError(
"Error, while transforming RL data for buffer: No data found to determine feature dimension"
)
feature_dim = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sensible for the PPO then but not for all other algorithms, not a robust error handling here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants