MAPPO Implementation by Harshul-18 · Pull Request #721 · assume-framework/assume

Harshul-18 · 2026-01-14T03:37:28Z

Description

Implemented PPO algorithm in assume framework.

Checklist

Implement PPO specific Actor and Critic Classes in neural_network_architecture.py.
Add PPO specific Buffer class in buffer.py.
Update the existing learning_role.py to work with PPO.
Configured the input data pipeline to be compatible with PPO.
Implement the MAPPO algorithm in mappo.py

Additional Comments

The results of the already implemented algorithm might be different from previously benchmarked results due to change in common hyper-parameters like learning_rate, episodes, noise, constants, etc.

Results

Scenario	PPO	TD3	DDPG
example_02a
example_02b
example_02c
example_02d
example_02e
custom_scenario

…arning module.

….py ADDED RolloutBuffer-in-buffer.py

into with_PPO_and_DDPG

kim-mskw · 2026-01-14T10:24:37Z

+        self.save_critic_params(directory=f"{directory}/critics")
+        self.save_actor_params(directory=f"{directory}/actors")
+
+    def save_critic_params(self, directory: str) -> None:


This code is completly redundant, I see why it is not in the BAseAlgoithm class because not all hav actors and critics, does it makes sense to have another parent class though? Please check

kim-mskw · 2026-01-14T10:28:15Z

+
+    def update_policy(self) -> None:
+        """
+        Update actor and critic networks using the DDPG algorithm.


make self explainatory dokstring that does not depend on dokstring of TD3

kim-mskw · 2026-01-14T10:44:36Z

+        value: np.ndarray,
+        log_prob: np.ndarray
+    ) -> None:
+        """Add a transition to the buffer."""


not a proper dokstring

kim-mskw · 2026-01-14T10:47:45Z

+            start_idx += batch_size
+
+    def _get_samples(self, indices: np.ndarray) -> RolloutBufferSamples:
+        """Convert numpy arrays to torch tensors for given indices."""


Also not a porper dokstring. In the original implementation of the PPO the dokstring are good, use these! https://github.com/assume-framework/assume/pull/462/files#diff-e7cf9fcee75b300570d21e1894f6aa672a59d0faa37d8afdcf9611d0ff28fab7R498

kim-mskw · 2026-01-14T10:51:39Z

            self.all_rewards = defaultdict(lambda: defaultdict(list))
            self.all_regrets = defaultdict(lambda: defaultdict(list))
            self.all_profits = defaultdict(lambda: defaultdict(list))
+            # PPO algorithm specific caches for on-policy learning


The learning_role knw the algorithm, can we do these things algorithms specific?

kim-mskw · 2026-01-14T10:52:39Z

+                    device
+                )
+
+                if cache["values"].get(timestamp):


why the if here? Is there an option that I do not have values in the cache and if so why?

I wrote it earlier to check the execution of the other files and the algorithm. But yes, it is not required here and is incorrect here to add zeros or any data explicitly if for some reason the values aren't added. the if..else.. can be removed and can be used in the same manner as rewards_data, actions_data, etc.

kim-mskw · 2026-01-14T10:59:21Z

+                # Add to rollout buffer
+                if self.rollout_buffer is not None:
+                    self.rollout_buffer.add(
+                        obs = to_numpy(obs_data),


why is this necessary here and not for the replay bufer we already convert the stuff into numpy at the end of transform_buffer_data

kim-mskw · 2026-01-14T11:05:16Z

        self.rl_eval = inter_episodic_data["all_eval"]
        self.avg_rewards = inter_episodic_data["avg_all_eval"]
        self.buffer = inter_episodic_data["buffer"]
+        self.rollout_buffer = inter_episodic_data["rollout_buffer"]


only load one buffer, get rid of redunandcy and make this algorithm specific

kim-mskw · 2026-01-14T11:13:01Z

+
+
+class ActorPPO(nn.Module):
+    activation_function_limit = {


This is used across many classes and redundant. I would suggest moving it to the learning utils.

kim-mskw · 2026-01-14T11:23:08Z

                curr_action = noise
+
+                # For PPO, store dummy log_prob and value during initial exploration
+                if self.algorithm == "mappo":
+                    self._last_log_prob = th.tensor(0.0, device=self.device)
+                    self._last_value = th.tensor(0.0, device=self.device)

            else:
-                # if we are not in the initial exploration phase we chose the action with the actor neural net
-                # and add noise to the action
-                curr_action = self.actor(next_observation).detach()
-                noise = self.action_noise.noise(
-                    device=self.device, dtype=self.float_type
-                )
-                curr_action += noise
+                # Check if we're using PPO algorithm
+                if self.algorithm == "mappo":
+                    # PPO: use get_action_and_log_prob for proper stochastic sampling
+                    curr_action, log_prob = self.actor.get_action_and_log_prob(next_observation.unsqueeze(0))
+                    curr_action = curr_action.squeeze(0).detach()
+                    self._last_log_prob = log_prob.squeeze(0).detach()
+
+                    # Get value estimate from critic (if available)
+                    if hasattr(self.learning_role, 'critics') and self.unit_id in self.learning_role.critics:
+                        critic = self.learning_role.critics[self.unit_id]
+                        self._last_value = critic(next_observation.unsqueeze(0)).squeeze().detach()
+                    else:
+                        self._last_value = th.tensor(0.0, device=self.device)
+
+                    # PPO uses stochastic policy, no external noise needed
+                    noise = th.zeros_like(curr_action, dtype=self.float_type)


PPO should not have an initial experience mode or should it? since it is a rollout buffer it will be gone after one update anyhow. Do you know papers with PPO and initial experience? Can we integrate this better? If you look into the original ilementation the get_actions function was moved to the algirthm file and hence can have all the algorithmtic features. which is cleaner!

Yeah, PPO should not have an initial experience mode theoretically based on my knowledge.

But when I was running the training, the rewards graph was almost all of the time showing the decreasing trend. So I thought to have the initial experience where random actions are performed (which were rewarding in positively mostly), so thought first update might cause it to improve from start, but it did improve slightly though on the custom scenario example. The original idea I had was to always save the K number of samples with top K maximum rewards where K would be less than the buffer size and will be a hyper-parameter, just for the newer updates to be improved keeping the best performances in mind. (Yet to do a research on it, but I think that deviates from the PPO).

For now, at the use case point, the initial experience mode is could be set to 0.

- suggest subclass of A2C algorithms so that we can outsource loading and saving function to base algorithm class and inherit form it

kim-mskw

Thanks for the initial implementation!

I added multiple comments about code redundancy and handling in the framework. I pushed non-working versions of what I mean so that you get a clear picture of how it could be improved. Many of these thoughts have already been made in https://github.com/assume-framework/assume/pull/462/files#diff-d9acf7632f3702baead73bae50128bce95b87863f5575f2731c87efbeacbc508, which is a working version of the PPO on an old branch. Please revisit and further revise your implementation.

Also, PPO is known to be less sample efficient, so before comparing them, please let it train longer.

…arning module.

….py ADDED RolloutBuffer-in-buffer.py

…r continue_learning

…eters passing structure

…t erased due to updates before in example_02a/config.yaml tiny configuration

- do gradient steps and episode_collecting_initial_expereince tests actually in OffPolicy algotihm config

- make default behavior A2CAlgorithm class independent of off-/on-policy to avoid mistakes

- use "uses_target_networks" flag for actor and critic creation as well, otherwise were created but never used? - same conistency enforced for extract_policy

kim-mskw · 2026-05-08T08:59:54Z

-        self.buffer.add(
-            obs=transform_buffer_data(cache["obs"], device, self.rl_strats.keys()),
-            actions=transform_buffer_data(
-                cache["actions"], device, self.rl_strats.keys()


I do not agree with this implemntation. We tlaked about this and I told you to use the implemenation on the main branch, which never sorts the rl_starts.keys(). they are used as a single source of toruth for unit order.

kim-mskw · 2026-05-08T09:06:51Z

+        logger.debug("Updating Policy")
+
+        # Keeping strategy order aligned with rollout-buffer column order.
+        sorted_unit_ids = sorted(self.learning_role.rl_strats.keys())


same with sorting as befroe

refactor convert_to_tensor function

kim-mskw · 2026-05-08T13:34:41Z

-            x = F.relu(layer(x))
-
-        x = self.q1_layers[-1](x)  # Output layer (no activation)
+        x = nn.Sequential(*self.q1_layers)(x)


I am a bit hestitant about this change, because this PR is huge as it is and this seems to be a style question. I will recert it we can do these chanegs in a seperate PR.

…the order in leanring role as singl source of truth

kim-mskw · 2026-05-08T14:16:04Z

+        return np.zeros((0, 0, 1), dtype=np.float32)
+
    # Get sorted lists of units and timestamps (for consistent ordering)
    all_times = sorted(nested_dict.keys())


This is a super important function. Why did you want to change that?
We must be using the key order with which they come in here and not sort them at any given place. We talked about taking the approach form the main branch. I commented it multiple times, please explain why you did it that way.

kim-mskw · 2026-05-08T14:16:31Z

    Returns:
-        th.Tensor: Shape (n_timesteps, n_powerplants, feature_dim)
+        Shape (n_timesteps, n_powerplants, feature_dim).
    """


Why did you suggest this default behavior?

kim-mskw · 2026-05-08T14:17:31Z

-        raise ValueError(
-            "Error, while transforming RL data for buffer: No data found to determine feature dimension"
-        )
+        feature_dim = 1


This is sensible for the PPO then but not for all other algorithms, not a robust error handling here.

Harshul-18 and others added 6 commits January 9, 2026 14:34

DONE: Added DDPG, PPO in multi-agent environment in /reinforcement_le…

1fec8d4

…arning module.

DONE: Added DDPG, PPO in multi-agent environment in /reinforcement_le…

8385449

…arning module.

Merge branch 'assume-framework:main' into with_PPO_and_DDPG

4064608

UPDATED ppo-input-pipeline, code-documentation DELETED rollout_buffer…

5b9763d

….py ADDED RolloutBuffer-in-buffer.py

Merge branch 'with_PPO_and_DDPG' of https://github.com/Harshul-18/assume

e38cc82

into with_PPO_and_DDPG

FIX: initial values_data assignment

9082e0b

kim-mskw reviewed Jan 14, 2026

View reviewed changes

Comment thread assume/reinforcement_learning/buffer.py

kim-mskw reviewed Jan 14, 2026

View reviewed changes

- started making proper config definition

e6d0056

kim-mskw reviewed Jan 14, 2026

View reviewed changes

Comment thread assume/reinforcement_learning/neural_network_architecture.py

kim-mskw reviewed Jan 14, 2026

View reviewed changes

Comment thread assume/reinforcement_learning/neural_network_architecture.py

kim-mskw reviewed Jan 14, 2026

View reviewed changes

kim-mskw added 2 commits January 14, 2026 12:25

outsource activation_function_limit

537f9e8

- make algorithm specific extra info instead of many if mappo statements

afe0077

- suggest subclass of A2C algorithms so that we can outsource loading and saving function to base algorithm class and inherit form it

kim-mskw self-requested a review January 14, 2026 12:05

kim-mskw requested changes Jan 14, 2026

View reviewed changes

ad buffer and algo doku

4c76dc4

kim-mskw self-assigned this Feb 4, 2026

kim-mskw added the discussion needed label Feb 4, 2026

Harshul-18 added 3 commits February 16, 2026 07:59

DONE: Added DDPG, PPO in multi-agent environment in /reinforcement_le…

8d39963

…arning module.

DONE: Added DDPG, PPO in multi-agent environment in /reinforcement_le…

7cc7035

…arning module.

UPDATED ppo-input-pipeline, code-documentation DELETED rollout_buffer…

fc2f6e0

….py ADDED RolloutBuffer-in-buffer.py

Harshul-18 added 11 commits April 24, 2026 09:09

integrated MAPPO pipeline in other files

563ea05

Fix Path import bug in learning_role.py — fixing the runtime error fo…

cdfc3d6

…r continue_learning

Fixed assume import logic API

90ca112

Fixed test_matd3.py to use the nested off_policy config structure

c8a2059

Added Rollout Buffer test file (test_rl_rolloutbuffer.py)

89eb665

Added MADDPG test cases (test_maddpg.py file)

24364ef

Updated the docs and completed the MADDPG implementation

a1442a8

Merge branch 'with_PPO_and_DDPG_merge' into with_PPO_and_DDPG

d34c4d9

Added MAPPO test file (test_mappo.py)

0ba82b7

Moved get_action from learning_strategies to RLAlgorithm

eb6522f

Merge remote-tracking branch 'upstream/main' into with_PPO_and_DDPG

28cb3a0

Harshul-18 requested a review from kim-mskw April 28, 2026 07:05

Harshul-18 added 2 commits May 8, 2026 01:55

Fixed the test_learning_role.py by updating the OffPolicyConfig param…

d2f8f9d

…eters passing structure

Fixed the test_integration_cli.py by adding market_mechanism which go…

798f7a0

…t erased due to updates before in example_02a/config.yaml tiny configuration