MAPPO Implementation#721
Conversation
….py ADDED RolloutBuffer-in-buffer.py
| self.save_critic_params(directory=f"{directory}/critics") | ||
| self.save_actor_params(directory=f"{directory}/actors") | ||
|
|
||
| def save_critic_params(self, directory: str) -> None: |
There was a problem hiding this comment.
This code is completly redundant, I see why it is not in the BAseAlgoithm class because not all hav actors and critics, does it makes sense to have another parent class though? Please check
|
|
||
| def update_policy(self) -> None: | ||
| """ | ||
| Update actor and critic networks using the DDPG algorithm. |
There was a problem hiding this comment.
make self explainatory dokstring that does not depend on dokstring of TD3
| value: np.ndarray, | ||
| log_prob: np.ndarray | ||
| ) -> None: | ||
| """Add a transition to the buffer.""" |
| start_idx += batch_size | ||
|
|
||
| def _get_samples(self, indices: np.ndarray) -> RolloutBufferSamples: | ||
| """Convert numpy arrays to torch tensors for given indices.""" |
There was a problem hiding this comment.
Also not a porper dokstring. In the original implementation of the PPO the dokstring are good, use these! https://github.com/assume-framework/assume/pull/462/files#diff-e7cf9fcee75b300570d21e1894f6aa672a59d0faa37d8afdcf9611d0ff28fab7R498
| self.all_rewards = defaultdict(lambda: defaultdict(list)) | ||
| self.all_regrets = defaultdict(lambda: defaultdict(list)) | ||
| self.all_profits = defaultdict(lambda: defaultdict(list)) | ||
| # PPO algorithm specific caches for on-policy learning |
There was a problem hiding this comment.
The learning_role knw the algorithm, can we do these things algorithms specific?
| device | ||
| ) | ||
|
|
||
| if cache["values"].get(timestamp): |
There was a problem hiding this comment.
why the if here? Is there an option that I do not have values in the cache and if so why?
There was a problem hiding this comment.
I wrote it earlier to check the execution of the other files and the algorithm. But yes, it is not required here and is incorrect here to add zeros or any data explicitly if for some reason the values aren't added. the if..else.. can be removed and can be used in the same manner as rewards_data, actions_data, etc.
| # Add to rollout buffer | ||
| if self.rollout_buffer is not None: | ||
| self.rollout_buffer.add( | ||
| obs = to_numpy(obs_data), |
There was a problem hiding this comment.
why is this necessary here and not for the replay bufer we already convert the stuff into numpy at the end of transform_buffer_data
| self.rl_eval = inter_episodic_data["all_eval"] | ||
| self.avg_rewards = inter_episodic_data["avg_all_eval"] | ||
| self.buffer = inter_episodic_data["buffer"] | ||
| self.rollout_buffer = inter_episodic_data["rollout_buffer"] |
There was a problem hiding this comment.
only load one buffer, get rid of redunandcy and make this algorithm specific
|
|
||
|
|
||
| class ActorPPO(nn.Module): | ||
| activation_function_limit = { |
There was a problem hiding this comment.
This is used across many classes and redundant. I would suggest moving it to the learning utils.
| curr_action = noise | ||
|
|
||
| # For PPO, store dummy log_prob and value during initial exploration | ||
| if self.algorithm == "mappo": | ||
| self._last_log_prob = th.tensor(0.0, device=self.device) | ||
| self._last_value = th.tensor(0.0, device=self.device) | ||
|
|
||
| else: | ||
| # if we are not in the initial exploration phase we chose the action with the actor neural net | ||
| # and add noise to the action | ||
| curr_action = self.actor(next_observation).detach() | ||
| noise = self.action_noise.noise( | ||
| device=self.device, dtype=self.float_type | ||
| ) | ||
| curr_action += noise | ||
| # Check if we're using PPO algorithm | ||
| if self.algorithm == "mappo": | ||
| # PPO: use get_action_and_log_prob for proper stochastic sampling | ||
| curr_action, log_prob = self.actor.get_action_and_log_prob(next_observation.unsqueeze(0)) | ||
| curr_action = curr_action.squeeze(0).detach() | ||
| self._last_log_prob = log_prob.squeeze(0).detach() | ||
|
|
||
| # Get value estimate from critic (if available) | ||
| if hasattr(self.learning_role, 'critics') and self.unit_id in self.learning_role.critics: | ||
| critic = self.learning_role.critics[self.unit_id] | ||
| self._last_value = critic(next_observation.unsqueeze(0)).squeeze().detach() | ||
| else: | ||
| self._last_value = th.tensor(0.0, device=self.device) | ||
|
|
||
| # PPO uses stochastic policy, no external noise needed | ||
| noise = th.zeros_like(curr_action, dtype=self.float_type) |
There was a problem hiding this comment.
PPO should not have an initial experience mode or should it? since it is a rollout buffer it will be gone after one update anyhow. Do you know papers with PPO and initial experience? Can we integrate this better? If you look into the original ilementation the get_actions function was moved to the algirthm file and hence can have all the algorithmtic features. which is cleaner!
There was a problem hiding this comment.
Yeah, PPO should not have an initial experience mode theoretically based on my knowledge.
But when I was running the training, the rewards graph was almost all of the time showing the decreasing trend. So I thought to have the initial experience where random actions are performed (which were rewarding in positively mostly), so thought first update might cause it to improve from start, but it did improve slightly though on the custom scenario example. The original idea I had was to always save the K number of samples with top K maximum rewards where K would be less than the buffer size and will be a hyper-parameter, just for the newer updates to be improved keeping the best performances in mind. (Yet to do a research on it, but I think that deviates from the PPO).
For now, at the use case point, the initial experience mode is could be set to 0.
- suggest subclass of A2C algorithms so that we can outsource loading and saving function to base algorithm class and inherit form it
kim-mskw
left a comment
There was a problem hiding this comment.
Thanks for the initial implementation!
I added multiple comments about code redundancy and handling in the framework. I pushed non-working versions of what I mean so that you get a clear picture of how it could be improved. Many of these thoughts have already been made in https://github.com/assume-framework/assume/pull/462/files#diff-d9acf7632f3702baead73bae50128bce95b87863f5575f2731c87efbeacbc508, which is a working version of the PPO on an old branch. Please revisit and further revise your implementation.
Also, PPO is known to be less sample efficient, so before comparing them, please let it train longer.
….py ADDED RolloutBuffer-in-buffer.py
…r continue_learning
…eters passing structure
…t erased due to updates before in example_02a/config.yaml tiny configuration
- do gradient steps and episode_collecting_initial_expereince tests actually in OffPolicy algotihm config
- make default behavior A2CAlgorithm class independent of off-/on-policy to avoid mistakes
- use "uses_target_networks" flag for actor and critic creation as well, otherwise were created but never used? - same conistency enforced for extract_policy
| self.buffer.add( | ||
| obs=transform_buffer_data(cache["obs"], device, self.rl_strats.keys()), | ||
| actions=transform_buffer_data( | ||
| cache["actions"], device, self.rl_strats.keys() |
There was a problem hiding this comment.
I do not agree with this implemntation. We tlaked about this and I told you to use the implemenation on the main branch, which never sorts the rl_starts.keys(). they are used as a single source of toruth for unit order.
| logger.debug("Updating Policy") | ||
|
|
||
| # Keeping strategy order aligned with rollout-buffer column order. | ||
| sorted_unit_ids = sorted(self.learning_role.rl_strats.keys()) |
There was a problem hiding this comment.
same with sorting as befroe
refactor convert_to_tensor function
| x = F.relu(layer(x)) | ||
|
|
||
| x = self.q1_layers[-1](x) # Output layer (no activation) | ||
| x = nn.Sequential(*self.q1_layers)(x) |
There was a problem hiding this comment.
I am a bit hestitant about this change, because this PR is huge as it is and this seems to be a style question. I will recert it we can do these chanegs in a seperate PR.
…the order in leanring role as singl source of truth
| return np.zeros((0, 0, 1), dtype=np.float32) | ||
|
|
||
| # Get sorted lists of units and timestamps (for consistent ordering) | ||
| all_times = sorted(nested_dict.keys()) |
There was a problem hiding this comment.
This is a super important function. Why did you want to change that?
We must be using the key order with which they come in here and not sort them at any given place. We talked about taking the approach form the main branch. I commented it multiple times, please explain why you did it that way.
| Returns: | ||
| th.Tensor: Shape (n_timesteps, n_powerplants, feature_dim) | ||
| Shape (n_timesteps, n_powerplants, feature_dim). | ||
| """ |
There was a problem hiding this comment.
Why did you suggest this default behavior?
| raise ValueError( | ||
| "Error, while transforming RL data for buffer: No data found to determine feature dimension" | ||
| ) | ||
| feature_dim = 1 |
There was a problem hiding this comment.
This is sensible for the PPO then but not for all other algorithms, not a robust error handling here.
Description
Implemented PPO algorithm in assume framework.
Checklist
neural_network_architecture.py.buffer.py.learning_role.pyto work with PPO.mappo.pyAdditional Comments
Results