Link to the Repository: https://github.com/shaginhekvs/Synthetic_Data_Hackathon Link to Medium article: https://medium.com/@shaginhekvs/avengers-rl-small-specialists-take-on-a-giant-3fa036402697
Welcome to the Synthetic Data Hackathon repository, where we revolutionize reinforcement learning by demonstrating that specialized, compact agents coordinating together can outshine massive, untrained models! Our project combines innovative environments, cutting-edge training techniques, and a thrilling "Avengers RL" concept that pits tiny trained experts against colossal but naive giants.
- Ali Alami Idrissi
- Kazuma Choji
- Keshav Singh
We developed a suite of custom reinforcement learning environments using the flexible OpenEnv framework, designed to test agents across varied control challenges:
- CartPole: Classic balancing act requiring precise feedback control
- Gym Environments: Integration with Gymnasium's full ecosystem, including classic control tasks
- MountainCar: Momentum-based challenge demanding strategic energy management
- LunarLander: Complex 2D aerial navigation with physics-based landing mechanics
The Super Complex Environment: But why stop at single tasks? We created a groundbreaking Sequential Environment that seamlessly combines all four environments into one epic, multi-phase challenge. This composite environment tests true versatility, requiring agents to transition between completely different control paradigms within a single episode.
We harnessed the power of Generalized Reward-based Policy Optimization (GRPO) to train a 3B-parameter LLAMA 3.2 model on each individual environment. GRPO's efficiency in optimizing language models for RL tasks allowed us to achieve mastery-level performance across:
- CartPole: Zero to hero in stabilization techniques
- LunarLander: Precision maneuvers and soft touchdowns
- MountainCar: Momentum mastery through intelligent hill ascensions
- BipedalWalker: Efficient gait patterns for rough terrain navigation
Each model was trained until surpassing the standard success thresholds, resulting in specialized experts ready to tackle their domains.
The climax arrives in our mixed Sequential Environment, where our trained 3B-parameter specialists face off against massive OSS 20B models. These open-source behemoths bring raw computational might but lack domain-specific training. While we haven't achieved clear superiority yet, our experiments show promising results with comparable performance in specific domains:
- Specialized Coordination: Our ensemble of small experts shows comparable performance to the untrained giants in some tasks
- Efficiency Gains: Order-of-magnitude smaller models with potential for future advantage
- Versatility: Demonstrates capability for switching between radically different control tasks, setting foundation for future improvements
The results suggest that intelligence through specialization and coordination has strong potential to compete with brute computational force, though more work is needed to achieve clear superiority.
Avengers RL explores whether specialized agents with simple coordination can challenge massive models.
We train separate 3B-parameter LLAMA models on individual environments (CartPole, MountainCar, LunarLander, BipedalWalker). During inference, a deterministic router switches between specialists based on the current environment phase in our multi-task Sequential Environment. This "Avengers" approach pits the coordinated specialists against massive untrained 20B-parameter OSS models.
Our heroes hail from standard Gymnasium environments, each developing unique superpowers:
| Hero | Home Environment | Core Skill |
|---|---|---|
| CartPole Man | CartPole-v1 | Lightning-fast balance and stabilization |
| Walker | BipedalWalker-v3 | Efficient locomotion over treacherous terrain |
| Jumper | MountainCarContinuous-v0 | Momentum generation and gap-crossing mastery |
| Lander Girl | LunarLanderContinuous-v2 | Precision thrust control and feather-light landings |
Each agent graduates only after reliably clearing their environment's success thresholds. Architecturally flexible, they can be neural policies or 3B-parameter LLAMA adapters fine-tuned for their specialty.
Step into the Endgame-v0โour masterfully crafted composite environment that stitches multiple Gym tasks into one continuous, heart-pounding episode. No smooth transitions hereโthis is pure controlled chaos, where success demands adapting to wildly different control paradigms!
The environment unfolds across dramatic phases:
- Balance & Approach ๐ฏ - Maintain pole stability (CartPole) while advancing toward objectives
- Bridge Run ๐โโ๏ธ - Navigate uneven terrain (BipedalWalker) with adaptive gaits
- Gap Jump ๐ฆ - Build momentum and vault valleys (MountainCar)
- Final Landing ๐ - Control precision descent and achieve perfect touchdowns (LunarLander)
Our smart wrapper handles seamless phase transitions, environment resets, and reward standardization. Observations cleverly encode active phases, timers, and normalized sensor data, while action spaces transparently forward to current sub-environments.
At the heart of Avengers RL is a simple deterministic router logic that switches between specialists based on the current phase of the environment:
- CartPole Phase โ CartPole specialist takes control
- MountainCar Phase โ MountainCar specialist takes control
- LunarLander Phase โ LunarLander specialist takes control
- BipedalWalker Phase โ BipedalWalker specialist takes control
The router reads the observation vector to determine which environment phase is active (encoded in the observation) and routes control to the appropriate specialist. No learning is involvedโpure rule-based switching!
Our implementation keeps things simple and focused:
- Specialist Training ๐ - Train each specialist independently on their environment
- Sequential Environment ๐ - Combine all environments into composite multi-phase challenges
- Deterministic Routing ๐ญ - Hard-coded switching logic based on environment phase
- Ensemble Execution ๐งช - Run all specialists together through the router, with single base model using lightweight LoRA adapters for each specialist
We measure victory across comprehensive metrics:
- Success Rate: Percentage of complete multi-phase runs
- Phase Scores: Domain-specific mean rewards for each hero
- Switch Frequency: How often the team changes leaders
- Energy Efficiency: Cumulative control effort optimization
- Computational Efficiency: Inference cost vs. baseline models
Enter Thanos: Our baseline adversary is a colossal 20B-parameter untrained quantized modelโpure raw potential without skill or training. He represents scale without wisdom!
Our Avengers ensemble: A handful of 3B-parameter trained specialists plus a deteminstic router logic.
The Results: While achievements show promise in matching performance in some domains, clear superiority over OSS models remains elusive. Our experiments demonstrate the viability of specialization + coordination approach but highlight areas for future improvement. Work in progress: Specialization + Coordination โ Current Goliath, with potential โซ Brute Size.
Avengers RL is both a playful tribute and a serious prototype in hierarchical RL. It demonstrates that multiple narrow experts, each competent in isolation, can be orchestrated to rival monolithic models, though more research is needed to achieve decisive victories.
This work creates concrete testbeds for studying mixture-of-experts routing, model efficiency, and emergent collaborative intelligence.
Tagline: Avengers didn't win Endgame V1, Thanos too strong... but they'll be back stronger for V2!
Before running any training notebooks, ensure the corresponding environment servers are running:
CartPole Environment (Port 8030):
start_cartpole_server.sh # or ./OpenEnv/scripts/start_cartpole_server.sh
# Expected port: 8030 - serves CartPole environment for trainingMountainCar Environment (Port 8050):
start_mountaincar_server.sh # or ./OpenEnv/scripts/start_mountaincar_server.sh
# Expected port: 8050 - serves MountainCarContinuous environment for trainingLunarLander Environment (Port 8090):
start_lunarlander_server.sh # or ./OpenEnv/scripts/start_lunarlander_server.sh
# Expected port: 8090 - serves LunarLanderContinuous environment for trainingSequential Environment (Port 8060):
start_sequential_server.sh # or ./OpenEnv/scripts/start_sequential_server.sh
# Expected port: 8060 - serves combined sequential environment for multi-experiment testingNote: Start the environment servers in a separate terminal session before launching training notebooks. Each server provides HTTP REST API endpoints for environment interaction.
Specialist Training (GRPO on individual environments):
- CartPole Training - GRPO training of LLAMA 3.2 3B for CartPole mastery (requires CartPole server on port 8030)
- LunarLander Training - GRPO training of LLAMA 3.2 3B for LunarLander mastery (requires LunarLander server on port 8090)
- MountainCar Training - GRPO training of LLAMA 3.2 3B for MountainCar mastery (requires MountainCar server on port 8050)
- BipedalWalker Training - Final implementation for BipedalWalker mastery (requires OpenAI Gym/BipedalWalker)
Ensemble & Evaluation:
- Avengers RL Endgame - Sequential environment inference, comparison against OSS models, router logic (requires Sequential server on port 8060)
Custom OpenEnv Environments (OpenEnv/src/envs/):
- CartPole Environment - RL environment wrapper for CartPole task (server: port 8030)
- MountainCar Environment - RL environment wrapper for MountainCar (server: port 8050)
- LunarLander Environment - RL environment wrapper for LunarLander (server: port 8090)
- Gym Environment - Framework integration with Gymnasium
- Sequential Environment - Multi-phase composite environment combining all environments (server: port 8060)
Comprehensive Test Suite (Synthetic_Data_Hackathon/OpenEnv/tests/):
- test_cartpole_env.py - CartPole environment API tests and integration
- test_mountaincar_env.py - MountainCar environment API tests and integration
- test_lunarEnv.py - LunarLander environment API tests and integration
- test_bipedalwalker_env.py - BipedalWalker environment API tests and integration
- test_gym_environment.py - Generic Gym environment API tests
- test_sequential_environment.py - Sequential multi-environment integration tests
Many thanks to Unsloth, AMD, and Meta for sponsoring the colossal AMD M300 GPUs that powered our fine-tuning and experimentation needs. This work was made possible through their generous hardware support and commitment to advancing AI research.