| title | GridMind |
|---|---|
| emoji | โก |
| colorFrom | blue |
| colorTo | green |
| sdk | gradio |
| sdk_version | 6.13.0 |
| python_version | 3.10 |
| app_file | app.py |
| pinned | false |
OpenEnv Hackathon 2026 โ India Submission
An advanced reinforcement learning system and custom environment simulating decentralised power distribution under strategic misreporting, delayed cascading failures, and grid overloads.
Modern electrical grids are highly vulnerable to cascading failuresโwhere a localized overload triggers trip-outs, placing greater demand on remaining lines, resulting in a city-wide blackout.
GridMind addresses this critical stability challenge. Human operators prevent blackouts using "defensive curtailment" (strategically cutting power to low-priority zones to shield critical infrastructure like hospitals). We use Reinforcement Learning (Recurrent PPO with LSTM) and LLM-Native Alignment (GRPO) to train autonomous agents to master this curtailment capability.
Furthermore, in decentralised smart grids, local distribution zones act as self-interested agents who often misreport (exaggerate) demand to secure more power. GridMind designs a Reputation System that clamps down on dishonesty and incentivizes zone cooperation.
GridMind includes a fully interactive web interface built with Gradio. The demo allows you to:
- Manual Mode: Test your skills as a human operator adjusting load sliders for residential, commercial, and hospital zones to maintain balance without tripping faults.
- AI Mode: Deploy the trained RecurrentPPO agent to handle dynamic demands, stochastic failures, and load fluctuations autonomously.
|
|
The environment simulates a 3-zone grid:
- Zone 1 (Residential): Low priority (criticality weight:
0.5) - Zone 2 (Commercial): Medium priority (criticality weight:
0.75) - Zone 3 (Hospital / Critical): High priority (criticality weight:
1.0)
graph TD
A[Environment Reset] --> B[Generate Demands & Total Power]
B --> C[Format State Description state_to_text]
C --> D[PPO LSTM / GRPO LLM Decision]
D --> E[Normalize Allocation Action]
E --> F[Stochastic Dynamics & Overload Detection]
F --> G[Queue Delayed Failures]
G --> H[Update Reputation System]
H --> I[Evaluate Composable Rubric]
I --> J[Check Episode Termination]
J -- No --> B
J -- Yes --> K[Generate Episode Summary]
If supply allocated to a zone exceeds
Important
The delayed effect of overloads creates a complex temporal credit assignment problem that standard feedforward (MLP) RL agents fail to solve because they lack memory of previous steps.
If a zone overbids / misreports demand (
- Reputation Decay: Reputation is penalized by
-0.1per step. - Reputation Recovery: Honest steps recover reputation by
+0.02(clamped between0.2and2.0). - Coordinated Weighting: Grid allocation dynamically prioritizes zones based on
priority * demand * reputation. Lying results in lower future allocations.
Aligned with the OpenEnv design principles, the reward uses a composable rubric rather than a monolithic score:
We evaluated four different execution dynamics under extreme demand peaks and line stress conditions:
- Baseline: Random load distribution.
- Selfish: Zones constantly overreport demand, maximizing local short-term allocations.
- Coordinated: Allocations utilize zone priority and demand weighting.
- Advanced: The full stackโintegrating coalition bonuses, reputation tracking, and global stability constraints.
| Simulation Mode | Avg Reward/Step | Avg Blackouts | Avg Stability | Avg Misreporting Rate | Coalition Activation |
|---|---|---|---|---|---|
| Baseline | -1.495 |
2.111 |
0.333 |
11.1% |
47.1% |
| Selfish | 1.544 |
0.111 |
0.944 |
0.0% |
44.8% |
| Coordinated | 1.288 |
0.333 |
0.778 |
0.0% |
48.3% |
| Advanced (Ours) | 1.666 |
0.000 |
0.944 |
0.0% |
54.9% |
Tip
The Advanced Mode achieves 0.000 blackouts even in high-stress and unstable grid states by forming active coalitions and suppressing strategic lying.
We trained a RecurrentPPO (PPO + LSTM) agent using Stable-Baselines3 and sb3-contrib to handle the grid's temporal dependencies.
We evaluated the trained agent over 50 full episodes against a Random baseline in the high-stress environment:
| Metric | Random Policy | PPO LSTM Agent | Improvement / Reduction |
|---|---|---|---|
| Avg Reward / Episode | -2399.724 |
-914.028 |
+61.9% |
| Avg Blackout Penalty | 88.253 |
46.482 |
-47.3% (reduction) |
| Avg Grid Stability | 0.331 |
0.641 |
+93.5% |
Standard MLP networks have no memory. Because grid overloading has a delayed cascading impact, the MLP policy cannot associate an overload action on step
GridMind supports LLM-native decision-making via a structured text serializer:
- Prompt Generation: The
state_to_text()method translates floats (demands, supplies, reputations, faults) into structured natural language:=== POWER GRID STATE (Step 12/50 โ 24% complete) === Grid Zones: Zone 1 [Residential (low)]: demand=0.345, supply=0.333, reputation=1.00, status=โ Healthy Zone 2 [Commercial (medium)]: demand=0.290, supply=0.333, reputation=0.90, status=โ Healthy Zone 3 [Hospital/Critical (HIGH)]: demand=0.365, supply=0.334, reputation=1.00, status=โ ๏ธ FAULT DETECTED Episode so far: Blackouts: 0 Total unmet demand: 0.120 Total reward: 14.50 Task: Allocate power to 3 zones as fractions summing to 1.0. Priority: Serve Zone 3 (Hospital) first. Avoid overloads โ they cascade into blackouts. Reply with exactly 3 space-separated floats. Example: 0.20 0.30 0.50 - GRPO Optimization: Using Group Relative Policy Optimization (GRPO) on platforms like Qwen2-0.5B, the LLM is trained directly on reward feedback from the environment. This pipeline runs efficiently in Google Colab (see deliverables above).
- env/: Core custom Gymnasium environment.
- gridops_env.py: Grid simulator logic, delayed failure queues, reputation dynamics, and reward composability.
- train/: Simulation analysis, plotting and helper code.
- train.py: Multi-seed environment baseline evaluator and standard PPO setup.
- plots.py: Plot generator for reward curves, blackout accumulation, and reputation trends.
- analyze.py: Research analysis layer including ablation studies and cascade delay tracking.
- models/: Pre-trained RL checkpoints.
- recurrent_ppo_grid.zip: Trained RecurrentPPO model.
- vecnorm.pkl: Observation normalization statistics.
- plots/: Training graphs, ablation charts, and visual comparison plots.
- app.py: Gradio app code for local execution and Hugging Face hosting.
- requirements.txt: Python dependencies.
- openenv.yaml: Environment metadata manifest.
git clone https://github.com/TechLearnr4S/GridMind
cd GridMindMake sure you have a python virtual environment set up (recommended: Python 3.10):
pip install -r requirements.txtLaunch the local web server:
python app.pyOpen http://127.0.0.1:7860 in your web browser to play manual mode or deploy the AI coordinator.
To re-run the simulation baselines, ablation studies, and save all analysis plots:
python train/train.pyRun the GRPO-aligned recurrent model evaluation script:
python eval_grpo.py- Dakshin (Dakshin10) โ Lead Reinforcement Learning Engineer & Environment Designer.


