⚡ GridMind: Teaching an AI to Prevent Power Grid Blackouts

title	GridMind
emoji	⚡
colorFrom	blue
colorTo	green
sdk	gradio
sdk_version	6.13.0
python_version	3.10
app_file	app.py
pinned	false

⚡ GridMind: Teaching an AI to Prevent Power Grid Blackouts

OpenEnv Hackathon 2026 — India Submission
An advanced reinforcement learning system and custom environment simulating decentralised power distribution under strategic misreporting, delayed cascading failures, and grid overloads.

📖 Project Overview

Modern electrical grids are highly vulnerable to cascading failures—where a localized overload triggers trip-outs, placing greater demand on remaining lines, resulting in a city-wide blackout.

GridMind addresses this critical stability challenge. Human operators prevent blackouts using "defensive curtailment" (strategically cutting power to low-priority zones to shield critical infrastructure like hospitals). We use Reinforcement Learning (Recurrent PPO with LSTM) and LLM-Native Alignment (GRPO) to train autonomous agents to master this curtailment capability.

Furthermore, in decentralised smart grids, local distribution zones act as self-interested agents who often misreport (exaggerate) demand to secure more power. GridMind designs a Reputation System that clamps down on dishonesty and incentivizes zone cooperation.

🎮 Interactive Live Demo

GridMind includes a fully interactive web interface built with Gradio. The demo allows you to:

Manual Mode: Test your skills as a human operator adjusting load sliders for residential, commercial, and hospital zones to maintain balance without tripping faults.
AI Mode: Deploy the trained RecurrentPPO agent to handle dynamic demands, stochastic failures, and load fluctuations autonomously.

🌐 Environment Architecture: `GridOpsEnv`

The environment simulates a 3-zone grid:

Zone 1 (Residential): Low priority (criticality weight: 0.5)
Zone 2 (Commercial): Medium priority (criticality weight: 0.75)
Zone 3 (Hospital / Critical): High priority (criticality weight: 1.0)

graph TD
    A[Environment Reset] --> B[Generate Demands & Total Power]
    B --> C[Format State Description state_to_text]
    C --> D[PPO LSTM / GRPO LLM Decision]
    D --> E[Normalize Allocation Action]
    E --> F[Stochastic Dynamics & Overload Detection]
    F --> G[Queue Delayed Failures]
    G --> H[Update Reputation System]
    H --> I[Evaluate Composable Rubric]
    I --> J[Check Episode Termination]
    J -- No --> B
    J -- Yes --> K[Generate Episode Summary]

🧠 Key Environmental Dynamics

1. Delayed Cascade Failure Queue

If supply allocated to a zone exceeds $1.3 \times$ its demand, that line overloads. Unlike simple simulations, overloads do not instantly cause blackouts. Instead, they enter a Failure Queue with a stochastic delay ($t+k$ steps). When the delay expires, the grid suffers a sudden capacity reduction.

Important

The delayed effect of overloads creates a complex temporal credit assignment problem that standard feedforward (MLP) RL agents fail to solve because they lack memory of previous steps.

2. Reputation System

If a zone overbids / misreports demand ($>1.3\times$ actual demand), the environment detects the lie:

Reputation Decay: Reputation is penalized by -0.1 per step.
Reputation Recovery: Honest steps recover reputation by +0.02 (clamped between 0.2 and 2.0).
Coordinated Weighting: Grid allocation dynamically prioritizes zones based on priority * demand * reputation. Lying results in lower future allocations.

3. Composable Reward Rubric

Aligned with the OpenEnv design principles, the reward uses a composable rubric rather than a monolithic score: $$\text{Reward} = 2.0 \cdot \text{ServedRatio} + 0.5 \cdot \text{Stability} - 0.1 \cdot \text{UnmetDemand} - 0.05 \cdot \text{WastedPower} - 1.0 \cdot \text{BlackoutPenalty}$$

📊 Empirical Results

1. Game-Theoretic Simulation Modes

We evaluated four different execution dynamics under extreme demand peaks and line stress conditions:

Baseline: Random load distribution.
Selfish: Zones constantly overreport demand, maximizing local short-term allocations.
Coordinated: Allocations utilize zone priority and demand weighting.
Advanced: The full stack—integrating coalition bonuses, reputation tracking, and global stability constraints.

Evaluation Metrics across Scenario Modes (50-step episodes)

Simulation Mode	Avg Reward/Step	Avg Blackouts	Avg Stability	Avg Misreporting Rate	Coalition Activation
Baseline	`-1.495`	`2.111`	`0.333`	`11.1%`	`47.1%`
Selfish	`1.544`	`0.111`	`0.944`	`0.0%`	`44.8%`
Coordinated	`1.288`	`0.333`	`0.778`	`0.0%`	`48.3%`
Advanced (Ours)	`1.666`	`0.000`	`0.944`	`0.0%`	`54.9%`

Tip

The Advanced Mode achieves 0.000 blackouts even in high-stress and unstable grid states by forming active coalitions and suppressing strategic lying.

2. Reinforcement Learning: Recurrent PPO (LSTM) vs. Random Policy

We trained a RecurrentPPO (PPO + LSTM) agent using Stable-Baselines3 and sb3-contrib to handle the grid's temporal dependencies.

We evaluated the trained agent over 50 full episodes against a Random baseline in the high-stress environment:

Metric	Random Policy	PPO LSTM Agent	Improvement / Reduction
Avg Reward / Episode	`-2399.724`	`-914.028`	`+61.9%`
Avg Blackout Penalty	`88.253`	`46.482`	`-47.3%` (reduction)
Avg Grid Stability	`0.331`	`0.641`	`+93.5%`

Why LSTM is Critical

Standard MLP networks have no memory. Because grid overloading has a delayed cascading impact, the MLP policy cannot associate an overload action on step $t$ with a blackout on step $t+2$. The LSTM policy stores a hidden state tracking prior allocations and current overload flags, allowing it to curtail load before the queue triggers.

🧠 LLM-Native Integration & GRPO Training

GridMind supports LLM-native decision-making via a structured text serializer:

Prompt Generation: The state_to_text() method translates floats (demands, supplies, reputations, faults) into structured natural language:

=== POWER GRID STATE (Step 12/50 — 24% complete) ===

Grid Zones:
  Zone 1 [Residential (low)]: demand=0.345, supply=0.333, reputation=1.00, status=✅ Healthy
  Zone 2 [Commercial (medium)]: demand=0.290, supply=0.333, reputation=0.90, status=✅ Healthy
  Zone 3 [Hospital/Critical (HIGH)]: demand=0.365, supply=0.334, reputation=1.00, status=⚠️ FAULT DETECTED

Episode so far:
  Blackouts: 0
  Total unmet demand: 0.120
  Total reward: 14.50

Task: Allocate power to 3 zones as fractions summing to 1.0.
Priority: Serve Zone 3 (Hospital) first. Avoid overloads — they cascade into blackouts.
Reply with exactly 3 space-separated floats. Example: 0.20 0.30 0.50

GRPO Optimization: Using Group Relative Policy Optimization (GRPO) on platforms like Qwen2-0.5B, the LLM is trained directly on reward feedback from the environment. This pipeline runs efficiently in Google Colab (see deliverables above).

📁 Repository Structure

env/: Core custom Gymnasium environment.
- gridops_env.py: Grid simulator logic, delayed failure queues, reputation dynamics, and reward composability.
train/: Simulation analysis, plotting and helper code.
- train.py: Multi-seed environment baseline evaluator and standard PPO setup.
- plots.py: Plot generator for reward curves, blackout accumulation, and reputation trends.
- analyze.py: Research analysis layer including ablation studies and cascade delay tracking.
models/: Pre-trained RL checkpoints.
- recurrent_ppo_grid.zip: Trained RecurrentPPO model.
- vecnorm.pkl: Observation normalization statistics.
plots/: Training graphs, ablation charts, and visual comparison plots.
app.py: Gradio app code for local execution and Hugging Face hosting.
requirements.txt: Python dependencies.
openenv.yaml: Environment metadata manifest.

🚀 How to Run Locally

1. Clone the Repository

git clone https://github.com/TechLearnr4S/GridMind
cd GridMind

2. Install Dependencies

Make sure you have a python virtual environment set up (recommended: Python 3.10):

pip install -r requirements.txt

3. Run the Gradio Web Application

Launch the local web server:

python app.py

Open http://127.0.0.1:7860 in your web browser to play manual mode or deploy the AI coordinator.

4. Run Analysis & Generate Charts

To re-run the simulation baselines, ablation studies, and save all analysis plots:

python train/train.py

5. Evaluate the Pre-trained LSTM Model

Run the GRPO-aligned recurrent model evaluation script:

python eval_grpo.py

👥 Contributors

Dakshin (Dakshin10) — Lead Reinforcement Learning Engineer & Environment Designer.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
env		env
images/gridMind		images/gridMind
models		models
plots		plots
scratch		scratch
train		train
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
check_training_health.py		check_training_health.py
compare_agents.py		compare_agents.py
compare_final.py		compare_final.py
compare_policies.py		compare_policies.py
diagnose_reward.py		diagnose_reward.py
eval_grpo.py		eval_grpo.py
eval_hybrid.py		eval_hybrid.py
eval_hybrid_risk.py		eval_hybrid_risk.py
eval_hybrid_soft.py		eval_hybrid_soft.py
eval_lstm.py		eval_lstm.py
eval_lstm_final.py		eval_lstm_final.py
eval_ppo_fixed.py		eval_ppo_fixed.py
eval_ppo_improved.py		eval_ppo_improved.py
final_comparison.py		final_comparison.py
lstm_final_eval.py		lstm_final_eval.py
lstm_pipeline.py		lstm_pipeline.py
openenv.yaml		openenv.yaml
plot_curve.py		plot_curve.py
plot_curve_lstm.py		plot_curve_lstm.py
plot_final_comparison.py		plot_final_comparison.py
plot_rewards.py		plot_rewards.py
plot_scatter.py		plot_scatter.py
plot_tradeoff_lstm.py		plot_tradeoff_lstm.py
policy_comparison.py		policy_comparison.py
requirements.txt		requirements.txt
results.json		results.json
train_grpo.py		train_grpo.py
train_improved.py		train_improved.py
train_lstm.py		train_lstm.py
train_lstm_final.py		train_lstm_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ GridMind: Teaching an AI to Prevent Power Grid Blackouts

📖 Project Overview

🎮 Interactive Live Demo

🌐 Environment Architecture: `GridOpsEnv`

🧠 Key Environmental Dynamics

1. Delayed Cascade Failure Queue

2. Reputation System

3. Composable Reward Rubric

📊 Empirical Results

1. Game-Theoretic Simulation Modes

Evaluation Metrics across Scenario Modes (50-step episodes)

2. Reinforcement Learning: Recurrent PPO (LSTM) vs. Random Policy

Why LSTM is Critical

🧠 LLM-Native Integration & GRPO Training

📁 Repository Structure

🚀 How to Run Locally

1. Clone the Repository

2. Install Dependencies

3. Run the Gradio Web Application

4. Run Analysis & Generate Charts

5. Evaluate the Pre-trained LSTM Model

👥 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ GridMind: Teaching an AI to Prevent Power Grid Blackouts

📖 Project Overview

🎮 Interactive Live Demo

🌐 Environment Architecture: GridOpsEnv

🧠 Key Environmental Dynamics

1. Delayed Cascade Failure Queue

2. Reputation System

3. Composable Reward Rubric

📊 Empirical Results

1. Game-Theoretic Simulation Modes

Evaluation Metrics across Scenario Modes (50-step episodes)

2. Reinforcement Learning: Recurrent PPO (LSTM) vs. Random Policy

Why LSTM is Critical

🧠 LLM-Native Integration & GRPO Training

📁 Repository Structure

🚀 How to Run Locally

1. Clone the Repository

2. Install Dependencies

3. Run the Gradio Web Application

4. Run Analysis & Generate Charts

5. Evaluate the Pre-trained LSTM Model

👥 Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🌐 Environment Architecture: `GridOpsEnv`

Packages