A reinforcement learning environment for cloud infrastructure resource allocation, built with PyTorch and Gradio. This project demonstrates intelligent resource management through interactive visualization and property-based testing.
This project uses PyTorch for tensor-based state representation and is compatible with OpenAI Gym-style interfaces.
Cloud infrastructure providers face a critical challenge: dynamically allocating computational resources to meet fluctuating demand while minimizing costs and maintaining system stability. Over-provisioning wastes money on idle resources, while under-provisioning risks system crashes and poor user experience.
This project simulates this real-world problem as a reinforcement learning environment where an agent must learn to:
- Balance resource utilization between 40-70% for optimal efficiency
- Respond to stochastic request rate fluctuations
- Minimize infrastructure costs while maintaining service quality
- Avoid system instability from resource exhaustion
We model cloud resource allocation as a Markov Decision Process (MDP) where:
- State: System metrics (CPU utilization, memory utilization, request rate, allocated resources, latency)
- Actions: Discrete resource adjustments (increase, decrease, or maintain server instances)
- Rewards: Shaped to encourage efficient operation and penalize waste or instability
- Dynamics: Stochastic request patterns simulate real-world variability
The solution provides:
- Interactive Gradio UI with real-time visualizations for manual exploration
- PyTorch-based state representation for seamless integration with deep RL algorithms
- Comprehensive grading system that evaluates stability, efficiency, and performance
- Property-based testing ensuring correctness across diverse scenarios
- Python 3.8+: Core programming language
- PyTorch: Tensor-based state representation for neural network integration
- Gradio: Interactive web UI with real-time plotting
- NumPy: Numerical computations and array operations
- Matplotlib: Visualization of system metrics over time
- Hypothesis: Property-based testing for robust validation
- Pytest: Unit and integration testing framework
- PyTorch: Industry-standard deep learning framework, enables easy integration with RL algorithms (DQN, PPO, A3C)
- Gradio: Rapid prototyping of interactive demos, perfect for showcasing RL environments
- Hypothesis: Discovers edge cases automatically through property-based testing
- Matplotlib: Publication-quality plots for analyzing agent behavior
The environment state is a 4-dimensional continuous vector:
state = [cpu, memory, request_rate, resources]
# Example: [65.3, 58.2, 47.0, 3]Observation Format: [cpu, memory, request_rate, resources]
State Components:
- CPU Utilization (0-100%): Percentage of CPU capacity being used
- Memory Utilization (0-100%): Percentage of memory capacity being used
- Request Rate (0-∞): Number of incoming requests per timestep
- Allocated Resources (1-∞): Number of server instances currently allocated
PyTorch Integration:
# Get state as PyTorch tensor for neural network input
state_tensor = env.get_state_tensor() # Returns torch.TensorThree discrete actions control resource allocation:
Action Format: {0: decrease, 1: maintain, 2: increase}
| Action | Value | Effect |
|---|---|---|
| Decrease | 0 | Remove 1 server instance (minimum 1) |
| Maintain | 1 | Keep current allocation unchanged |
| Increase | 2 | Add 1 server instance |
The environment simulates realistic cloud behavior:
-
Resource-Utilization Relationship:
CPU_util = (request_rate × cpu_per_request) / (resources × capacity) × 100 Memory_util = (request_rate × memory_per_request) / (resources × capacity) × 100 -
Stochastic Request Patterns:
- Request rate fluctuates with Gaussian noise:
N(base_rate, std_dev) - Simulates real-world traffic variability
- Request rate fluctuates with Gaussian noise:
-
Latency Calculation:
Latency = request_rate / max(allocated_resources, 1)- Higher request rate → higher latency
- More resources → lower latency
Episodes terminate when:
- Max steps reached (default: 100 timesteps)
- CPU utilization > 95% (system instability)
- Memory utilization > 95% (system instability)
The reward function shapes agent behavior through three components, with all rewards normalized to the range [0.0, 1.0] for consistency and comparability across episodes.
Encourages keeping CPU and memory in optimal range (40-70%):
if 40% ≤ utilization ≤ 70%:
reward = +1.0 (scaled by distance from center)
elif utilization < 40%:
reward = -0.3 - (distance_below / 100) # Over-provisioning penalty
else: # utilization > 70%
reward = -1.0 - (distance_above / 100) # Under-provisioning penaltyEncourages minimizing resource usage:
cost_penalty = -0.05 × allocated_resourcestotal_reward = cpu_reward + memory_reward + cost_penalty
if not (cpu_optimal and memory_optimal):
total_reward -= 0.1 # Additional penalty for mixed states
# Normalize to [0.0, 1.0] range
normalized_reward = (total_reward + 5.0) / 7.0
normalized_reward = max(0.0, min(1.0, normalized_reward))Reward Range: [0.0, 1.0] (normalized)
Design Rationale:
- Positive rewards only when both CPU and memory are optimal
- Stronger penalties for under-provisioning (instability risk) than over-provisioning
- Resource cost encourages efficiency without sacrificing performance
- Normalization ensures consistent reward magnitudes across episodes
The EpisodeGrader evaluates complete trajectories using three metrics:
Measures consistency of utilization levels:
stability = exp(-avg_variance / 100)- Lower variance → higher stability → better score
- Penalizes erratic resource allocation patterns
Measures resource utilization efficiency:
efficiency = 1.0 / avg_allocated_resources- Fewer resources → higher efficiency → better score
- Encourages lean resource usage
Normalized average reward:
performance = (avg_reward + 5.0) / 7.0 # Map [-5, 2] to [0, 1]final_score = 0.3×stability + 0.3×efficiency + 0.4×performance
passed = final_score ≥ threshold (default: 0.0)Grader Output:
{
'final_score': 0.523,
'passed': True,
'stability_score': 0.847,
'efficiency_score': 0.333,
'avg_reward': 0.156,
'avg_cpu': 52.3,
'avg_memory': 48.7,
'avg_latency': 12.4
}The environment supports three difficulty levels to evaluate agent robustness and generalization:
- Base Request Rate: 40 requests/timestep
- Request Rate Std Dev: 5.0
- Characteristics: Low randomness, stable request patterns, predictable system behavior
- Use Case: Initial testing and baseline evaluation
- Base Request Rate: 60 requests/timestep
- Request Rate Std Dev: 15.0
- Characteristics: Moderate randomness, occasional request spikes, balanced challenge
- Use Case: Standard evaluation and agent training
- Base Request Rate: 80 requests/timestep
- Request Rate Std Dev: 25.0
- Characteristics: High randomness, frequent system instability, challenging dynamics
- Use Case: Stress testing and robustness evaluation
from env.environment import CloudResourceEnv
# Create environment with specific task difficulty
env_easy = CloudResourceEnv(task_name="easy")
env_medium = CloudResourceEnv(task_name="medium") # Default
env_hard = CloudResourceEnv(task_name="hard")
# Run episode
state = env_hard.reset()
# ... interact with environmentA simple heuristic baseline policy is provided for performance comparison.
IMPORTANT: Activate your virtual environment first!
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Run baseline script
python run_baseline.pyThe baseline uses a simple CPU-based heuristic:
if cpu_utilization > 70%:
action = 2 # Increase resources
elif cpu_utilization < 40%:
action = 0 # Decrease resources
else:
action = 1 # Maintain resourcesCumulative Reward: 45.23
Average CPU: 52.30%
Average Memory: 48.70%
Final Score: 0.523
- Cumulative Reward: Sum of all rewards during the episode (higher is better)
- Average CPU/Memory: Mean utilization percentages (target: 40-70%)
- Final Score: Overall performance metric from EpisodeGrader (0.0-1.0 scale)
Use these baseline results to compare against your own RL agents or policies.
The system follows a modular design with four primary components:
┌─────────────────────────────────────────────────────────────┐
│ Gradio UI (app.py) │
│ - Display system metrics (CPU, memory, requests) │
│ - Action buttons (increase, decrease, maintain) │
│ - Episode progress and grading results │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Environment (env/environment.py) │
│ - Maintains state (CPU, memory, requests, resources) │
│ - Processes actions and updates state │
│ - Manages episode lifecycle (reset, step, termination) │
└──────────┬──────────────────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌─────────────────────────────┐
│ Reward Calculator │ │ Episode Grader │
│ (env/reward.py) │ │ (env/grader.py) │
│ - Evaluates decisions │ │ - Evaluates episodes │
│ - Encourages efficiency │ │ - Calculates scores │
│ - Penalizes waste │ │ - Assesses stability │
└──────────────────────────┘ └─────────────────────────────┘
- Environment: Simulates cloud system dynamics, manages state transitions, coordinates with reward module
- Reward Calculator: Provides feedback signals to guide learning, rewards optimal utilization (35-70%), penalizes over/under-provisioning
- Episode Grader: Evaluates complete trajectories, calculates stability and efficiency scores, determines pass/fail status
- UI: Enables manual interaction, visualizes system behavior, displays grading results
Before you begin, ensure you have:
- Python 3.8 or higher (Python 3.9+ recommended)
- pip package manager (usually included with Python)
- Basic Python knowledge: Classes, functions, numpy arrays
- Terminal/command line familiarity
To check your Python version:
python --version
# or
python3 --version- Python 3.8+ (Python 3.9+ recommended)
- pip package manager
- Virtual environment (recommended)
Check your Python version:
python --version # or python3 --versiongit clone <repository-url>
cd cloud-resource-allocation-rlIMPORTANT: Always activate your virtual environment before running any Python commands.
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activateYou should see (venv) in your terminal prompt when the virtual environment is active.
pip install -r requirements.txtDependencies installed:
torch>=2.0.0- PyTorch for tensor operationsgradio- Interactive web UInumpy- Numerical computationsmatplotlib- Plotting and visualizationhypothesis- Property-based testingpytest- Testing framework
# Run tests to verify setup
pytest tests/test_environment.py -v
# Launch demo
python app.pypytest tests/test_environment.py -v
If tests pass, you're ready to go!
## Demo Instructions
### Quick Start
1. **Install dependencies**:
```bash
pip install -r requirements.txt
-
Launch the interactive demo:
python app.py
-
Open your browser to
http://127.0.0.1:7860
Click "Reset Environment" to start a new episode with randomized initial conditions.
Monitor the left panel:
- CPU Utilization (target: 40-70%)
- Memory Utilization (target: 40-70%)
- Request Rate (stochastic)
- Allocated Resources (your control)
- Latency (lower is better)
Use action buttons to control resources:
- ⬆️ Increase Resources: Add 1 server instance
- ➡️ Maintain Resources: Keep current allocation
- ⬇️ Decrease Resources: Remove 1 instance (min 1)
Observe how your actions affect:
- CPU Utilization Over Time: Shows target zones (40-70% optimal, 95% critical)
- Resource Allocation Over Time: Tracks your resource decisions
For faster exploration:
- Select number of steps (1-20)
- Choose action to repeat
- Click "
▶️ Run Multiple Steps"
When the episode ends, review:
- Final Score and Pass/Fail status
- Stability, Efficiency, and Performance metrics
- Average CPU, Memory, and Latency
- Total steps and cumulative reward
- Optimal Strategy: Keep CPU and memory between 40-70%
- Watch for Spikes: Request rate changes randomly each step
- Latency Matters: More resources = lower latency
- Cost vs Performance: Balance efficiency with stability
- Good Episode: Stable utilization around 50-60%, minimal resource changes, positive cumulative reward
- Poor Episode: Erratic utilization, frequent resource adjustments, negative cumulative reward
- Failed Episode: CPU or memory exceeds 95% (early termination)
Execute the full test suite:
pytest tests/Run tests with verbose output:
pytest tests/ -vRun only property-based tests:
pytest tests/test_environment_properties.py tests/test_reward_properties.py tests/test_grader_properties.py -vRun a specific test file:
pytest tests/test_environment.py -v- Launch the UI: Run
python app.py - Click "Reset Environment": Initializes a new episode with random starting conditions
- Observe Initial State: Note the CPU utilization, memory utilization, and request rate
- Take Actions:
- If CPU/memory > 70%: Click "Increase Resources" to add a server instance
- If CPU/memory < 35%: Click "Decrease Resources" to remove an instance
- If CPU/memory is 35-70%: Click "Maintain Resources" to keep current allocation
- Watch the Dynamics: Request rate fluctuates randomly each step, affecting utilization
- Complete the Episode: Continue until the episode terminates (100 steps or critical utilization)
- Review Results: Check the grading report showing stability, efficiency, and overall score
from env.environment import CloudResourceEnv
# Create environment
env = CloudResourceEnv(max_steps=100, initial_resources=3)
# Reset to start new episode
state = env.reset()
print(f"Initial state: CPU={state[0]:.1f}%, Memory={state[1]:.1f}%, "
f"Requests={state[2]}, Resources={state[3]}")
# Run episode
done = False
cumulative_reward = 0
while not done:
# Simple policy: increase if CPU > 70%, decrease if CPU < 40%, else maintain
if state[0] > 70:
action = CloudResourceEnv.ACTION_INCREASE
elif state[0] < 40:
action = CloudResourceEnv.ACTION_DECREASE
else:
action = CloudResourceEnv.ACTION_MAINTAIN
# Execute action
state, reward, done, info = env.step(action)
cumulative_reward += reward
print(f"Step {info['step']}: Action={action}, Reward={reward:.2f}, "
f"CPU={state[0]:.1f}%, Resources={state[3]}")
print(f"Episode finished! Total reward: {cumulative_reward:.2f}")from env.environment import CloudResourceEnv
from env.grader import EpisodeGrader
# Run an episode
env = CloudResourceEnv()
grader = EpisodeGrader()
state = env.reset()
done = False
while not done:
action = 1 # Maintain resources (simple baseline)
state, reward, done, info = env.step(action)
# Grade the episode
results = grader.grade_episode(
env.episode_states,
env.episode_actions,
env.episode_rewards
)
print(f"Episode Score: {results['score']:.3f}")
print(f"Passed: {results['passed']}")
print(f"Stability: {results['stability_score']:.3f}")
print(f"Efficiency: {results['efficiency_score']:.3f}")
print(f"Avg Reward: {results['avg_reward']:.3f}")The state represents the current observation of the system. In this environment, the state is a 4-dimensional vector:
state = [cpu_utilization, memory_utilization, request_rate, allocated_resources]
# Example: [65.3, 58.2, 47, 3]- CPU Utilization: Percentage (0-100) indicating how much CPU capacity is being used
- Memory Utilization: Percentage (0-100) indicating how much memory is being used
- Request Rate: Number of incoming requests per time step
- Allocated Resources: Number of server instances currently allocated (minimum 1)
An action is a decision the agent makes to modify resource allocation. There are three discrete actions:
- Action 0 (Decrease): Remove one server instance (minimum 1 instance maintained)
- Action 1 (Maintain): Keep current allocation unchanged
- Action 2 (Increase): Add one server instance
The reward is a numerical signal indicating how good the current state is. The reward function encourages:
- Optimal Utilization (35-70%): Positive rewards when CPU and memory are in this range
- Avoiding Over-Provisioning: Negative rewards when utilization is too low (wasting resources)
- Avoiding Under-Provisioning: Negative rewards when utilization is too high (risking instability)
Reward calculation considers:
- CPU utilization relative to target range (35-70%)
- Memory utilization relative to target range (35-70%)
- Resource cost penalty (encourages using fewer instances when possible)
An episode is a complete simulation run from initialization to termination. Episodes terminate when:
- Maximum steps reached (default: 100 steps)
- CPU utilization exceeds 95% (system instability)
- Memory utilization exceeds 95% (system instability)
The environment simulates realistic cloud behavior:
- Stochastic Request Rate: Incoming requests fluctuate randomly each step, simulating real-world variability
- Resource-Utilization Relationship: More resources → lower utilization; fewer resources → higher utilization
- Request Impact: Higher request rate → higher CPU and memory utilization
- Start with Manual Exploration: Use the UI to get intuition before writing agent code
- Watch for Patterns: Notice how request rate changes affect utilization
- Test Edge Cases: Try maintaining 1 resource with high requests, or 10 resources with low requests
- Observe Termination: See what happens when you let CPU/memory exceed 95%
- Over-Provisioning: Allocating too many resources leads to low utilization and negative rewards (wasted money)
- Under-Provisioning: Allocating too few resources leads to high utilization and negative rewards (instability risk)
- Optimal Zone: Keeping utilization between 35-70% yields positive rewards
- Stochastic Challenges: Random request fluctuations make perfect control impossible
- Print State Transitions: Add print statements to see how state evolves
- Track Reward Components: Modify reward calculator to log CPU reward, memory reward, and resource penalty separately
- Visualize Episodes: Plot CPU/memory utilization over time to identify patterns
- Compare Policies: Run multiple episodes with different strategies and compare results
-
Modify Reward Function: Adjust target utilization ranges in
env/reward.pyreward_calc = RewardCalculator( target_cpu_range=(50.0, 80.0), # Change optimal range target_memory_range=(50.0, 80.0), resource_cost_weight=0.2 # Increase cost penalty )
-
Change Environment Parameters: Adjust dynamics in
env/environment.pyenv = CloudResourceEnv( max_steps=200, # Longer episodes initial_resources=5, # Start with more resources base_request_rate=100 # Higher baseline load )
-
Implement Simple Policies: Create rule-based agents to test different strategies
def aggressive_policy(state): """Always increase resources when utilization > 60%""" return 2 if state[0] > 60 or state[1] > 60 else 1 def conservative_policy(state): """Only increase when utilization > 80%""" return 2 if state[0] > 80 or state[1] > 80 else 0
- Add New State Features: Include additional metrics like network bandwidth or disk I/O
- Implement Multi-Step Actions: Allow increasing/decreasing by 2 or 3 instances at once
- Add Cost Models: Introduce different instance types with varying costs and capacities
- Create Visualization Tools: Plot episode trajectories, reward distributions, or policy heatmaps
- Train an RL Agent: Implement Q-learning, DQN, or PPO to learn optimal policies
- Add Constraints: Introduce budget limits, SLA requirements, or resource quotas
- Multi-Objective Optimization: Balance cost, performance, and reliability simultaneously
- Realistic Workload Patterns: Model daily/weekly traffic patterns instead of random fluctuations
- Add Property Tests: Write new Hypothesis tests for custom features
- Benchmark Policies: Create a test suite comparing different strategies
- Stress Testing: Test environment behavior under extreme conditions
cloud-resource-allocation-rl/
├── env/ # Core environment logic
│ ├── __init__.py
│ ├── environment.py # CloudResourceEnv class (state, actions, dynamics)
│ ├── reward.py # RewardCalculator class (reward shaping)
│ ├── grader.py # EpisodeGrader class (performance evaluation)
│ └── config.py # Configuration dataclasses
├── tests/ # Test suite
│ ├── __init__.py
│ ├── test_environment.py # Unit tests for environment
│ ├── test_environment_properties.py # Property tests for environment
│ ├── test_reward.py # Unit tests for reward calculator
│ ├── test_reward_properties.py # Property tests for reward calculator
│ ├── test_grader.py # Unit tests for grader
│ └── test_grader_properties.py # Property tests for grader
├── app.py # Gradio interactive UI
├── requirements.txt # Python dependencies
└── README.md # This file
Issue: ModuleNotFoundError: No module named 'gradio'
- Solution: Run
pip install -r requirements.txtto install dependencies
Issue: UI doesn't open in browser
- Solution: Manually navigate to
http://127.0.0.1:7860or check terminal for the correct URL
Issue: Tests fail with hypothesis errors
- Solution: Ensure Hypothesis is installed:
pip install hypothesis
Issue: Python version error
- Solution: Upgrade to Python 3.8+:
python --versionto check current version
This project is ready for deployment on Hugging Face Spaces:
Ensure your repository contains:
app.py- Main Gradio applicationrequirements.txt- All dependenciesenv/directory - Environment modulesREADME.md- This documentation
- Go to huggingface.co/spaces
- Click "Create new Space"
- Choose:
- SDK: Gradio
- Hardware: CPU Basic (free tier works fine)
- Visibility: Public or Private
Option A - Git:
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
git push hf mainOption B - Web Interface:
- Upload files directly through the Hugging Face web interface
The space will automatically:
- Install dependencies from
requirements.txt - Run
app.py - Launch the Gradio interface
- Check build logs for any errors
- Test the deployed app at
https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
For local deployment:
# Standard launch
python app.py
# Custom port
python app.py --server-port 8080
# Share publicly (temporary link)
# Modify app.py: demo.launch(share=True)No environment variables required. All configuration is in code.
Issue: Build fails on Hugging Face
- Solution: Check
requirements.txthas all dependencies - Solution: Ensure Python 3.8+ compatibility
Issue: App crashes on startup
- Solution: Check logs for import errors
- Solution: Verify all
env/modules are uploaded
Issue: Slow performance
- Solution: Upgrade to GPU hardware (if needed for RL training)
- Solution: Current CPU tier is sufficient for demo
- torch>=2.0.0: PyTorch for tensor operations and neural network integration
- numpy: Numerical computations and state representation
- gradio: Interactive web UI for manual environment control
- matplotlib: Real-time plotting and visualization
- hypothesis: Property-based testing framework
- pytest: Unit testing framework
This project is provided for educational purposes.