AI-powered project generator that transforms natural language descriptions into complete, production-ready codebases, validated inside a CubeSandbox microVM.
📄 Paper submitted to ICML 2026
A novel approach to autonomous code generation using multi-agent systems with iterative self-healing and comprehensive validation across diverse programming paradigms.
- Planning Agent: Analyzes errors and generates comprehensive fix strategies using tool-augmented reasoning
- Correction Agent: Executes fixes with code understanding and validation
- Iterative Self-Healing: Automatically detects and resolves dependency conflicts, build errors, and test failures
- Natural language to production-ready code
- Multi-file project generation with proper structure
- Support for modern languages and frameworks
- Intelligent dependency resolution
- Best practices and design patterns
- Every shell command from the planner runs inside a CubeSandbox microVM
- Project tree mirrored to
/workspaceon session start; subsequent edits are written through - Drop-in
e2b_code_interpreterSDK — no Docker daemon required - Sandbox is killed automatically when the pipeline exits
- 40 Programming Challenges across 4 languages:
- CUDA: GPU computing and parallel algorithms (10 challenges)
- Go: Concurrent systems and distributed computing (10 challenges)
- Rust: Memory-safe systems programming (10 challenges)
- TypeScript: Type-safe applications and frameworks (10 challenges)
- 4-Tier Difficulty System: From fundamentals to production systems
- Comprehensive benchmarking and metrics collection
graptLR:> LR
A[Natural Language Input] --> B[AI Analysis & Blueprint]
B --> C[Multi-File Code Generation]
C --> D[Dependency Resolution]
D --> E[CubeSandbox Provisioning]
E --> F[Build Validation]
F --> G{Build Success?}
G -->|No| H[Planning Agent]
H --> I[Correction Agent]
I --> F
G -->|Yes| J[Test Execution]
J --> K{Tests Pass?}
K -->|No| H
K -->|Yes| L[Production-Ready Project]
style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
style B fill:#9B59B6,stroke:#6C3483,stroke-width:2px,color:#fff
style C fill:#E67E22,stroke:#A04000,stroke-width:2px,color:#fff
style D fill:#3498DB,stroke:#1F618D,stroke-width:2px,color:#fff
style E fill:#1ABC9C,stroke:#117A65,stroke-width:2px,color:#fff
style F fill:#E74C3C,stroke:#922B21,stroke-width:2px,color:#fff
style L fill:#27AE60,stroke:#186A3B,stroke-width:2px,color:#fff
Core Generation Pipeline:
- Blueprint Generation: Analyzes requirements and creates software architecture
- Folder Structure: Generates project hierarchy with proper organization
- File Generation: Creates all necessary files with content (source, config, tests, docs)
- Metadata Management: Tracks dependencies, entry points, and test commands
Intelligent Error Resolution:
- Error Tracking: Monitors all errors across build and test phases
- Tool-Augmented Planning: Uses file operations, command execution, and analysis tools
- Context-Aware Fixes: Understands project structure and dependencies
- Iterative Refinement: Continues until success or max iterations reached
Validation & Testing:
- CubeSandbox Isolation: Every shell command runs inside a microVM with
/workspacemirroring the project tree - Command Detection: Automatically identifies build/test commands
- Log Analysis: Extracts and analyzes error messages
- Success Verification: Validates complete pipeline execution
Requirements:
- Python 3.9+
- Google Gemini API Key
- CubeSandbox (optional, for sandboxed validation)
# Clone and install
git clone https://github.com/HyperKuvid-Labs/alpha-stack.git
cd alpha-stack
pip install .
# Configure API key
alphastack setupCubeSandbox Installation (Recommended):
# One-click install of the local CubeSandbox stack
curl -sL https://github.com/tencentcloud/CubeSandbox/raw/master/deploy/one-click/online-install.sh | bash
# Create a sandbox template based on the official code-runner image
cubemastercli tpl create-from-image \
--image ccr.ccs.tencentyun.com/ags-image/sandbox-code:latest \
--writable-layer-size 1G
# Wire the template id into AlphaStack (overrideable via env vars)
alphastack sandbox --template-id <id>Sandbox environment variables:
| Variable | Default | Purpose |
|---|---|---|
E2B_API_URL |
http://127.0.0.1:3000 |
CubeSandbox API endpoint |
E2B_API_KEY |
dummy |
API key (CubeSandbox local mode does not enforce) |
CUBE_TEMPLATE_ID |
(unset) | Template id for the sandbox image |
If CUBE_TEMPLATE_ID (or the saved config) is missing, the testing pipeline falls back to running shell commands directly on the host and prints instructions on how to set it up.
Interactive Mode:
alphastack
# Follow the interactive prompts to generate your projectCommand Line:
# Generate a project
alphastack generate "A Flask REST API with user authentication and JWT tokens"
# Specify output directory
alphastack generate "Python CLI tool for file processing" -o /path/to/output
# Generate with custom name
alphastack generate "React TypeScript dashboard with charts"
# List generated projects
alphastack list
# Clean up projects
alphastack cleanExample Projects:
# Web Applications
alphastack generate "Express.js REST API with MongoDB and authentication"
alphastack generate "FastAPI service with PostgreSQL and async operations"
# CLI Tools
alphastack generate "Python CLI tool for image compression with progress bar"
alphastack generate "Go CLI for log analysis with concurrent processing"
# Data Processing
alphastack generate "Rust program for parallel CSV processing"
alphastack generate "Python script for web scraping with retry logic"
# System Programming
alphastack generate "CUDA kernel for matrix multiplication optimization"
alphastack generate "Go service with gRPC and protocol buffers"AlphaStack includes a comprehensive evaluation framework with 40 carefully designed programming challenges across 4 modern languages, organized into 4 difficulty tiers:
- Focus: Parallel computing, memory management, kernel optimization
- Challenges: Vector operations → Matrix operations → Sparse algorithms → Ray tracing engines
- Tier 4 Example: Ray tracing engine with BVH acceleration structure
- Focus: Distributed systems, goroutines, channels, service architecture
- Challenges: Worker pools → REST APIs → Load balancers → Raft consensus
- Tier 4 Example: Full Raft consensus protocol implementation
- Focus: Memory safety, ownership, lifetimes, zero-cost abstractions
- Challenges: Custom iterators → HTTP parsers → Procedural macros → Custom allocators
- Tier 4 Example: Custom bump allocator as global allocator with FFI
- Focus: Type system, generics, inference, compile-time safety
- Challenges: Event emitters → Type-safe routers → DI containers → Full-stack RPC
- Tier 4 Example: End-to-end type-safe RPC framework with inference
| Tier | Focus | Complexity | Lines of Code | Time |
|---|---|---|---|---|
| Tier 1 | Fundamentals | Single concept, basic algorithms | 150-400 | 2-4h |
| Tier 2 | Architecture | Multiple modules, abstractions | 400-700 | 4-8h |
| Tier 3 | Advanced | Domain expertise, algorithms | 500-900 | 8-16h |
| Tier 4 | Production | Complete systems, optimization | 800-1500 | 16-32h |
- Success Rate: Percentage of challenges solved correctly
- Build Success: Projects that compile/build without errors
- Test Pass Rate: Projects with passing test suites
- Iteration Count: Average iterations needed for error resolution
- Time to Solution: End-to-end generation time
- Code Quality: Adherence to best practices and patterns
Evaluation Location: src/prompts/eval/ contains all challenge specifications and test cases.
alpha-stack/
├── src/
│ ├── agents/ # Multi-agent system
│ │ ├── planner.py # Planning agent for error analysis
│ │ └── corrector.py # Correction agent for fixes
│ ├── sandbox/ # CubeSandbox integration
│ │ └── cube.py # CubeSession + SandboxShellManager
│ ├── testing/ # Planner-driven testing pipeline
│ │ ├── eval_generator.py # Test-file blueprint + generator
│ │ └── testing.py # TestingPipeline (sandbox lifecycle)
│ ├── prompts/ # Jinja2 prompt templates
│ │ └── eval/ # Evaluation challenges
│ │ ├── cuda/ # 10 CUDA challenges
│ │ ├── go/ # 10 Go challenges
│ │ ├── rust/ # 10 Rust challenges
│ │ └── typescript/ # 10 TypeScript challenges
│ ├── utils/ # Core utilities
│ │ ├── helpers.py # Helper functions
│ │ ├── prompt_manager.py # Template management
│ │ ├── error_tracker.py # Error tracking
│ │ └── tools.py # Tool definitions
│ ├── generator.py # Main generation logic
│ ├── eval_generator.py # Evaluation system
│ ├── cli.py # Command-line interface
│ ├── tui.py # Terminal UI
│ └── config.py # Configuration management
├── website/ # Project website
├── test_runner.py # Development test runner
└── pyproject.toml # Project metadata
- Primary Model: Google Gemini (configurable via
MODEL_NAME) - Alternative Support: OpenRouter API for evaluation framework
- Context Management: Intelligent prompt engineering with Jinja2 templates
Planning Agent (src/agents/planner.py):
- Analyzes build/test errors using structured error tracking
- Generates comprehensive fix plans with tool-based reasoning
- Maintains project structure cache for efficient planning
- Supports different error types (dependency, docker, common errors)
Correction Agent (src/agents/corrector.py):
- Executes planned fixes with code understanding
- Validates code changes before application
- Uses language-specific parsers for syntax validation
- Tracks changes to prevent infinite loops
Features:
- Per-pipeline microVM provisioned from a configured template
- Project tree mirrored to
/workspaceon session start; subsequent edits are written through - Shell commands stream stdout/stderr live so the planner can detect stalls
- Sandbox is killed automatically when the pipeline exits (no leaked microVMs)
- Falls back to host execution when no template is configured
Testing Framework (src/testing/testing.py + src/sandbox/cube.py):
CubeSessionowns the sandbox handle and file mirroringSandboxShellManageris a drop-in replacement for the hostShellManager- Real-time log capture and analysis
- Iterative error resolution with max round limits
- Success/failure validation with detailed reporting
Template System:
- Jinja2-based prompt templates for consistency
- Context-aware prompt rendering
- Specialized templates for different generation phases:
- Software blueprint generation
- Folder structure planning
- File content generation
- Error correction strategies
- Sandbox-aware planner instructions
- Languages: Python, JavaScript/TypeScript, Go, Rust, Java, C/C++, CUDA, and more
- Frameworks: Flask, FastAPI, Express.js, React, Vue, Next.js, etc.
- Project Types: Web APIs, CLI tools, data processors, system utilities, GPU kernels
- File Types: Source code, configuration, tests, documentation
- Dependency Resolution: Automatically resolves missing packages and version conflicts
- Build Fixes: Corrects syntax errors, import issues, configuration problems
- Test Fixes: Addresses failing tests, missing test dependencies, assertion errors
- Max Iterations: Configurable (default: 5 per phase)
- Startup: Sub-second microVM provisioning per pipeline run
- Test Execution: Isolated
/workspacemirroring the project tree - Success Rate: High success rate on Tier 1-2 challenges (>80%)
- Lifecycle: Single sandbox per project run, killed on completion
This work introduces a novel approach to autonomous code generation that addresses key challenges in AI-assisted software development:
- Multi-Agent Architecture: Separation of planning and correction concerns for better error resolution
- Iterative Self-Healing: Autonomous error detection and correction without human intervention
- Comprehensive Validation: End-to-end validation from build to test execution inside CubeSandbox microVMs
- Cross-Language Evaluation: Diverse evaluation suite spanning different programming paradigms
- Tool-Augmented Reasoning: Integration of file operations and command execution for context-aware fixes
- How effectively can multi-agent systems autonomously resolve software errors?
- What is the success rate across different programming paradigms and difficulty levels?
- How many iterations are typically required for convergence to a working solution?
- What types of errors can be automatically resolved vs. requiring human intervention?
The evaluation framework (src/prompts/eval/) provides a standardized benchmark with:
- 40 challenges across 4 languages and 4 difficulty tiers
- Clear success criteria (build success, test pass rate)
- Reproducible evaluation inside CubeSandbox microVMs
- Metrics for iteration count, time to solution, and code quality
For more details on the evaluation suite, see src/prompts/eval/README.md
We welcome contributions! Areas of interest:
- Additional programming language support
- New evaluation challenges
- Performance optimizations
- Documentation improvements
- Bug fixes and error handling
MIT License - see LICENSE file for details
- Repository: github.com/HyperKuvid-Labs/alpha-stack
- Issues: github.com/HyperKuvid-Labs/alpha-stack/issues
- Evaluation Suite: src/prompts/eval/
For research collaborations or questions about the ICML 2026 submission, please open an issue or contact the AlphaStack Team.
AlphaStack - Transforming Ideas into Code
Submitted to ICML 2026
