Claude Code mode for autonomous AI agent research
Inspired by Andrej Karpathy's autoresearch, this framework enables AI agents to autonomously experiment, iterate, and improve systems while humans only set the research agenda.
You don't touch the code directly. Instead, you use the setup script to create a research project, then let Claude Code do the autonomous research:
- Create a research project - Using the setup script
- Open in Claude Code - The AI reads
program.md - Autonomous iteration - Claude experiments, evaluates, and improves
- Review results - See what worked and what didn't
python setup.py prompt ./my-experiment \
--task "Extract the main sentiment from text (positive, negative, or neutral)" \
--eval-cases ./examples/sentiment-classification/eval_cases.json \
--max-experiments 20 \
--time 30This creates:
my-experiment/
├── program.md # Instructions for Claude (read this first!)
├── prompt.txt # The prompt to optimize
├── eval.py # Evaluation script
├── eval_cases.json # Test cases
└── README.md # Project documentation
cd my-experiment
claude-code .Hi! Please read program.md and let's start the autonomous research.
Claude will then:
- Read the current
prompt.txt - Run a baseline evaluation
- Start iterating on the prompt
- Report progress after each experiment
- Stop when it reaches the goal or budget
After the research session, check the final prompt.txt to see the optimized result.
┌─────────────────────────────────────────────────────────────┐
│ 1. Claude reads program.md and current target file │
│ 2. Claude proposes a small change to target file │
│ 3. Claude runs the evaluation command │
│ 4. Claude checks if metric improved │
│ 5. If better: keep changes │
│ If worse: revert to best │
│ 6. Repeat until max_experiments or time budget │
└─────────────────────────────────────────────────────────────┘
| Principle | Description |
|---|---|
| Single file to modify | The agent only touches one file, keeping diffs reviewable |
| Fixed time budget | Each experiment runs for the same duration, enabling fair comparison |
| Claude Code native | Leverages Claude Code's full capabilities for autonomous research |
| Simple and observable | Minimal dependencies, easy to run anywhere |
Optimize LLM system prompts for specific tasks:
python setup.py prompt ./sentiment-analysis \
--task "Classify text sentiment as positive, negative, or neutral" \
--eval-cases ./examples/sentiment-classification/eval_cases.json \
--max-experiments 20 \
--time 30Optimize machine learning model configurations:
python setup.py ml ./training-experiment \
--task "Optimize neural network for MNIST classification" \
--dataset "https://example.com/mnist.pkl" \
--max-experiments 15 \
--time 45Optimize Retrieval-Augmented Generation pipelines:
python setup.py rag ./rag-experiment \
--task "Optimize RAG for technical documentation Q&A" \
--eval-cases ./rag_eval_cases.json \
--max-experiments 20 \
--time 30Optimizes: Chunk size, overlap, top-k retrieval, reranking, and generation parameters.
Optimize tool descriptions and function calling prompts:
python setup.py tools ./agent-experiment \
--task "Optimize tool selection for web search agent" \
--eval-cases ./tool_scenarios.json \
--max-experiments 15 \
--time 25Optimizes: Tool descriptions, parameter documentation, system prompts, and tool ordering.
python setup.py <type> <output_dir> [options]
Types:
prompt Prompt optimization for LLM tasks
ml ML hyperparameter tuning
rag RAG pipeline optimization
tools Tool/function calling optimization
Options:
--task, -t Task description (required)
--eval-cases, -e Evaluation cases JSON file (for prompt, rag, tools types)
--dataset, -d Dataset URL (for ML type)
--max-experiments, -n Maximum number of experiments (default: 20)
--time-budget Total time budget in minutes (default: 30)The examples/sentiment-classification/ directory contains a complete example for prompt optimization.
To try it:
cd examples/sentiment-classification
claude-code .Then tell Claude: "Please read program.md and let's start the autonomous research."
- Python 3.10+
- Claude Code - The Claude CLI tool
- uv (for dependency management in research projects)
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - feel free to use this for your own projects.
- Inspired by Andrej Karpathy's autoresearch
- Built for Claude Code
Wake up to better systems, automatically.



