AutoResearch

Claude Code mode for autonomous AI agent research

Inspired by Andrej Karpathy's autoresearch, this framework enables AI agents to autonomously experiment, iterate, and improve systems while humans only set the research agenda.

The Core Idea

You don't touch the code directly. Instead, you use the setup script to create a research project, then let Claude Code do the autonomous research:

Create a research project - Using the setup script
Open in Claude Code - The AI reads program.md
Autonomous iteration - Claude experiments, evaluates, and improves
Review results - See what worked and what didn't

Quick Start

1. Create a Prompt Optimization Research Project

python setup.py prompt ./my-experiment \
  --task "Extract the main sentiment from text (positive, negative, or neutral)" \
  --eval-cases ./examples/sentiment-classification/eval_cases.json \
  --max-experiments 20 \
  --time 30

This creates:

my-experiment/
├── program.md          # Instructions for Claude (read this first!)
├── prompt.txt          # The prompt to optimize
├── eval.py             # Evaluation script
├── eval_cases.json     # Test cases
└── README.md           # Project documentation

2. Open in Claude Code

cd my-experiment
claude-code .

3. Tell Claude What to Do

Hi! Please read program.md and let's start the autonomous research.

Claude will then:

Read the current prompt.txt
Run a baseline evaluation
Start iterating on the prompt
Report progress after each experiment
Stop when it reaches the goal or budget

4. Review Results

After the research session, check the final prompt.txt to see the optimized result.

How It Works

The Research Loop

┌─────────────────────────────────────────────────────────────┐
│  1. Claude reads program.md and current target file         │
│  2. Claude proposes a small change to target file           │
│  3. Claude runs the evaluation command                      │
│  4. Claude checks if metric improved                        │
│  5. If better: keep changes                                 │
│     If worse: revert to best                                │
│  6. Repeat until max_experiments or time budget             │
└─────────────────────────────────────────────────────────────┘

Design Principles

Principle	Description
Single file to modify	The agent only touches one file, keeping diffs reviewable
Fixed time budget	Each experiment runs for the same duration, enabling fair comparison
Claude Code native	Leverages Claude Code's full capabilities for autonomous research
Simple and observable	Minimal dependencies, easy to run anywhere

Available Research Types

Prompt Optimization

Optimize LLM system prompts for specific tasks:

python setup.py prompt ./sentiment-analysis \
  --task "Classify text sentiment as positive, negative, or neutral" \
  --eval-cases ./examples/sentiment-classification/eval_cases.json \
  --max-experiments 20 \
  --time 30

ML Hyperparameter Tuning

Optimize machine learning model configurations:

python setup.py ml ./training-experiment \
  --task "Optimize neural network for MNIST classification" \
  --dataset "https://example.com/mnist.pkl" \
  --max-experiments 15 \
  --time 45

RAG Optimization

Optimize Retrieval-Augmented Generation pipelines:

python setup.py rag ./rag-experiment \
  --task "Optimize RAG for technical documentation Q&A" \
  --eval-cases ./rag_eval_cases.json \
  --max-experiments 20 \
  --time 30

Optimizes: Chunk size, overlap, top-k retrieval, reranking, and generation parameters.

Tool/Function Calling

Optimize tool descriptions and function calling prompts:

python setup.py tools ./agent-experiment \
  --task "Optimize tool selection for web search agent" \
  --eval-cases ./tool_scenarios.json \
  --max-experiments 15 \
  --time 25

Optimizes: Tool descriptions, parameter documentation, system prompts, and tool ordering.

Setup Script Options

python setup.py <type> <output_dir> [options]

Types:
  prompt     Prompt optimization for LLM tasks
  ml         ML hyperparameter tuning
  rag        RAG pipeline optimization
  tools      Tool/function calling optimization

Options:
  --task, -t              Task description (required)
  --eval-cases, -e        Evaluation cases JSON file (for prompt, rag, tools types)
  --dataset, -d           Dataset URL (for ML type)
  --max-experiments, -n   Maximum number of experiments (default: 20)
  --time-budget           Total time budget in minutes (default: 30)

Example: Sentiment Classification

The examples/sentiment-classification/ directory contains a complete example for prompt optimization.

To try it:

cd examples/sentiment-classification
claude-code .

Then tell Claude: "Please read program.md and let's start the autonomous research."

Requirements

Python 3.10+
Claude Code - The Claude CLI tool
uv (for dependency management in research projects)

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - feel free to use this for your own projects.

Acknowledgments

Inspired by Andrej Karpathy's autoresearch
Built for Claude Code

Wake up to better systems, automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples/sentiment-classification		examples/sentiment-classification
images		images
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
program.md		program.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoResearch

The Core Idea

Quick Start

1. Create a Prompt Optimization Research Project

2. Open in Claude Code

3. Tell Claude What to Do

4. Review Results

How It Works

The Research Loop

Design Principles

Available Research Types

Prompt Optimization

ML Hyperparameter Tuning

RAG Optimization

Tool/Function Calling

Setup Script Options

Example: Sentiment Classification

Requirements

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoResearch

The Core Idea

Quick Start

1. Create a Prompt Optimization Research Project

2. Open in Claude Code

3. Tell Claude What to Do

4. Review Results

How It Works

The Research Loop

Design Principles

Available Research Types

Prompt Optimization

ML Hyperparameter Tuning

RAG Optimization

Tool/Function Calling

Setup Script Options

Example: Sentiment Classification

Requirements

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages