Use Self-Evolution Finetuning to supercharge your LLM with only 1 prompt!
-
LLM Agents are powerful but sometimes still need explicit learning to solve unfamiliar problems. However, finetuning and prompt engineering are expensive and time-consuming. Self-Evolution is an easy and powerful way to let an LLM teach itself and evolve to be better at your tasks.
-
LiteEvo is an easy-to-use tool for self-evolution tuning of your models. Instead of tuning model weights, it lets the model repeatedly attempt your task and evolve a playbook (skill) — a structured guidance document that captures successful strategies, common pitfalls, and learned heuristics.
-
LiteEvo Self-Evolution only requires an agent (Claude Code, Codex, or any LLM), 1–10 tasks, and <10 minutes to finetune a playbook that boosts LLM agent performance!
-
Over iterations, the playbook accumulates distilled knowledge that dramatically improves task performance — without changing a single model parameter.
Prerequisites:
- Python package manager
uvto run the evolution. - A working LLM agent. This can be any CLI agent (supporting
claude,codex, andgemini-cli) or an OpenAI-compatible API. - A specific task you want the LLM to be good at, with a criterion for what kind of output is expected.
# Install
git clone https://github.com/wbopan/liteevo.git && cd liteevo
# Evolve Claude Code
uv run evolve \
--provider claude \
--task "Generate a random number from 1-100" \
--criterion "The number should be uniformly random" \
--step-size 15 \
--batch-size 5Watch as the model discovers that it needs actual randomization strategies instead of picking "random-looking" numbers like 37 and 42!
Playbook: A structured JSON document containing guidelines, workflows, checklists, and examples. The model updates this document as it learns from successes and failures.
Evolution Loop: The iterative process of:
- Generate response using current playbook
- Accumulate results into batches
- Reflect on batch performance against criteria
- Update playbook with learned insights
Criteria: Success conditions that define what "good" looks like for each task. These drive the model's self-reflection.
| Parameter | Description | Typical Value |
|---|---|---|
step-size |
Total number of generation iterations | 10-20 |
batch-size |
Steps between playbook updates | 3-5 |
playbook schema |
Structure defining the playbook format | See below |
update prompt |
Template for playbook update instructions | Customizable |
Trade-offs:
- Smaller batch → faster iteration, but risk of overfitting to individual examples
- Larger batch → more stable learning, but slower improvement cycles
The playbook is a commented JSON structure:
{
"playbook_version": 0,
"title": "Task Strategy Guide",
"description": "Brief description of what this playbook helps with",
"sections": {
"guidelines": [
"General tips and principles (max 10)"
],
"workflow": [
"Step-by-step process (max 10)"
],
"checklist": [
"Items to verify before completing (max 10)"
],
"examples": [
"Illustrative examples (max 10)"
]
},
"logs": [
"v0: Initial. v1: Added randomization strategy",
"v1: Found pattern. v2: Added edge case handling"
]
}liteevo/
├── prompts/
│ ├── UPDATE_PLAYBOOK.jinja2 # Playbook update template
│ ├── GENERATE_ANSWER.jinja2 # Generation template
│ └── PLAYBOOK_SCHEMA.txt # Default playbook schema
├── src/liteevolve/
│ ├── __init__.py
│ ├── evolve.py # Core evolution loop
│ ├── provider.py # LLM provider implementations
│ └── cli.py # Command-line interface
└── pyproject.toml
uv run evolve [OPTIONS]| Option | Description |
|---|---|
--provider |
LLM provider: claude, codex, gemini, openai, or cli |
--task or --tasks |
Single task string or glob pattern for task files |
--criterion or --criteria |
Single criterion string or glob pattern for criteria files |
| Option | Default | Description |
|---|---|---|
--provider-args |
- | Provider-specific arguments |
--output-dir |
outputs/YYYY-MM-DD-HHMMSS/ |
Output directory |
--step-size |
10 | Number of evolution steps |
--batch-size |
3 | Steps per playbook update |
--prompt-update-playbook |
prompts/UPDATE_PLAYBOOK.jinja2 |
Update template path |
--prompt-generate-answer |
prompts/GENERATE_ANSWER.jinja2 |
Generation template path |
--schema-playbook |
prompts/PLAYBOOK_SCHEMA.txt |
Playbook schema path |
LiteEvolve supports multiple LLM providers out of the box:
Uses Claude Code CLI. No API key required if Claude Code is already configured.
uv run evolve --provider claude \
--task "Your task here" \
--criterion "Success criterion"With custom arguments:
uv run evolve --provider claude \
--provider-args "--model claude-sonnet-4-20250514" \
--task "Your task" --criterion "Criterion"Supports OpenAI API and any compatible endpoint (Azure, local models, etc.).
# Using environment variable
export OPENAI_API_KEY="sk-..."
uv run evolve --provider openai \
--provider-args "model=gpt-4" \
--task "Your task" --criterion "Criterion"
# With explicit API key
uv run evolve --provider openai \
--provider-args "model=gpt-4,api_key=sk-..." \
--task "Your task" --criterion "Criterion"
# With custom endpoint (e.g., Azure, local)
uv run evolve --provider openai \
--provider-args "model=gpt-4,base_url=https://your-endpoint.com/v1,api_key=..." \
--task "Your task" --criterion "Criterion"
# With temperature control
uv run evolve --provider openai \
--provider-args "model=gpt-4,temperature=0.5" \
--task "Your task" --criterion "Criterion"OpenAI provider-args format: key1=value1,key2=value2,...
| Key | Required | Description |
|---|---|---|
model |
Yes | Model name (e.g., gpt-4, gpt-4o) |
api_key |
No* | API key (*uses OPENAI_API_KEY env var if not set) |
base_url |
No | Custom API endpoint (default: OpenAI) |
temperature |
No | Sampling temperature (default: 0.7) |
Uses Gemini CLI.
uv run evolve --provider gemini \
--task "Your task" --criterion "Criterion"Uses Codex CLI (codex exec).
uv run evolve --provider codex \
--task "Your task" --criterion "Criterion"Use any CLI tool that accepts a prompt and returns output to stdout.
uv run evolve --provider cli \
--provider-args "/path/to/your/llm-cli" \
--task "Your task" --criterion "Criterion"The custom CLI must:
- Accept a prompt as a command-line argument
- Return the response to stdout
- Return exit code 0 on success
Evolve a playbook for a single task with repeated iterations:
uv run evolve --provider claude \
--task "Write a haiku about programming" \
--criterion "Must follow 5-7-5 syllable structure strictly" \
--step-size 15 --batch-size 3Use glob patterns to load multiple tasks and criteria:
# Assuming examples/ascii_digit/tasks/001.txt, 002.txt, ... and matching criteria files
uv run evolve --provider openai \
--provider-args "model=gpt-4" \
--tasks "examples/ascii_digit/tasks/*.txt" \
--criteria "examples/ascii_digit/criteria/*.txt" \
--step-size 30 --batch-size 5Use your own prompt templates:
uv run evolve --provider claude \
--task "Solve math problems" \
--criterion "Answer must be correct" \
--prompt-update-playbook "./my-update-template.jinja2" \
--prompt-generate-answer "./my-generate-template.jinja2" \
--schema-playbook "./my-schema.txt"After evolution, the output directory contains:
outputs/2024-01-15-143022/
├── playbooks/
│ ├── playbook_v1.txt
│ ├── playbook_v2.txt
│ └── playbook_v3.txt # Final evolved playbook
└── generations/
├── 000_task000_v0.txt # Step 0, task 0, playbook v0
├── 001_task000_v0.txt
├── 002_task000_v0.txt
├── playbook_v1.txt # Full update response for v1
├── 003_task000_v1.txt # Now using playbook v1
└── ...
When customizing templates, the following variables are available:
| Variable | Type | Description |
|---|---|---|
config |
EvolutionConfig | Contains step_size, batch_size, etc. |
step_id |
int | Current step (0-indexed) |
tasks |
list[str] | All task inputs |
generations |
list[str] | All generated outputs so far |
criteria |
list[str] | All success criteria |
playbooks |
list[str] | All playbook versions |
current_task |
str | Current task being processed |
current_criterion |
str | Current task's criterion |
current_playbook |
str | Latest playbook version |
- Playbook extraction looks for the last
```jsonor```jsonccode block in the update response - Claude provider uses
claude -p "prompt"to generate responses - Task and criteria counts must match when using glob patterns


