AI research assistant for hypothesis-driven ML experiments.
Daedalus AI is a Claude Code MCP plugin that manages the full experiment lifecycle: hypothesis formation, experiment design, training execution, log monitoring, result analysis, and autonomous parameter exploration. It works as a set of tools that Claude Code can use alongside its native code editing capabilities.
Running ML experiments involves a repetitive loop: change a parameter, launch training, wait hours, analyze results, decide what to try next. Daedalus automates the bookkeeping and gives Claude Code the tools to manage this loop — while you focus on the research questions.
- Hypothesis-driven: Every experiment starts with a hypothesis and predictions, not just "try this"
- One variable at a time: The system enforces scientific discipline
- Full lifecycle tracking: Draft → Running → Completed → Analyzed, with structured reflections
- Persistent memory: Research notes, decisions, and dead ends survive across sessions
- Literature integration: Search papers, build a library, generate structured reviews
- Smart exploration: Convergence detection, parameter sensitivity, data-driven suggestions
- Works with any framework: HuggingFace, PyTorch, TensorFlow, custom scripts — Daedalus doesn't care what you train with
- Local or remote: Run on your machine or SSH to GPU servers with automatic sentinel monitoring
- Multi-GPU: torchrun, accelerate, deepspeed launchers with GPU selection
- Batch execution: Distribute experiments across hosts with round-robin scheduling
- Claude Code integration: MCP tools + slash commands + code editing = the AI can fix bugs, not just track experiments
# 1. Install the Python package
pip install daedalus-ai
# or from source:
pip install -e .
# 2. Add the marketplace and install the plugin
/plugin marketplace add sirCamp/daedalus-ai
/plugin install daedalus-ai@daedalus-aiThe plugin registers the MCP server, slash commands, agents, and skills automatically.
# 1. Install the package
pip install -e .
# 2. Register MCP server + install agents, commands, skills + auto-approve tools
daedalus install-mcp
# 3. Restart Claude Code| Scope | Flag | Where it writes | Use case |
|---|---|---|---|
user (default) |
--scope user or omit |
~/.claude.json |
Recommended: available in ALL projects |
project |
--scope project |
.mcp.json in project dir |
Shared via git, per-project |
local |
--scope local |
~/.claude.json (local entry) |
This project only, not shared |
# Global install (recommended, default)
daedalus install-mcp
# Project-only install (creates .mcp.json, committable to git)
daedalus install-mcp --scope projectSince it's installed in editable mode, Python code changes are immediate. But plugin files (commands, skills, agents) are copied to ~/.claude/, so after updating:
# Marketplace install: update via Claude Code
/plugin update daedalus-ai@daedalus-ai
# Manual install: re-install plugin files + restart Claude Code
daedalus install-mcp# 1. Create a project
daedalus init my-research
cd my-research
# 2. Edit daedalus.yaml — define your scripts, parameters, stack
# 3. Edit program.md — define your research goals and metrics
# 4. Open Claude Code and start experimenting
# Use /daedalus:experiment, /daedalus:research-status, or just ask in natural languagemy-research/
├── daedalus.yaml # Project config: runner, scripts, stack
├── program.md # Research goals and metrics (editable by agent)
├── runner_config.yaml # SSH host configuration
├── ledger/
│ ├── experiments.jsonl # Experiment log (append-only)
│ ├── papers.jsonl # Literature library
│ ├── notes.jsonl # Research memory (decisions, insights, dead ends)
│ └── narrative.md # Auto-generated narrative
└── runs/ # Work directories (auto-created)
└── exp_abc123/
├── logs/
│ ├── stdout.log
│ └── stderr.log
└── results.json
project_name: my-research
runner:
type: local # or: ssh
higher_is_better:
accuracy: true
loss: false
# Stack — libraries and frameworks (shown to Claude Code for context)
stack:
python: "3.11"
requirements: requirements.txt
libraries:
- transformers
- datasets
- torch
docs:
- https://huggingface.co/docs/transformers/
# Scripts registry — what scripts exist and their parameters
scripts:
train:
path: train.py
description: "Main training script"
parameters:
model_name:
type: str
default: "bert-base-uncased"
description: "HuggingFace model ID"
learning_rate:
type: float
default: 2e-5
range: [1e-6, 1e-3]
batch_size:
type: int
default: 16
choices: [8, 16, 32, 64]
epochs:
type: int
default: 3
eval:
path: evaluate.py
description: "Run evaluation"
parameters:
model_path:
type: str
required: trueExperiments support multi-GPU training via the launcher field in the experiment config:
# In the experiment config (set via create_experiment tool or CLI)
launcher: accelerate # "python" (default), "torchrun", "accelerate", "deepspeed"
num_gpus: 4 # Number of GPUs
gpu_ids: [0, 1, 2, 3] # Specific GPUs (sets CUDA_VISIBLE_DEVICES)| Launcher | Command generated |
|---|---|
python |
python train.py --args |
torchrun |
torchrun --nproc_per_node=4 train.py --args |
accelerate |
accelerate launch --num_processes=4 train.py --args |
deepspeed |
deepspeed --num_gpus=4 train.py --args |
To use a single specific GPU: gpu_ids: [2] → sets CUDA_VISIBLE_DEVICES=2.
hosts:
gpu-a100:
host: 10.0.0.1
user: ubuntu
key_path: ~/.ssh/id_rsa
remote_work_dir: /data/daedalus
python_path: /opt/conda/bin/python
gpu-h100:
host: 10.0.0.2
user: admin
password: mypassword # sshpass-based auth (alternative to key_path)
remote_work_dir: /science/daedalus| Command | Description |
|---|---|
daedalus init NAME |
Initialize a research project (works on new or existing directories) |
daedalus status |
Show experiment counts, active runs, best result |
daedalus ledger [-n N] |
Show last N experiments |
daedalus narrative |
Auto-generated research narrative |
daedalus context [-m MODE] |
Dump context (reason/design/reflect/review/full) |
| Command | Description |
|---|---|
daedalus hypothesis STATEMENT -r RATIONALE |
Create draft experiment |
daedalus run EXP_ID [-H HOST] |
Launch experiment |
daedalus run-batch [EXP_IDS] [-H HOST ...] |
Launch multiple experiments across hosts (round-robin) |
daedalus poll EXP_ID [-H HOST] |
Check running status |
daedalus logs EXP_ID [-n TAIL] |
View training logs (stdout + stderr) |
daedalus cancel EXP_ID [-H HOST] |
Cancel running experiment |
daedalus record EXP_ID -r METRIC=VALUE |
Record results manually |
daedalus reflect EXP_ID -a ANALYSIS -c STATUS |
Add analysis |
daedalus diff EXP_A EXP_B |
Compare two experiments |
| Command | Description |
|---|---|
daedalus data inspect PATH |
Dataset stats, distributions, samples, token estimates |
daedalus data validate PATH -f FORMAT |
Validate against format requirements |
| Command | Description |
|---|---|
daedalus papers search QUERY |
Search Semantic Scholar |
daedalus papers add ARXIV_ID |
Add paper to library |
daedalus papers list [-t TAG] |
Browse library |
| Command | Description |
|---|---|
daedalus sync [-H HOST] |
Upload scripts and requirements to remote |
daedalus setup-env [-H HOST] |
Install dependencies on remote |
| Command | Description |
|---|---|
daedalus watch EXP_ID |
Wait for experiment to complete |
daedalus watch --all |
Wait for all running experiments |
daedalus watch --stream |
JSONL event stream (for Claude Code) |
daedalus watch --verbose |
Human-readable progress to stderr |
| Command | Description |
|---|---|
daedalus agent reason |
Run one reasoning cycle |
daedalus agent design |
Design the next experiment |
daedalus agent reflect |
Reflect on latest results |
daedalus agent plan |
Create a structured research plan |
daedalus agent loop [-n N] [--plan] |
Autonomous research loop (optionally plan-driven) |
daedalus plan show |
Show current plan with step statuses |
daedalus plan next |
Show next actionable step |
| Command | Description |
|---|---|
daedalus install-mcp |
Register MCP server globally + install plugin files (default) |
daedalus install-mcp --scope project |
Register MCP server for this project only (.mcp.json) |
daedalus uninstall-mcp |
Remove MCP registration |
When registered as an MCP server, Daedalus exposes 30 tools to Claude Code:
| Tool | Description |
|---|---|
daedalus_get_context |
Research context: goals, history, papers, memory, suggestions |
daedalus_suggest_next |
Data-driven parameter suggestions with convergence detection |
daedalus_get_insights |
Aggregated insights: top configs, parameter sensitivity, dead ends, explored ranges |
| Tool | Description |
|---|---|
daedalus_list_experiments |
List experiments by status |
daedalus_get_experiment |
Full details of one experiment |
daedalus_create_experiment |
Create draft with hypothesis + config |
daedalus_launch_experiment |
Start training |
daedalus_poll_experiment |
Check status, auto-fetch results on completion |
daedalus_record_results |
Manually record results |
daedalus_add_reflection |
Record analysis + auto-save insight to memory |
| Tool | Description |
|---|---|
daedalus_compare_experiments |
Config diff + metric deltas |
daedalus_get_experiment_logs |
Recent training output (stdout + stderr) |
| Tool | Description |
|---|---|
daedalus_save_note |
Save a research note (decision, insight, dead_end, convergence, todo) |
daedalus_read_notes |
Read notes — filter by category, experiment, or keyword |
daedalus_update_program |
Edit program.md — append, replace section, or rewrite |
| Tool | Description |
|---|---|
daedalus_search_papers |
Query Semantic Scholar and ACL Anthology |
daedalus_add_paper |
Add to library by arXiv ID or ACL ID, with notes and tags |
daedalus_list_papers |
Browse library by tag |
daedalus_get_paper_details |
Full paper info from library, Semantic Scholar, or ACL Anthology |
daedalus_literature_review |
Structured review: papers vs experiments, gaps, new papers |
| Tool | Description |
|---|---|
daedalus_inspect_dataset |
Stats, distributions, token estimates |
daedalus_validate_dataset |
Format checks (18 formats supported) |
daedalus_list_scripts |
Available scripts and parameters |
| Tool | Description |
|---|---|
daedalus_batch_run |
Launch multiple experiments across hosts (round-robin) |
daedalus_remote_exec |
Execute a command on the remote SSH host (read-only direct, mutating needs confirmation) |
daedalus_human_confirm |
Ask researcher for approval |
| Tool | Description |
|---|---|
daedalus_get_plan |
Get the current research plan with all steps, autonomy mode, and guardrail status |
daedalus_create_plan |
Create a plan with autonomy mode and guardrails (3-7 steps, priorities, dependencies) |
daedalus_update_plan_step |
Mark steps done/failed, change priority, set requires_confirmation |
daedalus_update_plan |
Switch autonomy mode, adjust guardrails, pause/resume the plan |
After installing (daedalus install-mcp copies everything), these commands are available in Claude Code:
| Command | Description |
|---|---|
/daedalus:experiment |
Full experiment workflow: design → launch → monitor → analyze |
/daedalus:batch-run |
Launch multiple experiments across hosts |
/daedalus:analyze |
Analyze completed results and record reflections |
/daedalus:compare |
Compare two experiments side by side |
/daedalus:logs |
Show training logs (stdout + stderr) |
/daedalus:inspect-data PATH |
Inspect and validate a dataset |
/daedalus:watch [EXP_ID] |
Monitor running experiments |
/daedalus:research-status |
Project overview: progress, best results, next steps |
/daedalus:literature-review [TOPIC] |
Structured literature review with gap analysis |
/daedalus:plan |
Create or review a structured research plan |
Daedalus maintains persistent research memory across Claude Code sessions via ledger/notes.jsonl. Notes are categorized:
| Category | Use case |
|---|---|
decision |
Key research decisions and their rationale |
insight |
Learned patterns from experiments |
dead_end |
Approaches that don't work (auto-saved on rejected hypotheses) |
convergence |
Parameter convergence and saturation detections (auto-saved) |
todo |
Next steps and ideas to try |
general |
Uncategorized notes |
Memory is integrated at three levels:
- Context:
get_contextincludes recent notes in reason/design/full modes - Auto-save:
add_reflectionautomatically saves insights and dead ends - Agent loop: The reasoning step reads notes first to recall previous sessions
Daedalus supports structured research planning — like Claude Code's plan mode, but for experiments.
# Agent creates a multi-step plan
daedalus agent plan
# Run the loop following the plan
daedalus agent loop --plan --autonomous -n 5
# View the plan
daedalus plan show
# See what's next
daedalus plan nextEach plan step has:
- Description: what to test (one variable)
- Rationale: why it matters
- Expected outcome: what we predict
- Priority: 1 (highest) to 5
- Dependencies: which steps must complete first
- Status: pending → running → done/failed/skipped
- Requires confirmation: override for risky steps (even in autonomous mode)
The autonomous loop follows the plan: picks the next actionable step (highest priority with dependencies met), designs the experiment, runs it, reflects, marks it done, and moves to the next step. If all steps complete, the loop stops. The agent can also add new steps during reflection as results evolve.
Plans support two execution modes:
| Mode | Behavior | Use case |
|---|---|---|
autonomous (default) |
Launches directly — Claude Code chat is already human-in-the-loop | Normal usage via MCP |
supervised |
Adds explicit code-level confirmation gate before each launch | CLI run_loop or extra caution |
# Create an autonomous plan via MCP
# (Claude Code tool call)
create_plan(
goal="Find optimal learning rate",
steps=[...],
autonomy="autonomous",
max_experiments=10,
max_consecutive_failures=2,
)
# Switch mode mid-plan
update_plan(autonomy="autonomous") # "go overnight"
update_plan(autonomy="supervised") # "I'm back, let me review"When experiments are linked to plan steps (via plan_step_id in launch_experiment), Daedalus automatically:
- Links the experiment to the plan step on launch
- Marks the step as done/failed when the experiment completes
- Resets/increments failure counters
- Copies reflection notes to the plan step
No manual update_plan_step calls needed for these transitions.
Safety rails prevent runaway execution in autonomous mode:
| Guardrail | Default | Effect |
|---|---|---|
max_experiments |
unlimited | Auto-pause after N total experiments |
max_consecutive_failures |
2 | Auto-pause after N failures in a row |
Per-step requires_confirmation |
false | Block launch even in autonomous mode |
When a guardrail triggers, the plan is paused (not stopped). Resume with update_plan(paused=false).
| Tool | Description |
|---|---|
daedalus_get_plan |
Full plan state: steps, autonomy mode, guardrail counters, next actionable |
daedalus_create_plan |
Create plan with autonomy mode and guardrails |
daedalus_update_plan_step |
Update step status, priority, notes, requires_confirmation |
daedalus_update_plan |
Switch autonomy mode, adjust guardrails, pause/resume |
When experiment count exceeds 20, older experiments are automatically compressed into aggregated statistics (status distribution, parameter ranges, hypothesis outcomes) while recent experiments keep full detail. This prevents context overflow while preserving all relevant information.
suggest_next detects when parameters have converged and stops suggesting them:
| Check | Trigger | What it means |
|---|---|---|
| Plateau | Last 3+ values show <2% improvement | Metric has stabilized |
| Range saturation | >80% of declared range covered, optimum is interior | No room to explore |
| Categorical exhaustion | All choices tested | Best choice identified |
get_insights produces a structured analysis:
- Top configurations ranked by target metric
- Parameter sensitivity — which parameters have the most impact
- Dead ends — rejected hypotheses and failed experiments
- Explored ranges — coverage of declared parameter ranges
Papers are stored locally in ledger/papers.jsonl with:
- Title, authors, year, abstract, citation count
- Key findings — curated bullet points relevant to your research
- Relevance note — why this paper matters
- Tags — for filtering
The literature_review tool generates a structured review:
- Papers in library with key findings
- Papers vs experiments — which papers support which experimental outcomes
- Gaps — experiments without paper backing, rejected hypotheses needing literature
- New papers — optionally search Semantic Scholar for recent work
Daedalus validates datasets against 18 format specifications:
| Format | Required Fields | Use Case |
|---|---|---|
sft |
prompt, completion | Supervised fine-tuning |
chat |
messages (role/content) | Chat format |
dpo |
prompt, chosen, rejected | Direct preference optimization |
grpo |
prompt | Group relative policy optimization |
classification |
text, label | Text classification |
nli |
premise, hypothesis, label | Natural language inference |
sentiment |
text, label | Sentiment analysis |
multi_label |
text, labels | Multi-label classification |
ner |
tokens, tags | Named entity recognition |
token_classification |
tokens, tags | Token-level classification |
regression |
text, score | Regression tasks |
sts |
sentence1, sentence2, score | Semantic textual similarity |
qa |
question, answer | Question answering (open-domain) |
extractive_qa |
question, context, answers | Extractive QA (SQuAD-like) |
retrieval |
query, positive | Information retrieval |
ranking |
query, candidates | Learning to rank |
tabular |
target | Tabular/sklearn (features auto-detected) |
text_pair |
text1, text2 | Generic text pair |
Checks performed: required fields, format validation, duplicates (MD5), empty fields, class balance, split leakage.
The suggest_next tool analyzes experiment history and suggests what to try:
| Strategy | When | What it does |
|---|---|---|
| Baseline | 0 experiments | Returns defaults from scripts registry |
| Explore neighbor | 1 experiment | Suggests 2x and 0.5x of current values |
| Bisect | 2+ experiments | Binary search between best and second-best |
| Trend extrapolate | 3+ monotonic points | Extrapolates observed trend |
| Untried choice | Categorical params | Suggests values never tested |
| Unexplored range | Never-varied params | Identifies parameters worth exploring |
| Converged | 3+ plateau points | Reports convergence, stops suggesting |
When using Daedalus via MCP in Claude Code, experiments are monitored with background watchdog agents — not sleep loops or inline polling.
Two strategies based on expected training time:
Short watch (ETA < 30 min) — frequent updates, needs relaunching:
Agent(
description="Watch exp_abc123",
prompt="Monitor experiment exp_abc123. Do 3-5 cycles of:
(1) call daedalus_poll_experiment with exp_id exp_abc123,
(2) call daedalus_get_experiment_logs with exp_id exp_abc123 and tail=30,
(3) check for problems (NaN loss, OOM, process died).
If completed or problem detected, return immediately with a report.
Otherwise sleep 60 and repeat. After 5 cycles, return a progress report.",
subagent_type="general-purpose",
run_in_background=true
)
Long watch (ETA > 30 min, overnight, unattended) — sleeps first, then polls until done:
Agent(
description="Watch exp_abc123 (long)",
prompt="Monitor experiment exp_abc123 which has an ETA of ~3 hours.
First sleep 9000 (2.5h) to avoid wasting poll cycles.
Then do poll cycles every 60s until completed or problem detected
(NaN loss, OOM, process died). On completion or problem, return
immediately with a full report. Max 60 poll cycles after the initial sleep.",
subagent_type="general-purpose",
run_in_background=true
)
For long watch, calculate the initial sleep as ~80% of the ETA. The watchdog handles the full training duration without needing anyone to relaunch it — ideal for overnight runs.
Each watchdog returns one of:
- COMPLETED — final metrics, ready for analysis
- ALERT — problem detected (NaN, OOM, CUDA error), with diagnosis
- PROGRESS — (short watch only) still running, current step/loss/ETA
When a short watchdog returns, the main agent shows status, does productive work, and relaunches. When a long watchdog returns, it means the experiment completed or failed — go straight to analysis.
Anti-patterns: Never use Bash(sleep N) in the main conversation. Never use CronCreate or any invented polling mechanism. Never poll inline in a loop.
Here's what a monitoring session looks like in practice — Claude Code launches experiments, spawns watchdogs, does productive work, and reacts when results come in:
Agent: Two experiments launched. Let me start monitoring.
[Agent("Watch exp_6eea6a", ..., run_in_background=true)]
[Agent("Watch exp_34555d", ..., run_in_background=true)]
While they train, let me check if we can improve logging for next runs...
[Read train.py]
[Edit train.py — adds log_level="info", save_steps=50]
Done. For future experiments:
- log_level="info" — loss values will print to stdout
- save_steps=50 — checkpoints every 50 steps for earlier visibility
Current Status
| Experiment | Model | Step | ETA | Health |
|------------|-----------|---------|-------|---------|
| exp_6eea6a | Model-0.8B | ~51/375 | ~3.5h | Healthy |
| exp_34555d | Model-2B | ~39/375 | ~4.8h | Healthy |
Watchdogs running. They'll report back when epoch 1 completes.
--- (watchdog returns ~5 min later) ---
Agent "Watch exp_6eea6a" completed:
exp_6eea6a: step 72/375, loss=2.31, ~2.5h remaining — healthy
Agent: [relaunches watchdog for exp_6eea6a]
While waiting, let me search for papers on learning rate schedules...
[daedalus_search_papers("warmup learning rate fine-tuning")]
[daedalus_add_paper(...)]
--- (watchdog returns: COMPLETED) ---
Agent "Watch exp_6eea6a" completed:
exp_6eea6a COMPLETED — eval_loss=0.28, accuracy=0.94
Agent: [daedalus_compare_experiments(exp_ids=["exp_6eea6a", "exp_34555d_baseline"])]
[daedalus_add_reflection(...)]
[daedalus_save_note(category="insight", ...)]
exp_6eea6a vs baseline:
- accuracy: 0.91 -> 0.94 (+3.3%)
- eval_loss: 0.35 -> 0.28 (-20%)
Hypothesis CONFIRMED: lower learning rate improved convergence.
exp_34555d still running (~1.5h). Preparing next experiment config...
The cycle repeats: watchdog returns → show status → react → productive work → relaunch → wait.
The watcher tails training logs in real-time and emits structured events:
daedalus watch exp_007 --stream --log-interval 10{"event": "WATCHING", "exp_id": "exp_007"}
{"event": "PROGRESS", "exp_id": "exp_007", "key": "epoch", "value": "1/10"}
{"event": "METRIC", "exp_id": "exp_007", "metric": "loss", "value": 0.345}
{"event": "METRIC", "exp_id": "exp_007", "metric": "eval_loss", "value": 0.287}
{"event": "ALERT", "exp_id": "exp_007", "alert_type": "loss_spike", "detail": "Loss jumped 5.2x"}
{"event": "COMPLETED", "exp_id": "exp_007", "results": {"eval": {"acc": 0.92}}}- Metrics: loss, eval_loss, accuracy, learning_rate (HF Trainer dict + key=value formats)
- Progress: epoch N/M, step N/M, tqdm percentage bars
- Alerts: NaN, Inf, CUDA OOM, loss spike (configurable threshold), training stall (no output for N minutes)
For SSH experiments, a sentinel script runs inside screen on the remote host:
- Survives SSH disconnects
- Parses both stdout and stderr logs
- 5-level completion detection: results.json, trainer_state.json, model artifacts, exit code, log heuristics
- Writes structured events to
events.jsonl - Local watcher reads events via single SSH calls (no persistent connection)
┌─────────────────────────────────────────────────┐
│ Claude Code │
│ (the brain — reasons, edits code, decides) │
│ │
│ Native tools: Read, Write, Edit, Bash, Grep │
│ Daedalus tools: daedalus_* (via MCP) │
└──────────┬──────────────────────────┬────────────┘
│ MCP (JSON-RPC/stdio) │ direct
▼ ▼
┌──────────────────┐ ┌──────────────────────────┐
│ Daedalus MCP │ │ Your codebase │
│ Server │ │ (training scripts, │
│ │ │ data, configs) │
│ - ToolExecutor │ └──────────────────────────┘
│ - 30 tools │
│ - Ledger (JSONL)│
│ - Memory (JSONL)│
│ - Runners │
└──────┬───────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Runners │
│ │
│ LocalRunner SSHRunner + Sentinel │
│ - subprocess - ssh/scp (no paramiko) │
│ - logs/stdout.log - sentinel.sh in screen │
│ - results.json - events.jsonl on remote │
│ - sshpass for password auth │
└──────────────────────────────────────────────────┘
Key design principle: Claude Code is the brain. Daedalus is the hands. The watcher is the eyes. No AI runs inside Daedalus — it's pure bookkeeping, polling, and log parsing. All intelligence comes from Claude Code, which can both manage experiments AND edit the code.
Daedalus automatically detects result formats:
results.json— Daedalus native (nested{"eval": {"metric": value}}) or flat JSONeval_results.json/trainer_state.json— HuggingFace Trainerresults.csv— CSV with metric,value columns
Two ready-to-run examples are included in examples/:
RandomForest on Iris dataset. Runs in seconds, no GPU needed.
cd examples/sklearn-iris
daedalus init . # Initialize the project
# Open Claude Code and ask: "Run a baseline experiment"Demonstrates: experiment lifecycle, parameter exploration, convergence detection.
SFT with TRL on SmolLM2-135M (135M params). Runs in a few minutes on CPU.
cd examples/sft-tiny
pip install -r requirements.txt
daedalus init .
# Open Claude Code and ask: "Create a plan to find the best learning rate"Demonstrates: plan mode, HuggingFace integration, learning rate search, reflection.
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests (479 tests)
python -m pytest tests/ -q
# Run with coverage
python -m pytest tests/ --cov=daedalus --cov-report=term-missingMIT