Daedalus AI

AI research assistant for hypothesis-driven ML experiments.

Daedalus AI is a Claude Code MCP plugin that manages the full experiment lifecycle: hypothesis formation, experiment design, training execution, log monitoring, result analysis, and autonomous parameter exploration. It works as a set of tools that Claude Code can use alongside its native code editing capabilities.

Why Daedalus?

Running ML experiments involves a repetitive loop: change a parameter, launch training, wait hours, analyze results, decide what to try next. Daedalus automates the bookkeeping and gives Claude Code the tools to manage this loop — while you focus on the research questions.

Hypothesis-driven: Every experiment starts with a hypothesis and predictions, not just "try this"
One variable at a time: The system enforces scientific discipline
Full lifecycle tracking: Draft → Running → Completed → Analyzed, with structured reflections
Persistent memory: Research notes, decisions, and dead ends survive across sessions
Literature integration: Search papers, build a library, generate structured reviews
Smart exploration: Convergence detection, parameter sensitivity, data-driven suggestions
Works with any framework: HuggingFace, PyTorch, TensorFlow, custom scripts — Daedalus doesn't care what you train with
Local or remote: Run on your machine or SSH to GPU servers with automatic sentinel monitoring
Multi-GPU: torchrun, accelerate, deepspeed launchers with GPU selection
Batch execution: Distribute experiments across hosts with round-robin scheduling
Claude Code integration: MCP tools + slash commands + code editing = the AI can fix bugs, not just track experiments

Installation

Option A — Plugin Marketplace (recommended)

# 1. Install the Python package
pip install daedalus-ai
# or from source:
pip install -e .

# 2. Add the marketplace and install the plugin
/plugin marketplace add sirCamp/daedalus-ai
/plugin install daedalus-ai@daedalus-ai

The plugin registers the MCP server, slash commands, agents, and skills automatically.

Option B — Manual setup

# 1. Install the package
pip install -e .

# 2. Register MCP server + install agents, commands, skills + auto-approve tools
daedalus install-mcp

# 3. Restart Claude Code

Scopes (manual setup only)

Scope	Flag	Where it writes	Use case
`user` (default)	`--scope user` or omit	`~/.claude.json`	Recommended: available in ALL projects
`project`	`--scope project`	`.mcp.json` in project dir	Shared via git, per-project
`local`	`--scope local`	`~/.claude.json` (local entry)	This project only, not shared

# Global install (recommended, default)
daedalus install-mcp

# Project-only install (creates .mcp.json, committable to git)
daedalus install-mcp --scope project

Updating

Since it's installed in editable mode, Python code changes are immediate. But plugin files (commands, skills, agents) are copied to ~/.claude/, so after updating:

# Marketplace install: update via Claude Code
/plugin update daedalus-ai@daedalus-ai

# Manual install: re-install plugin files + restart Claude Code
daedalus install-mcp

Quick Start

# 1. Create a project
daedalus init my-research
cd my-research

# 2. Edit daedalus.yaml — define your scripts, parameters, stack
# 3. Edit program.md — define your research goals and metrics

# 4. Open Claude Code and start experimenting
#    Use /daedalus:experiment, /daedalus:research-status, or just ask in natural language

Project Structure

my-research/
├── daedalus.yaml          # Project config: runner, scripts, stack
├── program.md             # Research goals and metrics (editable by agent)
├── runner_config.yaml     # SSH host configuration
├── ledger/
│   ├── experiments.jsonl  # Experiment log (append-only)
│   ├── papers.jsonl       # Literature library
│   ├── notes.jsonl        # Research memory (decisions, insights, dead ends)
│   └── narrative.md       # Auto-generated narrative
└── runs/                  # Work directories (auto-created)
    └── exp_abc123/
        ├── logs/
        │   ├── stdout.log
        │   └── stderr.log
        └── results.json

Configuration

daedalus.yaml

project_name: my-research

runner:
  type: local              # or: ssh

higher_is_better:
  accuracy: true
  loss: false

# Stack — libraries and frameworks (shown to Claude Code for context)
stack:
  python: "3.11"
  requirements: requirements.txt
  libraries:
    - transformers
    - datasets
    - torch
  docs:
    - https://huggingface.co/docs/transformers/

# Scripts registry — what scripts exist and their parameters
scripts:
  train:
    path: train.py
    description: "Main training script"
    parameters:
      model_name:
        type: str
        default: "bert-base-uncased"
        description: "HuggingFace model ID"
      learning_rate:
        type: float
        default: 2e-5
        range: [1e-6, 1e-3]
      batch_size:
        type: int
        default: 16
        choices: [8, 16, 32, 64]
      epochs:
        type: int
        default: 3
  eval:
    path: evaluate.py
    description: "Run evaluation"
    parameters:
      model_path:
        type: str
        required: true

Multi-GPU Support

Experiments support multi-GPU training via the launcher field in the experiment config:

# In the experiment config (set via create_experiment tool or CLI)
launcher: accelerate   # "python" (default), "torchrun", "accelerate", "deepspeed"
num_gpus: 4            # Number of GPUs
gpu_ids: [0, 1, 2, 3]  # Specific GPUs (sets CUDA_VISIBLE_DEVICES)

Launcher	Command generated
`python`	`python train.py --args`
`torchrun`	`torchrun --nproc_per_node=4 train.py --args`
`accelerate`	`accelerate launch --num_processes=4 train.py --args`
`deepspeed`	`deepspeed --num_gpus=4 train.py --args`

To use a single specific GPU: gpu_ids: [2] → sets CUDA_VISIBLE_DEVICES=2.

Multi-Host SSH (runner_config.yaml)

hosts:
  gpu-a100:
    host: 10.0.0.1
    user: ubuntu
    key_path: ~/.ssh/id_rsa
    remote_work_dir: /data/daedalus
    python_path: /opt/conda/bin/python
  gpu-h100:
    host: 10.0.0.2
    user: admin
    password: mypassword      # sshpass-based auth (alternative to key_path)
    remote_work_dir: /science/daedalus

CLI Commands

Project Management

Command	Description
`daedalus init NAME`	Initialize a research project (works on new or existing directories)
`daedalus status`	Show experiment counts, active runs, best result
`daedalus ledger [-n N]`	Show last N experiments
`daedalus narrative`	Auto-generated research narrative
`daedalus context [-m MODE]`	Dump context (reason/design/reflect/review/full)

Experiment Lifecycle

Command	Description
`daedalus hypothesis STATEMENT -r RATIONALE`	Create draft experiment
`daedalus run EXP_ID [-H HOST]`	Launch experiment
`daedalus run-batch [EXP_IDS] [-H HOST ...]`	Launch multiple experiments across hosts (round-robin)
`daedalus poll EXP_ID [-H HOST]`	Check running status
`daedalus logs EXP_ID [-n TAIL]`	View training logs (stdout + stderr)
`daedalus cancel EXP_ID [-H HOST]`	Cancel running experiment
`daedalus record EXP_ID -r METRIC=VALUE`	Record results manually
`daedalus reflect EXP_ID -a ANALYSIS -c STATUS`	Add analysis
`daedalus diff EXP_A EXP_B`	Compare two experiments

Data Tools

Command	Description
`daedalus data inspect PATH`	Dataset stats, distributions, samples, token estimates
`daedalus data validate PATH -f FORMAT`	Validate against format requirements

Literature

Command	Description
`daedalus papers search QUERY`	Search Semantic Scholar
`daedalus papers add ARXIV_ID`	Add paper to library
`daedalus papers list [-t TAG]`	Browse library

Remote Management

Command	Description
`daedalus sync [-H HOST]`	Upload scripts and requirements to remote
`daedalus setup-env [-H HOST]`	Install dependencies on remote

Monitoring

Command	Description
`daedalus watch EXP_ID`	Wait for experiment to complete
`daedalus watch --all`	Wait for all running experiments
`daedalus watch --stream`	JSONL event stream (for Claude Code)
`daedalus watch --verbose`	Human-readable progress to stderr

Agent & Plan

Command	Description
`daedalus agent reason`	Run one reasoning cycle
`daedalus agent design`	Design the next experiment
`daedalus agent reflect`	Reflect on latest results
`daedalus agent plan`	Create a structured research plan
`daedalus agent loop [-n N] [--plan]`	Autonomous research loop (optionally plan-driven)
`daedalus plan show`	Show current plan with step statuses
`daedalus plan next`	Show next actionable step

MCP Integration

Command	Description
`daedalus install-mcp`	Register MCP server globally + install plugin files (default)
`daedalus install-mcp --scope project`	Register MCP server for this project only (.mcp.json)
`daedalus uninstall-mcp`	Remove MCP registration

MCP Tools

When registered as an MCP server, Daedalus exposes 30 tools to Claude Code:

Context & Analytics

Tool	Description
`daedalus_get_context`	Research context: goals, history, papers, memory, suggestions
`daedalus_suggest_next`	Data-driven parameter suggestions with convergence detection
`daedalus_get_insights`	Aggregated insights: top configs, parameter sensitivity, dead ends, explored ranges

Experiment Management

Tool	Description
`daedalus_list_experiments`	List experiments by status
`daedalus_get_experiment`	Full details of one experiment
`daedalus_create_experiment`	Create draft with hypothesis + config
`daedalus_launch_experiment`	Start training
`daedalus_poll_experiment`	Check status, auto-fetch results on completion
`daedalus_record_results`	Manually record results
`daedalus_add_reflection`	Record analysis + auto-save insight to memory

Comparison & Logs

Tool	Description
`daedalus_compare_experiments`	Config diff + metric deltas
`daedalus_get_experiment_logs`	Recent training output (stdout + stderr)

Research Memory

Tool	Description
`daedalus_save_note`	Save a research note (decision, insight, dead_end, convergence, todo)
`daedalus_read_notes`	Read notes — filter by category, experiment, or keyword
`daedalus_update_program`	Edit program.md — append, replace section, or rewrite

Literature

Tool	Description
`daedalus_search_papers`	Query Semantic Scholar and ACL Anthology
`daedalus_add_paper`	Add to library by arXiv ID or ACL ID, with notes and tags
`daedalus_list_papers`	Browse library by tag
`daedalus_get_paper_details`	Full paper info from library, Semantic Scholar, or ACL Anthology
`daedalus_literature_review`	Structured review: papers vs experiments, gaps, new papers

Data & Scripts

Tool	Description
`daedalus_inspect_dataset`	Stats, distributions, token estimates
`daedalus_validate_dataset`	Format checks (18 formats supported)
`daedalus_list_scripts`	Available scripts and parameters

Batch & Remote

Tool	Description
`daedalus_batch_run`	Launch multiple experiments across hosts (round-robin)
`daedalus_remote_exec`	Execute a command on the remote SSH host (read-only direct, mutating needs confirmation)
`daedalus_human_confirm`	Ask researcher for approval

Research Plan

Tool	Description
`daedalus_get_plan`	Get the current research plan with all steps, autonomy mode, and guardrail status
`daedalus_create_plan`	Create a plan with autonomy mode and guardrails (3-7 steps, priorities, dependencies)
`daedalus_update_plan_step`	Mark steps done/failed, change priority, set requires_confirmation
`daedalus_update_plan`	Switch autonomy mode, adjust guardrails, pause/resume the plan

Slash Commands

After installing (daedalus install-mcp copies everything), these commands are available in Claude Code:

Command	Description
`/daedalus:experiment`	Full experiment workflow: design → launch → monitor → analyze
`/daedalus:batch-run`	Launch multiple experiments across hosts
`/daedalus:analyze`	Analyze completed results and record reflections
`/daedalus:compare`	Compare two experiments side by side
`/daedalus:logs`	Show training logs (stdout + stderr)
`/daedalus:inspect-data PATH`	Inspect and validate a dataset
`/daedalus:watch [EXP_ID]`	Monitor running experiments
`/daedalus:research-status`	Project overview: progress, best results, next steps
`/daedalus:literature-review [TOPIC]`	Structured literature review with gap analysis
`/daedalus:plan`	Create or review a structured research plan

Research Memory

Daedalus maintains persistent research memory across Claude Code sessions via ledger/notes.jsonl. Notes are categorized:

Category	Use case
`decision`	Key research decisions and their rationale
`insight`	Learned patterns from experiments
`dead_end`	Approaches that don't work (auto-saved on rejected hypotheses)
`convergence`	Parameter convergence and saturation detections (auto-saved)
`todo`	Next steps and ideas to try
`general`	Uncategorized notes

Memory is integrated at three levels:

Context: get_context includes recent notes in reason/design/full modes
Auto-save: add_reflection automatically saves insights and dead ends
Agent loop: The reasoning step reads notes first to recall previous sessions

Plan Mode

Daedalus supports structured research planning — like Claude Code's plan mode, but for experiments.

# Agent creates a multi-step plan
daedalus agent plan

# Run the loop following the plan
daedalus agent loop --plan --autonomous -n 5

# View the plan
daedalus plan show

# See what's next
daedalus plan next

Each plan step has:

Description: what to test (one variable)
Rationale: why it matters
Expected outcome: what we predict
Priority: 1 (highest) to 5
Dependencies: which steps must complete first
Status: pending → running → done/failed/skipped
Requires confirmation: override for risky steps (even in autonomous mode)

The autonomous loop follows the plan: picks the next actionable step (highest priority with dependencies met), designs the experiment, runs it, reflects, marks it done, and moves to the next step. If all steps complete, the loop stops. The agent can also add new steps during reflection as results evolve.

Autonomy Modes

Plans support two execution modes:

Mode	Behavior	Use case
`autonomous` (default)	Launches directly — Claude Code chat is already human-in-the-loop	Normal usage via MCP
`supervised`	Adds explicit code-level confirmation gate before each launch	CLI `run_loop` or extra caution

# Create an autonomous plan via MCP
# (Claude Code tool call)
create_plan(
    goal="Find optimal learning rate",
    steps=[...],
    autonomy="autonomous",
    max_experiments=10,
    max_consecutive_failures=2,
)

# Switch mode mid-plan
update_plan(autonomy="autonomous")  # "go overnight"
update_plan(autonomy="supervised")  # "I'm back, let me review"

Auto-Bookkeeping

When experiments are linked to plan steps (via plan_step_id in launch_experiment), Daedalus automatically:

Links the experiment to the plan step on launch
Marks the step as done/failed when the experiment completes
Resets/increments failure counters
Copies reflection notes to the plan step

No manual update_plan_step calls needed for these transitions.

Guardrails

Safety rails prevent runaway execution in autonomous mode:

Guardrail	Default	Effect
`max_experiments`	unlimited	Auto-pause after N total experiments
`max_consecutive_failures`	2	Auto-pause after N failures in a row
Per-step `requires_confirmation`	false	Block launch even in autonomous mode

When a guardrail triggers, the plan is paused (not stopped). Resume with update_plan(paused=false).

MCP Tools for Plan Management

Tool	Description
`daedalus_get_plan`	Full plan state: steps, autonomy mode, guardrail counters, next actionable
`daedalus_create_plan`	Create plan with autonomy mode and guardrails
`daedalus_update_plan_step`	Update step status, priority, notes, requires_confirmation
`daedalus_update_plan`	Switch autonomy mode, adjust guardrails, pause/resume

Smart Context & Convergence

Context Compression

When experiment count exceeds 20, older experiments are automatically compressed into aggregated statistics (status distribution, parameter ranges, hypothesis outcomes) while recent experiments keep full detail. This prevents context overflow while preserving all relevant information.

Convergence Detection

suggest_next detects when parameters have converged and stops suggesting them:

Check	Trigger	What it means
Plateau	Last 3+ values show <2% improvement	Metric has stabilized
Range saturation	>80% of declared range covered, optimum is interior	No room to explore
Categorical exhaustion	All choices tested	Best choice identified

Insights Generator

get_insights produces a structured analysis:

Top configurations ranked by target metric
Parameter sensitivity — which parameters have the most impact
Dead ends — rejected hypotheses and failed experiments
Explored ranges — coverage of declared parameter ranges

Literature Management

Paper Library

Papers are stored locally in ledger/papers.jsonl with:

Title, authors, year, abstract, citation count
Key findings — curated bullet points relevant to your research
Relevance note — why this paper matters
Tags — for filtering

Literature Review

The literature_review tool generates a structured review:

Papers in library with key findings
Papers vs experiments — which papers support which experimental outcomes
Gaps — experiments without paper backing, rejected hypotheses needing literature
New papers — optionally search Semantic Scholar for recent work

Dataset Validation

Daedalus validates datasets against 18 format specifications:

Format	Required Fields	Use Case
`sft`	prompt, completion	Supervised fine-tuning
`chat`	messages (role/content)	Chat format
`dpo`	prompt, chosen, rejected	Direct preference optimization
`grpo`	prompt	Group relative policy optimization
`classification`	text, label	Text classification
`nli`	premise, hypothesis, label	Natural language inference
`sentiment`	text, label	Sentiment analysis
`multi_label`	text, labels	Multi-label classification
`ner`	tokens, tags	Named entity recognition
`token_classification`	tokens, tags	Token-level classification
`regression`	text, score	Regression tasks
`sts`	sentence1, sentence2, score	Semantic textual similarity
`qa`	question, answer	Question answering (open-domain)
`extractive_qa`	question, context, answers	Extractive QA (SQuAD-like)
`retrieval`	query, positive	Information retrieval
`ranking`	query, candidates	Learning to rank
`tabular`	target	Tabular/sklearn (features auto-detected)
`text_pair`	text1, text2	Generic text pair

Checks performed: required fields, format validation, duplicates (MD5), empty fields, class balance, split leakage.

Parameter Exploration

The suggest_next tool analyzes experiment history and suggests what to try:

Strategy	When	What it does
Baseline	0 experiments	Returns defaults from scripts registry
Explore neighbor	1 experiment	Suggests 2x and 0.5x of current values
Bisect	2+ experiments	Binary search between best and second-best
Trend extrapolate	3+ monotonic points	Extrapolates observed trend
Untried choice	Categorical params	Suggests values never tested
Unexplored range	Never-varied params	Identifies parameters worth exploring
Converged	3+ plateau points	Reports convergence, stops suggesting

Experiment Monitoring

MCP Monitoring Pattern (Claude Code)

When using Daedalus via MCP in Claude Code, experiments are monitored with background watchdog agents — not sleep loops or inline polling.

Two strategies based on expected training time:

Short watch (ETA < 30 min) — frequent updates, needs relaunching:

Agent(
  description="Watch exp_abc123",
  prompt="Monitor experiment exp_abc123. Do 3-5 cycles of:
    (1) call daedalus_poll_experiment with exp_id exp_abc123,
    (2) call daedalus_get_experiment_logs with exp_id exp_abc123 and tail=30,
    (3) check for problems (NaN loss, OOM, process died).
    If completed or problem detected, return immediately with a report.
    Otherwise sleep 60 and repeat. After 5 cycles, return a progress report.",
  subagent_type="general-purpose",
  run_in_background=true
)

Long watch (ETA > 30 min, overnight, unattended) — sleeps first, then polls until done:

Agent(
  description="Watch exp_abc123 (long)",
  prompt="Monitor experiment exp_abc123 which has an ETA of ~3 hours.
    First sleep 9000 (2.5h) to avoid wasting poll cycles.
    Then do poll cycles every 60s until completed or problem detected
    (NaN loss, OOM, process died). On completion or problem, return
    immediately with a full report. Max 60 poll cycles after the initial sleep.",
  subagent_type="general-purpose",
  run_in_background=true
)

For long watch, calculate the initial sleep as ~80% of the ETA. The watchdog handles the full training duration without needing anyone to relaunch it — ideal for overnight runs.

Each watchdog returns one of:

COMPLETED — final metrics, ready for analysis
ALERT — problem detected (NaN, OOM, CUDA error), with diagnosis
PROGRESS — (short watch only) still running, current step/loss/ETA

When a short watchdog returns, the main agent shows status, does productive work, and relaunches. When a long watchdog returns, it means the experiment completed or failed — go straight to analysis.

Anti-patterns: Never use Bash(sleep N) in the main conversation. Never use CronCreate or any invented polling mechanism. Never poll inline in a loop.

Real-World Example

Here's what a monitoring session looks like in practice — Claude Code launches experiments, spawns watchdogs, does productive work, and reacts when results come in:

Agent: Two experiments launched. Let me start monitoring.

       [Agent("Watch exp_6eea6a", ..., run_in_background=true)]
       [Agent("Watch exp_34555d", ..., run_in_background=true)]

       While they train, let me check if we can improve logging for next runs...

       [Read train.py]
       [Edit train.py — adds log_level="info", save_steps=50]

       Done. For future experiments:
       - log_level="info" — loss values will print to stdout
       - save_steps=50 — checkpoints every 50 steps for earlier visibility

       Current Status

       | Experiment | Model     | Step    | ETA   | Health  |
       |------------|-----------|---------|-------|---------|
       | exp_6eea6a | Model-0.8B | ~51/375 | ~3.5h | Healthy |
       | exp_34555d | Model-2B   | ~39/375 | ~4.8h | Healthy |

       Watchdogs running. They'll report back when epoch 1 completes.

--- (watchdog returns ~5 min later) ---

Agent "Watch exp_6eea6a" completed:
       exp_6eea6a: step 72/375, loss=2.31, ~2.5h remaining — healthy

Agent: [relaunches watchdog for exp_6eea6a]
       While waiting, let me search for papers on learning rate schedules...
       [daedalus_search_papers("warmup learning rate fine-tuning")]
       [daedalus_add_paper(...)]

--- (watchdog returns: COMPLETED) ---

Agent "Watch exp_6eea6a" completed:
       exp_6eea6a COMPLETED — eval_loss=0.28, accuracy=0.94

Agent: [daedalus_compare_experiments(exp_ids=["exp_6eea6a", "exp_34555d_baseline"])]
       [daedalus_add_reflection(...)]
       [daedalus_save_note(category="insight", ...)]

       exp_6eea6a vs baseline:
       - accuracy: 0.91 -> 0.94 (+3.3%)
       - eval_loss: 0.35 -> 0.28 (-20%)
       Hypothesis CONFIRMED: lower learning rate improved convergence.

       exp_34555d still running (~1.5h). Preparing next experiment config...

The cycle repeats: watchdog returns → show status → react → productive work → relaunch → wait.

CLI Monitoring

The watcher tails training logs in real-time and emits structured events:

daedalus watch exp_007 --stream --log-interval 10

{"event": "WATCHING", "exp_id": "exp_007"}
{"event": "PROGRESS", "exp_id": "exp_007", "key": "epoch", "value": "1/10"}
{"event": "METRIC", "exp_id": "exp_007", "metric": "loss", "value": 0.345}
{"event": "METRIC", "exp_id": "exp_007", "metric": "eval_loss", "value": 0.287}
{"event": "ALERT", "exp_id": "exp_007", "alert_type": "loss_spike", "detail": "Loss jumped 5.2x"}
{"event": "COMPLETED", "exp_id": "exp_007", "results": {"eval": {"acc": 0.92}}}

Detected Patterns

Metrics: loss, eval_loss, accuracy, learning_rate (HF Trainer dict + key=value formats)
Progress: epoch N/M, step N/M, tqdm percentage bars
Alerts: NaN, Inf, CUDA OOM, loss spike (configurable threshold), training stall (no output for N minutes)

Remote Sentinel

For SSH experiments, a sentinel script runs inside screen on the remote host:

Survives SSH disconnects
Parses both stdout and stderr logs
5-level completion detection: results.json, trainer_state.json, model artifacts, exit code, log heuristics
Writes structured events to events.jsonl
Local watcher reads events via single SSH calls (no persistent connection)

Architecture

┌─────────────────────────────────────────────────┐
│                  Claude Code                     │
│  (the brain — reasons, edits code, decides)      │
│                                                  │
│  Native tools: Read, Write, Edit, Bash, Grep     │
│  Daedalus tools: daedalus_* (via MCP)            │
└──────────┬──────────────────────────┬────────────┘
           │ MCP (JSON-RPC/stdio)     │ direct
           ▼                          ▼
┌──────────────────┐    ┌──────────────────────────┐
│  Daedalus MCP    │    │  Your codebase           │
│  Server          │    │  (training scripts,      │
│                  │    │   data, configs)          │
│  - ToolExecutor  │    └──────────────────────────┘
│  - 30 tools      │
│  - Ledger (JSONL)│
│  - Memory (JSONL)│
│  - Runners       │
└──────┬───────────┘
       │
       ▼
┌──────────────────────────────────────────────────┐
│  Runners                                         │
│                                                  │
│  LocalRunner          SSHRunner + Sentinel        │
│  - subprocess         - ssh/scp (no paramiko)     │
│  - logs/stdout.log    - sentinel.sh in screen     │
│  - results.json       - events.jsonl on remote    │
│                       - sshpass for password auth  │
└──────────────────────────────────────────────────┘

Key design principle: Claude Code is the brain. Daedalus is the hands. The watcher is the eyes. No AI runs inside Daedalus — it's pure bookkeeping, polling, and log parsing. All intelligence comes from Claude Code, which can both manage experiments AND edit the code.

Results Auto-Detection

Daedalus automatically detects result formats:

results.json — Daedalus native (nested {"eval": {"metric": value}}) or flat JSON
eval_results.json / trainer_state.json — HuggingFace Trainer
results.csv — CSV with metric,value columns

Examples

Two ready-to-run examples are included in examples/:

sklearn-iris — Classical ML

RandomForest on Iris dataset. Runs in seconds, no GPU needed.

cd examples/sklearn-iris
daedalus init .          # Initialize the project
# Open Claude Code and ask: "Run a baseline experiment"

Demonstrates: experiment lifecycle, parameter exploration, convergence detection.

sft-tiny — LLM Fine-tuning

SFT with TRL on SmolLM2-135M (135M params). Runs in a few minutes on CPU.

cd examples/sft-tiny
pip install -r requirements.txt
daedalus init .
# Open Claude Code and ask: "Create a plan to find the best learning rate"

Demonstrates: plan mode, HuggingFace integration, learning rate search, reflection.

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests (479 tests)
python -m pytest tests/ -q

# Run with coverage
python -m pytest tests/ --cov=daedalus --cov-report=term-missing

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude-plugin		.claude-plugin
agents		agents
commands		commands
daedalus		daedalus
examples		examples
skills		skills
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Daedalus AI

Why Daedalus?

Installation

Option A — Plugin Marketplace (recommended)

Option B — Manual setup

Scopes (manual setup only)

Updating

Quick Start

Project Structure

Configuration

daedalus.yaml

Multi-GPU Support

Multi-Host SSH (runner_config.yaml)

CLI Commands

Project Management

Experiment Lifecycle

Data Tools

Literature

Remote Management

Monitoring

Agent & Plan

MCP Integration

MCP Tools

Context & Analytics

Experiment Management

Comparison & Logs

Research Memory

Literature

Data & Scripts

Batch & Remote

Research Plan

Slash Commands

Research Memory

Plan Mode

Autonomy Modes

Auto-Bookkeeping

Guardrails

MCP Tools for Plan Management

Smart Context & Convergence

Context Compression

Convergence Detection

Insights Generator

Literature Management

Paper Library

Literature Review

Dataset Validation

Parameter Exploration

Experiment Monitoring

MCP Monitoring Pattern (Claude Code)

Real-World Example

CLI Monitoring

Detected Patterns

Remote Sentinel

Architecture

Results Auto-Detection

Examples

sklearn-iris — Classical ML

sft-tiny — LLM Fine-tuning

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages